1. 25 7月, 2022 3 次提交
  2. 16 7月, 2022 2 次提交
  3. 09 7月, 2022 1 次提交
  4. 21 6月, 2022 2 次提交
    • N
      btrfs: zoned: prevent allocation from previous data relocation BG · 343d8a30
      Naohiro Aota 提交于
      After commit 5f0addf7 ("btrfs: zoned: use dedicated lock for data
      relocation"), we observe IO errors on e.g, btrfs/232 like below.
      
        [09.0][T4038707] WARNING: CPU: 3 PID: 4038707 at fs/btrfs/extent-tree.c:2381 btrfs_cross_ref_exist+0xfc/0x120 [btrfs]
        <snip>
        [09.9][T4038707] Call Trace:
        [09.5][T4038707]  <TASK>
        [09.3][T4038707]  run_delalloc_nocow+0x7f1/0x11a0 [btrfs]
        [09.6][T4038707]  ? test_range_bit+0x174/0x320 [btrfs]
        [09.2][T4038707]  ? fallback_to_cow+0x980/0x980 [btrfs]
        [09.3][T4038707]  ? find_lock_delalloc_range+0x33e/0x3e0 [btrfs]
        [09.5][T4038707]  btrfs_run_delalloc_range+0x445/0x1320 [btrfs]
        [09.2][T4038707]  ? test_range_bit+0x320/0x320 [btrfs]
        [09.4][T4038707]  ? lock_downgrade+0x6a0/0x6a0
        [09.2][T4038707]  ? orc_find.part.0+0x1ed/0x300
        [09.5][T4038707]  ? __module_address.part.0+0x25/0x300
        [09.0][T4038707]  writepage_delalloc+0x159/0x310 [btrfs]
        <snip>
        [09.4][    C3] sd 10:0:1:0: [sde] tag#2620 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
        [09.5][    C3] sd 10:0:1:0: [sde] tag#2620 Sense Key : Illegal Request [current]
        [09.9][    C3] sd 10:0:1:0: [sde] tag#2620 Add. Sense: Unaligned write command
        [09.5][    C3] sd 10:0:1:0: [sde] tag#2620 CDB: Write(16) 8a 00 00 00 00 00 02 f3 63 87 00 00 00 2c 00 00
        [09.4][    C3] critical target error, dev sde, sector 396041272 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0
        [09.9][    C3] BTRFS error (device dm-1): bdev /dev/mapper/dml_102_2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
      
      The IO errors occur when we allocate a regular extent in previous data
      relocation block group.
      
      On zoned btrfs, we use a dedicated block group to relocate a data
      extent. Thus, we allocate relocating data extents (pre-alloc) only from
      the dedicated block group and vice versa. Once the free space in the
      dedicated block group gets tight, a relocating extent may not fit into
      the block group. In that case, we need to switch the dedicated block
      group to the next one. Then, the previous one is now freed up for
      allocating a regular extent. The BG is already not enough to allocate
      the relocating extent, but there is still room to allocate a smaller
      extent. Now the problem happens. By allocating a regular extent while
      nocow IOs for the relocation is still on-going, we will issue WRITE IOs
      (for relocation) and ZONE APPEND IOs (for the regular writes) at the
      same time. That mixed IOs confuses the write pointer and arises the
      unaligned write errors.
      
      This commit introduces a new bit 'zoned_data_reloc_ongoing' to the
      btrfs_block_group. We set this bit before releasing the dedicated block
      group, and no extent are allocated from a block group having this bit
      set. This bit is similar to setting block_group->ro, but is different from
      it by allowing nocow writes to start.
      
      Once all the nocow IO for relocation is done (hooked from
      btrfs_finish_ordered_io), we reset the bit to release the block group for
      further allocation.
      
      Fixes: c2707a25 ("btrfs: zoned: add a dedicated data relocation block group")
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      343d8a30
    • F
      btrfs: add missing inode updates on each iteration when replacing extents · 983d8209
      Filipe Manana 提交于
      When replacing file extents, called during fallocate, hole punching,
      clone and deduplication, we may not be able to replace/drop all the
      target file extent items with a single transaction handle. We may get
      -ENOSPC while doing it, in which case we release the transaction handle,
      balance the dirty pages of the btree inode, flush delayed items and get
      a new transaction handle to operate on what's left of the target range.
      
      By dropping and replacing file extent items we have effectively modified
      the inode, so we should bump its iversion and update its mtime/ctime
      before we update the inode item. This is because if the transaction
      we used for partially modifying the inode gets committed by someone after
      we release it and before we finish the rest of the range, a power failure
      happens, then after mounting the filesystem our inode has an outdated
      iversion and mtime/ctime, corresponding to the values it had before we
      changed it.
      
      So add the missing iversion and mtime/ctime updates.
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      983d8209
  5. 18 5月, 2022 1 次提交
  6. 16 5月, 2022 31 次提交
    • C
      btrfs: allocate the btrfs_dio_private as part of the iomap dio bio · 642c5d34
      Christoph Hellwig 提交于
      Create a new bio_set that contains all the per-bio private data needed
      by btrfs for direct I/O and tell the iomap code to use that instead
      of separately allocation the btrfs_dio_private structure.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      642c5d34
    • C
      btrfs: move struct btrfs_dio_private to inode.c · a3e171a0
      Christoph Hellwig 提交于
      The btrfs_dio_private structure is only used in inode.c, so move the
      definition there.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a3e171a0
    • C
      btrfs: remove the disk_bytenr in struct btrfs_dio_private · acb8b52a
      Christoph Hellwig 提交于
      This field is never used, so remove it. Last use was probably in
      23ea8e5a ("Btrfs: load checksum data once when submitting a direct
      read io").
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      acb8b52a
    • C
      btrfs: allocate dio_data on stack · 491a6d01
      Christoph Hellwig 提交于
      Make use of the new iomap_iter->private field to avoid a memory
      allocation per iomap range.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      491a6d01
    • C
      iomap: add per-iomap_iter private data · 786f847f
      Christoph Hellwig 提交于
      Allow the file system to keep state for all iterations.  For now only
      wire it up for direct I/O as there is an immediate need for it there.
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      786f847f
    • C
      btrfs: add a btrfs_dio_rw wrapper · 36e8c622
      Christoph Hellwig 提交于
      Add a wrapper around iomap_dio_rw that keeps the direct I/O internals
      isolated in inode.c.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      36e8c622
    • D
      btrfs: rename bio_flags in parameters and switch type · cb3a12d9
      David Sterba 提交于
      Several functions take parameter bio_flags that was simplified to just
      compress type, unify it and change the type accordingly.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb3a12d9
    • D
      btrfs: simplify handling of bio_ctrl::bio_flags · 2a5232a8
      David Sterba 提交于
      The bio_flags are used only to encode the compression and there are no
      other EXTENT_BIO_* flags, so the compress type can be stored directly.
      The struct member name is left unchanged and will be cleaned in later
      patches.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2a5232a8
    • D
    • F
      btrfs: fix deadlock between concurrent dio writes when low on free data space · f5585f4f
      Filipe Manana 提交于
      When reserving data space for a direct IO write we can end up deadlocking
      if we have multiple tasks attempting a write to the same file range, there
      are multiple extents covered by that file range, we are low on available
      space for data and the writes don't expand the inode's i_size.
      
      The deadlock can happen like this:
      
      1) We have a file with an i_size of 1M, at offset 0 it has an extent with
         a size of 128K and at offset 128K it has another extent also with a
         size of 128K;
      
      2) Task A does a direct IO write against file range [0, 256K), and because
         the write is within the i_size boundary, it takes the inode's lock (VFS
         level) in shared mode;
      
      3) Task A locks the file range [0, 256K) at btrfs_dio_iomap_begin(), and
         then gets the extent map for the extent covering the range [0, 128K).
         At btrfs_get_blocks_direct_write(), it creates an ordered extent for
         that file range ([0, 128K));
      
      4) Before returning from btrfs_dio_iomap_begin(), it unlocks the file
         range [0, 256K);
      
      5) Task A executes btrfs_dio_iomap_begin() again, this time for the file
         range [128K, 256K), and locks the file range [128K, 256K);
      
      6) Task B starts a direct IO write against file range [0, 256K) as well.
         It also locks the inode in shared mode, as it's within the i_size limit,
         and then tries to lock file range [0, 256K). It is able to lock the
         subrange [0, 128K) but then blocks waiting for the range [128K, 256K),
         as it is currently locked by task A;
      
      7) Task A enters btrfs_get_blocks_direct_write() and tries to reserve data
         space. Because we are low on available free space, it triggers the
         async data reclaim task, and waits for it to reserve data space;
      
      8) The async reclaim task decides to wait for all existing ordered extents
         to complete (through btrfs_wait_ordered_roots()).
         It finds the ordered extent previously created by task A for the file
         range [0, 128K) and waits for it to complete;
      
      9) The ordered extent for the file range [0, 128K) can not complete
         because it blocks at btrfs_finish_ordered_io() when trying to lock the
         file range [0, 128K).
      
         This results in a deadlock, because:
      
         - task B is holding the file range [0, 128K) locked, waiting for the
           range [128K, 256K) to be unlocked by task A;
      
         - task A is holding the file range [128K, 256K) locked and it's waiting
           for the async data reclaim task to satisfy its space reservation
           request;
      
         - the async data reclaim task is waiting for ordered extent [0, 128K)
           to complete, but the ordered extent can not complete because the
           file range [0, 128K) is currently locked by task B, which is waiting
           on task A to unlock file range [128K, 256K) and task A waiting
           on the async data reclaim task.
      
         This results in a deadlock between 4 task: task A, task B, the async
         data reclaim task and the task doing ordered extent completion (a work
         queue task).
      
      This type of deadlock can sporadically be triggered by the test case
      generic/300 from fstests, and results in a stack trace like the following:
      
      [12084.033689] INFO: task kworker/u16:7:123749 blocked for more than 241 seconds.
      [12084.034877]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.035562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.036548] task:kworker/u16:7   state:D stack:    0 pid:123749 ppid:     2 flags:0x00004000
      [12084.036554] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
      [12084.036599] Call Trace:
      [12084.036601]  <TASK>
      [12084.036606]  __schedule+0x3cb/0xed0
      [12084.036616]  schedule+0x4e/0xb0
      [12084.036620]  btrfs_start_ordered_extent+0x109/0x1c0 [btrfs]
      [12084.036651]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [12084.036659]  btrfs_run_ordered_extent_work+0x1a/0x30 [btrfs]
      [12084.036688]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [12084.036719]  ? lock_is_held_type+0xe8/0x140
      [12084.036727]  process_one_work+0x252/0x5a0
      [12084.036736]  ? process_one_work+0x5a0/0x5a0
      [12084.036738]  worker_thread+0x52/0x3b0
      [12084.036743]  ? process_one_work+0x5a0/0x5a0
      [12084.036745]  kthread+0xf2/0x120
      [12084.036747]  ? kthread_complete_and_exit+0x20/0x20
      [12084.036751]  ret_from_fork+0x22/0x30
      [12084.036765]  </TASK>
      [12084.036769] INFO: task kworker/u16:11:153787 blocked for more than 241 seconds.
      [12084.037702]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.038540] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.039506] task:kworker/u16:11  state:D stack:    0 pid:153787 ppid:     2 flags:0x00004000
      [12084.039511] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
      [12084.039551] Call Trace:
      [12084.039553]  <TASK>
      [12084.039557]  __schedule+0x3cb/0xed0
      [12084.039566]  schedule+0x4e/0xb0
      [12084.039569]  schedule_timeout+0xed/0x130
      [12084.039573]  ? mark_held_locks+0x50/0x80
      [12084.039578]  ? _raw_spin_unlock_irq+0x24/0x50
      [12084.039580]  ? lockdep_hardirqs_on+0x7d/0x100
      [12084.039585]  __wait_for_common+0xaf/0x1f0
      [12084.039587]  ? usleep_range_state+0xb0/0xb0
      [12084.039596]  btrfs_wait_ordered_extents+0x3d6/0x470 [btrfs]
      [12084.039636]  btrfs_wait_ordered_roots+0x175/0x240 [btrfs]
      [12084.039670]  flush_space+0x25b/0x630 [btrfs]
      [12084.039712]  btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
      [12084.039747]  process_one_work+0x252/0x5a0
      [12084.039756]  ? process_one_work+0x5a0/0x5a0
      [12084.039758]  worker_thread+0x52/0x3b0
      [12084.039762]  ? process_one_work+0x5a0/0x5a0
      [12084.039765]  kthread+0xf2/0x120
      [12084.039766]  ? kthread_complete_and_exit+0x20/0x20
      [12084.039770]  ret_from_fork+0x22/0x30
      [12084.039783]  </TASK>
      [12084.039800] INFO: task kworker/u16:17:217907 blocked for more than 241 seconds.
      [12084.040709]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.041398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.042404] task:kworker/u16:17  state:D stack:    0 pid:217907 ppid:     2 flags:0x00004000
      [12084.042411] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
      [12084.042461] Call Trace:
      [12084.042463]  <TASK>
      [12084.042471]  __schedule+0x3cb/0xed0
      [12084.042485]  schedule+0x4e/0xb0
      [12084.042490]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
      [12084.042539]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [12084.042551]  lock_extent_bits+0x37/0x90 [btrfs]
      [12084.042601]  btrfs_finish_ordered_io.isra.0+0x3fd/0x960 [btrfs]
      [12084.042656]  ? lock_is_held_type+0xe8/0x140
      [12084.042667]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [12084.042716]  ? lock_is_held_type+0xe8/0x140
      [12084.042727]  process_one_work+0x252/0x5a0
      [12084.042742]  worker_thread+0x52/0x3b0
      [12084.042750]  ? process_one_work+0x5a0/0x5a0
      [12084.042754]  kthread+0xf2/0x120
      [12084.042757]  ? kthread_complete_and_exit+0x20/0x20
      [12084.042763]  ret_from_fork+0x22/0x30
      [12084.042783]  </TASK>
      [12084.042798] INFO: task fio:234517 blocked for more than 241 seconds.
      [12084.043598]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
      [12084.044282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [12084.045244] task:fio             state:D stack:    0 pid:234517 ppid:234515 flags:0x00004000
      [12084.045248] Call Trace:
      [12084.045250]  <TASK>
      [12084.045254]  __schedule+0x3cb/0xed0
      [12084.045263]  schedule+0x4e/0xb0
      [12084.045266]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
      [12084.045298]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [12084.045306]  lock_extent_bits+0x37/0x90 [btrfs]
      [12084.045336]  btrfs_dio_iomap_begin+0x336/0xc60 [btrfs]
      [12084.045370]  ? lock_is_held_type+0xe8/0x140
      [12084.045378]  iomap_iter+0x184/0x4c0
      [12084.045383]  __iomap_dio_rw+0x2c6/0x8a0
      [12084.045406]  iomap_dio_rw+0xa/0x30
      [12084.045408]  btrfs_do_write_iter+0x370/0x5e0 [btrfs]
      [12084.045440]  aio_write+0xfa/0x2c0
      [12084.045448]  ? __might_fault+0x2a/0x70
      [12084.045451]  ? kvm_sched_clock_read+0x14/0x40
      [12084.045455]  ? lock_release+0x153/0x4a0
      [12084.045463]  io_submit_one+0x615/0x9f0
      [12084.045467]  ? __might_fault+0x2a/0x70
      [12084.045469]  ? kvm_sched_clock_read+0x14/0x40
      [12084.045478]  __x64_sys_io_submit+0x83/0x160
      [12084.045483]  ? syscall_enter_from_user_mode+0x1d/0x50
      [12084.045489]  do_syscall_64+0x3b/0x90
      [12084.045517]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [12084.045521] RIP: 0033:0x7fa76511af79
      [12084.045525] RSP: 002b:00007ffd6d6b9058 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
      [12084.045530] RAX: ffffffffffffffda RBX: 00007fa75ba6e760 RCX: 00007fa76511af79
      [12084.045532] RDX: 0000557b304ff3f0 RSI: 0000000000000001 RDI: 00007fa75ba4c000
      [12084.045535] RBP: 00007fa75ba4c000 R08: 00007fa751b76000 R09: 0000000000000330
      [12084.045537] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
      [12084.045540] R13: 0000000000000000 R14: 0000557b304ff3f0 R15: 0000557b30521eb0
      [12084.045561]  </TASK>
      
      Fix this issue by always reserving data space before locking a file range
      at btrfs_dio_iomap_begin(). If we can't reserve the space, then we don't
      error out immediately - instead after locking the file range, check if we
      can do a NOCOW write, and if we can we don't error out since we don't need
      to allocate a data extent, however if we can't NOCOW then error out with
      -ENOSPC. This also implies that we may end up reserving space when it's
      not needed because the write will end up being done in NOCOW mode - in that
      case we just release the space after we noticed we did a NOCOW write - this
      is the same type of logic that is done in the path for buffered IO writes.
      
      Fixes: f0bfa76a ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range")
      CC: stable@vger.kernel.org # 5.17+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f5585f4f
    • G
      btrfs: derive compression type from extent map during reads · 1d8fa2e2
      Goldwyn Rodrigues 提交于
      Derive the compression type from extent map as opposed to the bio flags
      passed. This makes it more precise and not reliant on function
      parameters.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1d8fa2e2
    • G
      btrfs: turn fs_roots_radix in btrfs_fs_info into an XArray · 48b36a60
      Gabriel Niebler 提交于
      … rename it to simply fs_roots and adjust all usages of this object to use
      the XArray API, because it is notionally easier to use and understand, as
      it provides array semantics, and also takes care of locking for us,
      further simplifying the code.
      
      Also do some refactoring, esp. where the API change requires largely
      rewriting some functions, anyway.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NGabriel Niebler <gniebler@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48b36a60
    • G
      btrfs: turn delayed_nodes_tree into an XArray · 253bf575
      Gabriel Niebler 提交于
      … in the btrfs_root struct and adjust all usages of this object to use
      the XArray API, because it is notionally easier to use and understand,
      as it provides array semantics, and also takes care of locking for us,
      further simplifying the code.
      
      Also use the opportunity to do some light refactoring.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NGabriel Niebler <gniebler@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      253bf575
    • C
      btrfs: do not return errors from submit_bio_hook_t instances · ad357938
      Christoph Hellwig 提交于
      Both btrfs_repair_one_sector and submit_bio_one as the direct caller of
      one of the instances ignore errors as they expect the methods themselves
      to call ->bi_end_io on error.  Remove the unused and dangerous return
      value.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ad357938
    • C
      btrfs: do not return errors from btrfs_submit_compressed_read · cb4411dd
      Christoph Hellwig 提交于
      btrfs_submit_compressed_read already calls ->bi_end_io on error and
      the caller must ignore the return value, so remove it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb4411dd
    • C
      btrfs: move btrfs_readpage to extent_io.c · 7aab8b32
      Christoph Hellwig 提交于
      Keep btrfs_readpage next to btrfs_do_readpage and the other address
      space operations.  This allows to keep submit_one_bio and
      struct btrfs_bio_ctrl file local in extent_io.c.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7aab8b32
    • F
      btrfs: avoid double search for block group during NOCOW writes · 2306e83e
      Filipe Manana 提交于
      When doing a NOCOW write, either through direct IO or buffered IO, we do
      two lookups for the block group that contains the target extent: once
      when we call btrfs_inc_nocow_writers() and then later again when we call
      btrfs_dec_nocow_writers() after creating the ordered extent.
      
      The lookups require taking a lock and navigating the red black tree used
      to track all block groups, which can take a non-negligible amount of time
      for a large filesystem with thousands of block groups, as well as lock
      contention and cache line bouncing.
      
      Improve on this by having a single block group search: making
      btrfs_inc_nocow_writers() return the block group to its caller and then
      have the caller pass that block group to btrfs_dec_nocow_writers().
      
      This is part of a patchset comprised of the following patches:
      
        btrfs: remove search start argument from first_logical_byte()
        btrfs: use rbtree with leftmost node cached for tracking lowest block group
        btrfs: use a read/write lock for protecting the block groups tree
        btrfs: return block group directly at btrfs_next_block_group()
        btrfs: avoid double search for block group during NOCOW writes
      
      The following test was used to test these changes from a performance
      perspective:
      
         $ cat test.sh
         #!/bin/bash
      
         modprobe null_blk nr_devices=0
      
         NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
         mkdir $NULL_DEV_PATH
         if [ $? -ne 0 ]; then
             echo "Failed to create nullb0 directory."
             exit 1
         fi
         echo 2 > $NULL_DEV_PATH/submit_queues
         echo 16384 > $NULL_DEV_PATH/size # 16G
         echo 1 > $NULL_DEV_PATH/memory_backed
         echo 1 > $NULL_DEV_PATH/power
      
         DEV=/dev/nullb0
         MNT=/mnt/nullb0
         LOOP_MNT="$MNT/loop"
         MOUNT_OPTIONS="-o ssd -o nodatacow"
         MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
         cat <<EOF > /tmp/fio-job.ini
         [io_uring_writes]
         rw=randwrite
         fsync=0
         fallocate=posix
         group_reporting=1
         direct=1
         ioengine=io_uring
         iodepth=64
         bs=64k
         filesize=1g
         runtime=300
         time_based
         directory=$LOOP_MNT
         numjobs=8
         thread
         EOF
      
         echo performance | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         echo
         echo "Using config:"
         echo
         cat /tmp/fio-job.ini
         echo
      
         umount $MNT &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
         mount $MOUNT_OPTIONS $DEV $MNT
      
         mkdir $LOOP_MNT
      
         truncate -s 4T $MNT/loopfile
         mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
         mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
      
         # Trigger the allocation of about 3500 data block groups, without
         # actually consuming space on underlying filesystem, just to make
         # the tree of block group large.
         fallocate -l 3500G $LOOP_MNT/filler
      
         fio /tmp/fio-job.ini
      
         umount $LOOP_MNT
         umount $MNT
      
         echo 0 > $NULL_DEV_PATH/power
         rmdir $NULL_DEV_PATH
      
      The test was run on a non-debug kernel (Debian's default kernel config),
      the result were the following.
      
      Before patchset:
      
        WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
      
      After patchset:
      
        WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
      
        +3.3% write throughput and +3.3% IO done in the same time period.
      
      The test has somewhat limited coverage scope, as with only NOCOW writes
      we get less contention on the red black tree of block groups, since we
      don't have the extra contention caused by COW writes, namely when
      allocating data extents, pinning and unpinning data extents, but on the
      hand there's access to tree in the NOCOW path, when incrementing a block
      group's number of NOCOW writers.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2306e83e
    • Q
      btrfs: avoid double clean up when submit_one_bio() failed · c9583ada
      Qu Wenruo 提交于
      [BUG]
      When running generic/475 with 64K page size and 4K sector size, it has a
      very high chance (almost 100%) to hang, with mostly data page locked but
      no one is going to unlock it.
      
      [CAUSE]
      With commit 1784b7d5 ("btrfs: handle csum lookup errors properly on
      reads"), if we failed to lookup checksum due to metadata IO error, we
      will return error for btrfs_submit_data_bio().
      
      This will cause the page to be unlocked twice in btrfs_do_readpage():
      
       btrfs_do_readpage()
       |- submit_extent_page()
       |  |- submit_one_bio()
       |     |- btrfs_submit_data_bio()
       |        |- if (ret) {
       |        |-     bio->bi_status = ret;
       |        |-     bio_endio(bio); }
       |               In the endio function, we will call end_page_read()
       |               and unlock_extent() to cleanup the subpage range.
       |
       |- if (ret) {
       |-        unlock_extent(); end_page_read() }
                 Here we unlock the extent and cleanup the subpage range
                 again.
      
      For unlock_extent(), it's mostly double unlock safe.
      
      But for end_page_read(), it's not, especially for subpage case,
      as for subpage case we will call btrfs_subpage_end_reader() to reduce
      the reader number, and use that to number to determine if we need to
      unlock the full page.
      
      If double accounted, it can underflow the number and leave the page
      locked without anyone to unlock it.
      
      [FIX]
      The commit 1784b7d5 ("btrfs: handle csum lookup errors properly on
      reads") itself is completely fine, it's our existing code not properly
      handling the error from bio submission hook properly.
      
      This patch will make submit_one_bio() to return void so that the callers
      will never be able to do cleanup when bio submission hook fails.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c9583ada
    • F
      btrfs: use BTRFS_DIR_START_INDEX at btrfs_create_new_inode() · 49024388
      Filipe Manana 提交于
      We are still using the magic value of 2 at btrfs_create_new_inode(), but
      there's now a constant for that, named BTRFS_DIR_START_INDEX, which was
      introduced in commit 528ee697 ("btrfs: put initial index value of a
      directory in a constant"). So change that to use the constant.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49024388
    • F
      btrfs: do not test for free space inode during NOCOW check against file extent · a7bb6bd4
      Filipe Manana 提交于
      When checking if we can do a NOCOW write against a range covered by a file
      extent item, we do a quick a check to determine if the inode's root was
      snapshotted in a generation older than the generation of the file extent
      item or not. This is to quickly determine if the extent is likely shared
      and avoid the expensive check for cross references (this was added in
      commit 78d4295b ("btrfs: lift some btrfs_cross_ref_exist checks in
      nocow path").
      
      We restrict that check to the case where the inode is not a free space
      inode (since commit 27a7ff55 ("btrfs: skip file_extent generation
      check for free_space_inode in run_delalloc_nocow")). That is because when
      we had the inode cache feature, inode caches were backed by a free space
      inode that belonged to the inode's root.
      
      However we don't have support for the inode cache feature since kernel
      5.11, so we don't need this check anymore since free space inodes are
      now always related to free space caches, which are always associated to
      the root tree (which can't be snapshotted, and its last_snapshot field
      is always 0).
      
      So remove that condition.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a7bb6bd4
    • F
      btrfs: move common NOCOW checks against a file extent into a helper · 619104ba
      Filipe Manana 提交于
      Verifying if we can do a NOCOW write against a range fully or partially
      covered by a file extent item requires verifying several constraints, and
      these are currently duplicated at two different places: can_nocow_extent()
      and run_delalloc_nocow().
      
      This change moves those checks into a common helper function to avoid
      duplication. It adds some comments and also preserves all existing
      behaviour like for example can_nocow_extent() treating errors from the
      calls to btrfs_cross_ref_exist() and csum_exist_in_range() as meaning
      we can not NOCOW, instead of propagating the error back to the caller.
      That specific behaviour is questionable but also reasonable to some
      degree.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      619104ba
    • S
      btrfs: factor out allocating an array of pages · dd137dd1
      Sweet Tea Dorminy 提交于
      Several functions currently populate an array of page pointers one
      allocated page at a time. Factor out the common code so as to allow
      improvements to all of the sites at once.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dd137dd1
    • Y
      btrfs: remove unnecessary type casts · 0d031dc4
      Yu Zhe 提交于
      Explicit type casts are not necessary when it's void* to another pointer
      type.
      Signed-off-by: NYu Zhe <yuzhe@nfschina.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d031dc4
    • Q
      btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine · fbca46eb
      Qu Wenruo 提交于
      The reason why we only support 64K page size for subpage is, for 64K
      page size we can ensure no matter what the nodesize is, we can fit it
      into one page.
      
      When other page size come, especially like 16K, the limitation is a bit
      limiting.
      
      To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
      non-subpage routine.  By this, we can allow 4K sectorsize on 16K page
      size.
      
      Although this introduces another smaller limitation, the metadata can
      not cross page boundary, which is already met by most recent mkfs.
      
      Another small improvement is, we can avoid the overhead for metadata if
      nodesize >= PAGE_SIZE.
      For 4K sector size and 64K page size/node size, or 4K sector size and
      16K page size/node size, we don't need to allocate extra memory for the
      metadata pages.
      
      Please note that, this patch will not yet enable other page size support
      yet.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbca46eb
    • Q
      btrfs: replace memset with memzero_page in data checksum verification · b06660b5
      Qu Wenruo 提交于
      The original code resets the page to 0x1 for not apparent reason, it's
      been like that since the initial 2007 code added in commit 07157aac
      ("Btrfs: Add file data csums back in via hooks in the extent map code").
      
      It could mean that a failed buffer can be detected from the data but
      that's just a guess and any value is good.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b06660b5
    • F
      btrfs: avoid blocking on space revervation when doing nowait dio writes · d4135134
      Filipe Manana 提交于
      When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
      proceed with the non-blocking, NOWAIT path. However reserving the metadata
      space and qgroup meta space can often result in blocking - flushing
      delalloc, wait for ordered extents to complete, trigger transaction
      commits, etc, going against the semantics of a NOWAIT write.
      
      So make the NOWAIT write path to try to reserve all the metadata it needs
      without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
      then return -EAGAIN to make the caller fallback to a blocking direct IO
      write.
      
      This is part of a patchset comprised of the following patches:
      
        btrfs: avoid blocking on page locks with nowait dio on compressed range
        btrfs: avoid blocking nowait dio when locking file range
        btrfs: avoid double nocow check when doing nowait dio writes
        btrfs: stop allocating a path when checking if cross reference exists
        btrfs: free path at can_nocow_extent() before checking for checksum items
        btrfs: release path earlier at can_nocow_extent()
        btrfs: avoid blocking when allocating context for nowait dio read/write
        btrfs: avoid blocking on space revervation when doing nowait dio writes
      
      The following test was run before and after applying this patchset:
      
        $ cat io-uring-nodatacow-test.sh
        #!/bin/bash
      
        DEV=/dev/sdc
        MNT=/mnt/sdc
      
        MOUNT_OPTIONS="-o ssd -o nodatacow"
        MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
        NUM_JOBS=4
        FILE_SIZE=8G
        RUN_TIME=300
      
        cat <<EOF > /tmp/fio-job.ini
        [io_uring_rw]
        rw=randrw
        fsync=0
        fallocate=posix
        group_reporting=1
        direct=1
        ioengine=io_uring
        iodepth=64
        bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
        filesize=$FILE_SIZE
        runtime=$RUN_TIME
        time_based
        filename=foobar
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo performance | \
           tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        fio /tmp/fio-job.ini
      
        umount $MNT
      
      The test was run a 12 cores box with 64G of ram, using a non-debug kernel
      config (Debian's default config) and a spinning disk.
      
      Result before the patchset:
      
       READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
      WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
      
      Result after the patchset:
      
       READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
      WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec
      
      That's about +7.2% throughput for reads and +6.9% for writes.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4135134
    • F
      btrfs: avoid blocking when allocating context for nowait dio read/write · 4f208dcc
      Filipe Manana 提交于
      When doing a NOWAIT direct IO read/write, we allocate a context object
      (struct btrfs_dio_data) with GFP_NOFS, which can result in blocking
      waiting for memory allocation (GFP_NOFS is __GFP_RECLAIM | __GFP_IO).
      This is undesirable for the NOWAIT semantics, so do the allocation with
      GFP_NOWAIT if we are serving a NOWAIT request and if the allocation fails
      return -EAGAIN, so that the caller can fallback to a blocking context and
      retry with a non-blocking write.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4f208dcc
    • F
      btrfs: release path earlier at can_nocow_extent() · 59d35c51
      Filipe Manana 提交于
      At can_nocow_extent(), we are releasing the path only after checking if
      the block group that has the target extent is read only, and after
      checking if there's delalloc in the range in case our extent is a
      preallocated extent. The read only extent check can be expensive if we
      have a very large filesystem with many block groups, as well as the
      check for delalloc in the inode's io_tree in case the io_tree is big
      due to IO on other file ranges.
      
      Our path is holding a read lock on a leaf and there's no need to keep
      the lock while doing those two checks, so release the path before doing
      them, immediately after the last use of the leaf.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      59d35c51
    • F
      btrfs: free path at can_nocow_extent() before checking for checksum items · c1a548db
      Filipe Manana 提交于
      When we look for checksum items, through csum_exist_in_range(), at
      can_nocow_extent(), we no longer need the path that we have previously
      allocated. Through csum_exist_in_range() -> btrfs_lookup_csums_range(),
      we also end up allocating a path, so we are adding unnecessary extra
      memory usage. So free the path before calling csum_exist_in_range().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1a548db
    • F
      btrfs: stop allocating a path when checking if cross reference exists · 1a89f173
      Filipe Manana 提交于
      At btrfs_cross_ref_exist() we always allocate a path, but we really don't
      need to because all its callers (only 2) already have an allocated path
      that is not being used when they call btrfs_cross_ref_exist(). So change
      btrfs_cross_ref_exist() to take a path as an argument and update both
      its callers to pass in the unused path they have when they call
      btrfs_cross_ref_exist().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a89f173
    • F
      btrfs: avoid double nocow check when doing nowait dio writes · d7a8ab4e
      Filipe Manana 提交于
      When doing a NOWAIT direct IO write we are checking twice if we can COW
      into the target file range using can_nocow_extent() - once at the very
      beginning of the write path, at btrfs_write_check() via
      check_nocow_nolock(), and later again at btrfs_get_blocks_direct_write().
      
      The can_nocow_extent() function does a lot of expensive things - searching
      for the file extent item in the inode's subvolume tree, searching for the
      extent item in the extent tree, checking delayed references, etc, so it
      isn't a very cheap call.
      
      We can remove the first check at btrfs_write_check(), and add there a
      quick check to verify if the inode has the NODATACOW or PREALLOC flags,
      and quickly bail out if it doesn't have neither of those flags, as that
      means we have to COW and therefore can't comply with the NOWAIT semantics.
      
      After this we do only one call to can_nocow_extent(), while we are at
      btrfs_get_blocks_direct_write(), where we have already locked the file
      range and we did a try lock on the range before, at
      btrfs_dio_iomap_begin() (since the previous patch in the series).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7a8ab4e