1. 23 8月, 2021 3 次提交
    • B
      btrfs: initial fsverity support · 14605409
      Boris Burkov 提交于
      Add support for fsverity in btrfs. To support the generic interface in
      fs/verity, we add two new item types in the fs tree for inodes with
      verity enabled. One stores the per-file verity descriptor and btrfs
      verity item and the other stores the Merkle tree data itself.
      
      Verity checking is done in end_page_read just before a page is marked
      uptodate. This naturally handles a variety of edge cases like holes,
      preallocated extents, and inline extents. Some care needs to be taken to
      not try to verity pages past the end of the file, which are accessed by
      the generic buffered file reading code under some circumstances like
      reading to the end of the last page and trying to read again. Direct IO
      on a verity file falls back to buffered reads.
      
      Verity relies on PageChecked for the Merkle tree data itself to avoid
      re-walking up shared paths in the tree. For this reason, we need to
      cache the Merkle tree data. Since the file is immutable after verity is
      turned on, we can cache it at an index past EOF.
      
      Use the new inode ro_flags to store verity on the inode item, so that we
      can enable verity on a file, then rollback to an older kernel and still
      mount the file system and read the file. Since we can't safely write the
      file anymore without ruining the invariants of the Merkle tree, we mark
      a ro_compat flag on the file system when a file has verity enabled.
      Acked-by: NEric Biggers <ebiggers@google.com>
      Co-developed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      14605409
    • B
      btrfs: add ro compat flags to inodes · 77eea05e
      Boris Burkov 提交于
      Currently, inode flags are fully backwards incompatible in btrfs. If we
      introduce a new inode flag, then tree-checker will detect it and fail.
      This can even cause us to fail to mount entirely. To make it possible to
      introduce new flags which can be read-only compatible, like VERITY, we
      add new ro flags to btrfs without treating them quite so harshly in
      tree-checker. A read-only file system can survive an unexpected flag,
      and can be mounted.
      
      As for the implementation, it unfortunately gets a little complicated.
      
      The on-disk representation of the inode, btrfs_inode_item, has an __le64
      for flags but the in-memory representation, btrfs_inode, uses a u32.
      David Sterba had the nice idea that we could reclaim those wasted 32 bits
      on disk and use them for the new ro_compat flags.
      
      It turns out that the tree-checker code which checks for unknown flags
      is broken, and ignores the upper 32 bits we are hoping to use. The issue
      is that the flags use the literal 1 rather than 1ULL, so the flags are
      signed ints, and one of them is specifically (1 << 31). As a result, the
      mask which ORs the flags is a negative integer on machines where int is
      32 bit twos complement. When tree-checker evaluates the expression:
      
        btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)
      
      The mask is something like 0x80000abc, which gets promoted to u64 with
      sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
      all the upper bits zeroed, and we can't detect unexpected flags.
      
      This suggests that we can't use those bits after all. Luckily, we have
      good reason to believe that they are zero anyway. Inode flags are
      metadata, which is always checksummed, so any bit flips that would
      introduce 1s would cause a checksum failure anyway (excluding the
      improbable case of the checksum getting corrupted exactly badly).
      
      Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
      inode flag should preserve its value and not add leading zeroes
      (at least for twos complement). The only place that flag
      (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
      the root item, and indeed for that inode we see 0xffffffff80000000 as
      the flags on disk. However, that inode is never seen by tree checker,
      nor is it used in a context where verity might be meaningful.
      Theoretically, a future ro flag might cause trouble on that inode, so we
      should proactively clean up that mess before it does.
      
      With the introduction of the new ro flags, keep two separate unsigned
      masks and check them against the appropriate u32. Since we no longer run
      afoul of sign extension, this also stops writing out 0xffffffff80000000
      in root_item inodes going forward.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      77eea05e
    • Q
      btrfs: allow read-write for 4K sectorsize on 64K page size systems · 95ea0486
      Qu Wenruo 提交于
      Since now we support data and metadata read-write for subpage, remove
      the RO requirement for subpage mount.
      
      There are some extra limitations though:
      
      - For now, subpage RW mount is still considered experimental
        Thus that mount warning will still be there.
      
      - No compression support
        There are still quite some PAGE_SIZE hard coded and quite some call
        sites use extent_clear_unlock_delalloc() to unlock locked_page.
        This will screw up subpage helpers.
      
        Now for subpage RW mount, no matter what mount option or inode attr is
        set, all writes will not be compressed.  Although reading compressed
        data has no problem.
      
      - No defrag for subpage case
        The defrag support for subpage case will come in later patches, which
        will also rework the defrag workflow.
      
      - No inline extent will be created
        This is mostly due to the fact that filemap_fdatawrite_range() will
        trigger more write than the range specified.
        In fallocate calls, this behavior can make us to writeback which can
        be inlined, before we enlarge the i_size.
      
        This is a very special corner case, and even current btrfs check won't
        report error on such inline extent + regular extent.
        But considering how much effort has been put to prevent such inline +
        regular, I'd prefer to cut off inline extent completely until we have
        a good solution.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95ea0486
  2. 22 6月, 2021 1 次提交
  3. 21 6月, 2021 7 次提交
  4. 11 5月, 2021 1 次提交
    • R
      btrfs: handle transaction start error in btrfs_fileattr_set · 9b8a233b
      Ritesh Harjani 提交于
      Add error handling in btrfs_fileattr_set in case of an error while
      starting a transaction. This fixes btrfs/232 which otherwise used to
      fail with below signature on Power.
      
        btrfs/232 [ 1119.474650] run fstests btrfs/232 at 2021-04-21 02:21:22
        <...>
        [ 1366.638585] BUG: Unable to handle kernel data access on read at 0xffffffffffffff86
        [ 1366.638768] Faulting instruction address: 0xc0000000009a5c88
        cpu 0x0: Vector: 380 (Data SLB Access) at [c000000014f177b0]
            pc: c0000000009a5c88: btrfs_update_root_times+0x58/0xc0
            lr: c0000000009a5c84: btrfs_update_root_times+0x54/0xc0
            <...>
            pid   = 24881, comm = fsstress
      	   btrfs_update_inode+0xa0/0x140
      	   btrfs_fileattr_set+0x5d0/0x6f0
      	   vfs_fileattr_set+0x2a8/0x390
      	   do_vfs_ioctl+0x1290/0x1ac0
      	   sys_ioctl+0x6c/0x120
      	   system_call_exception+0x3d4/0x410
      	   system_call_common+0xec/0x278
      
      Fixes: 97fc2977 ("btrfs: convert to fileattr")
      Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9b8a233b
  5. 29 4月, 2021 1 次提交
    • F
      btrfs: fix deadlock when cloning inline extents and using qgroups · f9baa501
      Filipe Manana 提交于
      There are a few exceptional cases where cloning an inline extent needs to
      copy the inline extent data into a page of the destination inode.
      
      When this happens, we end up starting a transaction while having a dirty
      page for the destination inode and while having the range locked in the
      destination's inode iotree too. Because when reserving metadata space
      for a transaction we may need to flush existing delalloc in case there is
      not enough free space, we have a mechanism in place to prevent a deadlock,
      which was introduced in commit 3d45f221 ("btrfs: fix deadlock when
      cloning inline extent and low on free metadata space").
      
      However when using qgroups, a transaction also reserves metadata qgroup
      space, which can also result in flushing delalloc in case there is not
      enough available space at the moment. When this happens we deadlock, since
      flushing delalloc requires locking the file range in the inode's iotree
      and the range was already locked at the very beginning of the clone
      operation, before attempting to start the transaction.
      
      When this issue happens, stack traces like the following are reported:
      
        [72747.556262] task:kworker/u81:9   state:D stack:    0 pid:  225 ppid:     2 flags:0x00004000
        [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
        [72747.556271] Call Trace:
        [72747.556273]  __schedule+0x296/0x760
        [72747.556277]  schedule+0x3c/0xa0
        [72747.556279]  io_schedule+0x12/0x40
        [72747.556284]  __lock_page+0x13c/0x280
        [72747.556287]  ? generic_file_readonly_mmap+0x70/0x70
        [72747.556325]  extent_write_cache_pages+0x22a/0x440 [btrfs]
        [72747.556331]  ? __set_page_dirty_nobuffers+0xe7/0x160
        [72747.556358]  ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
        [72747.556362]  ? update_group_capacity+0x25/0x210
        [72747.556366]  ? cpumask_next_and+0x1a/0x20
        [72747.556391]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.556394]  do_writepages+0x41/0xd0
        [72747.556398]  __writeback_single_inode+0x39/0x2a0
        [72747.556403]  writeback_sb_inodes+0x1ea/0x440
        [72747.556407]  __writeback_inodes_wb+0x5f/0xc0
        [72747.556410]  wb_writeback+0x235/0x2b0
        [72747.556414]  ? get_nr_inodes+0x35/0x50
        [72747.556417]  wb_workfn+0x354/0x490
        [72747.556420]  ? newidle_balance+0x2c5/0x3e0
        [72747.556424]  process_one_work+0x1aa/0x340
        [72747.556426]  worker_thread+0x30/0x390
        [72747.556429]  ? create_worker+0x1a0/0x1a0
        [72747.556432]  kthread+0x116/0x130
        [72747.556435]  ? kthread_park+0x80/0x80
        [72747.556438]  ret_from_fork+0x1f/0x30
      
        [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [72747.566961] Call Trace:
        [72747.566964]  __schedule+0x296/0x760
        [72747.566968]  ? finish_wait+0x80/0x80
        [72747.566970]  schedule+0x3c/0xa0
        [72747.566995]  wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
        [72747.566999]  ? finish_wait+0x80/0x80
        [72747.567024]  lock_extent_bits+0x37/0x90 [btrfs]
        [72747.567047]  btrfs_invalidatepage+0x299/0x2c0 [btrfs]
        [72747.567051]  ? find_get_pages_range_tag+0x2cd/0x380
        [72747.567076]  __extent_writepage+0x203/0x320 [btrfs]
        [72747.567102]  extent_write_cache_pages+0x2bb/0x440 [btrfs]
        [72747.567106]  ? update_load_avg+0x7e/0x5f0
        [72747.567109]  ? enqueue_entity+0xf4/0x6f0
        [72747.567134]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.567137]  ? enqueue_task_fair+0x93/0x6f0
        [72747.567140]  do_writepages+0x41/0xd0
        [72747.567144]  __filemap_fdatawrite_range+0xc7/0x100
        [72747.567167]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [72747.567195]  btrfs_work_helper+0xc2/0x300 [btrfs]
        [72747.567200]  process_one_work+0x1aa/0x340
        [72747.567202]  worker_thread+0x30/0x390
        [72747.567205]  ? create_worker+0x1a0/0x1a0
        [72747.567208]  kthread+0x116/0x130
        [72747.567211]  ? kthread_park+0x80/0x80
        [72747.567214]  ret_from_fork+0x1f/0x30
      
        [72747.569686] task:fsstress        state:D stack:    0 pid:841421 ppid:841417 flags:0x00000000
        [72747.569689] Call Trace:
        [72747.569691]  __schedule+0x296/0x760
        [72747.569694]  schedule+0x3c/0xa0
        [72747.569721]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569725]  ? finish_wait+0x80/0x80
        [72747.569753]  btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
        [72747.569781]  btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
        [72747.569804]  btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
        [72747.569810]  ? path_lookupat.isra.48+0x97/0x140
        [72747.569833]  btrfs_file_write_iter+0x81/0x410 [btrfs]
        [72747.569836]  ? __kmalloc+0x16a/0x2c0
        [72747.569839]  do_iter_readv_writev+0x160/0x1c0
        [72747.569843]  do_iter_write+0x80/0x1b0
        [72747.569847]  vfs_writev+0x84/0x140
        [72747.569869]  ? btrfs_file_llseek+0x38/0x270 [btrfs]
        [72747.569873]  do_writev+0x65/0x100
        [72747.569876]  do_syscall_64+0x33/0x40
        [72747.569879]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [72747.569899] task:fsstress        state:D stack:    0 pid:841424 ppid:841417 flags:0x00004000
        [72747.569903] Call Trace:
        [72747.569906]  __schedule+0x296/0x760
        [72747.569909]  schedule+0x3c/0xa0
        [72747.569936]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569940]  ? finish_wait+0x80/0x80
        [72747.569967]  __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
        [72747.569989]  start_transaction+0x279/0x580 [btrfs]
        [72747.570014]  clone_copy_inline_extent+0x332/0x490 [btrfs]
        [72747.570041]  btrfs_clone+0x5b7/0x7a0 [btrfs]
        [72747.570068]  ? lock_extent_bits+0x64/0x90 [btrfs]
        [72747.570095]  btrfs_clone_files+0xfc/0x150 [btrfs]
        [72747.570122]  btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
        [72747.570126]  do_clone_file_range+0xed/0x200
        [72747.570131]  vfs_clone_file_range+0x37/0x110
        [72747.570134]  ioctl_file_clone+0x7d/0xb0
        [72747.570137]  do_vfs_ioctl+0x138/0x630
        [72747.570140]  __x64_sys_ioctl+0x62/0xc0
        [72747.570143]  do_syscall_64+0x33/0x40
        [72747.570146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      So fix this by skipping the flush of delalloc for an inode that is
      flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
      such a special case of cloning an inline extent, when flushing delalloc
      during qgroup metadata reservation.
      
      The special cases for cloning inline extents were added in kernel 5.7 by
      by commit 05a5a762 ("Btrfs: implement full reflink support for
      inline extents"), while having qgroup metadata space reservation flushing
      delalloc when low on space was added in kernel 5.9 by commit
      c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get
      -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
      kernel backports.
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
      Fixes: c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
      CC: stable@vger.kernel.org # 5.9+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f9baa501
  6. 21 4月, 2021 1 次提交
    • F
      btrfs: fix metadata extent leak after failure to create subvolume · 67addf29
      Filipe Manana 提交于
      When creating a subvolume we allocate an extent buffer for its root node
      after starting a transaction. We setup a root item for the subvolume that
      points to that extent buffer and then attempt to insert the root item into
      the root tree - however if that fails, due to ENOMEM for example, we do
      not free the extent buffer previously allocated and we do not abort the
      transaction (as at that point we did nothing that can not be undone).
      
      This means that we effectively do not return the metadata extent back to
      the free space cache/tree and we leave a delayed reference for it which
      causes a metadata extent item to be added to the extent tree, in the next
      transaction commit, without having backreferences. When this happens
      'btrfs check' reports the following:
      
        $ btrfs check /dev/sdi
        Opening filesystem to check...
        Checking filesystem on /dev/sdi
        UUID: dce2cb9d-025f-4b05-a4bf-cee0ad3785eb
        [1/7] checking root items
        [2/7] checking extents
        ref mismatch on [30425088 16384] extent item 1, found 0
        backref 30425088 root 256 not referenced back 0x564a91c23d70
        incorrect global backref count on 30425088 found 1 wanted 0
        backpointer mismatch on [30425088 16384]
        owner ref check failed [30425088 16384]
        ERROR: errors found in extent allocation tree or chunk allocation
        [3/7] checking free space cache
        [4/7] checking fs roots
        [5/7] checking only csums items (without verifying data)
        [6/7] checking root refs
        [7/7] checking quota groups skipped (not enabled on this FS)
        found 212992 bytes used, error(s) found
        total csum bytes: 0
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 124669
        file data blocks allocated: 65536
         referenced 65536
      
      So fix this by freeing the metadata extent if btrfs_insert_root() returns
      an error.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      67addf29
  7. 19 4月, 2021 2 次提交
  8. 12 4月, 2021 1 次提交
  9. 02 3月, 2021 1 次提交
  10. 09 2月, 2021 7 次提交
  11. 24 1月, 2021 3 次提交
  12. 18 12月, 2020 1 次提交
    • F
      btrfs: fix deadlock when cloning inline extent and low on free metadata space · 3d45f221
      Filipe Manana 提交于
      When cloning an inline extent there are cases where we can not just copy
      the inline extent from the source range to the target range (e.g. when the
      target range starts at an offset greater than zero). In such cases we copy
      the inline extent's data into a page of the destination inode and then
      dirty that page. However, after that we will need to start a transaction
      for each processed extent and, if we are ever low on available metadata
      space, we may need to flush existing delalloc for all dirty inodes in an
      attempt to release metadata space - if that happens we may deadlock:
      
      * the async reclaim task queued a delalloc work to flush delalloc for
        the destination inode of the clone operation;
      
      * the task executing that delalloc work gets blocked waiting for the
        range with the dirty page to be unlocked, which is currently locked
        by the task doing the clone operation;
      
      * the async reclaim task blocks waiting for the delalloc work to complete;
      
      * the cloning task is waiting on the waitqueue of its reservation ticket
        while holding the range with the dirty page locked in the inode's
        io_tree;
      
      * if metadata space is not released by some other task (like delalloc for
        some other inode completing for example), the clone task waits forever
        and as a consequence the delalloc work and async reclaim tasks will hang
        forever as well. Releasing more space on the other hand may require
        starting a transaction, which will hang as well when trying to reserve
        metadata space, resulting in a deadlock between all these tasks.
      
      When this happens, traces like the following show up in dmesg/syslog:
      
        [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
        [87452.323644]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [87452.324852] task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
        [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [87452.326136] Call Trace:
        [87452.326737]  __schedule+0x5d1/0xcf0
        [87452.327390]  schedule+0x45/0xe0
        [87452.328174]  lock_extent_bits+0x1e6/0x2d0 [btrfs]
        [87452.328894]  ? finish_wait+0x90/0x90
        [87452.329474]  btrfs_invalidatepage+0x32c/0x390 [btrfs]
        [87452.330133]  ? __mod_memcg_state+0x8e/0x160
        [87452.330738]  __extent_writepage+0x2d4/0x400 [btrfs]
        [87452.331405]  extent_write_cache_pages+0x2b2/0x500 [btrfs]
        [87452.332007]  ? lock_release+0x20e/0x4c0
        [87452.332557]  ? trace_hardirqs_on+0x1b/0xf0
        [87452.333127]  extent_writepages+0x43/0x90 [btrfs]
        [87452.333653]  ? lock_acquire+0x1a3/0x490
        [87452.334177]  do_writepages+0x43/0xe0
        [87452.334699]  ? __filemap_fdatawrite_range+0xa4/0x100
        [87452.335720]  __filemap_fdatawrite_range+0xc5/0x100
        [87452.336500]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [87452.337216]  btrfs_work_helper+0xf1/0x600 [btrfs]
        [87452.337838]  process_one_work+0x24e/0x5e0
        [87452.338437]  worker_thread+0x50/0x3b0
        [87452.339137]  ? process_one_work+0x5e0/0x5e0
        [87452.339884]  kthread+0x153/0x170
        [87452.340507]  ? kthread_mod_delayed_work+0xc0/0xc0
        [87452.341153]  ret_from_fork+0x22/0x30
        [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
        [87452.342487]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [87452.344049] task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
        [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
        [87452.345655] Call Trace:
        [87452.346305]  __schedule+0x5d1/0xcf0
        [87452.346947]  ? kvm_clock_read+0x14/0x30
        [87452.347676]  ? wait_for_completion+0x81/0x110
        [87452.348389]  schedule+0x45/0xe0
        [87452.349077]  schedule_timeout+0x30c/0x580
        [87452.349718]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [87452.350340]  ? lock_acquire+0x1a3/0x490
        [87452.351006]  ? try_to_wake_up+0x7a/0xa20
        [87452.351541]  ? lock_release+0x20e/0x4c0
        [87452.352040]  ? lock_acquired+0x199/0x490
        [87452.352517]  ? wait_for_completion+0x81/0x110
        [87452.353000]  wait_for_completion+0xab/0x110
        [87452.353490]  start_delalloc_inodes+0x2af/0x390 [btrfs]
        [87452.353973]  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
        [87452.354455]  flush_space+0x24f/0x660 [btrfs]
        [87452.355063]  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
        [87452.355565]  process_one_work+0x24e/0x5e0
        [87452.356024]  worker_thread+0x20f/0x3b0
        [87452.356487]  ? process_one_work+0x5e0/0x5e0
        [87452.356973]  kthread+0x153/0x170
        [87452.357434]  ? kthread_mod_delayed_work+0xc0/0xc0
        [87452.357880]  ret_from_fork+0x22/0x30
        (...)
        < stack traces of several tasks waiting for the locks of the inodes of the
          clone operation >
        (...)
        [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
        [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
        [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
        [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
        [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
        [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
        [92867.447361] task:fsstress        state:D stack:    0 pid:2508238 ppid:2508153 flags:0x00004000
        [92867.447920] Call Trace:
        [92867.448435]  __schedule+0x5d1/0xcf0
        [92867.448934]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [92867.449423]  schedule+0x45/0xe0
        [92867.449916]  __reserve_bytes+0x4a4/0xb10 [btrfs]
        [92867.450576]  ? finish_wait+0x90/0x90
        [92867.451202]  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
        [92867.451815]  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
        [92867.452412]  start_transaction+0x2d1/0x760 [btrfs]
        [92867.453216]  clone_copy_inline_extent+0x333/0x490 [btrfs]
        [92867.453848]  ? lock_release+0x20e/0x4c0
        [92867.454539]  ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
        [92867.455218]  btrfs_clone+0x569/0x7e0 [btrfs]
        [92867.455952]  btrfs_clone_files+0xf6/0x150 [btrfs]
        [92867.456588]  btrfs_remap_file_range+0x324/0x3d0 [btrfs]
        [92867.457213]  do_clone_file_range+0xd4/0x1f0
        [92867.457828]  vfs_clone_file_range+0x4d/0x230
        [92867.458355]  ? lock_release+0x20e/0x4c0
        [92867.458890]  ioctl_file_clone+0x8f/0xc0
        [92867.459377]  do_vfs_ioctl+0x342/0x750
        [92867.459913]  __x64_sys_ioctl+0x62/0xb0
        [92867.460377]  do_syscall_64+0x33/0x80
        [92867.460842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        (...)
        < stack traces of more tasks blocked on metadata reservation like the clone
          task above, because the async reclaim task has deadlocked >
        (...)
      
      Another thing to notice is that the worker task that is deadlocked when
      trying to flush the destination inode of the clone operation is at
      btrfs_invalidatepage(). This is simply because the clone operation has a
      destination offset greater than the i_size and we only update the i_size
      of the destination file after cloning an extent (just like we do in the
      buffered write path).
      
      Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
      the flushing of delalloc for all inodes that have delalloc, add a runtime
      flag to an inode to signal it should not be flushed, and for inodes with
      that flag set, start_delalloc_inodes() will simply skip them. When the
      cloning code needs to dirty a page to copy an inline extent, set that flag
      on the inode and then clear it when the clone operation finishes.
      
      This could be sporadically triggered with test case generic/269 from
      fstests, which exercises many fsstress processes running in parallel with
      several dd processes filling up the entire filesystem.
      
      CC: stable@vger.kernel.org # 5.9+
      Fixes: 05a5a762 ("Btrfs: implement full reflink support for inline extents")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d45f221
  13. 10 12月, 2020 2 次提交
  14. 08 12月, 2020 4 次提交
  15. 05 11月, 2020 1 次提交
  16. 07 10月, 2020 4 次提交