1. 26 9月, 2022 2 次提交
  2. 25 7月, 2022 3 次提交
  3. 21 6月, 2022 2 次提交
    • F
      btrfs: add missing inode updates on each iteration when replacing extents · 983d8209
      Filipe Manana 提交于
      When replacing file extents, called during fallocate, hole punching,
      clone and deduplication, we may not be able to replace/drop all the
      target file extent items with a single transaction handle. We may get
      -ENOSPC while doing it, in which case we release the transaction handle,
      balance the dirty pages of the btree inode, flush delayed items and get
      a new transaction handle to operate on what's left of the target range.
      
      By dropping and replacing file extent items we have effectively modified
      the inode, so we should bump its iversion and update its mtime/ctime
      before we update the inode item. This is because if the transaction
      we used for partially modifying the inode gets committed by someone after
      we release it and before we finish the rest of the range, a power failure
      happens, then after mounting the filesystem our inode has an outdated
      iversion and mtime/ctime, corresponding to the values it had before we
      changed it.
      
      So add the missing iversion and mtime/ctime updates.
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      983d8209
    • F
      btrfs: fix race between reflinking and ordered extent completion · d4597898
      Filipe Manana 提交于
      While doing a reflink operation, if an ordered extent for a file range
      that does not overlap with the source and destination ranges of the
      reflink operation happens, we can end up having a failure in the reflink
      operation and return -EINVAL to user space.
      
      The following sequence of steps explains how this can happen:
      
      1) We have the page at file offset 315392 dirty (under delalloc);
      
      2) A reflink operation for this file starts, using the same file as both
         source and destination, the source range is [372736, 409600) (length of
         36864 bytes) and the destination range is [208896, 245760);
      
      3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source
         and destination ranges, and wait for any ordered extents in those range
         to complete;
      
      4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in
         the inode, but we neither wait for it to complete nor any ordered
         extents to complete. This results in starting delalloc for the page at
         file offset 315392 and creating an ordered extent for that single page
         range;
      
      5) We then move to btrfs_clone() and enter the loop to find file extent
         items to copy from the source range to destination range;
      
      6) In the first iteration we end up at last file extent item stored in
         leaf A:
      
         (...)
         item 131 key (143616 108 315392) itemoff 5101 itemsize 53
                  extent data disk bytenr 1903988736 nr 73728
                  extent data offset 12288 nr 61440 ram 73728
      
         This represents the file range [315392, 376832), which overlaps with
         the source range to clone.
      
         @datal is set to 61440, key.offset is 315392 and @next_key_min_offset
         is therefore set to 376832 (315392 + 61440).
      
         @off (372736) is > key.offset (315392), so @new_key.offset is set to
         the value of @destoff (208896).
      
         @new_key.offset == @last_dest_end (208896) so @drop_start is set to
         208896 (@new_key.offset).
      
         @datal is adjusted to 4096, as @off is > @key.offset.
      
         So in this iteration we call btrfs_replace_file_extents() for the range
         [208896, 212991] (a single page, which is
         [@drop_start, @new_key.offset + @datal - 1]).
      
         @last_dest_end is set to 212992 (@new_key.offset + @datal =
         208896 + 4096 = 212992).
      
         Before the next iteration of the loop, @key.offset is set to the value
         376832, which is @next_key_min_offset;
      
      7) On the second iteration btrfs_search_slot() leaves us again at leaf A,
         but this time pointing beyond the last slot of leaf A, as that's where
         a key with offset 376832 should be at if it existed. So end up calling
         btrfs_next_leaf();
      
      8) btrfs_next_leaf() releases the path, but before it searches again the
         tree for the next key/leaf, the ordered extent for the single page
         range at file offset 315392 completes. That results in trimming the
         file extent item we processed before, adjusting its key offset from
         315392 to 319488, reducing its length from 61440 to 57344 and inserting
         a new file extent item for that single page range, with a key offset of
         315392 and a length of 4096.
      
         Leaf A now looks like:
      
           (...)
           item 132 key (143616 108 315392) itemoff 4995 itemsize 53
                    extent data disk bytenr 1801666560 nr 4096
                    extent data offset 0 nr 4096 ram 4096
           item 133 key (143616 108 319488) itemoff 4942 itemsize 53
                    extent data disk bytenr 1903988736 nr 73728
                    extent data offset 16384 nr 57344 ram 73728
      
      9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A
         at slot 133, since it's the first key that follows what was the last
         key we saw (143616 108 315392). In fact it's the same item we processed
         before, but its key offset was changed, so it counts as a new key;
      
      10) So now we have:
      
          @key.offset == 319488
          @datal == 57344
      
          @off (372736) is > key.offset (319488), so @new_key.offset is set to
          208896 (@destoff value).
      
          @new_key.offset (208896) != @last_dest_end (212992), so @drop_start
          is set to 212992 (@last_dest_end value).
      
          @datal is adjusted to 4096 because @off > @key.offset.
      
          So in this iteration we call btrfs_replace_file_extents() for the
          invalid range of [212992, 212991] (which is
          [@drop_start, @new_key.offset + @datal - 1]).
      
          This range is empty, the end offset is smaller than the start offset
          so btrfs_replace_file_extents() returns -EINVAL, which we end up
          returning to user space and fail the reflink operation.
      
          This all happens because the range of this file extent item was
          already processed in the previous iteration.
      
      This scenario can be triggered very sporadically by fsx from fstests, for
      example with test case generic/522.
      
      So fix this by having btrfs_clone() skip file extent items that cover a
      file range that we have already processed.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4597898
  4. 16 5月, 2022 2 次提交
    • F
      btrfs: add and use helper to assert an inode range is clean · 63c34cb4
      Filipe Manana 提交于
      We have four different scenarios where we don't expect to find ordered
      extents after locking a file range:
      
      1) During plain fallocate;
      2) During hole punching;
      3) During zero range;
      4) During reflinks (both cloning and deduplication).
      
      This is because in all these cases we follow the pattern:
      
      1) Lock the inode's VFS lock in exclusive mode;
      
      2) Lock the inode's i_mmap_lock in exclusive node, to serialize with
         mmap writes;
      
      3) Flush delalloc in a file range and wait for all ordered extents
         to complete - both done through btrfs_wait_ordered_range();
      
      4) Lock the file range in the inode's io_tree.
      
      So add a helper that asserts that we don't have ordered extents for a
      given range. Make the four scenarios listed above use this helper after
      locking the respective file range.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      63c34cb4
    • F
      btrfs: remove inode_dio_wait() calls when starting reflink operations · 1c6cbbbe
      Filipe Manana 提交于
      When starting a reflink operation we have these calls to inode_dio_wait()
      which used to be needed because direct IO writes that don't cross the
      i_size boundary did not take the inode's VFS lock, so we could race with
      them and end up with ordered extents in target range after calling
      btrfs_wait_ordered_range().
      
      However that is not the case anymore, because the inode's VFS lock was
      changed from a mutex to a rw semaphore, by commit 9902af79
      ("parallel lookups: actual switch to rwsem"), and several years later we
      started to lock the inode's VFS lock in shared mode for direct IO writes
      that don't cross the i_size boundary (commit e9adabb9 ("btrfs: use
      shared lock for direct writes within EOF")).
      
      So remove those inode_dio_wait() calls.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1c6cbbbe
  5. 02 4月, 2022 1 次提交
  6. 14 3月, 2022 5 次提交
    • J
      btrfs: remove the cross file system checks from remap · ae460f05
      Josef Bacik 提交于
      The sb check is already done in do_clone_file_range, and the mnt check
      (which will hopefully go away in a subsequent patch) is done in
      ioctl_file_clone().  Remove the check in our code and put an ASSERT() to
      make sure it doesn't change underneath us.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae460f05
    • F
      btrfs: deal with unexpected extent type during reflinking · b2d9f2dc
      Filipe Manana 提交于
      Smatch complains about a possible dereference of a pointer that was not
      initialized:
      
          CC [M]  fs/btrfs/reflink.o
          CHECK   fs/btrfs/reflink.c
        fs/btrfs/reflink.c:533 btrfs_clone() error: potentially dereferencing uninitialized 'trans'.
      
      This is because we are not dealing with the case where the type of a file
      extent has an unexpected value (not regular, not prealloc and not inline),
      in which case the transaction handle pointer is not initialized.
      
      Such unexpected type should be impossible, except in case of some memory
      corruption caused either by bad hardware or some software bug causing
      something like a buffer overrun.
      
      So ASSERT that if the extent type is neither regular nor prealloc, then
      it must be inline. Bail out with -EUCLEAN and a warning in case it is
      not. This silences smatch.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b2d9f2dc
    • F
      btrfs: fix unexpected error path when reflinking an inline extent · 1f4613cd
      Filipe Manana 提交于
      When reflinking an inline extent, we assert that its file offset is 0 and
      that its uncompressed length is not greater than the sector size. We then
      return an error if one of those conditions is not satisfied. However we
      use a return statement, which results in returning from btrfs_clone()
      without freeing the path and buffer that were allocated before, as well as
      not clearing the flag BTRFS_INODE_NO_DELALLOC_FLUSH for the destination
      inode.
      
      Fix that by jumping to the 'out' label instead, and also add a WARN_ON()
      for each condition so that in case assertions are disabled, we get to
      known which of the unexpected conditions triggered the error.
      
      Fixes: a61e1e0d ("Btrfs: simplify inline extent handling when doing reflinks")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1f4613cd
    • F
      btrfs: reset last_reflink_trans after fsyncing inode · 23e3337f
      Filipe Manana 提交于
      When an inode has a last_reflink_trans matching the current transaction,
      we have to take special care when logging its checksums in order to
      avoid getting checksum items with overlapping ranges in a log tree,
      which could result in missing checksums after log replay (more on that
      in the changelogs of commit 40e046ac ("Btrfs: fix missing data
      checksums after replaying a log tree") and commit e289f03e ("btrfs:
      fix corrupt log due to concurrent fsync of inodes with shared extents")).
      We also need to make sure a full fsync will copy all old file extent
      items it finds in modified leaves, because they might have been copied
      from some other inode.
      
      However once we fsync an inode, we don't need to keep paying the price of
      that extra special care in future fsyncs done in the same transaction,
      unless the inode is used for another reflink operation or the full sync
      flag is set on it (truncate, failure to allocate extent maps for holes,
      and other exceptional and infrequent cases).
      
      So after we fsync an inode reset its last_unlink_trans to zero. In case
      another reflink happens, we continue to update the last_reflink_trans of
      the inode, just as before. Also set last_reflink_trans to the generation
      of the last transaction that modified the inode whenever we need to set
      the full sync flag on the inode, just like when we need to load an inode
      from disk after eviction.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      23e3337f
    • F
      btrfs: stop copying old file extents when doing a full fsync · 7f30c072
      Filipe Manana 提交于
      When logging an inode in full sync mode, we go over every leaf that was
      modified in the current transaction and has items associated to our inode,
      and then copy all those items into the log tree. This includes copying
      file extent items that were created and added to the inode in past
      transactions, which is useless and only makes use more leaf space in the
      log tree.
      
      It's common to have a file with many file extent items spanning many
      leaves where only a few file extent items are new and need to be logged,
      and in such case we log all the file extent items we find in the modified
      leaves.
      
      So change the full sync behaviour to skip over file extent items that are
      not needed. Those are the ones that match the following criteria:
      
      1) Have a generation older than the current transaction and the inode
         was not a target of a reflink operation, as that can copy file extent
         items from a past generation from some other inode into our inode, so
         we have to log them;
      
      2) Start at an offset within i_size - we must log anything at or beyond
         i_size, otherwise we would lose prealloc extents after log replay.
      
      The following script exercises a scenario where this happens, and it's
      somehow close enough to what happened often on a SQL Server workload which
      I had to debug sometime ago to fix an issue where a pattern of writes to
      prealloc extents and fsync resulted in fsync failing with -EIO (that was
      commit ea7036de ("btrfs: fix fsync failure and transaction abort
      after writes to prealloc extents")). In that particular case, we had large
      files that had random writes and were often truncated, which made the
      next fsync be a full sync.
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
        MOUNT_OPTIONS="-o ssd"
      
        FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G
        # FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G
        # FILE_SIZE=$((512 * 1024 * 1024)) # 512M
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create a file with many extents. Use direct IO to make it faster
        # to create the file - using buffered IO we would have to fsync
        # after each write (terribly slow).
        echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..."
        xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar
      
        # Commit the transaction, so every extent after this is from an
        # old generation.
        sync
      
        # Now rewrite only a few extents, which are all far spread apart from
        # each other (e.g. 1G / 32M = 32 extents).
        # After this only a few extents have a new generation, while all other
        # ones have an old generation.
        echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..."
        for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do
            xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null
        done
      
        # Fsync, the inode logged in full sync mode since it was never fsynced
        # before.
        echo "Fsyncing file..."
        xfs_io -c "fsync" $MNT/foobar
      
        umount $MNT
      
      And the following bpftrace program was running when executing the test
      script:
      
        $ cat bpf-script.sh
        #!/usr/bin/bpftrace
      
        k:btrfs_log_inode
        {
            @start_log_inode[tid] = nsecs;
        }
      
        kr:btrfs_log_inode
        /@start_log_inode[tid]/
        {
            @log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000;
            delete(@start_log_inode[tid]);
        }
      
        k:btrfs_sync_log
        {
            @start_sync_log[tid] = nsecs;
        }
      
        kr:btrfs_sync_log
        /@start_sync_log[tid]/
        {
            $sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000;
            printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]);
            printf("btrfs_sync_log()  took %llu us\n", $sync_log_dur);
            delete(@start_sync_log[tid]);
            delete(@log_inode_dur[tid]);
            exit();
        }
      
      With 512M test file, before this patch:
      
        btrfs_log_inode() took 15218 us
        btrfs_sync_log()  took 1328 us
      
        Log tree has 17 leaves and 1 node, its total size is 294912 bytes.
      
      With 512M test file, after this patch:
      
        btrfs_log_inode() took 14760 us
        btrfs_sync_log()  took 588 us
      
        Log tree has a single leaf, its total size is 16K.
      
      With 1G test file, before this patch:
      
        btrfs_log_inode() took 27301 us
        btrfs_sync_log()  took 1767 us
      
        Log tree has 33 leaves and 1 node, its total size is 557056 bytes.
      
      With 1G test file, after this patch:
      
        btrfs_log_inode() took 26166 us
        btrfs_sync_log()  took 593 us
      
        Log tree has a single leaf, its total size is 16K
      
      With 2G test file, before this patch:
      
        btrfs_log_inode() took 50892 us
        btrfs_sync_log()  took 3127 us
      
        Log tree has 65 leaves and 1 node, its total size is 1081344 bytes.
      
      With 2G test file, after this patch:
      
        btrfs_log_inode() took 50126 us
        btrfs_sync_log()  took 586 us
      
        Log tree has a single leaf, its total size is 16K.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f30c072
  7. 03 1月, 2022 1 次提交
  8. 27 10月, 2021 2 次提交
    • Q
      btrfs: subpage: add bitmap for PageChecked flag · e4f94347
      Qu Wenruo 提交于
      Although in btrfs we have very limited usage of PageChecked flag, it's
      still some page flag not yet subpage compatible.
      
      Fix it by introducing btrfs_subpage::checked_offset to do the convert.
      
      For most call sites, especially for free-space cache, COW fixup and
      btrfs_invalidatepage(), they all work in full page mode anyway.
      
      For other call sites, they work as subpage compatible mode.
      
      Some call sites need extra modification:
      
      - btrfs_drop_pages()
        Needs extra parameter to get the real range we need to clear checked
        flag.
      
        Also since btrfs_drop_pages() will accept pages beyond the dirtied
        range, update btrfs_subpage_clamp_range() to handle such case
        by setting @len to 0 if the page is beyond target range.
      
      - btrfs_invalidatepage()
        We need to call subpage helper before calling __btrfs_releasepage(),
        or it will trigger ASSERT() as page->private will be cleared.
      
      - btrfs_verify_data_csum()
        In theory we don't need the io_bio->csum check anymore, but it's
        won't hurt.  Just change the comment.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4f94347
    • S
      btrfs: reflink: initialize return value to 0 in btrfs_extent_same() · 44bee215
      Sidong Yang 提交于
      Fix a warning reported by smatch that ret could be returned without
      initialized.  The dedupe operations are supposed to to return 0 for a 0
      length range but the caller does not pass olen == 0. To keep this
      behaviour and also fix the warning initialize ret to 0.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NSidong Yang <realwakka@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      44bee215
  9. 21 6月, 2021 1 次提交
    • Q
      btrfs: reflink: make copy_inline_to_page() to be subpage compatible · 3115deb3
      Qu Wenruo 提交于
      The modifications are:
      
      - Page copy destination
        For subpage case, one page can contain multiple sectors, thus we can
        no longer expect the memcpy_to_page()/btrfs_decompress() to copy
        data into page offset 0.
        The correct offset is offset_in_page(file_offset) now, which should
        handle both regular sectorsize and subpage cases well.
      
      - Page status update
        Now we need to use subpage helper to handle the page status update.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3115deb3
  10. 28 5月, 2021 1 次提交
    • F
      btrfs: fix deadlock when cloning inline extents and low on available space · 76a6d5cd
      Filipe Manana 提交于
      There are a few cases where cloning an inline extent requires copying data
      into a page of the destination inode. For these cases we are allocating
      the required data and metadata space while holding a leaf locked. This can
      result in a deadlock when we are low on available space because allocating
      the space may flush delalloc and two deadlock scenarios can happen:
      
      1) When starting writeback for an inode with a very small dirty range that
         fits in an inline extent, we deadlock during the writeback when trying
         to insert the inline extent, at cow_file_range_inline(), if the extent
         is going to be located in the leaf for which we are already holding a
         read lock;
      
      2) After successfully starting writeback, for non-inline extent cases,
         the async reclaim thread will hang waiting for an ordered extent to
         complete if the ordered extent completion needs to modify the leaf
         for which the clone task is holding a read lock (for adding or
         replacing file extent items). So the cloning task will wait forever
         on the async reclaim thread to make progress, which in turn is
         waiting for the ordered extent completion which in turn is waiting
         to acquire a write lock on the same leaf.
      
      So fix this by making sure we release the path (and therefore the leaf)
      every time we need to copy the inline extent's data into a page of the
      destination inode, as by that time we do not need to have the leaf locked.
      
      Fixes: 05a5a762 ("Btrfs: implement full reflink support for inline extents")
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      76a6d5cd
  11. 17 5月, 2021 1 次提交
    • F
      btrfs: release path before starting transaction when cloning inline extent · 6416954c
      Filipe Manana 提交于
      When cloning an inline extent there are a few cases, such as when we have
      an implicit hole at file offset 0, where we start a transaction while
      holding a read lock on a leaf. Starting the transaction results in a call
      to sb_start_intwrite(), which results in doing a read lock on a percpu
      semaphore. Lockdep doesn't like this and complains about it:
      
        [46.580704] ======================================================
        [46.580752] WARNING: possible circular locking dependency detected
        [46.580799] 5.13.0-rc1 #28 Not tainted
        [46.580832] ------------------------------------------------------
        [46.580877] cloner/3835 is trying to acquire lock:
        [46.580918] c00000001301d638 (sb_internal#2){.+.+}-{0:0}, at: clone_copy_inline_extent+0xe4/0x5a0
        [46.581167]
        [46.581167] but task is already holding lock:
        [46.581217] c000000007fa2550 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
        [46.581293]
        [46.581293] which lock already depends on the new lock.
        [46.581293]
        [46.581351]
        [46.581351] the existing dependency chain (in reverse order) is:
        [46.581410]
        [46.581410] -> #1 (btrfs-tree-00){++++}-{3:3}:
        [46.581464]        down_read_nested+0x68/0x200
        [46.581536]        __btrfs_tree_read_lock+0x70/0x1d0
        [46.581577]        btrfs_read_lock_root_node+0x88/0x200
        [46.581623]        btrfs_search_slot+0x298/0xb70
        [46.581665]        btrfs_set_inode_index+0xfc/0x260
        [46.581708]        btrfs_new_inode+0x26c/0x950
        [46.581749]        btrfs_create+0xf4/0x2b0
        [46.581782]        lookup_open.isra.57+0x55c/0x6a0
        [46.581855]        path_openat+0x418/0xd20
        [46.581888]        do_filp_open+0x9c/0x130
        [46.581920]        do_sys_openat2+0x2ec/0x430
        [46.581961]        do_sys_open+0x90/0xc0
        [46.581993]        system_call_exception+0x3d4/0x410
        [46.582037]        system_call_common+0xec/0x278
        [46.582078]
        [46.582078] -> #0 (sb_internal#2){.+.+}-{0:0}:
        [46.582135]        __lock_acquire+0x1e90/0x2c50
        [46.582176]        lock_acquire+0x2b4/0x5b0
        [46.582263]        start_transaction+0x3cc/0x950
        [46.582308]        clone_copy_inline_extent+0xe4/0x5a0
        [46.582353]        btrfs_clone+0x5fc/0x880
        [46.582388]        btrfs_clone_files+0xd8/0x1c0
        [46.582434]        btrfs_remap_file_range+0x3d8/0x590
        [46.582481]        do_clone_file_range+0x10c/0x270
        [46.582558]        vfs_clone_file_range+0x1b0/0x310
        [46.582605]        ioctl_file_clone+0x90/0x130
        [46.582651]        do_vfs_ioctl+0x874/0x1ac0
        [46.582697]        sys_ioctl+0x6c/0x120
        [46.582733]        system_call_exception+0x3d4/0x410
        [46.582777]        system_call_common+0xec/0x278
        [46.582822]
        [46.582822] other info that might help us debug this:
        [46.582822]
        [46.582888]  Possible unsafe locking scenario:
        [46.582888]
        [46.582942]        CPU0                    CPU1
        [46.582984]        ----                    ----
        [46.583028]   lock(btrfs-tree-00);
        [46.583062]                                lock(sb_internal#2);
        [46.583119]                                lock(btrfs-tree-00);
        [46.583174]   lock(sb_internal#2);
        [46.583212]
        [46.583212]  *** DEADLOCK ***
        [46.583212]
        [46.583266] 6 locks held by cloner/3835:
        [46.583299]  #0: c00000001301d448 (sb_writers#12){.+.+}-{0:0}, at: ioctl_file_clone+0x90/0x130
        [46.583382]  #1: c00000000f6d3768 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}, at: lock_two_nondirectories+0x58/0xc0
        [46.583477]  #2: c00000000f6d72a8 (&sb->s_type->i_mutex_key#15/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x9c/0xc0
        [46.583574]  #3: c00000000f6d7138 (&ei->i_mmap_lock){+.+.}-{3:3}, at: btrfs_remap_file_range+0xd0/0x590
        [46.583657]  #4: c00000000f6d35f8 (&ei->i_mmap_lock/1){+.+.}-{3:3}, at: btrfs_remap_file_range+0xe0/0x590
        [46.583743]  #5: c000000007fa2550 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0
        [46.583828]
        [46.583828] stack backtrace:
        [46.583872] CPU: 1 PID: 3835 Comm: cloner Not tainted 5.13.0-rc1 #28
        [46.583931] Call Trace:
        [46.583955] [c0000000167c7200] [c000000000c1ee78] dump_stack+0xec/0x144 (unreliable)
        [46.584052] [c0000000167c7240] [c000000000274058] print_circular_bug.isra.32+0x3a8/0x400
        [46.584123] [c0000000167c72e0] [c0000000002741f4] check_noncircular+0x144/0x190
        [46.584191] [c0000000167c73b0] [c000000000278fc0] __lock_acquire+0x1e90/0x2c50
        [46.584259] [c0000000167c74f0] [c00000000027aa94] lock_acquire+0x2b4/0x5b0
        [46.584317] [c0000000167c75e0] [c000000000a0d6cc] start_transaction+0x3cc/0x950
        [46.584388] [c0000000167c7690] [c000000000af47a4] clone_copy_inline_extent+0xe4/0x5a0
        [46.584457] [c0000000167c77c0] [c000000000af525c] btrfs_clone+0x5fc/0x880
        [46.584514] [c0000000167c7990] [c000000000af5698] btrfs_clone_files+0xd8/0x1c0
        [46.584583] [c0000000167c7a00] [c000000000af5b58] btrfs_remap_file_range+0x3d8/0x590
        [46.584652] [c0000000167c7ae0] [c0000000005d81dc] do_clone_file_range+0x10c/0x270
        [46.584722] [c0000000167c7b40] [c0000000005d84f0] vfs_clone_file_range+0x1b0/0x310
        [46.584793] [c0000000167c7bb0] [c00000000058bf80] ioctl_file_clone+0x90/0x130
        [46.584861] [c0000000167c7c10] [c00000000058c894] do_vfs_ioctl+0x874/0x1ac0
        [46.584922] [c0000000167c7d10] [c00000000058db4c] sys_ioctl+0x6c/0x120
        [46.584978] [c0000000167c7d60] [c0000000000364a4] system_call_exception+0x3d4/0x410
        [46.585046] [c0000000167c7e10] [c00000000000d45c] system_call_common+0xec/0x278
        [46.585114] --- interrupt: c00 at 0x7ffff7e22990
        [46.585160] NIP:  00007ffff7e22990 LR: 00000001000010ec CTR: 0000000000000000
        [46.585224] REGS: c0000000167c7e80 TRAP: 0c00   Not tainted  (5.13.0-rc1)
        [46.585280] MSR:  800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE>  CR: 28000244  XER: 00000000
        [46.585374] IRQMASK: 0
        [46.585374] GPR00: 0000000000000036 00007fffffffdec0 00007ffff7f17100 0000000000000004
        [46.585374] GPR04: 000000008020940d 00007fffffffdf40 0000000000000000 0000000000000000
        [46.585374] GPR08: 0000000000000004 0000000000000000 0000000000000000 0000000000000000
        [46.585374] GPR12: 0000000000000000 00007ffff7ffa940 0000000000000000 0000000000000000
        [46.585374] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        [46.585374] GPR20: 0000000000000000 000000009123683e 00007fffffffdf40 0000000000000000
        [46.585374] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000004
        [46.585374] GPR28: 0000000100030260 0000000100030280 0000000000000003 000000000000005f
        [46.585919] NIP [00007ffff7e22990] 0x7ffff7e22990
        [46.585964] LR [00000001000010ec] 0x1000010ec
        [46.586010] --- interrupt: c00
      
      This should be a false positive, as both locks are acquired in read mode.
      Nevertheless, we don't need to hold a leaf locked when we start the
      transaction, so just release the leaf (path) before starting it.
      Reported-by: NRitesh Harjani <riteshh@linux.ibm.com>
      Link: https://lore.kernel.org/linux-btrfs/20210513214404.xks77p566fglzgum@riteshh-domain/Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6416954c
  12. 06 5月, 2021 1 次提交
    • I
      btrfs: use memzero_page() instead of open coded kmap pattern · d048b9c2
      Ira Weiny 提交于
      There are many places where kmap/memset/kunmap patterns occur.
      
      Use the newly lifted memzero_page() to eliminate direct uses of kmap and
      leverage the new core functions use of kmap_local_page().
      
      The development of this patch was aided by the following coccinelle
      script:
      
      // <smpl>
      // SPDX-License-Identifier: GPL-2.0-only
      // Find kmap/memset/kunmap pattern and replace with memset*page calls
      //
      // NOTE: Offsets and other expressions may be more complex than what the script
      // will automatically generate.  Therefore a catchall rule is provided to find
      // the pattern which then must be evaluated by hand.
      //
      // Confidence: Low
      // Copyright: (C) 2021 Intel Corporation
      // URL: http://coccinelle.lip6.fr/
      // Comments:
      // Options:
      
      //
      // Then the memset pattern
      //
      @ memset_rule1 @
      expression page, V, L, Off;
      identifier ptr;
      type VP;
      @@
      
      (
      -VP ptr = kmap(page);
      |
      -ptr = kmap(page);
      |
      -VP ptr = kmap_atomic(page);
      |
      -ptr = kmap_atomic(page);
      )
      <+...
      (
      -memset(ptr, 0, L);
      +memzero_page(page, 0, L);
      |
      -memset(ptr + Off, 0, L);
      +memzero_page(page, Off, L);
      |
      -memset(ptr, V, L);
      +memset_page(page, V, 0, L);
      |
      -memset(ptr + Off, V, L);
      +memset_page(page, V, Off, L);
      )
      ...+>
      (
      -kunmap(page);
      |
      -kunmap_atomic(ptr);
      )
      
      // Remove any pointers left unused
      @
      depends on memset_rule1
      @
      identifier memset_rule1.ptr;
      type VP, VP1;
      @@
      
      -VP ptr;
      	... when != ptr;
      ? VP1 ptr;
      
      //
      // Catch all
      //
      @ memset_rule2 @
      expression page;
      identifier ptr;
      expression GenTo, GenSize, GenValue;
      type VP;
      @@
      
      (
      -VP ptr = kmap(page);
      |
      -ptr = kmap(page);
      |
      -VP ptr = kmap_atomic(page);
      |
      -ptr = kmap_atomic(page);
      )
      <+...
      (
      //
      // Some call sites have complex expressions within the memset/memcpy
      // The follow are catch alls which need to be evaluated by hand.
      //
      -memset(GenTo, 0, GenSize);
      +memzero_pageExtra(page, GenTo, GenSize);
      |
      -memset(GenTo, GenValue, GenSize);
      +memset_pageExtra(page, GenValue, GenTo, GenSize);
      )
      ...+>
      (
      -kunmap(page);
      |
      -kunmap_atomic(ptr);
      )
      
      // Remove any pointers left unused
      @
      depends on memset_rule2
      @
      identifier memset_rule2.ptr;
      type VP, VP1;
      @@
      
      -VP ptr;
      	... when != ptr;
      ? VP1 ptr;
      
      // </smpl>
      
      Link: https://lkml.kernel.org/r/20210309212137.2610186-4-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d048b9c2
  13. 19 4月, 2021 4 次提交
    • F
      btrfs: make reflinks respect O_SYNC O_DSYNC and S_SYNC flags · b7a7a834
      Filipe Manana 提交于
      If we reflink to or from a file opened with O_SYNC/O_DSYNC or to/from a
      file that has the S_SYNC attribute set, we totally ignore that and do not
      durably persist the reflink changes. Since a reflink can change the data
      readable from a file (and mtime/ctime, or a file size), it makes sense to
      durably persist (fsync) the source and destination files/ranges.
      
      This was previously discussed at:
      
      https://lore.kernel.org/linux-btrfs/20200903035225.GJ6090@magnolia/
      
      The recently introduced test case generic/628, from fstests, exercises
      these scenarios and currently fails without this change.
      
      So make sure we fsync the source and destination files/ranges when either
      of them was opened with O_SYNC/O_DSYNC or has the S_SYNC attribute set,
      just like XFS already does.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7a7a834
    • J
      btrfs: exclude mmaps while doing remap · 8c99516a
      Josef Bacik 提交于
      Darrick reported a potential issue to me where we could allow mmap
      writes after validating a page range matched in the case of dedupe.
      Generally we rely on lock page -> lock extent with the ordered flush to
      protect us, but this is done after we check the pages because we use the
      generic helpers, so we could modify the page in between doing the check
      and locking the range.
      
      There also exists a deadlock, as described by Filipe
      
      """
      When cloning a file range, we lock the inodes, flush any delalloc within
      the respective file ranges, wait for any ordered extents and then lock the
      file ranges in both inodes. This means that right after we flush delalloc
      and before we lock the file ranges, memory mapped writes can come in and
      dirty pages in the file ranges of the clone operation.
      
      Most of the time this is harmless and causes no problems. However, if we
      are low on available metadata space, we can later end up in a deadlock
      when starting a transaction to replace file extent items. This happens if
      when allocating metadata space for the transaction, we need to wait for
      the async reclaim thread to release space and the reclaim thread needs to
      flush delalloc for the inode that got the memory mapped write and has its
      range locked by the clone task.
      
      Basically what happens is the following:
      
      1) A clone operation locks inodes A and B, flushes delalloc for both
         inodes in the respective file ranges and waits for any ordered extents
         in those ranges to complete;
      
      2) Before the clone task locks the file ranges, another task does a
         memory mapped write (which does not lock the inode) for one of the
         inodes of the clone operation. So now we have a dirty page in one of
         the ranges used by the clone operation;
      
      3) The clone operation locks the file ranges for inodes A and B;
      
      4) Later, when iterating over the file extents of inode A, the clone
         task attempts to start a transaction. There's not enough available
         free metadata space, so the async reclaim task is started (if not
         running already) and we wait for someone to wake us up on our
         reservation ticket;
      
      5) The async reclaim task is not able to release space by any other
         means and decides to flush delalloc for the inode of the clone
         operation;
      
      6) The workqueue job used to flush the inode blocks when starting
         delalloc for the inode, since the file range is currently locked by
         the clone task;
      
      7) But the clone task is waiting on its reservation ticket and the async
         reclaim task is waiting on the flush job to complete, which can't
         progress since the clone task has the file range locked. So unless
         some other task is able to release space, for example an ordered
         extent for some other inode completes, we have a deadlock between all
         these tasks;
      
      When this happens stack traces like the following show up in dmesg/syslog:
      
       INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
             Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
       Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
       Call Trace:
        __schedule+0x5d1/0xcf0
        schedule+0x45/0xe0
        lock_extent_bits+0x1e6/0x2d0 [btrfs]
        ? finish_wait+0x90/0x90
        btrfs_invalidatepage+0x32c/0x390 [btrfs]
        ? __mod_memcg_state+0x8e/0x160
        __extent_writepage+0x2d4/0x400 [btrfs]
        extent_write_cache_pages+0x2b2/0x500 [btrfs]
        ? lock_release+0x20e/0x4c0
        ? trace_hardirqs_on+0x1b/0xf0
        extent_writepages+0x43/0x90 [btrfs]
        ? lock_acquire+0x1a3/0x490
        do_writepages+0x43/0xe0
        ? __filemap_fdatawrite_range+0xa4/0x100
        __filemap_fdatawrite_range+0xc5/0x100
        btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        btrfs_work_helper+0xf1/0x600 [btrfs]
        process_one_work+0x24e/0x5e0
        worker_thread+0x50/0x3b0
        ? process_one_work+0x5e0/0x5e0
        kthread+0x153/0x170
        ? kthread_mod_delayed_work+0xc0/0xc0
        ret_from_fork+0x22/0x30
       INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
             Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
       Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
       Call Trace:
        __schedule+0x5d1/0xcf0
        ? kvm_clock_read+0x14/0x30
        ? wait_for_completion+0x81/0x110
        schedule+0x45/0xe0
        schedule_timeout+0x30c/0x580
        ? _raw_spin_unlock_irqrestore+0x3c/0x60
        ? lock_acquire+0x1a3/0x490
        ? try_to_wake_up+0x7a/0xa20
        ? lock_release+0x20e/0x4c0
        ? lock_acquired+0x199/0x490
        ? wait_for_completion+0x81/0x110
        wait_for_completion+0xab/0x110
        start_delalloc_inodes+0x2af/0x390 [btrfs]
        btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
        flush_space+0x24f/0x660 [btrfs]
        btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
        process_one_work+0x24e/0x5e0
        worker_thread+0x20f/0x3b0
        ? process_one_work+0x5e0/0x5e0
        kthread+0x153/0x170
        ? kthread_mod_delayed_work+0xc0/0xc0
        ret_from_fork+0x22/0x30
      (...)
      several other tasks blocked on inode locks held by the clone task below
      (...)
       RIP: 0033:0x7f61efe73fff
       Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
       RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
       RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
       RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c
       RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0
       R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002
       R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
       task: fdm-stress        state:D stack:    0 pid:2508234 ppid:2508153 flags:0x00004000
       Call Trace:
        __schedule+0x5d1/0xcf0
        ? _raw_spin_unlock_irqrestore+0x3c/0x60
        schedule+0x45/0xe0
        __reserve_bytes+0x4a4/0xb10 [btrfs]
        ? finish_wait+0x90/0x90
        btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
        btrfs_block_rsv_add+0x1f/0x50 [btrfs]
        start_transaction+0x2d1/0x760 [btrfs]
        btrfs_replace_file_extents+0x120/0x930 [btrfs]
        ? lock_release+0x20e/0x4c0
        btrfs_clone+0x3e4/0x7e0 [btrfs]
        ? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs]
        btrfs_clone_files+0xf6/0x150 [btrfs]
        btrfs_remap_file_range+0x324/0x3d0 [btrfs]
        do_clone_file_range+0xd4/0x1f0
        vfs_clone_file_range+0x4d/0x230
        ? lock_release+0x20e/0x4c0
        ioctl_file_clone+0x8f/0xc0
        do_vfs_ioctl+0x342/0x750
        __x64_sys_ioctl+0x62/0xb0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      """
      
      Fix both of these issues by excluding mmaps from happening we are doing
      any sort of remap, which prevents this race completely.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c99516a
    • J
      btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers · 64708539
      Josef Bacik 提交于
      A few places we intermix btrfs_inode_lock with a inode_unlock, and some
      places we just use inode_lock/inode_unlock instead of btrfs_inode_lock.
      
      None of these places are using this incorrectly, but as we adjust some
      of these callers it would be nice to keep everything consistent, so
      convert everybody to use btrfs_inode_lock/btrfs_inode_unlock.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      64708539
    • N
  14. 26 2月, 2021 1 次提交
    • I
      btrfs: use memcpy_[to|from]_page() and kmap_local_page() · 3590ec58
      Ira Weiny 提交于
      There are many places where the pattern kmap/memcpy/kunmap occurs.
      
      This pattern was lifted to the core common functions
      memcpy_[to|from]_page().
      
      Use these new functions to reduce the code, eliminate direct uses of
      kmap, and leverage the new core functions use of kmap_local_page().
      
      Also, there is 1 place where a kmap/memcpy is followed by an
      optional memset.  Here we leave the kmap open coded to avoid remapping
      the page but use kmap_local_page() directly.
      
      Development of this patch was aided by the coccinelle script:
      
      // <smpl>
      // SPDX-License-Identifier: GPL-2.0-only
      // Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls
      //
      // NOTE: Offsets and other expressions may be more complex than what the script
      // will automatically generate.  Therefore a catchall rule is provided to find
      // the pattern which then must be evaluated by hand.
      //
      // Confidence: Low
      // Copyright: (C) 2021 Intel Corporation
      // URL: http://coccinelle.lip6.fr/
      // Comments:
      // Options:
      
      //
      // simple memcpy version
      //
      @ memcpy_rule1 @
      expression page, T, F, B, Off;
      identifier ptr;
      type VP;
      @@
      
      (
      -VP ptr = kmap(page);
      |
      -ptr = kmap(page);
      |
      -VP ptr = kmap_atomic(page);
      |
      -ptr = kmap_atomic(page);
      )
      <+...
      (
      -memcpy(ptr + Off, F, B);
      +memcpy_to_page(page, Off, F, B);
      |
      -memcpy(ptr, F, B);
      +memcpy_to_page(page, 0, F, B);
      |
      -memcpy(T, ptr + Off, B);
      +memcpy_from_page(T, page, Off, B);
      |
      -memcpy(T, ptr, B);
      +memcpy_from_page(T, page, 0, B);
      )
      ...+>
      (
      -kunmap(page);
      |
      -kunmap_atomic(ptr);
      )
      
      // Remove any pointers left unused
      @
      depends on memcpy_rule1
      @
      identifier memcpy_rule1.ptr;
      type VP, VP1;
      @@
      
      -VP ptr;
      	... when != ptr;
      ? VP1 ptr;
      
      //
      // Some callers kmap without a temp pointer
      //
      @ memcpy_rule2 @
      expression page, T, Off, F, B;
      @@
      
      <+...
      (
      -memcpy(kmap(page) + Off, F, B);
      +memcpy_to_page(page, Off, F, B);
      |
      -memcpy(kmap(page), F, B);
      +memcpy_to_page(page, 0, F, B);
      |
      -memcpy(T, kmap(page) + Off, B);
      +memcpy_from_page(T, page, Off, B);
      |
      -memcpy(T, kmap(page), B);
      +memcpy_from_page(T, page, 0, B);
      )
      ...+>
      -kunmap(page);
      // No need for the ptr variable removal
      
      //
      // Catch all
      //
      @ memcpy_rule3 @
      expression page;
      expression GenTo, GenFrom, GenSize;
      identifier ptr;
      type VP;
      @@
      
      (
      -VP ptr = kmap(page);
      |
      -ptr = kmap(page);
      |
      -VP ptr = kmap_atomic(page);
      |
      -ptr = kmap_atomic(page);
      )
      <+...
      (
      //
      // Some call sites have complex expressions within the memcpy
      // match a catch all to be evaluated by hand.
      //
      -memcpy(GenTo, GenFrom, GenSize);
      +memcpy_to_pageExtra(page, GenTo, GenFrom, GenSize);
      +memcpy_from_pageExtra(GenTo, page, GenFrom, GenSize);
      )
      ...+>
      (
      -kunmap(page);
      |
      -kunmap_atomic(ptr);
      )
      
      // Remove any pointers left unused
      @
      depends on memcpy_rule3
      @
      identifier memcpy_rule3.ptr;
      type VP, VP1;
      @@
      
      -VP ptr;
      	... when != ptr;
      ? VP1 ptr;
      
      // <smpl>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3590ec58
  15. 23 2月, 2021 1 次提交
    • F
      btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled · 3660d0bc
      Filipe Manana 提交于
      When using the NO_HOLES feature, if we clone a file range that spans only
      a hole into a range that is at or beyond the current i_size of the
      destination file, we end up not setting the full sync runtime flag on the
      inode. As a result, if we then fsync the destination file and have a power
      failure, after log replay we can end up exposing stale data instead of
      having a hole for that range.
      
      The conditions for this to happen are the following:
      
      1) We have a file with a size of, for example, 1280K;
      
      2) There is a written (non-prealloc) extent for the file range from 1024K
         to 1280K with a length of 256K;
      
      3) This particular file extent layout is durably persisted, so that the
         existing superblock persisted on disk points to a subvolume root where
         the file has that exact file extent layout and state;
      
      4) The file is truncated to a smaller size, to an offset lower than the
         start offset of its last extent, for example to 800K. The truncate sets
         the full sync runtime flag on the inode;
      
      6) Fsync the file to log it and clear the full sync runtime flag;
      
      7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
         into the file with a destination offset that starts at or beyond the
         256K file extent item we had - for example to offset 1024K;
      
      8) Since the clone operation does not find extents in the source range,
         we end up in the if branch at the bottom of btrfs_clone() where we
         punch a hole for the file range starting at offset 1024K by calling
         btrfs_replace_file_extents(). There we end up not setting the full
         sync flag on the inode, because we don't know we are being called in
         a clone context (and not fallocate's punch hole operation), and
         neither do we create an extent map to represent a hole because the
         requested range is beyond eof;
      
      9) A further fsync to the file will be a fast fsync, since the clone
         operation did not set the full sync flag, and therefore it relies on
         modified extent maps to correctly log the file layout. But since
         it does not find any extent map marking the range from 1024K (the
         previous eof) to the new eof, it does not log a file extent item
         for that range representing the hole;
      
      10) After a power failure no hole for the range starting at 1024K is
         punched and we end up exposing stale data from the old 256K extent.
      
      Turning this into exact steps:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdi
        $ mount /dev/sdi /mnt
      
        # Create our test file with 3 extents of 256K and a 256K hole at offset
        # 256K. The file has a size of 1280K.
        $ xfs_io -f -s \
                    -c "pwrite -S 0xab -b 256K 0 256K" \
                    -c "pwrite -S 0xcd -b 256K 512K 256K" \
                    -c "pwrite -S 0xef -b 256K 768K 256K" \
                    -c "pwrite -S 0x73 -b 256K 1024K 256K" \
                    /mnt/sdi/foobar
      
        # Make sure it's durably persisted. We want the last committed super
        # block to point to this particular file extent layout.
        sync
      
        # Now truncate our file to a smaller size, falling within a position of
        # the second extent. This sets the full sync runtime flag on the inode.
        # Then fsync the file to log it and clear the full sync flag from the
        # inode. The third extent is no longer part of the file and therefore
        # it is not logged.
        $ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar
      
        # Now do a clone operation that only clones the hole and sets back the
        # file size to match the size it had before the truncate operation
        # (1280K).
        $ xfs_io \
              -c "reflink /mnt/foobar 256K 1024K 256K" \
              -c "fsync" \
              /mnt/foobar
      
        # File data before power failure:
        $ od -A d -t x1 /mnt/foobar
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
        *
        0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
        *
        0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        1310720
      
        <power fail>
      
        # Mount the fs again to replay the log tree.
        $ mount /dev/sdi /mnt
      
        # File data after power failure:
        $ od -A d -t x1 /mnt/foobar
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
        *
        0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
        *
        0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
        *
        1310720
      
      The range from 1024K to 1280K should correspond to a hole but instead it
      points to stale data, to the 256K extent that should not exist after the
      truncate operation.
      
      The issue does not exists when not using NO_HOLES, because for that case
      we use file extent items to represent holes, these are found and copied
      during the loop that iterates over extents at btrfs_clone(), and that
      causes btrfs_replace_file_extents() to be called with a non-NULL
      extent_info argument and therefore set the full sync runtime flag on the
      inode.
      
      So fix this by making the code that deals with a trailing hole during
      cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
      range starts at or beyond the current i_size.
      
      A test case for fstests will follow soon.
      
      Backporting notes: for kernel 5.4 the change goes to ioctl.c into
      btrfs_clone before the last call to btrfs_punch_hole_range.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3660d0bc
  16. 09 2月, 2021 1 次提交
    • Q
      btrfs: introduce btrfs_subpage for data inodes · 32443de3
      Qu Wenruo 提交于
      To support subpage sector size, data also need extra info to make sure
      which sectors in a page are uptodate/dirty/...
      
      This patch will make pages for data inodes get btrfs_subpage structure
      attached, and detached when the page is freed.
      
      This patch also slightly changes the timing when
      set_page_extent_mapped() is called to make sure:
      
      - We have page->mapping set
        page->mapping->host is used to grab btrfs_fs_info, thus we can only
        call this function after page is mapped to an inode.
      
        One call site attaches pages to inode manually, thus we have to modify
        the timing of set_page_extent_mapped() a bit.
      
      - As soon as possible, before other operations
        Since memory allocation can fail, we have to do extra error handling.
        Calling set_page_extent_mapped() as soon as possible can simply the
        error handling for several call sites.
      
      The idea is pretty much the same as iomap_page, but with more bitmaps
      for btrfs specific cases.
      
      Currently the plan is to switch iomap if iomap can provide sector
      aligned write back (only write back dirty sectors, but not the full
      page, data balance require this feature).
      
      So we will stick to btrfs specific bitmap for now.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32443de3
  17. 18 12月, 2020 1 次提交
    • F
      btrfs: fix deadlock when cloning inline extent and low on free metadata space · 3d45f221
      Filipe Manana 提交于
      When cloning an inline extent there are cases where we can not just copy
      the inline extent from the source range to the target range (e.g. when the
      target range starts at an offset greater than zero). In such cases we copy
      the inline extent's data into a page of the destination inode and then
      dirty that page. However, after that we will need to start a transaction
      for each processed extent and, if we are ever low on available metadata
      space, we may need to flush existing delalloc for all dirty inodes in an
      attempt to release metadata space - if that happens we may deadlock:
      
      * the async reclaim task queued a delalloc work to flush delalloc for
        the destination inode of the clone operation;
      
      * the task executing that delalloc work gets blocked waiting for the
        range with the dirty page to be unlocked, which is currently locked
        by the task doing the clone operation;
      
      * the async reclaim task blocks waiting for the delalloc work to complete;
      
      * the cloning task is waiting on the waitqueue of its reservation ticket
        while holding the range with the dirty page locked in the inode's
        io_tree;
      
      * if metadata space is not released by some other task (like delalloc for
        some other inode completing for example), the clone task waits forever
        and as a consequence the delalloc work and async reclaim tasks will hang
        forever as well. Releasing more space on the other hand may require
        starting a transaction, which will hang as well when trying to reserve
        metadata space, resulting in a deadlock between all these tasks.
      
      When this happens, traces like the following show up in dmesg/syslog:
      
        [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
        [87452.323644]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [87452.324852] task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
        [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [87452.326136] Call Trace:
        [87452.326737]  __schedule+0x5d1/0xcf0
        [87452.327390]  schedule+0x45/0xe0
        [87452.328174]  lock_extent_bits+0x1e6/0x2d0 [btrfs]
        [87452.328894]  ? finish_wait+0x90/0x90
        [87452.329474]  btrfs_invalidatepage+0x32c/0x390 [btrfs]
        [87452.330133]  ? __mod_memcg_state+0x8e/0x160
        [87452.330738]  __extent_writepage+0x2d4/0x400 [btrfs]
        [87452.331405]  extent_write_cache_pages+0x2b2/0x500 [btrfs]
        [87452.332007]  ? lock_release+0x20e/0x4c0
        [87452.332557]  ? trace_hardirqs_on+0x1b/0xf0
        [87452.333127]  extent_writepages+0x43/0x90 [btrfs]
        [87452.333653]  ? lock_acquire+0x1a3/0x490
        [87452.334177]  do_writepages+0x43/0xe0
        [87452.334699]  ? __filemap_fdatawrite_range+0xa4/0x100
        [87452.335720]  __filemap_fdatawrite_range+0xc5/0x100
        [87452.336500]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [87452.337216]  btrfs_work_helper+0xf1/0x600 [btrfs]
        [87452.337838]  process_one_work+0x24e/0x5e0
        [87452.338437]  worker_thread+0x50/0x3b0
        [87452.339137]  ? process_one_work+0x5e0/0x5e0
        [87452.339884]  kthread+0x153/0x170
        [87452.340507]  ? kthread_mod_delayed_work+0xc0/0xc0
        [87452.341153]  ret_from_fork+0x22/0x30
        [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
        [87452.342487]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [87452.344049] task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
        [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
        [87452.345655] Call Trace:
        [87452.346305]  __schedule+0x5d1/0xcf0
        [87452.346947]  ? kvm_clock_read+0x14/0x30
        [87452.347676]  ? wait_for_completion+0x81/0x110
        [87452.348389]  schedule+0x45/0xe0
        [87452.349077]  schedule_timeout+0x30c/0x580
        [87452.349718]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [87452.350340]  ? lock_acquire+0x1a3/0x490
        [87452.351006]  ? try_to_wake_up+0x7a/0xa20
        [87452.351541]  ? lock_release+0x20e/0x4c0
        [87452.352040]  ? lock_acquired+0x199/0x490
        [87452.352517]  ? wait_for_completion+0x81/0x110
        [87452.353000]  wait_for_completion+0xab/0x110
        [87452.353490]  start_delalloc_inodes+0x2af/0x390 [btrfs]
        [87452.353973]  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
        [87452.354455]  flush_space+0x24f/0x660 [btrfs]
        [87452.355063]  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
        [87452.355565]  process_one_work+0x24e/0x5e0
        [87452.356024]  worker_thread+0x20f/0x3b0
        [87452.356487]  ? process_one_work+0x5e0/0x5e0
        [87452.356973]  kthread+0x153/0x170
        [87452.357434]  ? kthread_mod_delayed_work+0xc0/0xc0
        [87452.357880]  ret_from_fork+0x22/0x30
        (...)
        < stack traces of several tasks waiting for the locks of the inodes of the
          clone operation >
        (...)
        [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
        [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
        [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
        [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
        [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
        [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
        [92867.447361] task:fsstress        state:D stack:    0 pid:2508238 ppid:2508153 flags:0x00004000
        [92867.447920] Call Trace:
        [92867.448435]  __schedule+0x5d1/0xcf0
        [92867.448934]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [92867.449423]  schedule+0x45/0xe0
        [92867.449916]  __reserve_bytes+0x4a4/0xb10 [btrfs]
        [92867.450576]  ? finish_wait+0x90/0x90
        [92867.451202]  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
        [92867.451815]  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
        [92867.452412]  start_transaction+0x2d1/0x760 [btrfs]
        [92867.453216]  clone_copy_inline_extent+0x333/0x490 [btrfs]
        [92867.453848]  ? lock_release+0x20e/0x4c0
        [92867.454539]  ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
        [92867.455218]  btrfs_clone+0x569/0x7e0 [btrfs]
        [92867.455952]  btrfs_clone_files+0xf6/0x150 [btrfs]
        [92867.456588]  btrfs_remap_file_range+0x324/0x3d0 [btrfs]
        [92867.457213]  do_clone_file_range+0xd4/0x1f0
        [92867.457828]  vfs_clone_file_range+0x4d/0x230
        [92867.458355]  ? lock_release+0x20e/0x4c0
        [92867.458890]  ioctl_file_clone+0x8f/0xc0
        [92867.459377]  do_vfs_ioctl+0x342/0x750
        [92867.459913]  __x64_sys_ioctl+0x62/0xb0
        [92867.460377]  do_syscall_64+0x33/0x80
        [92867.460842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        (...)
        < stack traces of more tasks blocked on metadata reservation like the clone
          task above, because the async reclaim task has deadlocked >
        (...)
      
      Another thing to notice is that the worker task that is deadlocked when
      trying to flush the destination inode of the clone operation is at
      btrfs_invalidatepage(). This is simply because the clone operation has a
      destination offset greater than the i_size and we only update the i_size
      of the destination file after cloning an extent (just like we do in the
      buffered write path).
      
      Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
      the flushing of delalloc for all inodes that have delalloc, add a runtime
      flag to an inode to signal it should not be flushed, and for inodes with
      that flag set, start_delalloc_inodes() will simply skip them. When the
      cloning code needs to dirty a page to copy an inline extent, set that flag
      on the inode and then clear it when the clone operation finishes.
      
      This could be sporadically triggered with test case generic/269 from
      fstests, which exercises many fsstress processes running in parallel with
      several dd processes filling up the entire filesystem.
      
      CC: stable@vger.kernel.org # 5.9+
      Fixes: 05a5a762 ("Btrfs: implement full reflink support for inline extents")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d45f221
  18. 08 12月, 2020 6 次提交
    • N
    • N
    • N
    • F
      btrfs: update the number of bytes used by an inode atomically · 2766ff61
      Filipe Manana 提交于
      There are several occasions where we do not update the inode's number of
      used bytes atomically, resulting in a concurrent stat(2) syscall to report
      a value of used blocks that does not correspond to a valid value, that is,
      a value that does not match neither what we had before the operation nor
      what we get after the operation completes.
      
      In extreme cases it can result in stat(2) reporting zero used blocks, which
      can cause problems for some userspace tools where they can consider a file
      with a non-zero size and zero used blocks as completely sparse and skip
      reading data, as reported/discussed a long time ago in some threads like
      the following:
      
        https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
      
      The cases where this can happen are the following:
      
      -> Case 1
      
      If we do a write (buffered or direct IO) against a file region for which
      there is already an allocated extent (or multiple extents), then we have a
      short time window where we can report a number of used blocks to stat(2)
      that does not take into account the file region being overwritten. This
      short time window happens when completing the ordered extent(s).
      
      This happens because when we drop the extents in the write range we
      decrement the inode's number of bytes and later on when we insert the new
      extent(s) we increment the number of bytes in the inode, resulting in a
      short time window where a stat(2) syscall can get an incorrect number of
      used blocks.
      
      If we do writes that overwrite an entire file, then we have a short time
      window where we report 0 used blocks to stat(2).
      
      Example reproducer:
      
        $ cat reproducer-1.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        stat_loop()
        {
            trap "wait; exit" SIGTERM
            local filepath=$1
            local expected=$2
            local got
      
            while :; do
                got=$(stat -c %b $filepath)
                if [ $got -ne $expected ]; then
                   echo -n "ERROR: unexpected used blocks"
                   echo " (got: $got expected: $expected)"
                fi
            done
        }
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        # mkfs.reiserfs -f $DEV > /dev/null
        mount $DEV $MNT
      
        xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
        expected=$(stat -c %b $MNT/foobar)
      
        # Create a process to keep calling stat(2) on the file and see if the
        # reported number of blocks used (disk space used) changes, it should
        # not because we are not increasing the file size nor punching holes.
        stat_loop $MNT/foobar $expected &
        loop_pid=$!
      
        for ((i = 0; i < 50000; i++)); do
            xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
        done
      
        kill $loop_pid &> /dev/null
        wait
      
        umount $DEV
      
        $ ./reproducer-1.sh
        ERROR: unexpected used blocks (got: 0 expected: 128)
        ERROR: unexpected used blocks (got: 0 expected: 128)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      -> Case 2
      
      If we do a buffered write against a file region that does not have any
      allocated extents, like a hole or beyond EOF, then during ordered extent
      completion we have a short time window where a concurrent stat(2) syscall
      can report a number of used blocks that does not correspond to the value
      before or after the write operation, a value that is actually larger than
      the value after the write completes.
      
      This happens because once we start a buffered write into an unallocated
      file range we increment the inode's 'new_delalloc_bytes', to make sure
      any stat(2) call gets a correct used blocks value before delalloc is
      flushed and completes. However at ordered extent completion, after we
      inserted the new extent, we increment the inode's number of bytes used
      with the size of the new extent, and only later, when clearing the range
      in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
      counter with the size of the extent. So this results in a short time
      window where a concurrent stat(2) syscall can report a number of used
      blocks that accounts for the new extent twice.
      
      Example reproducer:
      
        $ cat reproducer-2.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        stat_loop()
        {
            trap "wait; exit" SIGTERM
            local filepath=$1
            local expected=$2
            local got
      
            while :; do
                got=$(stat -c %b $filepath)
                if [ $got -ne $expected ]; then
                    echo -n "ERROR: unexpected used blocks"
                    echo " (got: $got expected: $expected)"
                fi
            done
        }
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        # mkfs.reiserfs -f $DEV > /dev/null
        mount $DEV $MNT
      
        touch $MNT/foobar
        write_size=$((64 * 1024))
        for ((i = 0; i < 16384; i++)); do
           offset=$(($i * $write_size))
           xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
           blocks_used=$(stat -c %b $MNT/foobar)
      
           # Fsync the file to trigger writeback and keep calling stat(2) on it
           # to see if the number of blocks used changes.
           stat_loop $MNT/foobar $blocks_used &
           loop_pid=$!
           xfs_io -c "fsync" $MNT/foobar
      
           kill $loop_pid &> /dev/null
           wait $loop_pid
        done
      
        umount $DEV
      
        $ ./reproducer-2.sh
        ERROR: unexpected used blocks (got: 265472 expected: 265344)
        ERROR: unexpected used blocks (got: 284032 expected: 283904)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      -> Case 3
      
      Another case where such problems happen is during other operations that
      replace extents in a file range with other extents. Those operations are
      extent cloning, deduplication and fallocate's zero range operation.
      
      The cause of the problem is similar to the first case. When we drop the
      extents from a range, we decrement the inode's number of bytes, and later
      on, after inserting the new extents we increment it. Since this is not
      done atomically, a concurrent stat(2) call can see and return a number of
      used blocks that is smaller than it should be, does not match the number
      of used blocks before or after the clone/deduplication/zero operation.
      
      Like for the first case, when doing a clone, deduplication or zero range
      operation against an entire file, we end up having a time window where we
      can report 0 used blocks to a stat(2) call.
      
      Example reproducer:
      
        $ cat reproducer-3.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f -m reflink=1 $DEV > /dev/null
        mount $DEV $MNT
      
        extent_size=$((64 * 1024))
        num_extents=16384
        file_size=$(($extent_size * $num_extents))
      
        # File foo has many small extents.
        xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
            > /dev/null
        # File bar has much less extents and has exactly the same data as foo.
        xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
      
        expected=$(stat -c %b $MNT/foo)
      
        # Now deduplicate bar into foo. While the deduplication is in progres,
        # the number of used blocks/file size reported by stat should not change
        xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null  &
        dedupe_pid=$!
        while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
            used=$(stat -c %b $MNT/foo)
            if [ $used -ne $expected ]; then
                echo "Unexpected blocks used: $used (expected: $expected)"
            fi
        done
      
        umount $DEV
      
        $ ./reproducer-3.sh
        Unexpected blocks used: 2076800 (expected: 2097152)
        Unexpected blocks used: 2097024 (expected: 2097152)
        Unexpected blocks used: 2079872 (expected: 2097152)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      So fix this by:
      
      1) Making btrfs_drop_extents() not decrement the VFS inode's number of
         bytes, and instead return the number of bytes;
      
      2) Making any code that drops extents and adds new extents update the
         inode's number of bytes atomically, while holding the btrfs inode's
         spinlock, which is also used by the stat(2) callback to get the inode's
         number of bytes;
      
      3) For ranges in the inode's iotree that are marked as 'delalloc new',
         corresponding to previously unallocated ranges, increment the inode's
         number of bytes when clearing the 'delalloc new' bit from the range,
         in the same critical section that decrements the inode's
         'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
      
      An alternative would be to have btrfs_getattr() wait for any IO (ordered
      extents in progress) and locking the whole range (0 to (u64)-1) while it
      it computes the number of blocks used. But that would mean blocking
      stat(2), which is a very used syscall and expected to be fast, waiting
      for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2766ff61
    • F
      btrfs: refactor btrfs_drop_extents() to make it easier to extend · 5893dfb9
      Filipe Manana 提交于
      There are many arguments for __btrfs_drop_extents() and its wrapper
      btrfs_drop_extents(), which makes it hard to add more arguments to it and
      requires changing every caller. I have added a couple myself back in 2014
      commit 1acae57b ("Btrfs: faster file extent item replace operations")
      and therefore know firsthand that it is a bit cumbersome to add additional
      arguments to these functions.
      
      Since I will need to add more arguments in a subsequent bug fix, this
      change is preparatory work and adds a data structure that holds all the
      arguments, for both input and output, that are passed to this function,
      with some comments in the structure's definition mentioning what each
      field is and how it relates to other fields.
      
      Callers of this function need only to zero out the content of the
      structure and setup only the fields they need. This also removes the
      need to have both __btrfs_drop_extents() and btrfs_drop_extents(), so
      now we have a single function named btrfs_drop_extents() that takes a
      pointer to this new data structure (struct btrfs_drop_extents_args).
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5893dfb9
    • J
      btrfs: locking: rip out path->leave_spinning · b9729ce0
      Josef Bacik 提交于
      We no longer distinguish between blocking and spinning, so rip out all
      this code.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b9729ce0
  19. 07 10月, 2020 4 次提交
    • J
      btrfs: reschedule when cloning lots of extents · 6b613cc9
      Johannes Thumshirn 提交于
      We have several occurrences of a soft lockup from fstest's generic/175
      testcase, which look more or less like this one:
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [xfs_io:10030]
        Kernel panic - not syncing: softlockup: hung tasks
        CPU: 0 PID: 10030 Comm: xfs_io Tainted: G             L    5.9.0-rc5+ #768
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         <IRQ>
         dump_stack+0x77/0xa0
         panic+0xfa/0x2cb
         watchdog_timer_fn.cold+0x85/0xa5
         ? lockup_detector_update_enable+0x50/0x50
         __hrtimer_run_queues+0x99/0x4c0
         ? recalibrate_cpu_khz+0x10/0x10
         hrtimer_run_queues+0x9f/0xb0
         update_process_times+0x28/0x80
         tick_handle_periodic+0x1b/0x60
         __sysvec_apic_timer_interrupt+0x76/0x210
         asm_call_on_stack+0x12/0x20
         </IRQ>
         sysvec_apic_timer_interrupt+0x7f/0x90
         asm_sysvec_apic_timer_interrupt+0x12/0x20
        RIP: 0010:btrfs_tree_unlock+0x91/0x1a0 [btrfs]
        RSP: 0018:ffffc90007123a58 EFLAGS: 00000282
        RAX: ffff8881cea2fbe0 RBX: ffff8881cea2fbe0 RCX: 0000000000000000
        RDX: ffff8881d23fd200 RSI: ffffffff82045220 RDI: ffff8881cea2fba0
        RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000032
        R10: 0000160000000000 R11: 0000000000001000 R12: 0000000000001000
        R13: ffff8882357fd5b0 R14: ffff88816fa76e70 R15: ffff8881cea2fad0
         ? btrfs_tree_unlock+0x15b/0x1a0 [btrfs]
         btrfs_release_path+0x67/0x80 [btrfs]
         btrfs_insert_replace_extent+0x177/0x2c0 [btrfs]
         btrfs_replace_file_extents+0x472/0x7c0 [btrfs]
         btrfs_clone+0x9ba/0xbd0 [btrfs]
         btrfs_clone_files.isra.0+0xeb/0x140 [btrfs]
         ? file_update_time+0xcd/0x120
         btrfs_remap_file_range+0x322/0x3b0 [btrfs]
         do_clone_file_range+0xb7/0x1e0
         vfs_clone_file_range+0x30/0xa0
         ioctl_file_clone+0x8a/0xc0
         do_vfs_ioctl+0x5b2/0x6f0
         __x64_sys_ioctl+0x37/0xa0
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f87977fc247
        RSP: 002b:00007ffd51a2f6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f87977fc247
        RDX: 00007ffd51a2f710 RSI: 000000004020940d RDI: 0000000000000003
        RBP: 0000000000000004 R08: 00007ffd51a79080 R09: 0000000000000000
        R10: 00005621f11352f2 R11: 0000000000000206 R12: 0000000000000000
        R13: 0000000000000000 R14: 00005621f128b958 R15: 0000000080000000
        Kernel Offset: disabled
        ---[ end Kernel panic - not syncing: softlockup: hung tasks ]---
      
      All of these lockup reports have the call chain btrfs_clone_files() ->
      btrfs_clone() in common. btrfs_clone_files() calls btrfs_clone() with
      both source and destination extents locked and loops over the source
      extent to create the clones.
      
      Conditionally reschedule in the btrfs_clone() loop, to give some time back
      to other processes.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b613cc9
    • F
      btrfs: rename btrfs_punch_hole_range() to a more generic name · 306bfec0
      Filipe Manana 提交于
      The function btrfs_punch_hole_range() is now used to replace all the file
      extents in a given file range with an extent described in the given struct
      btrfs_replace_extent_info argument. This extent can either be an existing
      extent that is being cloned or it can be a new extent (namely a prealloc
      extent). When that argument is NULL it only punches a hole (drops all the
      existing extents) in the file range.
      
      So rename the function to btrfs_replace_file_extents().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      306bfec0
    • F
      btrfs: rename struct btrfs_clone_extent_info to a more generic name · bf385648
      Filipe Manana 提交于
      Now that we can use btrfs_clone_extent_info to convey information for a
      new prealloc extent as well, and not just for existing extents that are
      being cloned, rename it to btrfs_replace_extent_info, which reflects the
      fact that this is now more generic and it is used to replace all existing
      extents in a file range with the extent described by the structure.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf385648
    • F
      btrfs: remove item_size member of struct btrfs_clone_extent_info · fb870f6c
      Filipe Manana 提交于
      The value of item_size of struct btrfs_clone_extent_info is always set to
      the size of a non-inline file extent item, and in fact the infrastructure
      that uses this structure (btrfs_punch_hole_range()) does not work with
      inline file extents at all (and it is not supposed to).
      
      So just remove that field from the structure and use directly
      sizeof(struct btrfs_file_extent_item) instead. Also assert that the
      file extent type is not inline at btrfs_insert_clone_extent().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb870f6c