1. 16 5月, 2022 35 次提交
    • C
      btrfs: do not return errors from btrfs_submit_compressed_read · cb4411dd
      Christoph Hellwig 提交于
      btrfs_submit_compressed_read already calls ->bi_end_io on error and
      the caller must ignore the return value, so remove it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb4411dd
    • C
      btrfs: move btrfs_readpage to extent_io.c · 7aab8b32
      Christoph Hellwig 提交于
      Keep btrfs_readpage next to btrfs_do_readpage and the other address
      space operations.  This allows to keep submit_one_bio and
      struct btrfs_bio_ctrl file local in extent_io.c.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7aab8b32
    • F
      btrfs: avoid double search for block group during NOCOW writes · 2306e83e
      Filipe Manana 提交于
      When doing a NOCOW write, either through direct IO or buffered IO, we do
      two lookups for the block group that contains the target extent: once
      when we call btrfs_inc_nocow_writers() and then later again when we call
      btrfs_dec_nocow_writers() after creating the ordered extent.
      
      The lookups require taking a lock and navigating the red black tree used
      to track all block groups, which can take a non-negligible amount of time
      for a large filesystem with thousands of block groups, as well as lock
      contention and cache line bouncing.
      
      Improve on this by having a single block group search: making
      btrfs_inc_nocow_writers() return the block group to its caller and then
      have the caller pass that block group to btrfs_dec_nocow_writers().
      
      This is part of a patchset comprised of the following patches:
      
        btrfs: remove search start argument from first_logical_byte()
        btrfs: use rbtree with leftmost node cached for tracking lowest block group
        btrfs: use a read/write lock for protecting the block groups tree
        btrfs: return block group directly at btrfs_next_block_group()
        btrfs: avoid double search for block group during NOCOW writes
      
      The following test was used to test these changes from a performance
      perspective:
      
         $ cat test.sh
         #!/bin/bash
      
         modprobe null_blk nr_devices=0
      
         NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
         mkdir $NULL_DEV_PATH
         if [ $? -ne 0 ]; then
             echo "Failed to create nullb0 directory."
             exit 1
         fi
         echo 2 > $NULL_DEV_PATH/submit_queues
         echo 16384 > $NULL_DEV_PATH/size # 16G
         echo 1 > $NULL_DEV_PATH/memory_backed
         echo 1 > $NULL_DEV_PATH/power
      
         DEV=/dev/nullb0
         MNT=/mnt/nullb0
         LOOP_MNT="$MNT/loop"
         MOUNT_OPTIONS="-o ssd -o nodatacow"
         MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
         cat <<EOF > /tmp/fio-job.ini
         [io_uring_writes]
         rw=randwrite
         fsync=0
         fallocate=posix
         group_reporting=1
         direct=1
         ioengine=io_uring
         iodepth=64
         bs=64k
         filesize=1g
         runtime=300
         time_based
         directory=$LOOP_MNT
         numjobs=8
         thread
         EOF
      
         echo performance | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         echo
         echo "Using config:"
         echo
         cat /tmp/fio-job.ini
         echo
      
         umount $MNT &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
         mount $MOUNT_OPTIONS $DEV $MNT
      
         mkdir $LOOP_MNT
      
         truncate -s 4T $MNT/loopfile
         mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
         mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
      
         # Trigger the allocation of about 3500 data block groups, without
         # actually consuming space on underlying filesystem, just to make
         # the tree of block group large.
         fallocate -l 3500G $LOOP_MNT/filler
      
         fio /tmp/fio-job.ini
      
         umount $LOOP_MNT
         umount $MNT
      
         echo 0 > $NULL_DEV_PATH/power
         rmdir $NULL_DEV_PATH
      
      The test was run on a non-debug kernel (Debian's default kernel config),
      the result were the following.
      
      Before patchset:
      
        WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
      
      After patchset:
      
        WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
      
        +3.3% write throughput and +3.3% IO done in the same time period.
      
      The test has somewhat limited coverage scope, as with only NOCOW writes
      we get less contention on the red black tree of block groups, since we
      don't have the extra contention caused by COW writes, namely when
      allocating data extents, pinning and unpinning data extents, but on the
      hand there's access to tree in the NOCOW path, when incrementing a block
      group's number of NOCOW writers.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2306e83e
    • Q
      btrfs: avoid double clean up when submit_one_bio() failed · c9583ada
      Qu Wenruo 提交于
      [BUG]
      When running generic/475 with 64K page size and 4K sector size, it has a
      very high chance (almost 100%) to hang, with mostly data page locked but
      no one is going to unlock it.
      
      [CAUSE]
      With commit 1784b7d5 ("btrfs: handle csum lookup errors properly on
      reads"), if we failed to lookup checksum due to metadata IO error, we
      will return error for btrfs_submit_data_bio().
      
      This will cause the page to be unlocked twice in btrfs_do_readpage():
      
       btrfs_do_readpage()
       |- submit_extent_page()
       |  |- submit_one_bio()
       |     |- btrfs_submit_data_bio()
       |        |- if (ret) {
       |        |-     bio->bi_status = ret;
       |        |-     bio_endio(bio); }
       |               In the endio function, we will call end_page_read()
       |               and unlock_extent() to cleanup the subpage range.
       |
       |- if (ret) {
       |-        unlock_extent(); end_page_read() }
                 Here we unlock the extent and cleanup the subpage range
                 again.
      
      For unlock_extent(), it's mostly double unlock safe.
      
      But for end_page_read(), it's not, especially for subpage case,
      as for subpage case we will call btrfs_subpage_end_reader() to reduce
      the reader number, and use that to number to determine if we need to
      unlock the full page.
      
      If double accounted, it can underflow the number and leave the page
      locked without anyone to unlock it.
      
      [FIX]
      The commit 1784b7d5 ("btrfs: handle csum lookup errors properly on
      reads") itself is completely fine, it's our existing code not properly
      handling the error from bio submission hook properly.
      
      This patch will make submit_one_bio() to return void so that the callers
      will never be able to do cleanup when bio submission hook fails.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c9583ada
    • F
      btrfs: use BTRFS_DIR_START_INDEX at btrfs_create_new_inode() · 49024388
      Filipe Manana 提交于
      We are still using the magic value of 2 at btrfs_create_new_inode(), but
      there's now a constant for that, named BTRFS_DIR_START_INDEX, which was
      introduced in commit 528ee697 ("btrfs: put initial index value of a
      directory in a constant"). So change that to use the constant.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49024388
    • F
      btrfs: do not test for free space inode during NOCOW check against file extent · a7bb6bd4
      Filipe Manana 提交于
      When checking if we can do a NOCOW write against a range covered by a file
      extent item, we do a quick a check to determine if the inode's root was
      snapshotted in a generation older than the generation of the file extent
      item or not. This is to quickly determine if the extent is likely shared
      and avoid the expensive check for cross references (this was added in
      commit 78d4295b ("btrfs: lift some btrfs_cross_ref_exist checks in
      nocow path").
      
      We restrict that check to the case where the inode is not a free space
      inode (since commit 27a7ff55 ("btrfs: skip file_extent generation
      check for free_space_inode in run_delalloc_nocow")). That is because when
      we had the inode cache feature, inode caches were backed by a free space
      inode that belonged to the inode's root.
      
      However we don't have support for the inode cache feature since kernel
      5.11, so we don't need this check anymore since free space inodes are
      now always related to free space caches, which are always associated to
      the root tree (which can't be snapshotted, and its last_snapshot field
      is always 0).
      
      So remove that condition.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a7bb6bd4
    • F
      btrfs: move common NOCOW checks against a file extent into a helper · 619104ba
      Filipe Manana 提交于
      Verifying if we can do a NOCOW write against a range fully or partially
      covered by a file extent item requires verifying several constraints, and
      these are currently duplicated at two different places: can_nocow_extent()
      and run_delalloc_nocow().
      
      This change moves those checks into a common helper function to avoid
      duplication. It adds some comments and also preserves all existing
      behaviour like for example can_nocow_extent() treating errors from the
      calls to btrfs_cross_ref_exist() and csum_exist_in_range() as meaning
      we can not NOCOW, instead of propagating the error back to the caller.
      That specific behaviour is questionable but also reasonable to some
      degree.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      619104ba
    • S
      btrfs: factor out allocating an array of pages · dd137dd1
      Sweet Tea Dorminy 提交于
      Several functions currently populate an array of page pointers one
      allocated page at a time. Factor out the common code so as to allow
      improvements to all of the sites at once.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dd137dd1
    • Y
      btrfs: remove unnecessary type casts · 0d031dc4
      Yu Zhe 提交于
      Explicit type casts are not necessary when it's void* to another pointer
      type.
      Signed-off-by: NYu Zhe <yuzhe@nfschina.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d031dc4
    • Q
      btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine · fbca46eb
      Qu Wenruo 提交于
      The reason why we only support 64K page size for subpage is, for 64K
      page size we can ensure no matter what the nodesize is, we can fit it
      into one page.
      
      When other page size come, especially like 16K, the limitation is a bit
      limiting.
      
      To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
      non-subpage routine.  By this, we can allow 4K sectorsize on 16K page
      size.
      
      Although this introduces another smaller limitation, the metadata can
      not cross page boundary, which is already met by most recent mkfs.
      
      Another small improvement is, we can avoid the overhead for metadata if
      nodesize >= PAGE_SIZE.
      For 4K sector size and 64K page size/node size, or 4K sector size and
      16K page size/node size, we don't need to allocate extra memory for the
      metadata pages.
      
      Please note that, this patch will not yet enable other page size support
      yet.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbca46eb
    • Q
      btrfs: replace memset with memzero_page in data checksum verification · b06660b5
      Qu Wenruo 提交于
      The original code resets the page to 0x1 for not apparent reason, it's
      been like that since the initial 2007 code added in commit 07157aac
      ("Btrfs: Add file data csums back in via hooks in the extent map code").
      
      It could mean that a failed buffer can be detected from the data but
      that's just a guess and any value is good.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b06660b5
    • F
      btrfs: avoid blocking on space revervation when doing nowait dio writes · d4135134
      Filipe Manana 提交于
      When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
      proceed with the non-blocking, NOWAIT path. However reserving the metadata
      space and qgroup meta space can often result in blocking - flushing
      delalloc, wait for ordered extents to complete, trigger transaction
      commits, etc, going against the semantics of a NOWAIT write.
      
      So make the NOWAIT write path to try to reserve all the metadata it needs
      without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
      then return -EAGAIN to make the caller fallback to a blocking direct IO
      write.
      
      This is part of a patchset comprised of the following patches:
      
        btrfs: avoid blocking on page locks with nowait dio on compressed range
        btrfs: avoid blocking nowait dio when locking file range
        btrfs: avoid double nocow check when doing nowait dio writes
        btrfs: stop allocating a path when checking if cross reference exists
        btrfs: free path at can_nocow_extent() before checking for checksum items
        btrfs: release path earlier at can_nocow_extent()
        btrfs: avoid blocking when allocating context for nowait dio read/write
        btrfs: avoid blocking on space revervation when doing nowait dio writes
      
      The following test was run before and after applying this patchset:
      
        $ cat io-uring-nodatacow-test.sh
        #!/bin/bash
      
        DEV=/dev/sdc
        MNT=/mnt/sdc
      
        MOUNT_OPTIONS="-o ssd -o nodatacow"
        MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
        NUM_JOBS=4
        FILE_SIZE=8G
        RUN_TIME=300
      
        cat <<EOF > /tmp/fio-job.ini
        [io_uring_rw]
        rw=randrw
        fsync=0
        fallocate=posix
        group_reporting=1
        direct=1
        ioengine=io_uring
        iodepth=64
        bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
        filesize=$FILE_SIZE
        runtime=$RUN_TIME
        time_based
        filename=foobar
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo performance | \
           tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        fio /tmp/fio-job.ini
      
        umount $MNT
      
      The test was run a 12 cores box with 64G of ram, using a non-debug kernel
      config (Debian's default config) and a spinning disk.
      
      Result before the patchset:
      
       READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
      WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
      
      Result after the patchset:
      
       READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
      WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec
      
      That's about +7.2% throughput for reads and +6.9% for writes.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4135134
    • F
      btrfs: avoid blocking when allocating context for nowait dio read/write · 4f208dcc
      Filipe Manana 提交于
      When doing a NOWAIT direct IO read/write, we allocate a context object
      (struct btrfs_dio_data) with GFP_NOFS, which can result in blocking
      waiting for memory allocation (GFP_NOFS is __GFP_RECLAIM | __GFP_IO).
      This is undesirable for the NOWAIT semantics, so do the allocation with
      GFP_NOWAIT if we are serving a NOWAIT request and if the allocation fails
      return -EAGAIN, so that the caller can fallback to a blocking context and
      retry with a non-blocking write.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4f208dcc
    • F
      btrfs: release path earlier at can_nocow_extent() · 59d35c51
      Filipe Manana 提交于
      At can_nocow_extent(), we are releasing the path only after checking if
      the block group that has the target extent is read only, and after
      checking if there's delalloc in the range in case our extent is a
      preallocated extent. The read only extent check can be expensive if we
      have a very large filesystem with many block groups, as well as the
      check for delalloc in the inode's io_tree in case the io_tree is big
      due to IO on other file ranges.
      
      Our path is holding a read lock on a leaf and there's no need to keep
      the lock while doing those two checks, so release the path before doing
      them, immediately after the last use of the leaf.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      59d35c51
    • F
      btrfs: free path at can_nocow_extent() before checking for checksum items · c1a548db
      Filipe Manana 提交于
      When we look for checksum items, through csum_exist_in_range(), at
      can_nocow_extent(), we no longer need the path that we have previously
      allocated. Through csum_exist_in_range() -> btrfs_lookup_csums_range(),
      we also end up allocating a path, so we are adding unnecessary extra
      memory usage. So free the path before calling csum_exist_in_range().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1a548db
    • F
      btrfs: stop allocating a path when checking if cross reference exists · 1a89f173
      Filipe Manana 提交于
      At btrfs_cross_ref_exist() we always allocate a path, but we really don't
      need to because all its callers (only 2) already have an allocated path
      that is not being used when they call btrfs_cross_ref_exist(). So change
      btrfs_cross_ref_exist() to take a path as an argument and update both
      its callers to pass in the unused path they have when they call
      btrfs_cross_ref_exist().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a89f173
    • F
      btrfs: avoid double nocow check when doing nowait dio writes · d7a8ab4e
      Filipe Manana 提交于
      When doing a NOWAIT direct IO write we are checking twice if we can COW
      into the target file range using can_nocow_extent() - once at the very
      beginning of the write path, at btrfs_write_check() via
      check_nocow_nolock(), and later again at btrfs_get_blocks_direct_write().
      
      The can_nocow_extent() function does a lot of expensive things - searching
      for the file extent item in the inode's subvolume tree, searching for the
      extent item in the extent tree, checking delayed references, etc, so it
      isn't a very cheap call.
      
      We can remove the first check at btrfs_write_check(), and add there a
      quick check to verify if the inode has the NODATACOW or PREALLOC flags,
      and quickly bail out if it doesn't have neither of those flags, as that
      means we have to COW and therefore can't comply with the NOWAIT semantics.
      
      After this we do only one call to can_nocow_extent(), while we are at
      btrfs_get_blocks_direct_write(), where we have already locked the file
      range and we did a try lock on the range before, at
      btrfs_dio_iomap_begin() (since the previous patch in the series).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7a8ab4e
    • F
      btrfs: avoid blocking nowait dio when locking file range · 59094403
      Filipe Manana 提交于
      If we are doing a NOWAIT direct IO read/write, we can block when locking
      the file range at btrfs_dio_iomap_begin(), as it's possible the range (or
      a part of it) is already locked by another task (mmap writes, another
      direct IO read/write racing with us, fiemap, etc). We are also waiting for
      completion of any ordered extent we find in the range, which also can
      block us for a significant amount of time.
      
      There's also the incorrect fallback to buffered IO (returning -ENOTBLK)
      when we are dealing with a NOWAIT request and we can't proceed. In this
      case we should be returning -EAGAIN, as falling back to buffered IO can
      result in blocking for many different reasons, so that the caller can
      delegate a retry to a context where blocking is more acceptable.
      
      Fix these cases by:
      
      1) Doing a try lock on the file range and failing with -EAGAIN if we
         can not lock right away;
      
      2) Fail with -EAGAIN if we find an ordered extent;
      
      3) Return -EAGAIN instead of -ENOTBLK when we need to fallback to
         buffered IO and we have a NOWAIT request.
      
      This will also allow us to avoid a duplicated check that verifies if we
      are able to do a NOCOW write for NOWAIT direct IO writes, done in the
      next patch.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      59094403
    • F
      btrfs: avoid blocking on page locks with nowait dio on compressed range · b023e675
      Filipe Manana 提交于
      If we are doing NOWAIT direct IO read/write and our inode has compressed
      extents, we call filemap_fdatawrite_range() against the range in order
      to wait for compressed writeback to complete, since the generic code at
      iomap_dio_rw() calls filemap_write_and_wait_range() once, which is not
      enough to wait for compressed writeback to complete.
      
      This call to filemap_fdatawrite_range() can block on page locks, since
      the first writepages() on a range that we will try to compress results
      only in queuing a work to compress the data while holding the pages
      locked.
      
      Even though the generic code at iomap_dio_rw() will do the right thing
      and return -EAGAIN for NOWAIT requests in case there are pages in the
      range, we can still end up at btrfs_dio_iomap_begin() with pages in the
      range because either of the following can happen:
      
      1) Memory mapped writes, as we haven't locked the range yet;
      
      2) Buffered reads might have started, which lock the pages, and we do
         the filemap_fdatawrite_range() call before locking the file range.
      
      So don't call filemap_fdatawrite_range() at btrfs_dio_iomap_begin() if we
      are doing a NOWAIT read/write. Instead call filemap_range_needs_writeback()
      to check if there are any locked, dirty, or under writeback pages, and
      return -EAGAIN if that's the case.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b023e675
    • F
      btrfs: add and use helper to assert an inode range is clean · 63c34cb4
      Filipe Manana 提交于
      We have four different scenarios where we don't expect to find ordered
      extents after locking a file range:
      
      1) During plain fallocate;
      2) During hole punching;
      3) During zero range;
      4) During reflinks (both cloning and deduplication).
      
      This is because in all these cases we follow the pattern:
      
      1) Lock the inode's VFS lock in exclusive mode;
      
      2) Lock the inode's i_mmap_lock in exclusive node, to serialize with
         mmap writes;
      
      3) Flush delalloc in a file range and wait for all ordered extents
         to complete - both done through btrfs_wait_ordered_range();
      
      4) Lock the file range in the inode's io_tree.
      
      So add a helper that asserts that we don't have ordered extents for a
      given range. Make the four scenarios listed above use this helper after
      locking the respective file range.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      63c34cb4
    • S
      btrfs: restore inode creation before xattr setting · 6c3636eb
      Sweet Tea Dorminy 提交于
      According to the tree checker, "all xattrs with a given objectid follow
      the inode with that objectid in the tree" is an invariant. This was
      broken by the recent change "btrfs: move common inode creation code into
      btrfs_create_new_inode()", which moved acl creation and property
      inheritance (stored in xattrs) to before inode insertion into the tree.
      As a result, under certain timings, the xattrs could be written to the
      tree before the inode, causing the tree checker to report violation of
      the invariant.
      
      Move property inheritance and acl creation back to their old ordering
      after the inode insertion.
      Suggested-by: NOmar Sandoval <osandov@osandov.com>
      Reported-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6c3636eb
    • O
      btrfs: move common inode creation code into btrfs_create_new_inode() · caae78e0
      Omar Sandoval 提交于
      All of our inode creation code paths duplicate the calls to
      btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
      additionally duplicates property inheritance and the call to
      btrfs_set_inode_index(). Fix this by moving the common code into
      btrfs_create_new_inode(). This accomplishes a few things at once:
      
      1. It reduces code duplication.
      
      2. It allows us to set up the inode completely before inserting the
         inode item, removing calls to btrfs_update_inode().
      
      3. It fixes a leak of an inode on disk in some error cases. For example,
         in btrfs_create(), if btrfs_new_inode() succeeds, then we have
         inserted an inode item and its inode ref. However, if something after
         that fails (e.g., btrfs_init_inode_security()), then we end the
         transaction and then decrement the link count on the inode. If the
         transaction is committed and the system crashes before the failed
         inode is deleted, then we leak that inode on disk. Instead, this
         refactoring aborts the transaction when we can't recover more
         gracefully.
      
      4. It exposes various ways that subvolume creation diverges from mkdir
         in terms of inheriting flags, properties, permissions, and POSIX
         ACLs, a lot of which appears to be accidental. This patch explicitly
         does _not_ change the existing non-standard behavior, but it makes
         those differences more clear in the code and documents them so that
         we can discuss whether they should be changed.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      caae78e0
    • O
      btrfs: reserve correct number of items for inode creation · 3538d68d
      Omar Sandoval 提交于
      The various inode creation code paths do not account for the compression
      property, POSIX ACLs, or the parent inode item when starting a
      transaction. Fix it by refactoring all of these code paths to use a new
      function, btrfs_new_inode_prepare(), which computes the correct number
      of items. To do so, it needs to know whether POSIX ACLs will be created,
      so move the ACL creation into that function. To reduce the number of
      arguments that need to be passed around for inode creation, define
      struct btrfs_new_inode_args containing all of the relevant information.
      
      btrfs_new_inode_prepare() will also be a good place to set up the
      fscrypt context and encrypted filename in the future.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3538d68d
    • O
      btrfs: factor out common part of btrfs_{mknod,create,mkdir}() · 5f465bf1
      Omar Sandoval 提交于
      btrfs_{mknod,create,mkdir}() are now identical other than the inode
      initialization and some inconsequential function call order differences.
      Factor out the common code to reduce code duplication.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f465bf1
    • O
      btrfs: allocate inode outside of btrfs_new_inode() · a1fd0c35
      Omar Sandoval 提交于
      Instead of calling new_inode() and inode_init_owner() inside of
      btrfs_new_inode(), do it in the callers. This allows us to pass in just
      the inode instead of the mnt_userns and mode and removes the need for
      memalloc_nofs_{save,restores}() since we do it before starting a
      transaction. In create_subvol(), it also means we no longer have to look
      up the inode again to instantiate it. This also paves the way for some
      more cleanups in later patches.
      
      This also removes the comments about Smack checking i_op, which are no
      longer true since commit 5d6c3191 ("xattr: Add
      __vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags &
      IOP_XATTR, which is set based on sb->s_xattr.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a1fd0c35
    • G
      btrfs: use btrfs_for_each_slot in btrfs_real_readdir · a8ce68fd
      Gabriel Niebler 提交于
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: NGabriel Niebler <gniebler@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a8ce68fd
    • O
      btrfs: set inode flags earlier in btrfs_new_inode() · 305eaac0
      Omar Sandoval 提交于
      btrfs_new_inode() inherits the inode flags from the parent directory and
      the mount options _after_ we fill the inode item. This works because all
      of the callers of btrfs_new_inode() make further changes to the inode
      and then call btrfs_update_inode(). It'd be better to fully initialize
      the inode once to avoid the extra update, so as a first step, set the
      inode flags _before_ filling the inode item.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      305eaac0
    • O
      btrfs: move btrfs_get_free_objectid() call into btrfs_new_inode() · 6437d458
      Omar Sandoval 提交于
      Every call of btrfs_new_inode() is immediately preceded by a call to
      btrfs_get_free_objectid(). Since getting an inode number is part of
      creating a new inode, this is better off being moved into
      btrfs_new_inode(). While we're here, get rid of the comment about
      reclaiming inode numbers, since we only did that when using the ino
      cache, which was removed by commit 5297199a ("btrfs: remove inode
      number cache feature").
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6437d458
    • O
      btrfs: don't pass parent objectid to btrfs_new_inode() explicitly · 23c24ef8
      Omar Sandoval 提交于
      For everything other than a subvolume root inode, we get the parent
      objectid from the parent directory. For the subvolume root inode, the
      parent objectid is the same as the inode's objectid. We can find this
      within btrfs_new_inode() instead of passing it.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      23c24ef8
    • O
      btrfs: remove unnecessary set_nlink() in btrfs_create_subvol_root() · c51fa511
      Omar Sandoval 提交于
      btrfs_new_inode() already returns an inode with nlink set to 1 (via
      inode_init_always()). Get rid of the unnecessary set.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c51fa511
    • O
      btrfs: remove unnecessary inode_set_bytes(0) call · 6d831f7e
      Omar Sandoval 提交于
      new_inode() always returns an inode with i_blocks and i_bytes set to 0
      (via inode_init_always()). Remove the unnecessary call to
      inode_set_bytes() in btrfs_new_inode().
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d831f7e
    • O
      btrfs: remove unnecessary btrfs_i_size_write(0) calls · 9124e15f
      Omar Sandoval 提交于
      btrfs_new_inode() always returns an inode with i_size and disk_i_size
      set to 0 (via inode_init_always() and btrfs_alloc_inode(),
      respectively). Remove the unnecessary calls to btrfs_i_size_write() in
      btrfs_mkdir() and btrfs_create_subvol_root().
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9124e15f
    • O
      btrfs: get rid of btrfs_add_nondir() · 81512e89
      Omar Sandoval 提交于
      This is a trivial wrapper around btrfs_add_link(). The only thing it
      does other than moving arguments around is translating a > 0 return
      value to -EEXIST. As far as I can tell, btrfs_add_link() won't return >
      0 (and if it did, the existing callsites in, e.g., btrfs_mkdir() would
      be broken). The check itself dates back to commit 2c90e5d6 ("Btrfs:
      still corruption hunting"), so it's probably left over from debugging.
      Let's just get rid of btrfs_add_nondir().
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      81512e89
    • O
      btrfs: reserve correct number of items for rename · c1621871
      Omar Sandoval 提交于
      btrfs_rename() and btrfs_rename_exchange() don't account for enough
      items. Replace the incorrect explanations with a specific breakdown of
      the number of items and account them accurately.
      
      Note that this glosses over RENAME_WHITEOUT because the next commit is
      going to rework that, too.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1621871
    • O
      btrfs: reserve correct number of items for unlink and rmdir · bca4ad7c
      Omar Sandoval 提交于
      __btrfs_unlink_inode() calls btrfs_update_inode() on the parent
      directory in order to update its size and sequence number. Make sure we
      account for it.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bca4ad7c
  2. 28 4月, 2022 1 次提交
  3. 19 4月, 2022 2 次提交
  4. 06 4月, 2022 2 次提交
    • N
      btrfs: release correct delalloc amount in direct IO write path · 6d82ad13
      Naohiro Aota 提交于
      Running generic/406 causes the following WARNING in btrfs_destroy_inode()
      which tells there are outstanding extents left.
      
      In btrfs_get_blocks_direct_write(), we reserve a temporary outstanding
      extents with btrfs_delalloc_reserve_metadata() (or indirectly from
      btrfs_delalloc_reserve_space(()). We then release the outstanding extents
      with btrfs_delalloc_release_extents(). However, the "len" can be modified
      in the COW case, which releases fewer outstanding extents than expected.
      
      Fix it by calling btrfs_delalloc_release_extents() for the original length.
      
      To reproduce the warning, the filesystem should be 1 GiB.  It's
      triggering a short-write, due to not being able to allocate a large
      extent and instead allocating a smaller one.
      
        WARNING: CPU: 0 PID: 757 at fs/btrfs/inode.c:8848 btrfs_destroy_inode+0x1e6/0x210 [btrfs]
        Modules linked in: btrfs blake2b_generic xor lzo_compress
        lzo_decompress raid6_pq zstd zstd_decompress zstd_compress xxhash zram
        zsmalloc
        CPU: 0 PID: 757 Comm: umount Not tainted 5.17.0-rc8+ #101
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS d55cb5a 04/01/2014
        RIP: 0010:btrfs_destroy_inode+0x1e6/0x210 [btrfs]
        RSP: 0018:ffffc9000327bda8 EFLAGS: 00010206
        RAX: 0000000000000000 RBX: ffff888100548b78 RCX: 0000000000000000
        RDX: 0000000000026900 RSI: 0000000000000000 RDI: ffff888100548b78
        RBP: ffff888100548940 R08: 0000000000000000 R09: ffff88810b48aba8
        R10: 0000000000000001 R11: ffff8881004eb240 R12: ffff88810b48a800
        R13: ffff88810b48ec08 R14: ffff88810b48ed00 R15: ffff888100490c68
        FS:  00007f8549ea0b80(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f854a09e733 CR3: 000000010a2e9003 CR4: 0000000000370eb0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <TASK>
         destroy_inode+0x33/0x70
         dispose_list+0x43/0x60
         evict_inodes+0x161/0x1b0
         generic_shutdown_super+0x2d/0x110
         kill_anon_super+0xf/0x20
         btrfs_kill_super+0xd/0x20 [btrfs]
         deactivate_locked_super+0x27/0x90
         cleanup_mnt+0x12c/0x180
         task_work_run+0x54/0x80
         exit_to_user_mode_prepare+0x152/0x160
         syscall_exit_to_user_mode+0x12/0x30
         do_syscall_64+0x42/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         RIP: 0033:0x7f854a000fb7
      
      Fixes: f0bfa76a ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range")
      CC: stable@vger.kernel.org # 5.17
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d82ad13
    • H
      btrfs: zoned: remove redundant condition in btrfs_run_delalloc_range · 9435be73
      Haowen Bai 提交于
      The logic !A || A && B is equivalent to !A || B. so we can
      make code clear.
      
      Note: though it's preferred to be in the more human readable form, there
      have been repeated reports and patches as the expression is detected by
      tools so apply it to reduce the load.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NHaowen Bai <baihaowen@meizu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9435be73