1. 10 6月, 2020 1 次提交
  2. 28 5月, 2020 6 次提交
    • F
      btrfs: fix space_info bytes_may_use underflow during space cache writeout · 2166e5ed
      Filipe Manana 提交于
      We always preallocate a data extent for writing a free space cache, which
      causes writeback to always try the nocow path first, since the free space
      inode has the prealloc bit set in its flags.
      
      However if the block group that contains the data extent for the space
      cache has been turned to RO mode due to a running scrub or balance for
      example, we have to fallback to the cow path. In that case once a new data
      extent is allocated we end up calling btrfs_add_reserved_bytes(), which
      decrements the counter named bytes_may_use from the data space_info object
      with the expection that this counter was previously incremented with the
      same amount (the size of the data extent).
      
      However when we started writeout of the space cache at cache_save_setup(),
      we incremented the value of the bytes_may_use counter through a call to
      btrfs_check_data_free_space() and then decremented it through a call to
      btrfs_prealloc_file_range_trans() immediately after. So when starting the
      writeback if we fallback to cow mode we have to increment the counter
      bytes_may_use of the data space_info again to compensate for the extent
      allocation done by the cow path.
      
      When this issue happens we are incorrectly decrementing the bytes_may_use
      counter and when its current value is smaller then the amount we try to
      subtract we end up with the following warning:
      
       ------------[ cut here ]------------
       WARNING: CPU: 3 PID: 657 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
       Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
       CPU: 3 PID: 657 Comm: kworker/u8:7 Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       Workqueue: writeback wb_workfn (flush-btrfs-1591)
       RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
       Code: ff ff 48 (...)
       RSP: 0000:ffffa41608f13660 EFLAGS: 00010287
       RAX: 0000000000001000 RBX: ffff9615b93ae400 RCX: 0000000000000000
       RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9615b96ab410
       RBP: fffffffffffee000 R08: 0000000000000001 R09: 0000000000000000
       R10: ffff961585e62a40 R11: 0000000000000000 R12: ffff9615b96ab400
       R13: ffff9615a1a2a000 R14: 0000000000012000 R15: ffff9615b93ae400
       FS:  0000000000000000(0000) GS:ffff9615bb200000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000055cbbc2ae178 CR3: 0000000115794006 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        find_free_extent+0x4a0/0x16c0 [btrfs]
        btrfs_reserve_extent+0x91/0x180 [btrfs]
        cow_file_range+0x12d/0x490 [btrfs]
        btrfs_run_delalloc_range+0x9f/0x6d0 [btrfs]
        ? find_lock_delalloc_range+0x221/0x250 [btrfs]
        writepage_delalloc+0xe8/0x150 [btrfs]
        __extent_writepage+0xe8/0x4c0 [btrfs]
        extent_write_cache_pages+0x237/0x530 [btrfs]
        extent_writepages+0x44/0xa0 [btrfs]
        do_writepages+0x23/0x80
        __writeback_single_inode+0x59/0x700
        writeback_sb_inodes+0x267/0x5f0
        __writeback_inodes_wb+0x87/0xe0
        wb_writeback+0x382/0x590
        ? wb_workfn+0x4a2/0x6c0
        wb_workfn+0x4a2/0x6c0
        process_one_work+0x26d/0x6a0
        worker_thread+0x4f/0x3e0
        ? process_one_work+0x6a0/0x6a0
        kthread+0x103/0x140
        ? kthread_create_worker_on_cpu+0x70/0x70
        ret_from_fork+0x3a/0x50
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace bd7c03622e0b0a52 ]---
       ------------[ cut here ]------------
      
      So fix this by incrementing the bytes_may_use counter of the data
      space_info when we fallback to the cow path. If the cow path is successful
      the counter is decremented after extent allocation (by
      btrfs_add_reserved_bytes()), if it fails it ends up being decremented as
      well when clearing the delalloc range (extent_clear_unlock_delalloc()).
      
      This could be triggered sporadically by the test case btrfs/061 from
      fstests.
      
      Fixes: 82d5902d ("Btrfs: Support reading/writing on disk free ino cache")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2166e5ed
    • F
      btrfs: fix space_info bytes_may_use underflow after nocow buffered write · 467dc47e
      Filipe Manana 提交于
      When doing a buffered write we always try to reserve data space for it,
      even when the file has the NOCOW bit set or the write falls into a file
      range covered by a prealloc extent. This is done both because it is
      expensive to check if we can do a nocow write (checking if an extent is
      shared through reflinks or if there's a hole in the range for example),
      and because when writeback starts we might actually need to fallback to
      COW mode (for example the block group containing the target extents was
      turned into RO mode due to a scrub or balance).
      
      When we are unable to reserve data space we check if we can do a nocow
      write, and if we can, we proceed with dirtying the pages and setting up
      the range for delalloc. In this case the bytes_may_use counter of the
      data space_info object is not incremented, unlike in the case where we
      are able to reserve data space (done through btrfs_check_data_free_space()
      which calls btrfs_alloc_data_chunk_ondemand()).
      
      Later when running delalloc we attempt to start writeback in nocow mode
      but we might revert back to cow mode, for example because in the meanwhile
      a block group was turned into RO mode by a scrub or relocation. The cow
      path after successfully allocating an extent ends up calling
      btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of
      the data space_info object to have been incremented before - but we did
      not do it when the buffered write started, since there was not enough
      available data space. So btrfs_add_reserved_bytes() ends up decrementing
      the bytes_may_use counter anyway, and when the counter's current value
      is smaller then the size of the allocated extent we get a stack trace
      like the following:
      
       ------------[ cut here ]------------
       WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
       Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...)
       CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       Workqueue: writeback wb_workfn (flush-btrfs-1754)
       RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs]
       Code: ff ff 48 (...)
       RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287
       RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000
       RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410
       RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000
       R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400
       R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800
       FS:  0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        find_free_extent+0x4a0/0x16c0 [btrfs]
        btrfs_reserve_extent+0x91/0x180 [btrfs]
        cow_file_range+0x12d/0x490 [btrfs]
        run_delalloc_nocow+0x341/0xa40 [btrfs]
        btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs]
        ? find_lock_delalloc_range+0x221/0x250 [btrfs]
        writepage_delalloc+0xe8/0x150 [btrfs]
        __extent_writepage+0xe8/0x4c0 [btrfs]
        extent_write_cache_pages+0x237/0x530 [btrfs]
        ? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs]
        extent_writepages+0x44/0xa0 [btrfs]
        do_writepages+0x23/0x80
        __writeback_single_inode+0x59/0x700
        writeback_sb_inodes+0x267/0x5f0
        __writeback_inodes_wb+0x87/0xe0
        wb_writeback+0x382/0x590
        ? wb_workfn+0x4a2/0x6c0
        wb_workfn+0x4a2/0x6c0
        process_one_work+0x26d/0x6a0
        worker_thread+0x4f/0x3e0
        ? process_one_work+0x6a0/0x6a0
        kthread+0x103/0x140
        ? kthread_create_worker_on_cpu+0x70/0x70
        ret_from_fork+0x3a/0x50
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffff94ebdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace f9f6ef8ec4cd8ec9 ]---
      
      So to fix this, when falling back into cow mode check if space was not
      reserved, by testing for the bit EXTENT_NORESERVE in the respective file
      range, and if not, increment the bytes_may_use counter for the data
      space_info object. Also clear the EXTENT_NORESERVE bit from the range, so
      that if the cow path fails it decrements the bytes_may_use counter when
      clearing the delalloc range (through the btrfs_clear_delalloc_extent()
      callback).
      
      Fixes: 7ee9e440 ("Btrfs: check if we can nocow if we don't have data space")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      467dc47e
    • F
      btrfs: fix wrong file range cleanup after an error filling dealloc range · e2c8e92d
      Filipe Manana 提交于
      If an error happens while running dellaloc in COW mode for a range, we can
      end up calling extent_clear_unlock_delalloc() for a range that goes beyond
      our range's end offset by 1 byte, which affects 1 extra page. This results
      in clearing bits and doing page operations (such as a page unlock) outside
      our target range.
      
      Fix that by calling extent_clear_unlock_delalloc() with an inclusive end
      offset, instead of an exclusive end offset, at cow_file_range().
      
      Fixes: a315e68f ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e2c8e92d
    • C
      btrfs: split btrfs_direct_IO to read and write part · d8f3e735
      Christoph Hellwig 提交于
      The read and write versions don't have anything in common except for the
      call to iomap_dio_rw.  So split this function, and merge each half into
      its only caller.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8f3e735
    • G
      btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK · 5f008163
      Goldwyn Rodrigues 提交于
      Since we now perform direct reads using i_rwsem, we can remove this
      inode flag used to co-ordinate unlocked reads.
      
      The truncate call takes i_rwsem. This means it is correctly synchronized
      with concurrent direct reads.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <jth@kernel.org>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f008163
    • G
      btrfs: switch to iomap_dio_rw() for dio · a43a67a2
      Goldwyn Rodrigues 提交于
      Switch from __blockdev_direct_IO() to iomap_dio_rw().
      Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it
      as iomap_begin() for iomap direct I/O functions. This function
      allocates and locks all the blocks required for the I/O.
      btrfs_submit_direct() is used as the submit_io() hook for direct I/O
      ops.
      
      Since we need direct I/O reads to go through iomap_dio_rw(), we change
      file_operations.read_iter() to a btrfs_file_read_iter() which calls
      btrfs_direct_IO() for direct reads and falls back to
      generic_file_buffered_read() for incomplete reads and buffered reads.
      
      We don't need address_space.direct_IO() anymore so set it to noop.
      Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
      capable of direct I/O reads from a hole, so we don't need to return
      -ENOENT.
      
      BTRFS direct I/O is now done under i_rwsem, shared in case of reads and
      exclusive in case of writes. This guards against simultaneous truncates.
      
      Use iomap->iomap_end() to check for failed or incomplete direct I/O:
       - for writes, call __endio_write_update_ordered()
       - for reads, unlock extents
      
      btrfs_dio_data is now hooked in iomap->private and not
      current->journal_info. It carries the reservation variable and the
      amount of data submitted, so we can calculate the amount of data to call
      __endio_write_update_ordered in case of an error.
      
      This patch removes last use of struct buffer_head from btrfs.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a43a67a2
  3. 25 5月, 2020 18 次提交
    • D
      btrfs: simplify iget helpers · 0202e83f
      David Sterba 提交于
      The inode lookup starting at btrfs_iget takes the full location key,
      while only the objectid is used to match the inode, because the lookup
      happens inside the given root thus the inode number is unique.
      The entire location key is properly set up in btrfs_init_locked_inode.
      
      Simplify the helpers and pass only inode number, renaming it to 'ino'
      instead of 'objectid'. This allows to remove temporary variables key,
      saving some stack space.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0202e83f
    • D
      btrfs: simplify root lookup by id · 56e9357a
      David Sterba 提交于
      The main function to lookup a root by its id btrfs_get_fs_root takes the
      whole key, while only using the objectid. The value of offset is preset
      to (u64)-1 but not actually used until btrfs_find_root that does the
      actual search.
      
      Switch btrfs_get_fs_root to use only objectid and remove all local
      variables that existed just for the lookup. The actual key for search is
      set up in btrfs_get_fs_root, reusing another key variable.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56e9357a
    • Q
      btrfs: inode: cleanup the log-tree exceptions in btrfs_truncate_inode_items() · 82028e0a
      Qu Wenruo 提交于
      There are a lot of root owner checks in btrfs_truncate_inode_items()
      like:
      
      	if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) ||
      	    root == fs_info->tree_root)
      
      But considering that, only these trees can have INODE_ITEMs:
      
      - tree root (for v1 space cache)
      - subvolume trees
      - tree reloc trees
      - data reloc tree
      - log trees
      
      And since subvolume/tree reloc/data reloc trees all have SHAREABLE bit,
      and we're checking tree root manually, so above check is just excluding
      log trees.
      
      This patch will replace two of such checks to a simpler one:
      
      	if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
      
      This would merge btrfs_drop_extent_cache() and lock_extent_bits() call
      into the same if branch.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      82028e0a
    • Q
      btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE · 92a7cc42
      Qu Wenruo 提交于
      The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
      
      In fact, that bit can only be set to those trees:
      
      - Subvolume roots
      - Data reloc root
      - Reloc roots for above roots
      
      All other trees won't get this bit set.  So just by the result, it is
      obvious that, roots with this bit set can have tree blocks shared with
      other trees.  Either shared by snapshots, or by reloc roots (an special
      snapshot created by relocation).
      
      This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
      make it easier to understand, and update all comment mentioning
      "reference counted" to follow the rename.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92a7cc42
    • D
      btrfs: drop eb parameter from set/get token helpers · cc4c13d5
      David Sterba 提交于
      Now that all set/get helpers use the eb from the token, we don't need to
      pass it to many btrfs_token_*/btrfs_set_token_* helpers, saving some
      stack space.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc4c13d5
    • R
      btrfs: speedup dead root detection during orphan cleanup · a619b3c7
      Robbie Ko 提交于
      When mounting, we handle deleted subvolume and orphan items.  First,
      find add orphan roots, then add them to fs_root radix tree.  Second, in
      tree-root, process each orphan item, skip if it is dead root.
      
      The original algorithm is based on the list of dead_roots, one by one to
      visit and check whether the objectid is consistent, the time complexity
      is O (n ^ 2).  When processing 50000 deleted subvols, it takes about
      120s.
      
      Because btrfs_find_orphan_roots has already ran before us, and added
      deleted subvol to fs_roots radix tree.
      
      The fs root will only be removed from the fs_roots radix tree after the
      cleaner process is started, and the cleaner will only start execution
      after the mount is complete.
      
      btrfs_orphan_cleanup can be called during the whole filesystem mount
      lifetime, but only "tree root" will be used in this section of code, and
      only mount time will be brought into tree root.
      
      So we can quickly check whether the orphan item is dead root through the
      fs_roots radix tree.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a619b3c7
    • E
      btrfs: use crypto_shash_digest() instead of open coding · fd08001f
      Eric Biggers 提交于
      Use crypto_shash_digest() instead of crypto_shash_init() +
      crypto_shash_update() + crypto_shash_final().  This is more efficient.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd08001f
    • O
      btrfs: unify buffered and direct I/O read repair · 77d5d689
      Omar Sandoval 提交于
      Currently, direct I/O has its own versions of bio_readpage_error() and
      btrfs_check_repairable() (dio_read_error() and
      btrfs_check_dio_repairable(), respectively). The main difference is that
      the direct I/O version doesn't do read validation. The rework of direct
      I/O repair makes it possible to do validation, so we can get rid of
      btrfs_check_dio_repairable() and combine bio_readpage_error() and
      dio_read_error() into a new helper, btrfs_submit_read_repair().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      77d5d689
    • O
      btrfs: get rid of endio_repair_workers · 5c047a69
      Omar Sandoval 提交于
      This was originally added in commit 8b110e39 ("Btrfs: implement
      repair function when direct read fails") to avoid a deadlock. In that
      commit, the direct I/O read endio executes on the endio_workers
      workqueue, submits a repair bio, and waits for it to complete. The
      repair bio endio must execute on a different workqueue, otherwise it
      could block on the endio_workers workqueue becoming available, which
      won't happen because the original endio is blocked on the repair bio.
      
      As of the previous commit, the original endio doesn't wait for the
      repair bio, so this separate workqueue is unnecessary.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5c047a69
    • O
      btrfs: simplify direct I/O read repair · fd9d6670
      Omar Sandoval 提交于
      Direct I/O read repair was originally implemented in commit 8b110e39
      ("Btrfs: implement repair function when direct read fails"). This
      implementation is unnecessarily complicated. There is major code
      duplication between __btrfs_subio_endio_read() (checks checksums and
      handles I/O errors for files with checksums),
      __btrfs_correct_data_nocsum() (handles I/O errors for files without
      checksums), btrfs_retry_endio() (checks checksums and handles I/O errors
      for retries of files with checksums), and btrfs_retry_endio_nocsum()
      (handles I/O errors for retries of files without checksum). If it sounds
      like these should be one function, that's because they should.
      Additionally, these functions are very hard to follow due to their
      excessive use of goto.
      
      This commit replaces the original implementation. After the previous
      commit getting rid of orig_bio, we can reuse the same endio callback for
      repair I/O and the original I/O, we just need to track the file offset
      and original iterator in the repair bio. We can also unify the handling
      of files with and without checksums and simplify the control flow. We
      also no longer have to wait for each repair I/O to complete one by one.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd9d6670
    • O
      btrfs: get rid of one layer of bios in direct I/O · 769b4f24
      Omar Sandoval 提交于
      In the worst case, there are _4_ layers of bios in the Btrfs direct I/O
      path:
      
      1. The bio created by the generic direct I/O code (dio_bio).
      2. A clone of dio_bio we create in btrfs_submit_direct() to represent
         the entire direct I/O range (orig_bio).
      3. A partial clone of orig_bio limited to the size of a RAID stripe that
         we create in btrfs_submit_direct_hook().
      4. Clones of each of those split bios for each RAID stripe that we
         create in btrfs_map_bio().
      
      As of the previous commit, the second layer (orig_bio) is no longer
      needed for anything: we can split dio_bio instead, and complete dio_bio
      directly when all of the cloned bios complete. This lets us clean up a
      bunch of cruft, including dip->subio_endio and dip->errors (we can use
      dio_bio->bi_status instead). It also enables the next big cleanup of
      direct I/O read repair.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      769b4f24
    • O
      btrfs: put direct I/O checksums in btrfs_dio_private instead of bio · 85879573
      Omar Sandoval 提交于
      The next commit will get rid of btrfs_dio_private->orig_bio. The only
      thing we really need it for is containing all of the checksums, but we
      can easily put the checksum array in btrfs_dio_private and have the
      submitted bios reference the array. We can also look the checksums up
      while we're setting up instead of the current awkward logic that looks
      them up for orig_bio when the first split bio is submitted.
      
      (Interestingly, btrfs_dio_private did contain the
      checksums before commit 23ea8e5a ("Btrfs: load checksum data once
      when submitting a direct read io"), but it didn't look them up up
      front.)
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      85879573
    • O
      btrfs: convert btrfs_dio_private->pending_bios to refcount_t · e3b318d1
      Omar Sandoval 提交于
      This is really a reference count now, so convert it to refcount_t and
      rename it to refs.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3b318d1
    • O
      btrfs: remove unused btrfs_dio_private::private · 2390a6da
      Omar Sandoval 提交于
      We haven't used this since commit 9be3395b ("Btrfs: use a btrfs
      bioset instead of abusing bio internals").
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2390a6da
    • O
      btrfs: rename __readpage_endio_check to check_data_csum · 47df7765
      Omar Sandoval 提交于
      __readpage_endio_check() is also used from the direct I/O read code, so
      give it a more descriptive name.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      47df7765
    • O
      btrfs: fix double __endio_write_update_ordered in direct I/O · c36cac28
      Omar Sandoval 提交于
      In btrfs_submit_direct(), if we fail to allocate the btrfs_dio_private,
      we complete the ordered extent range. However, we don't mark that the
      range doesn't need to be cleaned up from btrfs_direct_IO() until later.
      Therefore, if we fail to allocate the btrfs_dio_private, we complete the
      ordered extent range twice. We could fix this by updating
      unsubmitted_oe_range earlier, but it's cleaner to reorganize the code so
      that creating the btrfs_dio_private and submitting the bios are
      separate, and once the btrfs_dio_private is created, cleanup always
      happens through the btrfs_dio_private.
      
      The logic around unsubmitted_oe_range_end and unsubmitted_oe_range_start
      is really subtle. We have the following:
      
        1. btrfs_direct_IO sets those two to the same value.
      
        2. When we call __blockdev_direct_IO unless
           btrfs_get_blocks_direct->btrfs_get_blocks_direct_write is called to
           modify unsubmitted_oe_range_start so that start < end. Cleanup
           won't happen.
      
        3. We come into btrfs_submit_direct - if it dip allocation fails we'd
           return with oe_range_end now modified so cleanup will happen.
      
        4. If we manage to allocate the dip we reset the unsubmitted range
           members to be equal so that cleanup happens from
           btrfs_endio_direct_write.
      
      This 4-step logic is not really obvious, especially given it's scattered
      across 3 functions.
      
      Fixes: f28a4928 ("Btrfs: fix leaking of ordered extents after direct IO write error")
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      [ add range start/end logic explanation from Nikolay ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c36cac28
    • O
      btrfs: fix error handling when submitting direct I/O bio · 6d3113a1
      Omar Sandoval 提交于
      In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID
      stripe or chunk, we submit orig_bio without cloning it. In this case, we
      don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we
      decrement pending_bios to -1, and we never complete orig_bio. Fix it by
      initializing pending_bios to 1 instead of incrementing later.
      
      Fixing this exposes another bug: we put orig_bio prematurely and then
      put it again from end_io. Fix it by not putting orig_bio.
      
      After this change, pending_bios is really more of a reference count, but
      I'll leave that cleanup separate to keep the fix small.
      
      Fixes: e65e1535 ("btrfs: fix panic caused by direct IO")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d3113a1
    • J
      btrfs: improve global reserve stealing logic · 7f9fe614
      Josef Bacik 提交于
      For unlink transactions and block group removal
      btrfs_start_transaction_fallback_global_rsv will first try to start an
      ordinary transaction and if it fails it will fall back to reserving the
      required amount by stealing from the global reserve. This is problematic
      because of all the same reasons we had with previous iterations of the
      ENOSPC handling, thundering herd.  We get a bunch of failures all at
      once, everybody tries to allocate from the global reserve, some win and
      some lose, we get an ENSOPC.
      
      Fix this behavior by introducing BTRFS_RESERVE_FLUSH_ALL_STEAL. It's
      used to mark unlink reservation. To fix this we need to integrate this
      logic into the normal ENOSPC infrastructure.  We still go through all of
      the normal flushing work, and at the moment we begin to fail all the
      tickets we try to satisfy any tickets that are allowed to steal by
      stealing from the global reserve.  If this works we start the flushing
      system over again just like we would with a normal ticket satisfaction.
      This serializes our global reserve stealing, so we don't have the
      thundering herd problem.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f9fe614
  4. 24 3月, 2020 15 次提交