1. 24 3月, 2020 14 次提交
    • J
      btrfs: move the root freeing stuff into btrfs_put_root · 8c38938c
      Josef Bacik 提交于
      There are a few different ways to free roots, either you allocated them
      yourself and you just do
      
      free_extent_buffer(root->node);
      free_extent_buffer(root->commit_node);
      btrfs_put_root(root);
      
      Which is the pattern for log roots.  Or for snapshots/subvolumes that
      are being dropped you simply call btrfs_free_fs_root() which does all
      the cleanup for you.
      
      Unify this all into btrfs_put_root(), so that we don't free up things
      associated with the root until the last reference is dropped.  This
      makes the root freeing code much more significant.
      
      The only caveat is at close_ctree() time we have to free the extent
      buffers for all of our main roots (extent_root, chunk_root, etc) because
      we have to drop the btree_inode and we'll run into issues if we hold
      onto those nodes until ->kill_sb() time.  This will be addressed in the
      future when we kill the btree_inode.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8c38938c
    • J
      btrfs: make the extent buffer leak check per fs info · 3fd63727
      Josef Bacik 提交于
      I'm going to make the entire destruction of btrfs_root's controlled by
      their refcount, so it will be helpful to notice if we're leaking their
      eb's on umount.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3fd63727
    • Q
      btrfs: Don't submit any btree write bio if the fs has errors · b3ff8f1d
      Qu Wenruo 提交于
      [BUG]
      There is a fuzzed image which could cause KASAN report at unmount time.
      
        BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
        Read of size 8 at addr ffff888067cf6848 by task umount/1922
      
        CPU: 0 PID: 1922 Comm: umount Tainted: G        W         5.0.21 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
        Call Trace:
         dump_stack+0x5b/0x8b
         print_address_description+0x70/0x280
         kasan_report+0x13a/0x19b
         btrfs_queue_work+0x2c1/0x390
         btrfs_wq_submit_bio+0x1cd/0x240
         btree_submit_bio_hook+0x18c/0x2a0
         submit_one_bio+0x1be/0x320
         flush_write_bio.isra.41+0x2c/0x70
         btree_write_cache_pages+0x3bb/0x7f0
         do_writepages+0x5c/0x130
         __writeback_single_inode+0xa3/0x9a0
         writeback_single_inode+0x23d/0x390
         write_inode_now+0x1b5/0x280
         iput+0x2ef/0x600
         close_ctree+0x341/0x750
         generic_shutdown_super+0x126/0x370
         kill_anon_super+0x31/0x50
         btrfs_kill_super+0x36/0x2b0
         deactivate_locked_super+0x80/0xc0
         deactivate_super+0x13c/0x150
         cleanup_mnt+0x9a/0x130
         task_work_run+0x11a/0x1b0
         exit_to_usermode_loop+0x107/0x130
         do_syscall_64+0x1e5/0x280
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      [CAUSE]
      The fuzzed image has a completely screwd up extent tree:
      
        leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
        refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
                item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
                        extent refs 1 gen 9 flags 1
                        ref#0: extent data backref root 5 objectid 259 offset 0 count 1
                item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
                        extent refs 1 gen 9 flags 1
                        ref#0: extent data backref root 5 objectid 271 offset 0 count 1
                item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
                        extent refs 1 gen 9 flags 1
                        ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
                item 3 key (29360128 169 0) itemoff 3803 itemsize 33
                        extent refs 1 gen 9 flags 2
                        ref#0: tree block backref root 5
                item 4 key (29368320 169 1) itemoff 3770 itemsize 33
                        extent refs 1 gen 9 flags 2
                        ref#0: tree block backref root 5
                item 5 key (29372416 169 0) itemoff 3737 itemsize 33
                        extent refs 1 gen 9 flags 2
                        ref#0: tree block backref root 5
      
      Note that leaf 29421568 doesn't have its backref in the extent tree.
      Thus extent allocator can re-allocate leaf 29421568 for other trees.
      
      In short, the bug is caused by:
      
      - Existing tree block gets allocated to log tree
        This got its generation bumped.
      
      - Log tree balance cleaned dirty bit of offending tree block
        It will not be written back to disk, thus no WRITTEN flag.
      
      - Original owner of the tree block gets COWed
        Since the tree block has higher transid, no WRITTEN flag, it's reused,
        and not traced by transaction::dirty_pages.
      
      - Transaction aborted
        Tree blocks get cleaned according to transaction::dirty_pages. But the
        offending tree block is not recorded at all.
      
      - Filesystem unmount
        All pages are assumed to be are clean, destroying all workqueue, then
        call iput(btree_inode).
        But offending tree block is still dirty, which triggers writeback, and
        causes use-after-free bug.
      
      The detailed sequence looks like this:
      
      - Initial status
        eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
            not traced by any dirty extent_iot_tree.
      
      - New tree block is allocated
        Since there is no backref for 29421568, it's re-allocated as new tree
        block.
        Keep in mind that tree block 29421568 is still referred by extent
        tree.
      
      - Tree block 29421568 is filled for log tree
        eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
            traced by btrfs_root::dirty_log_pages
      
      - Some log tree operations
        Since the fs is using node size 4096, the log tree can easily go a
        level higher.
      
      - Log tree needs balance
        Tree block 29421568 gets all its content pushed to right, thus now
        it is empty, and we don't need it.
        btrfs_clean_tree_block() from __push_leaf_right() get called.
      
        eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
            traced by btrfs_root::dirty_log_pages
      
      - Log tree write back
        btree_write_cache_pages() goes through dirty pages ranges, but since
        page of tree block 29421568 gets cleaned already, it's not written
        back to disk. Thus it doesn't have WRITTEN bit set.
        But ranges in dirty_log_pages are cleared.
      
        eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
            not traced by any dirty extent_iot_tree.
      
      - Extent tree update when committing transaction
        Since tree block 29421568 has transid equal to running trans, and has
        no WRITTEN bit, should_cow_block() will use it directly without adding
        it to btrfs_transaction::dirty_pages.
      
        eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
            not traced by any dirty extent_iot_tree.
      
        At this stage, we're doomed. We have a dirty eb not tracked by any
        extent io tree.
      
      - Transaction gets aborted due to corrupted extent tree
        Btrfs cleans up dirty pages according to transaction::dirty_pages and
        btrfs_root::dirty_log_pages.
        But since tree block 29421568 is not tracked by neither of them, it's
        still dirty.
      
        eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
            not traced by any dirty extent_iot_tree.
      
      - Filesystem unmount
        Since all cleanup is assumed to be done, all workqueus are destroyed.
        Then iput(btree_inode) is called, expecting no dirty pages.
        But tree 29421568 is still dirty, thus triggering writeback.
        Since all workqueues are already freed, we cause use-after-free.
      
      This shows us that, log tree blocks + bad extent tree can cause wild
      dirty pages.
      
      [FIX]
      To fix the problem, don't submit any btree write bio if the filesytem
      has any error.  This is the last safe net, just in case other cleanup
      haven't caught catch it.
      
      Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b3ff8f1d
    • J
      btrfs: Add missing lock annotation for release_extent_buffer() · 5ce48d0f
      Jules Irenge 提交于
      Sparse reports a warning at release_extent_buffer()
      warning: context imbalance in release_extent_buffer() - unexpected unlock
      
      The root cause is the missing annotation at release_extent_buffer()
      Add the missing __releases(&eb->refs_lock) annotation
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJules Irenge <jbi.octave@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5ce48d0f
    • F
      Btrfs: avoid unnecessary splits when setting bits on an extent io tree · 55ffaabe
      Filipe Manana 提交于
      When attempting to set bits on a range of an exent io tree that already
      has those bits set we can end up splitting an extent state record, use
      the preallocated extent state record, insert it into the red black tree,
      do another search on the red black tree, merge the preallocated extent
      state record with the previous extent state record, remove that previous
      record from the red black tree and then free it. This is all unnecessary
      work that consumes time.
      
      This happens specifically at the following case at __set_extent_bit():
      
        $ cat -n fs/btrfs/extent_io.c
         957  static int __must_check
         958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
        (...)
        1044          /*
        1045           *     | ---- desired range ---- |
        1046           * | state |
        1047           *   or
        1048           * | ------------- state -------------- |
        1049           *
        (...)
        1060          if (state->start < start) {
        1061                  if (state->state & exclusive_bits) {
        1062                          *failed_start = start;
        1063                          err = -EEXIST;
        1064                          goto out;
        1065                  }
        1066
        1067                  prealloc = alloc_extent_state_atomic(prealloc);
        1068                  BUG_ON(!prealloc);
        1069                  err = split_state(tree, state, prealloc, start);
        1070                  if (err)
        1071                          extent_io_tree_panic(tree, err);
        1072
        1073                  prealloc = NULL;
      
      So if our extent state represents a range from 0 to 1MiB for example, and
      we want to set bits in the range 128KiB to 256KiB for example, and that
      extent state record already has all those bits set, we end up splitting
      that record, so we end up with extent state records in the tree which
      represent the ranges from 0 to 128KiB and from 128KiB to 1MiB. This is
      temporary because a subsequent iteration in that function will end up
      merging the records.
      
      The splitting requires using the preallocated extent state record, so
      a future iteration that needs to do another split will need to allocate
      another extent state record in an atomic context, something not ideal
      that we try to avoid as much as possible. The splitting also requires
      an insertion in the red black tree, and a subsequent merge will require
      a deletion from the red black tree and freeing an extent state record.
      
      This change just skips the splitting of an extent state record when it
      already has all the bits the we need to set.
      
      Setting a bit that is already set for a range is very common in the
      inode's 'file_extent_tree' extent io tree for example, where we keep
      setting the EXTENT_DIRTY bit every time we replace an extent.
      
      This change also fixes a bug that happens after the recent patchset from
      Josef that avoids having implicit holes after a power failure when not
      using the NO_HOLES feature, more specifically the patch with the subject:
      
        "btrfs: introduce the inode->file_extent_tree"
      
      This patch introduced an extent io tree per inode to keep track of
      completed ordered extents and figure out at any time what is the safe
      value for the inode's disk_i_size. This assumes that for contiguous
      ranges in a file we always end up with a single extent state record in
      the io tree, but that is not the case, as there is a short time window
      where we can have two extent state records representing contiguous
      ranges. When this happens we end setting up an incorrect value for the
      inode's disk_i_size, resulting in data loss after a clean unmount
      of the filesystem. The following example explains how this can happen.
      
      Suppose we have an inode with an i_size and a disk_i_size of 1MiB, so in
      the inode's file_extent_tree we have a single extent state record that
      represents the range [0, 1MiB) with the EXTENT_DIRTY bit set. Then the
      following steps happen:
      
      1) A buffered write against file range [512KiB, 768KiB) is made. At this
         point delalloc was not flushed yet;
      
      2) Deduplication from some other inode into this inode's range
         [128KiB, 256KiB) is made. This causes btrfs_inode_set_file_extent_range()
         to be called, from btrfs_insert_clone_extent(), to mark the range
         [128KiB, 256KiB) with EXTENT_DIRTY in the inode's file_extent_tree;
      
      3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
         end up at __set_extent_bit(). In the first iteration of that function's
         loop we end up in the following branch:
      
         $ cat -n fs/btrfs/extent_io.c
          957  static int __must_check
          958  __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
         (...)
         1044          /*
         1045           *     | ---- desired range ---- |
         1046           * | state |
         1047           *   or
         1048           * | ------------- state -------------- |
         1049           *
         (...)
         1060          if (state->start < start) {
         1061                  if (state->state & exclusive_bits) {
         1062                          *failed_start = start;
         1063                          err = -EEXIST;
         1064                          goto out;
         1065                  }
         1066
         1067                  prealloc = alloc_extent_state_atomic(prealloc);
         1068                  BUG_ON(!prealloc);
         1069                  err = split_state(tree, state, prealloc, start);
         1070                  if (err)
         1071                          extent_io_tree_panic(tree, err);
         1072
         1073                  prealloc = NULL;
         (...)
         1089                  goto search_again;
      
         This splits the state record into two, one for range [0, 128KiB) and
         another for the range [128KiB, 1MiB). Both already have the EXTENT_DIRTY
         bit set. Then we jump to the 'search_again' label, where we unlock the
         the spinlock protecting the extent io tree before jumping to the
         'again' label to perform the next iteration;
      
      4) In the meanwhile, delalloc is flushed, the ordered extent for the range
         [512KiB, 768KiB) is created and when it completes, at
         btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
         with a value of 0 for its 'new_size' argument;
      
      5) Before the deduplication task currently at __set_extent_bit() moves to
         the next iteration, the task finishing the ordered extent calls
         find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
         and gets 'start' set to 0 and 'end' set to 128KiB - because at this
         moment the io tree has two extent state records, one representing the
         range [0, 128KiB) and another representing the range [128KiB, 1MiB),
         both with EXTENT_DIRTY set. Then we set 'isize' to:
      
         isize = min(isize, end + 1)
               = min(1MiB, 128KiB - 1 + 1)
               = 128KiB
      
         Then we set the inode's disk_i_size to 128KiB (isize).
      
         After a clean unmount of the filesystem and mounting it again, we have
         the file with a size of 128KiB, and effectively lost all the data it
         had before in the range from 128KiB to 1MiB.
      
      This change fixes that issue too, as we never end up splitting extent
      state records when they already have all the bits we want set.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      55ffaabe
    • D
      btrfs: sink argument tree to __do_readpage · f657a31c
      David Sterba 提交于
      The tree pointer can be safely read from the inode, use it and drop the
      redundant argument.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f657a31c
    • D
      btrfs: sink arugment tree to contiguous_readpages · b6660e80
      David Sterba 提交于
      The tree pointer can be safely read from the inode, use it and drop the
      redundant argument.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b6660e80
    • D
      btrfs: sink argument tree to __extent_read_full_page · 0d44fea7
      David Sterba 提交于
      The tree pointer can be safely read from the inode, use it and drop the
      redundant argument.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d44fea7
    • D
      btrfs: sink argument tree to extent_read_full_page · 71ad38b4
      David Sterba 提交于
      The tree pointer can be safely read from the page's inode, use it and
      drop the redundant argument.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      71ad38b4
    • D
      btrfs: drop argument tree from btrfs_lock_and_flush_ordered_range · b272ae22
      David Sterba 提交于
      The tree pointer can be safely read from the inode so we can drop the
      redundant argument from btrfs_lock_and_flush_ordered_range.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b272ae22
    • D
      btrfs: add assertions for tree == inode->io_tree to extent IO helpers · ae6957eb
      David Sterba 提交于
      Add assertions to all helpers that get tree as argument and verify that
      it's the same that can be obtained from the inode or from its pages. In
      followup patches the redundant arguments and assertions will be removed
      one by one.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae6957eb
    • D
      btrfs: drop argument tree from submit_extent_page · 0ceb34bf
      David Sterba 提交于
      Now that we're sure the tree from argument is same as the one we can get
      from the page's inode io_tree, drop the redundant argument.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0ceb34bf
    • D
      btrfs: remove extent_page_data::tree · 45b08405
      David Sterba 提交于
      All functions that set up extent_page_data::tree set it to the inode
      io_tree. That's passed down the callstack that accesses either the same
      inode or its pages. In the end submit_extent_page can pull the tree out
      of the page and we don't have to store it in the structure.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      45b08405
    • J
      btrfs: introduce per-inode file extent tree · 41a2ee75
      Josef Bacik 提交于
      In order to keep track of where we have file extents on disk, and thus
      where it is safe to adjust the i_size to, we need to have a tree in
      place to keep track of the contiguous areas we have file extents for.
      
      Add helpers to use this tree, as it's not required for NO_HOLES file
      systems.  We will use this by setting DIRTY for areas we know we have
      file extent item's set, and clearing it when we remove file extent items
      for truncation.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      41a2ee75
  2. 31 1月, 2020 3 次提交
    • J
      btrfs: drop the -EBUSY case in __extent_writepage_io · 5ab58055
      Josef Bacik 提交于
      Now that we only return 0 or -EAGAIN from btrfs_writepage_cow_fixup, we
      do not need this -EBUSY case.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5ab58055
    • N
      btrfs: Correctly handle empty trees in find_first_clear_extent_bit · 5750c375
      Nikolay Borisov 提交于
      Raviu reported that running his regular fs_trim segfaulted with the
      following backtrace:
      
      [  237.525947] assertion failed: prev, in ../fs/btrfs/extent_io.c:1595
      [  237.525984] ------------[ cut here ]------------
      [  237.525985] kernel BUG at ../fs/btrfs/ctree.h:3117!
      [  237.525992] invalid opcode: 0000 [#1] SMP PTI
      [  237.525998] CPU: 4 PID: 4423 Comm: fstrim Tainted: G     U     OE     5.4.14-8-vanilla #1
      [  237.526001] Hardware name: ASUSTeK COMPUTER INC.
      [  237.526044] RIP: 0010:assfail.constprop.58+0x18/0x1a [btrfs]
      [  237.526079] Call Trace:
      [  237.526120]  find_first_clear_extent_bit+0x13d/0x150 [btrfs]
      [  237.526148]  btrfs_trim_fs+0x211/0x3f0 [btrfs]
      [  237.526184]  btrfs_ioctl_fitrim+0x103/0x170 [btrfs]
      [  237.526219]  btrfs_ioctl+0x129a/0x2ed0 [btrfs]
      [  237.526227]  ? filemap_map_pages+0x190/0x3d0
      [  237.526232]  ? do_filp_open+0xaf/0x110
      [  237.526238]  ? _copy_to_user+0x22/0x30
      [  237.526242]  ? cp_new_stat+0x150/0x180
      [  237.526247]  ? do_vfs_ioctl+0xa4/0x640
      [  237.526278]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
      [  237.526283]  do_vfs_ioctl+0xa4/0x640
      [  237.526288]  ? __do_sys_newfstat+0x3c/0x60
      [  237.526292]  ksys_ioctl+0x70/0x80
      [  237.526297]  __x64_sys_ioctl+0x16/0x20
      [  237.526303]  do_syscall_64+0x5a/0x1c0
      [  237.526310]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      That was due to btrfs_fs_device::aloc_tree being empty. Initially I
      thought this wasn't possible and as a percaution have put the assert in
      find_first_clear_extent_bit. Turns out this is indeed possible and could
      happen when a file system with SINGLE data/metadata profile has a 2nd
      device added. Until balance is run or a new chunk is allocated on this
      device it will be completely empty.
      
      In this case find_first_clear_extent_bit should return the full range
      [0, -1ULL] and let the caller handle this i.e for trim the end will be
      capped at the size of actual device.
      
      Link: https://lore.kernel.org/linux-btrfs/izW2WNyvy1dEDweBICizKnd2KDwDiDyY2EYQr4YCwk7pkuIpthx-JRn65MPBde00ND6V0_Lh8mW0kZwzDiLDv25pUYWxkskWNJnVP0kgdMA=@protonmail.com/
      Fixes: 45bfcfc1 ("btrfs: Implement find_first_clear_extent_bit")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5750c375
    • J
      btrfs: flush write bio if we loop in extent_write_cache_pages · 42ffb0bf
      Josef Bacik 提交于
      There exists a deadlock with range_cyclic that has existed forever.  If
      we loop around with a bio already built we could deadlock with a writer
      who has the page locked that we're attempting to write but is waiting on
      a page in our bio to be written out.  The task traces are as follows
      
        PID: 1329874  TASK: ffff889ebcdf3800  CPU: 33  COMMAND: "kworker/u113:5"
         #0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
         #1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
         #2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
         #3 [ffffc900297bb708] __lock_page at ffffffff811f145b
         #4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
         #5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
         #6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
         #7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
         #8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
         #9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd
      
        PID: 2167901  TASK: ffff889dc6a59c00  CPU: 14  COMMAND:
        "aio-dio-invalid"
         #0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
         #1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
         #2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
         #3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
         #4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
         #5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
         #6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
         #7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
         #8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
         #9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032
      
      I used drgn to find the respective pages we were stuck on
      
      page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
      page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874
      
      As you can see the kworker is waiting for bit 0 (PG_locked) on index
      7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
      8148.  aio-dio-invalid has 7680, and the kworker epd looks like the
      following
      
        crash> struct extent_page_data ffffc900297bbbb0
        struct extent_page_data {
          bio = 0xffff889f747ed830,
          tree = 0xffff889eed6ba448,
          extent_locked = 0,
          sync_io = 0
        }
      
      Probably worth mentioning as well that it waits for writeback of the
      page to complete while holding a lock on it (at prepare_pages()).
      
      Using drgn I walked the bio pages looking for page
      0xffffea00fbfc7500 which is the one we're waiting for writeback on
      
        bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
        for i in range(0, bio.bi_vcnt.value_()):
            bv = bio.bi_io_vec[i]
            if bv.bv_page.value_() == 0xffffea00fbfc7500:
      	  print("FOUND IT")
      
      which validated what I suspected.
      
      The fix for this is simple, flush the epd before we loop back around to
      the beginning of the file during writeout.
      
      Fixes: b293f02e ("Btrfs: Add writepages support")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      42ffb0bf
  3. 20 1月, 2020 6 次提交
  4. 13 12月, 2019 1 次提交
  5. 19 11月, 2019 5 次提交
    • D
      btrfs: drop bdev argument from submit_extent_page · fa17ed06
      David Sterba 提交于
      After previous patches removing bdev being passed around to set it to
      bio, it has become unused in submit_extent_page. So it now has "only" 13
      parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa17ed06
    • D
      btrfs: remove extent_map::bdev · a019e9e1
      David Sterba 提交于
      We can now remove the bdev from extent_map. Previous patches made sure
      that bio_set_dev is correctly in all places and that we don't need to
      grab it from latest_bdev or pass it around inside the extent map.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a019e9e1
    • D
      btrfs: drop bio_set_dev where not needed · 1a418027
      David Sterba 提交于
      bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
      also changing some state bits if there was a different bdev set before.
      This is one thing that's not needed.
      
      Another thing is that setting a bdev at bio allocation time is too early
      and actually does not work with plain redundancy profiles, where each
      time we submit a bio to a device, the bdev is set correctly.
      
      In many places the bio bdev is set to latest_bdev that seems to serve as
      a stub pointer "just to put something to bio". But we don't have to do
      that.
      
      Where do we know which bdev to set:
      
      * for regular IO: submit_stripe_bio that's called by btrfs_map_bio
      
      * repair IO: repair_io_failure, read or write from specific device
      
      * super block write (using buffer_heads but uses raw bdev) and barriers
      
      * scrub: this does not use all regular IO paths as it needs to reach all
        copies, verify and fixup eventually, and for that all bdev management
        is independent
      
      * raid56: rbio_add_io_page, for the RMW write
      
      * integrity-checker: does it's own low-level block tracking
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a418027
    • D
      btrfs: get bdev directly from fs_devices in submit_extent_page · 429aebc0
      David Sterba 提交于
      This is preparatory patch to remove @bdev parameter from
      submit_extent_page. It can't be removed completely, because the cgroups
      need it for wbc when initializing the bio
      
      wbc_init_bio
        bio_associate_blkg_from_css
          dereference bdev->bi_disk->queue
      
      The bdev pointer is the same as latest_bdev, thus no functional change.
      We can retrieve it from fs_devices that's reachable through several
      dereferences. The local variable shadows the parameter, but that's only
      temporary.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      429aebc0
    • D
      btrfs: sink write_flags to __extent_writepage_io · 57e5ffeb
      David Sterba 提交于
      __extent_writepage reads write flags from wbc and passes both to
      __extent_writepage_io. This makes write_flags redundant and we can
      remove it.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      57e5ffeb
  6. 18 11月, 2019 9 次提交
    • T
      btrfs: Avoid getting stuck during cyclic writebacks · f7bddf1e
      Tejun Heo 提交于
      During a cyclic writeback, extent_write_cache_pages() uses done_index
      to update the writeback_index after the current run is over.  However,
      instead of current index + 1, it gets to to the current index itself.
      
      Unfortunately, this, combined with returning on EOF instead of looping
      back, can lead to the following pathlogical behavior.
      
      1. There is a single file which has accumulated enough dirty pages to
         trigger balance_dirty_pages() and the writer appending to the file
         with a series of short writes.
      
      2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.
      
      3. Writeback kicks in and the cursor is on the last page of the dirty
         file.  Writeback is started or skipped if already in progress.  As
         it's EOF, extent_write_cache_pages() returns and the cursor is set
         to done_index which is pointing to the last page.
      
      4. Writeback is done.  Nothing happens till balance_dirty_pages
         finishes, at which point we go back to #1.
      
      This can almost completely stall out writing back of the file and keep
      the system over dirty threshold for a long time which can mess up the
      whole system.  We encountered this issue in production with a package
      handling application which can reliably reproduce the issue when
      running under tight memory limits.
      
      Reading the comment in the error handling section, this seems to be to
      avoid accidentally skipping a page in case the write attempt on the
      page doesn't succeed.  However, this concern seems bogus.
      
      On each page, the code either:
      
      * Skips and moves onto the next page.
      
      * Fails issue and sets done_index to index + 1.
      
      * Successfully issues and continue to the next page if budget allows
        and not EOF.
      
      IOW, as long as it's not EOF and there's budget, the code never
      retries writing back the same page.  Only when a page happens to be
      the last page of a particular run, we end up retrying the page, which
      can't possibly guarantee anything data integrity related.  Besides,
      cyclic writes are only used for non-syncing writebacks meaning that
      there's no data integrity implication to begin with.
      
      Fix it by always setting done_index past the current page being
      processed.
      
      Note that this problem exists in other writepages too.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7bddf1e
    • C
      Btrfs: extent_write_locked_range() should attach inode->i_wb · dbb70bec
      Chris Mason 提交于
      extent_write_locked_range() is used when we're falling back to buffered
      IO from inside of compression.  It allocates its own wbc and should
      associate it with the inode's i_wb to make sure the IO goes down from
      the correct cgroup.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dbb70bec
    • C
      Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios · ec39f769
      Chris Mason 提交于
      Async CRCs and compression submit IO through helper threads, which means
      they have IO priority inversions when cgroup IO controllers are in use.
      
      This flags all of the writes submitted by btrfs helper threads as
      REQ_CGROUP_PUNT.  submit_bio() will punt these to dedicated per-blkcg
      work items to avoid the priority inversion.
      
      For the compression code, we take a reference on the wbc's blkg css and
      pass it down to the async workers.
      
      For the async CRCs, the bio already has the correct css, we just need to
      tell the block layer to use REQ_CGROUP_PUNT.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Modified-and-reviewed-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ec39f769
    • C
      Btrfs: only associate the locked page with one async_chunk struct · 1d53c9e6
      Chris Mason 提交于
      The btrfs writepages function collects a large range of pages flagged
      for delayed allocation, and then sends them down through the COW code
      for processing.  When compression is on, we allocate one async_chunk
      structure for every 512K, and then run those pages through the
      compression code for IO submission.
      
      writepages starts all of this off with a single page, locked by the
      original call to extent_write_cache_pages(), and it's important to keep
      track of this page because it has already been through
      clear_page_dirty_for_io().
      
      The btrfs async_chunk struct has a pointer to the locked_page, and when
      we're redirtying the page because compression had to fallback to
      uncompressed IO, we use page->index to decide if a given async_chunk
      struct really owns that page.
      
      But, this is racey.  If a given delalloc range is broken up into two
      async_chunks (chunkA and chunkB), we can end up with something like
      this:
      
       compress_file_range(chunkA)
       submit_compress_extents(chunkA)
       submit compressed bios(chunkA)
       put_page(locked_page)
      
      				 compress_file_range(chunkB)
      				 ...
      
      Or:
      
       async_cow_submit
        submit_compressed_extents <--- falls back to buffered writeout
         cow_file_range
          extent_clear_unlock_delalloc
           __process_pages_contig
             put_page(locked_pages)
      
      					    async_cow_submit
      
      The end result is that chunkA is completed and cleaned up before chunkB
      even starts processing.  This means we can free locked_page() and reuse
      it elsewhere.  If we get really lucky, it'll have the same page->index
      in its new home as it did before.
      
      While we're processing chunkB, we might decide we need to fall back to
      uncompressed IO, and so compress_file_range() will call
      __set_page_dirty_nobufers() on chunkB->locked_page.
      
      Without cgroups in use, this creates as a phantom dirty page, which
      isn't great but isn't the end of the world. What can happen, it can go
      through the fixup worker and the whole COW machinery again:
      
      in submit_compressed_extents():
        while (async extents) {
        ...
          cow_file_range
          if (!page_started ...)
            extent_write_locked_range
          else if (...)
            unlock_page
          continue;
      
      This hasn't been observed in practice but is still possible.
      
      With cgroups in use, we might crash in the accounting code because
      page->mapping->i_wb isn't set.
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
        IP: percpu_counter_add_batch+0x11/0x70
        PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
        CPU: 16 PID: 2172 Comm: rm Not tainted
        RIP: 0010:percpu_counter_add_batch+0x11/0x70
        RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
        RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
        RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
        RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
        R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
        R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
        FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         account_page_cleaned+0x15b/0x1f0
         __cancel_dirty_page+0x146/0x200
         truncate_cleanup_page+0x92/0xb0
         truncate_inode_pages_range+0x202/0x7d0
         btrfs_evict_inode+0x92/0x5a0
         evict+0xc1/0x190
         do_unlinkat+0x176/0x280
         do_syscall_64+0x63/0x1a0
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      The fix here is to make asyc_chunk->locked_page NULL everywhere but the
      one async_chunk struct that's allowed to do things to the locked page.
      
      Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
      Fixes: 771ed689 ("Btrfs: Optimize compressed writeback and reads")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      [ update changelog from mail thread discussion ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1d53c9e6
    • J
      btrfs: move the failrec tree stuff into extent-io-tree.h · b3f167aa
      Josef Bacik 提交于
      This needs to be cleaned up in the future, but for now it belongs to the
      extent-io-tree stuff since it uses the internal tree search code.
      Needed to export get_state_failrec and set_state_failrec as well since
      we're not going to move the actual IO part of the failrec stuff out at
      this point.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b3f167aa
    • J
      btrfs: export find_delalloc_range · 083e75e7
      Josef Bacik 提交于
      This utilizes internal stuff to the extent_io_tree, so we need to export
      it before we move it.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      083e75e7
    • J
      btrfs: move extent_io_tree defs to their own header · 9c7d3a54
      Josef Bacik 提交于
      extent_io.c/h are huge, encompassing a bunch of different things.  The
      extent_io_tree code can live on its own, so separate this out.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c7d3a54
    • J
      btrfs: separate out the extent io init function · 6f0d04f8
      Josef Bacik 提交于
      We are moving extent_io_tree into it's on file, so separate out the
      extent_state init stuff from extent_io_tree_init().
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6f0d04f8
    • J
      btrfs: separate out the extent leak code · 33ca832f
      Josef Bacik 提交于
      We check both extent buffer and extent state leaks in the same function,
      separate these two functions out so we can move them around.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      33ca832f
  7. 24 9月, 2019 2 次提交