1. 21 2月, 2020 1 次提交
    • F
      Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof · a5ae50de
      Filipe Manana 提交于
      While logging the prealloc extents of an inode during a fast fsync we call
      btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
      holding a read lock on a leaf of the inode's root (not the log root, the
      fs/subvol root), and then that function locks the file range in the inode's
      iotree. This can lead to a deadlock when:
      
      * the fsync is ranged
      
      * the file has prealloc extents beyond eof
      
      * writeback for a range different from the fsync range starts
        during the fsync
      
      * the size of the file is not sector size aligned
      
      Because when finishing an ordered extent we lock first a file range and
      then try to COW the fs/subvol tree to insert an extent item.
      
      The following diagram shows how the deadlock can happen.
      
                 CPU 1                                        CPU 2
      
        btrfs_sync_file()
          --> for range [0, 1MiB)
      
          --> inode has a size of
              1MiB and has 1 prealloc
              extent beyond the
              i_size, starting at offset
              4MiB
      
          flushes all delalloc for the
          range [0MiB, 1MiB) and waits
          for the respective ordered
          extents to complete
      
                                                    --> before task at CPU 1 locks the
                                                        inode, a write into file range
                                                        [1MiB, 2MiB + 1KiB) is made
      
                                                    --> i_size is updated to 2MiB + 1KiB
      
                                                    --> writeback is started for that
                                                        range, [1MiB, 2MiB + 4KiB)
                                                        --> end offset rounded up to
                                                            be sector size aligned
      
          btrfs_log_dentry_safe()
            btrfs_log_inode_parent()
              btrfs_log_inode()
      
                btrfs_log_changed_extents()
                  btrfs_log_prealloc_extents()
                    --> does a search on the
                        inode's root
                    --> holds a read lock on
                        leaf X
      
                                                    btrfs_finish_ordered_io()
                                                      --> locks range [1MiB, 2MiB + 4KiB)
                                                          --> end offset rounded up
                                                              to be sector size aligned
      
                                                      --> tries to cow leaf X, through
                                                          insert_reserved_file_extent()
                                                          --> already locked by the
                                                              task at CPU 1
      
                    btrfs_truncate_inode_items()
      
                      --> gets an i_size of
                          2MiB + 1KiB, which is
                          not sector size
                          aligned
      
                      --> tries to lock file
                          range [2MiB, (u64)-1)
                          --> the start range
                              is rounded down
                              from 2MiB + 1K
                              to 2MiB to be sector
                              size aligned
      
                          --> but the subrange
                              [2MiB, 2MiB + 4KiB) is
                              already locked by
                              task at CPU 2 which
                              is waiting to get a
                              write lock on leaf X
                              for which we are
                              holding a read lock
      
                                      *** deadlock ***
      
      This results in a stack trace like the following, triggered by test case
      generic/561 from fstests:
      
        [ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
        [ 2779.979536]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
        [ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 2779.990136] kworker/u8:6    D    0   247      2 0x80004000
        [ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
        [ 2779.990466] Call Trace:
        [ 2779.990491]  ? __schedule+0x384/0xa30
        [ 2779.990521]  schedule+0x33/0xe0
        [ 2779.990616]  btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
        [ 2779.990632]  ? remove_wait_queue+0x60/0x60
        [ 2779.990730]  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
        [ 2779.990782]  btrfs_search_slot+0x510/0x1000 [btrfs]
        [ 2779.990869]  btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
        [ 2779.990944]  __btrfs_drop_extents+0x161/0x1060 [btrfs]
        [ 2779.990987]  ? mark_held_locks+0x6d/0xc0
        [ 2779.990994]  ? __slab_alloc.isra.49+0x99/0x100
        [ 2779.991060]  ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
        [ 2779.991145]  insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
        [ 2779.991222]  ? start_transaction+0xdd/0x5c0 [btrfs]
        [ 2779.991291]  btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
        [ 2779.991405]  btrfs_work_helper+0xaa/0x720 [btrfs]
        [ 2779.991432]  process_one_work+0x26d/0x6a0
        [ 2779.991460]  worker_thread+0x4f/0x3e0
        [ 2779.991481]  ? process_one_work+0x6a0/0x6a0
        [ 2779.991489]  kthread+0x103/0x140
        [ 2779.991499]  ? kthread_create_worker_on_cpu+0x70/0x70
        [ 2779.991515]  ret_from_fork+0x3a/0x50
        (...)
        [ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
        [ 2780.027480]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
        [ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 2780.030035] fsstress        D    0 17375  17373 0x00004000
        [ 2780.030038] Call Trace:
        [ 2780.030044]  ? __schedule+0x384/0xa30
        [ 2780.030052]  schedule+0x33/0xe0
        [ 2780.030075]  lock_extent_bits+0x20c/0x320 [btrfs]
        [ 2780.030094]  ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
        [ 2780.030098]  ? rcu_read_lock_sched_held+0x59/0xa0
        [ 2780.030102]  ? remove_wait_queue+0x60/0x60
        [ 2780.030122]  btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
        [ 2780.030151]  ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
        [ 2780.030165]  ? btrfs_search_slot+0x379/0x1000 [btrfs]
        [ 2780.030195]  btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
        [ 2780.030202]  ? do_raw_spin_unlock+0x49/0xc0
        [ 2780.030215]  ? btrfs_get_num_csums+0x10/0x10 [btrfs]
        [ 2780.030239]  btrfs_log_inode+0xf83/0x1124 [btrfs]
        [ 2780.030251]  ? __mutex_unlock_slowpath+0x45/0x2a0
        [ 2780.030275]  btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
        [ 2780.030282]  ? dget_parent+0xa1/0x370
        [ 2780.030309]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [ 2780.030329]  btrfs_sync_file+0x3f3/0x490 [btrfs]
        [ 2780.030339]  do_fsync+0x38/0x60
        [ 2780.030343]  __x64_sys_fdatasync+0x13/0x20
        [ 2780.030345]  do_syscall_64+0x5c/0x280
        [ 2780.030348]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
        [ 2780.030361] Code: Bad RIP value.
        [ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
        [ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
        [ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
        [ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
        [ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
        [ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
      
      So fix this by making btrfs_truncate_inode_items() not lock the range in
      the inode's iotree when the target root is a log root, since it's not
      needed to lock the range for log roots as the protection from the inode's
      lock and log_mutex are all that's needed.
      
      Fixes: 28553fa9 ("Btrfs: fix race between shrinking truncate and fiemap")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5ae50de
  2. 19 2月, 2020 1 次提交
    • J
      btrfs: fix bytes_may_use underflow in prealloc error condtition · b778cf96
      Josef Bacik 提交于
      I hit the following warning while running my error injection stress
      testing:
      
        WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
        RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
        Call Trace:
        btrfs_free_reserved_data_space+0x4f/0x70 [btrfs]
        __btrfs_prealloc_file_range+0x378/0x470 [btrfs]
        elfcorehdr_read+0x40/0x40
        ? elfcorehdr_read+0x40/0x40
        ? btrfs_commit_transaction+0xca/0xa50 [btrfs]
        ? dput+0xb4/0x2a0
        ? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
        ? btrfs_sync_file+0x30e/0x420 [btrfs]
        ? do_fsync+0x38/0x70
        ? __x64_sys_fdatasync+0x13/0x20
        ? do_syscall_64+0x5b/0x1b0
        ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happens if we fail to insert our reserved file extent.  At this
      point we've already converted our reservation from ->bytes_may_use to
      ->bytes_reserved.  However once we break we will attempt to free
      everything from [cur_offset, end] from ->bytes_may_use, but our extent
      reservation will overlap part of this.
      
      Fix this problem by adding ins.offset (our extent allocation size) to
      cur_offset so we remove the actual remaining part from ->bytes_may_use.
      
      I validated this fix using my inject-error.py script
      
      python inject-error.py -o should_fail_bio -t cache_save_setup -t \
      	__btrfs_prealloc_file_range \
      	-t insert_reserved_file_extent.constprop.0 \
      	-r "-5" ./run-fsstress.sh
      
      where run-fsstress.sh simply mounts and runs fsstress on a disk.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b778cf96
  3. 17 2月, 2020 1 次提交
    • J
      btrfs: don't set path->leave_spinning for truncate · 52e29e33
      Josef Bacik 提交于
      The only time we actually leave the path spinning is if we're truncating
      a small amount and don't actually free an extent, which is not a common
      occurrence.  We have to set the path blocking in order to add the
      delayed ref anyway, so the first extent we find we set the path to
      blocking and stay blocking for the duration of the operation.  With the
      upcoming file extent map stuff there will be another case that we have
      to have the path blocking, so just swap to blocking always.
      
      Note: this patch also fixes a warning after 28553fa9 ("Btrfs: fix
      race between shrinking truncate and fiemap") got merged that inserts
      extent locks around truncation so the path must not leave spinning locks
      after btrfs_search_slot.
      
        [70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
        [70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
        [70.794863] 5 locks held by rsync/1141:
        [70.794876]  #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
        [70.795030]  #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
        [70.795051]  #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
        [70.795124]  #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
        [70.795203]  #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
        [70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
        [70.795362] Call Trace:
        [70.795374]  dump_stack+0x71/0xa0
        [70.795445]  ___might_sleep.part.96.cold.106+0xa6/0xb6
        [70.795459]  kmem_cache_alloc+0x1d3/0x290
        [70.795471]  alloc_extent_state+0x22/0x1c0
        [70.795544]  __clear_extent_bit+0x3ba/0x580
        [70.795557]  ? _raw_spin_unlock_irq+0x24/0x30
        [70.795569]  btrfs_truncate_inode_items+0x339/0xe50
        [70.795647]  btrfs_evict_inode+0x269/0x540
        [70.795659]  ? dput.part.38+0x29/0x460
        [70.795671]  evict+0xcd/0x190
        [70.795682]  __dentry_kill+0xd6/0x180
        [70.795754]  dput.part.38+0x2ad/0x460
        [70.795765]  do_renameat2+0x3cb/0x540
        [70.795777]  __x64_sys_rename+0x1c/0x20
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Fixes: 28553fa9 ("Btrfs: fix race between shrinking truncate and fiemap")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52e29e33
  4. 13 2月, 2020 1 次提交
    • F
      Btrfs: fix race between shrinking truncate and fiemap · 28553fa9
      Filipe Manana 提交于
      When there is a fiemap executing in parallel with a shrinking truncate
      we can end up in a situation where we have extent maps for which we no
      longer have corresponding file extent items. This is generally harmless
      and at the moment the only consequences are missing file extent items
      representing holes after we expand the file size again after the
      truncate operation removed the prealloc extent items, and stale
      information for future fiemap calls (reporting extents that no longer
      exist or may have been reallocated to other files for example).
      
      Consider the following example:
      
      1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
         and a 1MiB prealloc extent at file offset 128KiB;
      
      2) Task A starts doing a shrinking truncate of our inode to reduce it to
         a size of 64KiB. Before it searches the subvolume tree for file
         extent items to delete, it drops all the extent maps in the range
         from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();
      
      3) Task B starts doing a fiemap against our inode. When looking up for
         the inode's extent maps in the range from 128KiB to (u64)-1, it
         doesn't find any in the inode's extent map tree, since they were
         removed by task A.  Because it didn't find any in the extent map
         tree, it scans the inode's subvolume tree for file extent items, and
         it finds the 1MiB prealloc extent at file offset 128KiB, then it
         creates an extent map based on that file extent item and adds it to
         inode's extent map tree (this ends up being done by
         btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
         get_extent_skip_holes());
      
      4) Task A then drops the prealloc extent at file offset 128KiB and
         shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
         truncation operation finishes and we end up with an extent map
         representing a 1MiB prealloc extent at file offset 128KiB, despite we
         don't have any more that extent;
      
      After this the two types of problems we have are:
      
      1) Future calls to fiemap always report that a 1MiB prealloc extent
         exists at file offset 128KiB. This is stale information, no longer
         correct;
      
      2) If the size of the file is increased, by a truncate operation that
         increases the file size or by a write into a file offset > 64KiB for
         example, we end up not inserting file extent items to represent holes
         for any range between 128KiB and 128KiB + 1MiB, since the hole
         expansion function, btrfs_cont_expand() will skip hole insertion for
         any range for which an extent map exists that represents a prealloc
         extent. This causes fsck to complain about missing file extent items
         when not using the NO_HOLES feature.
      
      The second issue could be often triggered by test case generic/561 from
      fstests, which runs fsstress and duperemove in parallel, and duperemove
      does frequent fiemap calls.
      
      Essentially the problems happens because fiemap does not acquire the
      inode's lock while truncate does, and fiemap locks the file range in the
      inode's iotree while truncate does not. So fix the issue by making
      btrfs_truncate_inode_items() lock the file range from the new file size
      to (u64)-1, so that it serializes with fiemap.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28553fa9
  5. 31 1月, 2020 2 次提交
    • J
      btrfs: do not do delalloc reservation under page lock · f4b1363c
      Josef Bacik 提交于
      We ran into a deadlock in production with the fixup worker.  The stack
      traces were as follows:
      
      Thread responsible for the writeout, waiting on the page lock
      
        [<0>] io_schedule+0x12/0x40
        [<0>] __lock_page+0x109/0x1e0
        [<0>] extent_write_cache_pages+0x206/0x360
        [<0>] extent_writepages+0x40/0x60
        [<0>] do_writepages+0x31/0xb0
        [<0>] __writeback_single_inode+0x3d/0x350
        [<0>] writeback_sb_inodes+0x19d/0x3c0
        [<0>] __writeback_inodes_wb+0x5d/0xb0
        [<0>] wb_writeback+0x231/0x2c0
        [<0>] wb_workfn+0x308/0x3c0
        [<0>] process_one_work+0x1e0/0x390
        [<0>] worker_thread+0x2b/0x3c0
        [<0>] kthread+0x113/0x130
        [<0>] ret_from_fork+0x35/0x40
        [<0>] 0xffffffffffffffff
      
      Thread of the fixup worker who is holding the page lock
      
        [<0>] start_delalloc_inodes+0x241/0x2d0
        [<0>] btrfs_start_delalloc_roots+0x179/0x230
        [<0>] btrfs_alloc_data_chunk_ondemand+0x11b/0x2e0
        [<0>] btrfs_check_data_free_space+0x53/0xa0
        [<0>] btrfs_delalloc_reserve_space+0x20/0x70
        [<0>] btrfs_writepage_fixup_worker+0x1fc/0x2a0
        [<0>] normal_work_helper+0x11c/0x360
        [<0>] process_one_work+0x1e0/0x390
        [<0>] worker_thread+0x2b/0x3c0
        [<0>] kthread+0x113/0x130
        [<0>] ret_from_fork+0x35/0x40
        [<0>] 0xffffffffffffffff
      
      Thankfully the stars have to align just right to hit this.  First you
      have to end up in the fixup worker, which is tricky by itself (my
      reproducer does DIO reads into a MMAP'ed region, so not a common
      operation).  Then you have to have less than a page size of free data
      space and 0 unallocated space so you go down the "commit the transaction
      to free up pinned space" path.  This was accomplished by a random
      balance that was running on the host.  Then you get this deadlock.
      
      I'm still in the process of trying to force the deadlock to happen on
      demand, but I've hit other issues.  I can still trigger the fixup worker
      path itself so this patch has been tested in that regard, so the normal
      case is fine.
      
      Fixes: 87826df0 ("btrfs: delalloc for page dirtied out-of-band in fixup worker")
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f4b1363c
    • C
      Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker · 25f3c502
      Chris Mason 提交于
      For COW, btrfs expects pages dirty pages to have been through a few setup
      steps.  This includes reserving space for the new block allocations and marking
      the range in the state tree for delayed allocation.
      
      A few places outside btrfs will dirty pages directly, especially when unmapping
      mmap'd pages.  In order for these to properly go through COW, we run them
      through a fixup worker to wait for stable pages, and do the delalloc prep.
      
      87826df0 added a window where the dirty pages were cleaned, but pending
      more action from the fixup worker.  We clear_page_dirty_for_io() before
      we call into writepage, so the page is no longer dirty.  The commit
      changed it so now we leave the page clean between unlocking it here and
      the fixup worker starting at some point in the future.
      
      During this window, page migration can jump in and relocate the page.  Once our
      fixup work actually starts, it finds page->mapping is NULL and we end up
      freeing the page without ever writing it.
      
      This leads to crc errors and other exciting problems, since it screws up the
      whole statemachine for waiting for ordered extents.  The fix here is to keep
      the page dirty while we're waiting for the fixup worker to get to work.
      This is accomplished by returning -EAGAIN from btrfs_writepage_cow_fixup
      if we queued the page up for fixup, which will cause the writepage
      function to redirty the page.
      
      Because we now expect the page to be dirty once it gets to the fixup
      worker we must adjust the error cases to call clear_page_dirty_for_io()
      on the page.  That is the bulk of the patch, but it is not the fix, the
      fix is the -EAGAIN from btrfs_writepage_cow_fixup.  We cannot separate
      these two changes out because the error conditions change with the new
      expectations.
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25f3c502
  6. 20 1月, 2020 9 次提交
  7. 08 1月, 2020 2 次提交
    • J
      btrfs: fix invalid removal of root ref · d49d3287
      Josef Bacik 提交于
      If we have the following sequence of events
      
        btrfs sub create A
        btrfs sub create A/B
        btrfs sub snap A C
        mkdir C/foo
        mv A/B C/foo
        rm -rf *
      
      We will end up with a transaction abort.
      
      The reason for this is because we create a root ref for B pointing to A.
      When we create a snapshot of C we still have B in our tree, but because
      the root ref points to A and not C we will make it appear to be empty.
      
      The problem happens when we move B into C.  This removes the root ref
      for B pointing to A and adds a ref of B pointing to C.  When we rmdir C
      we'll see that we have a ref to our root and remove the root ref,
      despite not actually matching our reference name.
      
      Now btrfs_del_root_ref() allowing this to work is a bug as well, however
      we know that this inode does not actually point to a root ref in the
      first place, so we shouldn't be calling btrfs_del_root_ref() in the
      first place and instead simply look up our dir index for this item and
      do the rest of the removal.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d49d3287
    • J
      btrfs: rework arguments of btrfs_unlink_subvol · 045d3967
      Josef Bacik 提交于
      btrfs_unlink_subvol takes the name of the dentry and the root objectid
      based on what kind of inode this is, either a real subvolume link or a
      empty one that we inherited as a snapshot.  We need to fix how we unlink
      in the case for BTRFS_EMPTY_SUBVOL_DIR_OBJECTID in the future, so rework
      btrfs_unlink_subvol to just take the dentry and handle getting the right
      objectid given the type of inode this is.  There is no functional change
      here, simply pushing the work into btrfs_unlink_subvol() proper.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      045d3967
  8. 30 12月, 2019 1 次提交
    • F
      Btrfs: fix infinite loop during nocow writeback due to race · de7999af
      Filipe Manana 提交于
      When starting writeback for a range that covers part of a preallocated
      extent, due to a race with writeback for another range that also covers
      another part of the same preallocated extent, we can end up in an infinite
      loop.
      
      Consider the following example where for inode 280 we have two dirty
      ranges:
      
        range A, from 294912 to 303103, 8192 bytes
        range B, from 348160 to 438271, 90112 bytes
      
      and we have the following file extent item layout for our inode:
      
        leaf 38895616 gen 24544 total ptrs 29 free space 13820 owner 5
            (...)
            item 27 key (280 108 200704) itemoff 14598 itemsize 53
                extent data disk bytenr 0 nr 0 type 1 (regular)
                extent data offset 0 nr 94208 ram 94208
            item 28 key (280 108 294912) itemoff 14545 itemsize 53
                extent data disk bytenr 10433052672 nr 81920 type 2 (prealloc)
                extent data offset 0 nr 81920 ram 81920
      
      Then the following happens:
      
      1) Writeback starts for range B (from 348160 to 438271), execution of
         run_delalloc_nocow() starts;
      
      2) The first iteration of run_delalloc_nocow()'s whil loop leaves us at
         the extent item at slot 28, pointing to the prealloc extent item
         covering the range from 294912 to 376831. This extent covers part of
         our range;
      
      3) An ordered extent is created against that extent, covering the file
         range from 348160 to 376831 (28672 bytes);
      
      4) We adjust 'cur_offset' to 376832 and move on to the next iteration of
         the while loop;
      
      5) The call to btrfs_lookup_file_extent() leaves us at the same leaf,
         pointing to slot 29, 1 slot after the last item (the extent item
         we processed in the previous iteration);
      
      6) Because we are a slot beyond the last item, we call btrfs_next_leaf(),
         which releases the search path before doing a another search for the
         last key of the leaf (280 108 294912);
      
      7) Right after btrfs_next_leaf() released the path, and before it did
         another search for the last key of the leaf, writeback for the range
         A (from 294912 to 303103) completes (it was previously started at
         some point);
      
      8) Upon completion of the ordered extent for range A, the prealloc extent
         we previously found got split into two extent items, one covering the
         range from 294912 to 303103 (8192 bytes), with a type of regular extent
         (and no longer prealloc) and another covering the range from 303104 to
         376831 (73728 bytes), with a type of prealloc and an offset of 8192
         bytes. So our leaf now has the following layout:
      
           leaf 38895616 gen 24544 total ptrs 31 free space 13664 owner 5
               (...)
               item 27 key (280 108 200704) itemoff 14598 itemsize 53
                   extent data disk bytenr 0 nr 0 type 1
                   extent data offset 0 nr 8192 ram 94208
               item 28 key (280 108 208896) itemoff 14545 itemsize 53
                   extent data disk bytenr 10433142784 nr 86016 type 1
                   extent data offset 0 nr 86016 ram 86016
               item 29 key (280 108 294912) itemoff 14492 itemsize 53
                   extent data disk bytenr 10433052672 nr 81920 type 1
                   extent data offset 0 nr 8192 ram 81920
               item 30 key (280 108 303104) itemoff 14439 itemsize 53
                   extent data disk bytenr 10433052672 nr 81920 type 2
                   extent data offset 8192 nr 73728 ram 81920
      
      9) After btrfs_next_leaf() returns, we have our path pointing to that same
         leaf and at slot 30, since it has a key we didn't have before and it's
         the first key greater then the key that was previously the last key of
         the leaf (key (280 108 294912));
      
      10) The extent item at slot 30 covers the range from 303104 to 376831
          which is in our target range, so we process it, despite having already
          created an ordered extent against this extent for the file range from
          348160 to 376831. This is because we skip to the next extent item only
          if its end is less than or equals to the start of our delalloc range,
          and not less than or equals to the current offset ('cur_offset');
      
      11) As a result we compute 'num_bytes' as:
      
          num_bytes = min(end + 1, extent_end) - cur_offset;
                    = min(438271 + 1, 376832) - 376832 = 0
      
      12) We then call create_io_em() for a 0 bytes range starting at offset
          376832;
      
      13) Then create_io_em() enters an infinite loop because its calls to
          btrfs_drop_extent_cache() do nothing due to the 0 length range
          passed to it. So no existing extent maps that cover the offset
          376832 get removed, and therefore calls to add_extent_mapping()
          return -EEXIST, resulting in an infinite loop. This loop from
          create_io_em() is the following:
      
          do {
              btrfs_drop_extent_cache(BTRFS_I(inode), em->start,
                                      em->start + em->len - 1, 0);
              write_lock(&em_tree->lock);
              ret = add_extent_mapping(em_tree, em, 1);
              write_unlock(&em_tree->lock);
              /*
               * The caller has taken lock_extent(), who could race with us
               * to add em?
               */
          } while (ret == -EEXIST);
      
      Also, each call to btrfs_drop_extent_cache() triggers a warning because
      the start offset passed to it (376832) is smaller then the end offset
      (376832 - 1) passed to it by -1, due to the 0 length:
      
        [258532.052621] ------------[ cut here ]------------
        [258532.052643] WARNING: CPU: 0 PID: 9987 at fs/btrfs/file.c:602 btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
        (...)
        [258532.052672] CPU: 0 PID: 9987 Comm: fsx Tainted: G        W         5.4.0-rc7-btrfs-next-64 #1
        [258532.052673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [258532.052691] RIP: 0010:btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
        (...)
        [258532.052695] RSP: 0018:ffffb4be0153f860 EFLAGS: 00010287
        [258532.052700] RAX: ffff975b445ee360 RBX: ffff975b44eb3e08 RCX: 0000000000000000
        [258532.052700] RDX: 0000000000038fff RSI: 0000000000039000 RDI: ffff975b445ee308
        [258532.052700] RBP: 0000000000038fff R08: 0000000000000000 R09: 0000000000000001
        [258532.052701] R10: ffff975b513c5c10 R11: 00000000e3c0cfa9 R12: 0000000000039000
        [258532.052703] R13: ffff975b445ee360 R14: 00000000ffffffef R15: ffff975b445ee308
        [258532.052705] FS:  00007f86a821de80(0000) GS:ffff975b76a00000(0000) knlGS:0000000000000000
        [258532.052707] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [258532.052708] CR2: 00007fdacf0f3ab4 CR3: 00000001f9d26002 CR4: 00000000003606f0
        [258532.052712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [258532.052717] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [258532.052717] Call Trace:
        [258532.052718]  ? preempt_schedule_common+0x32/0x70
        [258532.052722]  ? ___preempt_schedule+0x16/0x20
        [258532.052741]  create_io_em+0xff/0x180 [btrfs]
        [258532.052767]  run_delalloc_nocow+0x942/0xb10 [btrfs]
        [258532.052791]  btrfs_run_delalloc_range+0x30b/0x520 [btrfs]
        [258532.052812]  ? find_lock_delalloc_range+0x221/0x250 [btrfs]
        [258532.052834]  writepage_delalloc+0xe4/0x140 [btrfs]
        [258532.052855]  __extent_writepage+0x110/0x4e0 [btrfs]
        [258532.052876]  extent_write_cache_pages+0x21c/0x480 [btrfs]
        [258532.052906]  extent_writepages+0x52/0xb0 [btrfs]
        [258532.052911]  do_writepages+0x23/0x80
        [258532.052915]  __filemap_fdatawrite_range+0xd2/0x110
        [258532.052938]  btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
        [258532.052954]  start_ordered_ops+0x57/0xa0 [btrfs]
        [258532.052973]  ? btrfs_sync_file+0x225/0x490 [btrfs]
        [258532.052988]  btrfs_sync_file+0x225/0x490 [btrfs]
        [258532.052997]  __x64_sys_msync+0x199/0x200
        [258532.053004]  do_syscall_64+0x5c/0x250
        [258532.053007]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [258532.053010] RIP: 0033:0x7f86a7dfd760
        (...)
        [258532.053014] RSP: 002b:00007ffd99af0368 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
        [258532.053016] RAX: ffffffffffffffda RBX: 0000000000000ec9 RCX: 00007f86a7dfd760
        [258532.053017] RDX: 0000000000000004 RSI: 000000000000836c RDI: 00007f86a8221000
        [258532.053019] RBP: 0000000000021ec9 R08: 0000000000000003 R09: 00007f86a812037c
        [258532.053020] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000000074a3
        [258532.053021] R13: 00007f86a8221000 R14: 000000000000836c R15: 0000000000000001
        [258532.053032] irq event stamp: 1653450494
        [258532.053035] hardirqs last  enabled at (1653450493): [<ffffffff9dec69f9>] _raw_spin_unlock_irq+0x29/0x50
        [258532.053037] hardirqs last disabled at (1653450494): [<ffffffff9d4048ea>] trace_hardirqs_off_thunk+0x1a/0x20
        [258532.053039] softirqs last  enabled at (1653449852): [<ffffffff9e200466>] __do_softirq+0x466/0x6bd
        [258532.053042] softirqs last disabled at (1653449845): [<ffffffff9d4c8a0c>] irq_exit+0xec/0x120
        [258532.053043] ---[ end trace 8476fce13d9ce20a ]---
      
      Which results in flooding dmesg/syslog since btrfs_drop_extent_cache()
      uses WARN_ON() and not WARN_ON_ONCE().
      
      So fix this issue by changing run_delalloc_nocow()'s loop to move to the
      next extent item when the current extent item ends at at offset less than
      or equals to the current offset instead of the start offset.
      
      Fixes: 80ff3856 ("Btrfs: update nodatacow code v2")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de7999af
  9. 13 12月, 2019 2 次提交
    • J
      btrfs: don't double lock the subvol_sem for rename exchange · 943eb3bf
      Josef Bacik 提交于
      If we're rename exchanging two subvols we'll try to lock this lock
      twice, which is bad.  Just lock once if either of the ino's are subvols.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      943eb3bf
    • J
      btrfs: do not call synchronize_srcu() in inode_tree_del · f72ff01d
      Josef Bacik 提交于
      Testing with the new fsstress uncovered a pretty nasty deadlock with
      lookup and snapshot deletion.
      
      Process A
      unlink
       -> final iput
         -> inode_tree_del
           -> synchronize_srcu(subvol_srcu)
      
      Process B
      btrfs_lookup  <- srcu_read_lock() acquired here
        -> btrfs_iget
          -> find inode that has I_FREEING set
            -> __wait_on_freeing_inode()
      
      We're holding the srcu_read_lock() while doing the iget in order to make
      sure our fs root doesn't go away, and then we are waiting for the inode
      to finish freeing.  However because the free'ing process is doing a
      synchronize_srcu() we deadlock.
      
      Fix this by dropping the synchronize_srcu() in inode_tree_del().  We
      don't need people to stop accessing the fs root at this point, we're
      only adding our empty root to the dead roots list.
      
      A larger much more invasive fix is forthcoming to address how we deal
      with fs roots, but this fixes the immediate problem.
      
      Fixes: 76dda93c ("Btrfs: add snapshot/subvolume destroy ioctl")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f72ff01d
  10. 19 11月, 2019 5 次提交
    • D
      btrfs: remove extent_map::bdev · a019e9e1
      David Sterba 提交于
      We can now remove the bdev from extent_map. Previous patches made sure
      that bio_set_dev is correctly in all places and that we don't need to
      grab it from latest_bdev or pass it around inside the extent map.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a019e9e1
    • J
      btrfs: record all roots for rename exchange on a subvol · 3e174099
      Josef Bacik 提交于
      Testing with the new fsstress support for subvolumes uncovered a pretty
      bad problem with rename exchange on subvolumes.  We're modifying two
      different subvolumes, but we only start the transaction on one of them,
      so the other one is not added to the dirty root list.  This is caught by
      btrfs_cow_block() with a warning because the root has not been updated,
      however if we do not modify this root again we'll end up pointing at an
      invalid root because the root item is never updated.
      
      Fix this by making sure we add the destination root to the trans list,
      the same as we do with normal renames.  This fixes the corruption.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3e174099
    • D
      btrfs: rename btrfs_block_group_cache · 32da5386
      David Sterba 提交于
      The type name is misleading, a single entry is named 'cache' while this
      normally means a collection of objects. Rename that everywhere. Also the
      identifier was quite long, making function prototypes harder to format.
      Suggested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      32da5386
    • D
      btrfs: sink write flags to cow_file_range_async · fac07d2b
      David Sterba 提交于
      In commit "Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios",
      cow_file_range_async gained wbc as a parameter and this makes passing
      write flags redundant. Set it inside the function and remove the
      parameter.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fac07d2b
    • F
      Btrfs: remove unnecessary delalloc mutex for inodes · 16ad3be1
      Filipe Manana 提交于
      The inode delalloc mutex was added a long time ago by commit f248679e
      ("Btrfs: add a delalloc mutex to inodes for delalloc reservations"), and
      the reason for its introduction is not very clear from the change log. It
      claims it solves bogus warnings from lockdep, however it lacks an example
      report/warning from lockdep, or any explanation.
      
      Since we have enough concurrentcy protection from the locks of the space
      info and block reserve objects, and such lockdep warnings don't seem to
      exist anymore (at least on a 5.3 kernel I couldn't get them with fstests,
      ltp, fs_mark, etc), remove it, simplifying things a bit and decreasing
      the size of the btrfs_inode structure. With some quick fio tests doing
      direct IO and mmap writes I couldn't observe any significant performance
      increase either (direct IO writes that don't increase the file's size
      don't hold the inode's lock for their entire duration and mmap writes
      don't hold the inode's lock at all), which are the only type of writes
      that could see any performance gain due to less serialization.
      
      Review feedback from Josef:
      
      The problem was taking the i_mutex in mmap, which is how I was
      protecting delalloc reservations originally.  The delalloc mutex didn't
      come with all of the other dependencies.  That's what the lockdep
      messages were about, removing the lock isn't going to make them appear
      again.
      
      We _had_ to lock around this because we used to do tricks to keep from
      over-reserving, and if we didn't serialize delalloc reservations we'd
      end up with ugly accounting problems when we tried to clean things up.
      
      However with my recentish changes this isn't the case anymore.  Every
      operation is responsible for reserving its space, and then adding it to
      the inode.  Then cleaning up is straightforward and can't be mucked up
      by other users.  So we no longer need the delalloc mutex to safe us from
      ourselves.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      16ad3be1
  11. 18 11月, 2019 8 次提交
    • D
      btrfs: get bdev from latest_dev for dio bh_result · 8530c37a
      David Sterba 提交于
      To remove use of extent_map::bdev we need to find a replacement, and the
      latest_bdev is the only one we can use here, because inode::i_bdev and
      superblock::s_bdev are NULL.
      
      The DIO code uses bdev in two places:
      
      * to read blocksize to perform alignment checks in
        do_blockdev_direct_IO, but we do them in btrfs code before any call to
        DIO
      
      * in the following call chain:
      
        do_direct_IO
          get_more_blocks
           sdio->get_block() <-- this is btrfs_get_blocks_direct
      
        subsequently the map_bh->b_dev member is used in clean_bdev_aliases
        and dio_new_bio to set the bio's bdev to that of the buffer_head.
        However, because we have provided a submit function dio_bio_submit
        calls our submission function and ignores the bdev.
      
      So it's safe to pass any valid bdev that's used within the filesystem.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8530c37a
    • F
      Btrfs: fix metadata space leak on fixup worker failure to set range as delalloc · 53687007
      Filipe Manana 提交于
      In the fixup worker, if we fail to mark the range as delalloc in the io
      tree, we must release the previously reserved metadata, as well as update
      the outstanding extents counter for the inode, otherwise we leak metadata
      space.
      
      In pratice we can't return an error from btrfs_set_extent_delalloc(),
      which is just a wrapper around __set_extent_bit(), as for most errors
      __set_extent_bit() does a BUG_ON() (or panics which hits a BUG_ON() as
      well) and returning an -EEXIST error doesn't happen in this case since
      the exclusive bits parameter always has a value of 0 through this code
      path. Nevertheless, just fix the error handling in the fixup worker,
      in case one day __set_extent_bit() can return an error to this code
      path.
      
      Fixes: f3038ee3 ("btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53687007
    • N
      btrfs: Rename btrfs_join_transaction_nolock · 8d510121
      Nikolay Borisov 提交于
      This function is used only during the final phase of freespace cache
      writeout. This is necessary since using the plain btrfs_join_transaction
      api is deadlock prone. The deadlock looks like:
      
      T1:
      btrfs_commit_transaction
        commit_cowonly_roots
          btrfs_write_dirty_block_groups
            btrfs_wait_cache_io
              __btrfs_wait_cache_io
             btrfs_wait_ordered_range <-- Triggers ordered IO for freespace
                                          inode and blocks transaction commit
      				    until freespace cache writeout
      
      T2: <-- after T1 has triggered the writeout
      finish_ordered_fn
        btrfs_finish_ordered_io
          btrfs_join_transaction <--- this would block waiting for current
                                      transaction to commit, but since trans
      				commit is waiting for this writeout to
      				finish
      
      The special purpose functions prevents it by simply skipping the "wait
      for writeout" since it's guaranteed the transaction won't proceed until
      we are done.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8d510121
    • C
      Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios · ec39f769
      Chris Mason 提交于
      Async CRCs and compression submit IO through helper threads, which means
      they have IO priority inversions when cgroup IO controllers are in use.
      
      This flags all of the writes submitted by btrfs helper threads as
      REQ_CGROUP_PUNT.  submit_bio() will punt these to dedicated per-blkcg
      work items to avoid the priority inversion.
      
      For the compression code, we take a reference on the wbc's blkg css and
      pass it down to the async workers.
      
      For the async CRCs, the bio already has the correct css, we just need to
      tell the block layer to use REQ_CGROUP_PUNT.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Modified-and-reviewed-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ec39f769
    • C
      Btrfs: only associate the locked page with one async_chunk struct · 1d53c9e6
      Chris Mason 提交于
      The btrfs writepages function collects a large range of pages flagged
      for delayed allocation, and then sends them down through the COW code
      for processing.  When compression is on, we allocate one async_chunk
      structure for every 512K, and then run those pages through the
      compression code for IO submission.
      
      writepages starts all of this off with a single page, locked by the
      original call to extent_write_cache_pages(), and it's important to keep
      track of this page because it has already been through
      clear_page_dirty_for_io().
      
      The btrfs async_chunk struct has a pointer to the locked_page, and when
      we're redirtying the page because compression had to fallback to
      uncompressed IO, we use page->index to decide if a given async_chunk
      struct really owns that page.
      
      But, this is racey.  If a given delalloc range is broken up into two
      async_chunks (chunkA and chunkB), we can end up with something like
      this:
      
       compress_file_range(chunkA)
       submit_compress_extents(chunkA)
       submit compressed bios(chunkA)
       put_page(locked_page)
      
      				 compress_file_range(chunkB)
      				 ...
      
      Or:
      
       async_cow_submit
        submit_compressed_extents <--- falls back to buffered writeout
         cow_file_range
          extent_clear_unlock_delalloc
           __process_pages_contig
             put_page(locked_pages)
      
      					    async_cow_submit
      
      The end result is that chunkA is completed and cleaned up before chunkB
      even starts processing.  This means we can free locked_page() and reuse
      it elsewhere.  If we get really lucky, it'll have the same page->index
      in its new home as it did before.
      
      While we're processing chunkB, we might decide we need to fall back to
      uncompressed IO, and so compress_file_range() will call
      __set_page_dirty_nobufers() on chunkB->locked_page.
      
      Without cgroups in use, this creates as a phantom dirty page, which
      isn't great but isn't the end of the world. What can happen, it can go
      through the fixup worker and the whole COW machinery again:
      
      in submit_compressed_extents():
        while (async extents) {
        ...
          cow_file_range
          if (!page_started ...)
            extent_write_locked_range
          else if (...)
            unlock_page
          continue;
      
      This hasn't been observed in practice but is still possible.
      
      With cgroups in use, we might crash in the accounting code because
      page->mapping->i_wb isn't set.
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
        IP: percpu_counter_add_batch+0x11/0x70
        PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
        CPU: 16 PID: 2172 Comm: rm Not tainted
        RIP: 0010:percpu_counter_add_batch+0x11/0x70
        RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
        RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
        RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
        RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
        R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
        R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
        FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         account_page_cleaned+0x15b/0x1f0
         __cancel_dirty_page+0x146/0x200
         truncate_cleanup_page+0x92/0xb0
         truncate_inode_pages_range+0x202/0x7d0
         btrfs_evict_inode+0x92/0x5a0
         evict+0xc1/0x190
         do_unlinkat+0x176/0x280
         do_syscall_64+0x63/0x1a0
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      The fix here is to make asyc_chunk->locked_page NULL everywhere but the
      one async_chunk struct that's allowed to do things to the locked page.
      
      Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
      Fixes: 771ed689 ("Btrfs: Optimize compressed writeback and reads")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      [ update changelog from mail thread discussion ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1d53c9e6
    • C
      Btrfs: stop using btrfs_schedule_bio() · 08635bae
      Chris Mason 提交于
      btrfs_schedule_bio() hands IO off to a helper thread to do the actual
      submit_bio() call.  This has been used to make sure async crc and
      compression helpers don't get stuck on IO submission.  To maintain good
      performance, over time the IO submission threads duplicated some IO
      scheduler characteristics such as high and low priority IOs and they
      also made some ugly assumptions about request allocation batch sizes.
      
      All of this cost at least one extra context switch during IO submission,
      and doesn't fit well with the modern blkmq IO stack.  So, this commit stops
      using btrfs_schedule_bio().  We may need to adjust the number of async
      helper threads for crcs and compression, but long term it's a better
      path.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      08635bae
    • D
      btrfs: drop unused parameter is_new from btrfs_iget · 4c66e0d4
      David Sterba 提交于
      The parameter is now always set to NULL and could be dropped. The last
      user was get_default_root but that got reworked in 05dbe683 ("Btrfs:
      unify subvol= and subvolid= mounting") and the parameter became unused.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c66e0d4
    • O
      btrfs: get rid of unique workqueue helper functions · a0cac0ec
      Omar Sandoval 提交于
      Commit 9e0af237 ("Btrfs: fix task hang under heavy compressed
      write") worked around the issue that a recycled work item could get a
      false dependency on the original work item due to how the workqueue code
      guarantees non-reentrancy. It did so by giving different work functions
      to different types of work.
      
      However, the fixes in the previous few patches are more complete, as
      they prevent a work item from being recycled at all (except for a tiny
      window that the kernel workqueue code handles for us). This obsoletes
      the previous fix, so we don't need the unique helpers for correctness.
      The only other reason to keep them would be so they show up in stack
      traces, but they always seem to be optimized to a tail call, so they
      don't show up anyways. So, let's just get rid of the extra indirection.
      
      While we're here, rename normal_work_helper() to the more informative
      btrfs_work_helper().
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a0cac0ec
  12. 12 11月, 2019 1 次提交
    • F
      Btrfs: fix log context list corruption after rename exchange operation · e6c61710
      Filipe Manana 提交于
      During rename exchange we might have successfully log the new name in the
      source root's log tree, in which case we leave our log context (allocated
      on stack) in the root's list of log contextes. However we might fail to
      log the new name in the destination root, in which case we fallback to
      a transaction commit later and never sync the log of the source root,
      which causes the source root log context to remain in the list of log
      contextes. This later causes invalid memory accesses because the context
      was allocated on stack and after rename exchange finishes the stack gets
      reused and overwritten for other purposes.
      
      The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
      detect this and report something like the following:
      
        [  691.489929] ------------[ cut here ]------------
        [  691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
        [  691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
        (...)
        [  691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
        [  691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [  691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
        (...)
        [  691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
        [  691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
        [  691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
        [  691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
        [  691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
        [  691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
        [  691.490019] FS:  00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
        [  691.490021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
        [  691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [  691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [  691.490029] Call Trace:
        [  691.490058]  btrfs_log_inode_parent+0x667/0x2730 [btrfs]
        [  691.490083]  ? join_transaction+0x24a/0xce0 [btrfs]
        [  691.490107]  ? btrfs_end_log_trans+0x80/0x80 [btrfs]
        [  691.490111]  ? dget_parent+0xb8/0x460
        [  691.490116]  ? lock_downgrade+0x6b0/0x6b0
        [  691.490121]  ? rwlock_bug.part.0+0x90/0x90
        [  691.490127]  ? do_raw_spin_unlock+0x142/0x220
        [  691.490151]  btrfs_log_dentry_safe+0x65/0x90 [btrfs]
        [  691.490172]  btrfs_sync_file+0x9f1/0xc00 [btrfs]
        [  691.490195]  ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
        [  691.490198]  ? rcu_read_lock_any_held.part.11+0x20/0x20
        [  691.490204]  ? __do_sys_newstat+0x88/0xd0
        [  691.490207]  ? cp_new_stat+0x5d0/0x5d0
        [  691.490218]  ? do_fsync+0x38/0x60
        [  691.490220]  do_fsync+0x38/0x60
        [  691.490224]  __x64_sys_fdatasync+0x32/0x40
        [  691.490228]  do_syscall_64+0x9f/0x540
        [  691.490233]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [  691.490235] RIP: 0033:0x7f4b253ad5f0
        (...)
        [  691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
        [  691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
        [  691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
        [  691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
        [  691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
        [  691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0
      
      This started happening recently when running some test cases from fstests
      like btrfs/004 for example, because support for rename exchange was added
      last week to fsstress from fstests.
      
      So fix this by deleting the log context for the source root from the list
      if we have logged the new name in the source root.
      Reported-by: NSu Yue <Damenly_Su@gmx.com>
      Fixes: d4682ba0 ("Btrfs: sync log after logging new name")
      CC: stable@vger.kernel.org # 4.19+
      Tested-by: NSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e6c61710
  13. 05 11月, 2019 1 次提交
    • J
      btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range · d98da499
      Josef Bacik 提交于
      We hit a regression while rolling out 5.2 internally where we were
      hitting the following panic
      
        kernel BUG at mm/page-writeback.c:2659!
        RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
        Call Trace:
         __process_pages_contig+0x25a/0x350
         ? extent_clear_unlock_delalloc+0x43/0x70
         submit_compressed_extents+0x359/0x4d0
         normal_work_helper+0x15a/0x330
         process_one_work+0x1f5/0x3f0
         worker_thread+0x2d/0x3d0
         ? rescuer_thread+0x340/0x340
         kthread+0x111/0x130
         ? kthread_create_on_node+0x60/0x60
         ret_from_fork+0x1f/0x30
      
      This is happening because the page is not locked when doing
      clear_page_dirty_for_io.  Looking at the core dump it was because our
      async_extent had a ram_size of 24576 but our async_chunk range only
      spanned 20480, so we had a whole extra page in our ram_size for our
      async_extent.
      
      This happened because we try not to compress pages outside of our
      i_size, however a cleanup patch changed us to do
      
      actual_end = min_t(u64, i_size_read(inode), end + 1);
      
      which is problematic because i_size_read() can evaluate to different
      values in between checking and assigning.  So either an expanding
      truncate or a fallocate could increase our i_size while we're doing
      writeout and actual_end would end up being past the range we have
      locked.
      
      I confirmed this was what was happening by installing a debug kernel
      that had
      
        actual_end = min_t(u64, i_size_read(inode), end + 1);
        if (actual_end > end + 1) {
      	  printk(KERN_ERR "KABOOM\n");
      	  actual_end = end + 1;
        }
      
      and installing it onto 500 boxes of the tier that had been seeing the
      problem regularly.  Last night I got my debug message and no panic,
      confirming what I expected.
      
      [ dsterba: the assembly confirms a tiny race window:
      
          mov    0x20(%rsp),%rax
          cmp    %rax,0x48(%r15)           # read
          movl   $0x0,0x18(%rsp)
          mov    %rax,%r12
          mov    %r14,%rax
          cmovbe 0x48(%r15),%r12           # eval
      
        Where r15 is inode and 0x48 is offset of i_size.
      
        The original fix was to revert 62b37622 that would do an
        intermediate assignment and this would also avoid the doulble
        evaluation but is not future-proof, should the compiler merge the
        stores and call i_size_read anyway.
      
        There's a patch adding READ_ONCE to i_size_read but that's not being
        applied at the moment and we need to fix the bug. Instead, emulate
        READ_ONCE by two barrier()s that's what effectively happens. The
        assembly confirms single evaluation:
      
          mov    0x48(%rbp),%rax          # read once
          mov    0x20(%rsp),%rcx
          mov    $0x20,%edx
          cmp    %rax,%rcx
          cmovbe %rcx,%rax
          mov    %rax,(%rsp)
          mov    %rax,%rcx
          mov    %r14,%rax
      
        Where 0x48(%rbp) is inode->i_size stored to %eax.
      ]
      
      Fixes: 62b37622 ("btrfs: Remove isize local variable in compress_file_range")
      CC: stable@vger.kernel.org # v5.1+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ changelog updated ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d98da499
  14. 16 10月, 2019 1 次提交
    • Q
      btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() · 8702ba93
      Qu Wenruo 提交于
      [Background]
      Btrfs qgroup uses two types of reserved space for METADATA space,
      PERTRANS and PREALLOC.
      
      PERTRANS is metadata space reserved for each transaction started by
      btrfs_start_transaction().
      While PREALLOC is for delalloc, where we reserve space before joining a
      transaction, and finally it will be converted to PERTRANS after the
      writeback is done.
      
      [Inconsistency]
      However there is inconsistency in how we handle PREALLOC metadata space.
      
      The most obvious one is:
      In btrfs_buffered_write():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
      
      We always free qgroup PREALLOC meta space.
      
      While in btrfs_truncate_block():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
      
      We only free qgroup PREALLOC meta space when something went wrong.
      
      [The Correct Behavior]
      The correct behavior should be the one in btrfs_buffered_write(), we
      should always free PREALLOC metadata space.
      
      The reason is, the btrfs_delalloc_* mechanism works by:
      - Reserve metadata first, even it's not necessary
        In btrfs_delalloc_reserve_metadata()
      
      - Free the unused metadata space
        Normally in:
        btrfs_delalloc_release_extents()
        |- btrfs_inode_rsv_release()
           Here we do calculation on whether we should release or not.
      
      E.g. for 64K buffered write, the metadata rsv works like:
      
      /* The first page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=0
      total:		num_bytes=calc_inode_reservations()
      /* The first page caused one outstanding extent, thus needs metadata
         rsv */
      
      /* The 2nd page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed
      /* The 2nd page doesn't cause new outstanding extent, needs no new meta
         rsv, so we free what we have reserved */
      
      /* The 3rd~16th pages */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed (still space for one outstanding extent)
      
      This means, if btrfs_delalloc_release_extents() determines to free some
      space, then those space should be freed NOW.
      So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
      than btrfs_qgroup_convert_reserved_meta().
      
      The good news is:
      - The callers are not that hot
        The hottest caller is in btrfs_buffered_write(), which is already
        fixed by commit 336a8bb8 ("btrfs: Fix wrong
        btrfs_delalloc_release_extents parameter"). Thus it's not that
        easy to cause false EDQUOT.
      
      - The trans commit in advance for qgroup would hide the bug
        Since commit f5fef459 ("btrfs: qgroup: Make qgroup async transaction
        commit more aggressive"), when btrfs qgroup metadata free space is slow,
        it will try to commit transaction and free the wrongly converted
        PERTRANS space, so it's not that easy to hit such bug.
      
      [FIX]
      So to fix the problem, remove the @qgroup_free parameter for
      btrfs_delalloc_release_extents(), and always pass true to
      btrfs_inode_rsv_release().
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8702ba93
  15. 02 10月, 2019 1 次提交
    • J
      btrfs: allocate new inode in NOFS context · 11a19a90
      Josef Bacik 提交于
      A user reported a lockdep splat
      
       ======================================================
       WARNING: possible circular locking dependency detected
       5.2.11-gentoo #2 Not tainted
       ------------------------------------------------------
       kswapd0/711 is trying to acquire lock:
       000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500
      
      but task is already holding lock:
       000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (fs_reclaim){+.+.}:
       kmem_cache_alloc+0x1f/0x1c0
       btrfs_alloc_inode+0x1f/0x260
       alloc_inode+0x16/0xa0
       new_inode+0xe/0xb0
       btrfs_new_inode+0x70/0x610
       btrfs_symlink+0xd0/0x420
       vfs_symlink+0x9c/0x100
       do_symlinkat+0x66/0xe0
       do_syscall_64+0x55/0x1c0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      -> #0 (sb_internal){.+.+}:
       __sb_start_write+0xf6/0x150
       start_transaction+0x3a8/0x500
       btrfs_commit_inode_delayed_inode+0x59/0x110
       btrfs_evict_inode+0x19e/0x4c0
       evict+0xbc/0x1f0
       inode_lru_isolate+0x113/0x190
       __list_lru_walk_one.isra.4+0x5c/0x100
       list_lru_walk_one+0x32/0x50
       prune_icache_sb+0x36/0x80
       super_cache_scan+0x14a/0x1d0
       do_shrink_slab+0x131/0x320
       shrink_node+0xf7/0x380
       balance_pgdat+0x2d5/0x640
       kswapd+0x2ba/0x5e0
       kthread+0x147/0x160
       ret_from_fork+0x24/0x30
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
       CPU0 CPU1
       ---- ----
       lock(fs_reclaim);
       lock(sb_internal);
       lock(fs_reclaim);
       lock(sb_internal);
      *** DEADLOCK ***
      
       3 locks held by kswapd0/711:
       #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
       #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380
       #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0
      
      stack backtrace:
       CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2
       Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017
       Call Trace:
       dump_stack+0x85/0xc7
       print_circular_bug.cold.40+0x1d9/0x235
       __lock_acquire+0x18b1/0x1f00
       lock_acquire+0xa6/0x170
       ? start_transaction+0x3a8/0x500
       __sb_start_write+0xf6/0x150
       ? start_transaction+0x3a8/0x500
       start_transaction+0x3a8/0x500
       btrfs_commit_inode_delayed_inode+0x59/0x110
       btrfs_evict_inode+0x19e/0x4c0
       ? var_wake_function+0x20/0x20
       evict+0xbc/0x1f0
       inode_lru_isolate+0x113/0x190
       ? discard_new_inode+0xc0/0xc0
       __list_lru_walk_one.isra.4+0x5c/0x100
       ? discard_new_inode+0xc0/0xc0
       list_lru_walk_one+0x32/0x50
       prune_icache_sb+0x36/0x80
       super_cache_scan+0x14a/0x1d0
       do_shrink_slab+0x131/0x320
       shrink_node+0xf7/0x380
       balance_pgdat+0x2d5/0x640
       kswapd+0x2ba/0x5e0
       ? __wake_up_common_lock+0x90/0x90
       kthread+0x147/0x160
       ? balance_pgdat+0x640/0x640
       ? __kthread_create_on_node+0x160/0x160
       ret_from_fork+0x24/0x30
      
      This is because btrfs_new_inode() calls new_inode() under the
      transaction.  We could probably move the new_inode() outside of this but
      for now just wrap it in memalloc_nofs_save().
      Reported-by: NZdenek Sojka <zsojka@seznam.cz>
      Fixes: 712e36c5 ("btrfs: use GFP_KERNEL in btrfs_alloc_inode")
      CC: stable@vger.kernel.org # 4.16+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11a19a90
  16. 09 9月, 2019 3 次提交