1. 18 11月, 2019 6 次提交
    • N
      btrfs: Rename btrfs_join_transaction_nolock · 8d510121
      Nikolay Borisov 提交于
      This function is used only during the final phase of freespace cache
      writeout. This is necessary since using the plain btrfs_join_transaction
      api is deadlock prone. The deadlock looks like:
      
      T1:
      btrfs_commit_transaction
        commit_cowonly_roots
          btrfs_write_dirty_block_groups
            btrfs_wait_cache_io
              __btrfs_wait_cache_io
             btrfs_wait_ordered_range <-- Triggers ordered IO for freespace
                                          inode and blocks transaction commit
      				    until freespace cache writeout
      
      T2: <-- after T1 has triggered the writeout
      finish_ordered_fn
        btrfs_finish_ordered_io
          btrfs_join_transaction <--- this would block waiting for current
                                      transaction to commit, but since trans
      				commit is waiting for this writeout to
      				finish
      
      The special purpose functions prevents it by simply skipping the "wait
      for writeout" since it's guaranteed the transaction won't proceed until
      we are done.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8d510121
    • C
      Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios · ec39f769
      Chris Mason 提交于
      Async CRCs and compression submit IO through helper threads, which means
      they have IO priority inversions when cgroup IO controllers are in use.
      
      This flags all of the writes submitted by btrfs helper threads as
      REQ_CGROUP_PUNT.  submit_bio() will punt these to dedicated per-blkcg
      work items to avoid the priority inversion.
      
      For the compression code, we take a reference on the wbc's blkg css and
      pass it down to the async workers.
      
      For the async CRCs, the bio already has the correct css, we just need to
      tell the block layer to use REQ_CGROUP_PUNT.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Modified-and-reviewed-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ec39f769
    • C
      Btrfs: only associate the locked page with one async_chunk struct · 1d53c9e6
      Chris Mason 提交于
      The btrfs writepages function collects a large range of pages flagged
      for delayed allocation, and then sends them down through the COW code
      for processing.  When compression is on, we allocate one async_chunk
      structure for every 512K, and then run those pages through the
      compression code for IO submission.
      
      writepages starts all of this off with a single page, locked by the
      original call to extent_write_cache_pages(), and it's important to keep
      track of this page because it has already been through
      clear_page_dirty_for_io().
      
      The btrfs async_chunk struct has a pointer to the locked_page, and when
      we're redirtying the page because compression had to fallback to
      uncompressed IO, we use page->index to decide if a given async_chunk
      struct really owns that page.
      
      But, this is racey.  If a given delalloc range is broken up into two
      async_chunks (chunkA and chunkB), we can end up with something like
      this:
      
       compress_file_range(chunkA)
       submit_compress_extents(chunkA)
       submit compressed bios(chunkA)
       put_page(locked_page)
      
      				 compress_file_range(chunkB)
      				 ...
      
      Or:
      
       async_cow_submit
        submit_compressed_extents <--- falls back to buffered writeout
         cow_file_range
          extent_clear_unlock_delalloc
           __process_pages_contig
             put_page(locked_pages)
      
      					    async_cow_submit
      
      The end result is that chunkA is completed and cleaned up before chunkB
      even starts processing.  This means we can free locked_page() and reuse
      it elsewhere.  If we get really lucky, it'll have the same page->index
      in its new home as it did before.
      
      While we're processing chunkB, we might decide we need to fall back to
      uncompressed IO, and so compress_file_range() will call
      __set_page_dirty_nobufers() on chunkB->locked_page.
      
      Without cgroups in use, this creates as a phantom dirty page, which
      isn't great but isn't the end of the world. What can happen, it can go
      through the fixup worker and the whole COW machinery again:
      
      in submit_compressed_extents():
        while (async extents) {
        ...
          cow_file_range
          if (!page_started ...)
            extent_write_locked_range
          else if (...)
            unlock_page
          continue;
      
      This hasn't been observed in practice but is still possible.
      
      With cgroups in use, we might crash in the accounting code because
      page->mapping->i_wb isn't set.
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
        IP: percpu_counter_add_batch+0x11/0x70
        PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
        CPU: 16 PID: 2172 Comm: rm Not tainted
        RIP: 0010:percpu_counter_add_batch+0x11/0x70
        RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
        RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
        RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
        RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
        R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
        R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
        FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         account_page_cleaned+0x15b/0x1f0
         __cancel_dirty_page+0x146/0x200
         truncate_cleanup_page+0x92/0xb0
         truncate_inode_pages_range+0x202/0x7d0
         btrfs_evict_inode+0x92/0x5a0
         evict+0xc1/0x190
         do_unlinkat+0x176/0x280
         do_syscall_64+0x63/0x1a0
         entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      The fix here is to make asyc_chunk->locked_page NULL everywhere but the
      one async_chunk struct that's allowed to do things to the locked page.
      
      Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
      Fixes: 771ed689 ("Btrfs: Optimize compressed writeback and reads")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      [ update changelog from mail thread discussion ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1d53c9e6
    • C
      Btrfs: stop using btrfs_schedule_bio() · 08635bae
      Chris Mason 提交于
      btrfs_schedule_bio() hands IO off to a helper thread to do the actual
      submit_bio() call.  This has been used to make sure async crc and
      compression helpers don't get stuck on IO submission.  To maintain good
      performance, over time the IO submission threads duplicated some IO
      scheduler characteristics such as high and low priority IOs and they
      also made some ugly assumptions about request allocation batch sizes.
      
      All of this cost at least one extra context switch during IO submission,
      and doesn't fit well with the modern blkmq IO stack.  So, this commit stops
      using btrfs_schedule_bio().  We may need to adjust the number of async
      helper threads for crcs and compression, but long term it's a better
      path.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      08635bae
    • D
      btrfs: drop unused parameter is_new from btrfs_iget · 4c66e0d4
      David Sterba 提交于
      The parameter is now always set to NULL and could be dropped. The last
      user was get_default_root but that got reworked in 05dbe683 ("Btrfs:
      unify subvol= and subvolid= mounting") and the parameter became unused.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c66e0d4
    • O
      btrfs: get rid of unique workqueue helper functions · a0cac0ec
      Omar Sandoval 提交于
      Commit 9e0af237 ("Btrfs: fix task hang under heavy compressed
      write") worked around the issue that a recycled work item could get a
      false dependency on the original work item due to how the workqueue code
      guarantees non-reentrancy. It did so by giving different work functions
      to different types of work.
      
      However, the fixes in the previous few patches are more complete, as
      they prevent a work item from being recycled at all (except for a tiny
      window that the kernel workqueue code handles for us). This obsoletes
      the previous fix, so we don't need the unique helpers for correctness.
      The only other reason to keep them would be so they show up in stack
      traces, but they always seem to be optimized to a tail call, so they
      don't show up anyways. So, let's just get rid of the extra indirection.
      
      While we're here, rename normal_work_helper() to the more informative
      btrfs_work_helper().
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a0cac0ec
  2. 12 11月, 2019 1 次提交
    • F
      Btrfs: fix log context list corruption after rename exchange operation · e6c61710
      Filipe Manana 提交于
      During rename exchange we might have successfully log the new name in the
      source root's log tree, in which case we leave our log context (allocated
      on stack) in the root's list of log contextes. However we might fail to
      log the new name in the destination root, in which case we fallback to
      a transaction commit later and never sync the log of the source root,
      which causes the source root log context to remain in the list of log
      contextes. This later causes invalid memory accesses because the context
      was allocated on stack and after rename exchange finishes the stack gets
      reused and overwritten for other purposes.
      
      The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
      detect this and report something like the following:
      
        [  691.489929] ------------[ cut here ]------------
        [  691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
        [  691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
        (...)
        [  691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
        [  691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [  691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
        (...)
        [  691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
        [  691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
        [  691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
        [  691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
        [  691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
        [  691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
        [  691.490019] FS:  00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
        [  691.490021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
        [  691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [  691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [  691.490029] Call Trace:
        [  691.490058]  btrfs_log_inode_parent+0x667/0x2730 [btrfs]
        [  691.490083]  ? join_transaction+0x24a/0xce0 [btrfs]
        [  691.490107]  ? btrfs_end_log_trans+0x80/0x80 [btrfs]
        [  691.490111]  ? dget_parent+0xb8/0x460
        [  691.490116]  ? lock_downgrade+0x6b0/0x6b0
        [  691.490121]  ? rwlock_bug.part.0+0x90/0x90
        [  691.490127]  ? do_raw_spin_unlock+0x142/0x220
        [  691.490151]  btrfs_log_dentry_safe+0x65/0x90 [btrfs]
        [  691.490172]  btrfs_sync_file+0x9f1/0xc00 [btrfs]
        [  691.490195]  ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
        [  691.490198]  ? rcu_read_lock_any_held.part.11+0x20/0x20
        [  691.490204]  ? __do_sys_newstat+0x88/0xd0
        [  691.490207]  ? cp_new_stat+0x5d0/0x5d0
        [  691.490218]  ? do_fsync+0x38/0x60
        [  691.490220]  do_fsync+0x38/0x60
        [  691.490224]  __x64_sys_fdatasync+0x32/0x40
        [  691.490228]  do_syscall_64+0x9f/0x540
        [  691.490233]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [  691.490235] RIP: 0033:0x7f4b253ad5f0
        (...)
        [  691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
        [  691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
        [  691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
        [  691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
        [  691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
        [  691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0
      
      This started happening recently when running some test cases from fstests
      like btrfs/004 for example, because support for rename exchange was added
      last week to fsstress from fstests.
      
      So fix this by deleting the log context for the source root from the list
      if we have logged the new name in the source root.
      Reported-by: NSu Yue <Damenly_Su@gmx.com>
      Fixes: d4682ba0 ("Btrfs: sync log after logging new name")
      CC: stable@vger.kernel.org # 4.19+
      Tested-by: NSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e6c61710
  3. 05 11月, 2019 1 次提交
    • J
      btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range · d98da499
      Josef Bacik 提交于
      We hit a regression while rolling out 5.2 internally where we were
      hitting the following panic
      
        kernel BUG at mm/page-writeback.c:2659!
        RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
        Call Trace:
         __process_pages_contig+0x25a/0x350
         ? extent_clear_unlock_delalloc+0x43/0x70
         submit_compressed_extents+0x359/0x4d0
         normal_work_helper+0x15a/0x330
         process_one_work+0x1f5/0x3f0
         worker_thread+0x2d/0x3d0
         ? rescuer_thread+0x340/0x340
         kthread+0x111/0x130
         ? kthread_create_on_node+0x60/0x60
         ret_from_fork+0x1f/0x30
      
      This is happening because the page is not locked when doing
      clear_page_dirty_for_io.  Looking at the core dump it was because our
      async_extent had a ram_size of 24576 but our async_chunk range only
      spanned 20480, so we had a whole extra page in our ram_size for our
      async_extent.
      
      This happened because we try not to compress pages outside of our
      i_size, however a cleanup patch changed us to do
      
      actual_end = min_t(u64, i_size_read(inode), end + 1);
      
      which is problematic because i_size_read() can evaluate to different
      values in between checking and assigning.  So either an expanding
      truncate or a fallocate could increase our i_size while we're doing
      writeout and actual_end would end up being past the range we have
      locked.
      
      I confirmed this was what was happening by installing a debug kernel
      that had
      
        actual_end = min_t(u64, i_size_read(inode), end + 1);
        if (actual_end > end + 1) {
      	  printk(KERN_ERR "KABOOM\n");
      	  actual_end = end + 1;
        }
      
      and installing it onto 500 boxes of the tier that had been seeing the
      problem regularly.  Last night I got my debug message and no panic,
      confirming what I expected.
      
      [ dsterba: the assembly confirms a tiny race window:
      
          mov    0x20(%rsp),%rax
          cmp    %rax,0x48(%r15)           # read
          movl   $0x0,0x18(%rsp)
          mov    %rax,%r12
          mov    %r14,%rax
          cmovbe 0x48(%r15),%r12           # eval
      
        Where r15 is inode and 0x48 is offset of i_size.
      
        The original fix was to revert 62b37622 that would do an
        intermediate assignment and this would also avoid the doulble
        evaluation but is not future-proof, should the compiler merge the
        stores and call i_size_read anyway.
      
        There's a patch adding READ_ONCE to i_size_read but that's not being
        applied at the moment and we need to fix the bug. Instead, emulate
        READ_ONCE by two barrier()s that's what effectively happens. The
        assembly confirms single evaluation:
      
          mov    0x48(%rbp),%rax          # read once
          mov    0x20(%rsp),%rcx
          mov    $0x20,%edx
          cmp    %rax,%rcx
          cmovbe %rcx,%rax
          mov    %rax,(%rsp)
          mov    %rax,%rcx
          mov    %r14,%rax
      
        Where 0x48(%rbp) is inode->i_size stored to %eax.
      ]
      
      Fixes: 62b37622 ("btrfs: Remove isize local variable in compress_file_range")
      CC: stable@vger.kernel.org # v5.1+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ changelog updated ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d98da499
  4. 16 10月, 2019 1 次提交
    • Q
      btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() · 8702ba93
      Qu Wenruo 提交于
      [Background]
      Btrfs qgroup uses two types of reserved space for METADATA space,
      PERTRANS and PREALLOC.
      
      PERTRANS is metadata space reserved for each transaction started by
      btrfs_start_transaction().
      While PREALLOC is for delalloc, where we reserve space before joining a
      transaction, and finally it will be converted to PERTRANS after the
      writeback is done.
      
      [Inconsistency]
      However there is inconsistency in how we handle PREALLOC metadata space.
      
      The most obvious one is:
      In btrfs_buffered_write():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
      
      We always free qgroup PREALLOC meta space.
      
      While in btrfs_truncate_block():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
      
      We only free qgroup PREALLOC meta space when something went wrong.
      
      [The Correct Behavior]
      The correct behavior should be the one in btrfs_buffered_write(), we
      should always free PREALLOC metadata space.
      
      The reason is, the btrfs_delalloc_* mechanism works by:
      - Reserve metadata first, even it's not necessary
        In btrfs_delalloc_reserve_metadata()
      
      - Free the unused metadata space
        Normally in:
        btrfs_delalloc_release_extents()
        |- btrfs_inode_rsv_release()
           Here we do calculation on whether we should release or not.
      
      E.g. for 64K buffered write, the metadata rsv works like:
      
      /* The first page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=0
      total:		num_bytes=calc_inode_reservations()
      /* The first page caused one outstanding extent, thus needs metadata
         rsv */
      
      /* The 2nd page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed
      /* The 2nd page doesn't cause new outstanding extent, needs no new meta
         rsv, so we free what we have reserved */
      
      /* The 3rd~16th pages */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed (still space for one outstanding extent)
      
      This means, if btrfs_delalloc_release_extents() determines to free some
      space, then those space should be freed NOW.
      So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
      than btrfs_qgroup_convert_reserved_meta().
      
      The good news is:
      - The callers are not that hot
        The hottest caller is in btrfs_buffered_write(), which is already
        fixed by commit 336a8bb8 ("btrfs: Fix wrong
        btrfs_delalloc_release_extents parameter"). Thus it's not that
        easy to cause false EDQUOT.
      
      - The trans commit in advance for qgroup would hide the bug
        Since commit f5fef459 ("btrfs: qgroup: Make qgroup async transaction
        commit more aggressive"), when btrfs qgroup metadata free space is slow,
        it will try to commit transaction and free the wrongly converted
        PERTRANS space, so it's not that easy to hit such bug.
      
      [FIX]
      So to fix the problem, remove the @qgroup_free parameter for
      btrfs_delalloc_release_extents(), and always pass true to
      btrfs_inode_rsv_release().
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8702ba93
  5. 02 10月, 2019 1 次提交
    • J
      btrfs: allocate new inode in NOFS context · 11a19a90
      Josef Bacik 提交于
      A user reported a lockdep splat
      
       ======================================================
       WARNING: possible circular locking dependency detected
       5.2.11-gentoo #2 Not tainted
       ------------------------------------------------------
       kswapd0/711 is trying to acquire lock:
       000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500
      
      but task is already holding lock:
       000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (fs_reclaim){+.+.}:
       kmem_cache_alloc+0x1f/0x1c0
       btrfs_alloc_inode+0x1f/0x260
       alloc_inode+0x16/0xa0
       new_inode+0xe/0xb0
       btrfs_new_inode+0x70/0x610
       btrfs_symlink+0xd0/0x420
       vfs_symlink+0x9c/0x100
       do_symlinkat+0x66/0xe0
       do_syscall_64+0x55/0x1c0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      -> #0 (sb_internal){.+.+}:
       __sb_start_write+0xf6/0x150
       start_transaction+0x3a8/0x500
       btrfs_commit_inode_delayed_inode+0x59/0x110
       btrfs_evict_inode+0x19e/0x4c0
       evict+0xbc/0x1f0
       inode_lru_isolate+0x113/0x190
       __list_lru_walk_one.isra.4+0x5c/0x100
       list_lru_walk_one+0x32/0x50
       prune_icache_sb+0x36/0x80
       super_cache_scan+0x14a/0x1d0
       do_shrink_slab+0x131/0x320
       shrink_node+0xf7/0x380
       balance_pgdat+0x2d5/0x640
       kswapd+0x2ba/0x5e0
       kthread+0x147/0x160
       ret_from_fork+0x24/0x30
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
       CPU0 CPU1
       ---- ----
       lock(fs_reclaim);
       lock(sb_internal);
       lock(fs_reclaim);
       lock(sb_internal);
      *** DEADLOCK ***
      
       3 locks held by kswapd0/711:
       #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
       #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380
       #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0
      
      stack backtrace:
       CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2
       Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017
       Call Trace:
       dump_stack+0x85/0xc7
       print_circular_bug.cold.40+0x1d9/0x235
       __lock_acquire+0x18b1/0x1f00
       lock_acquire+0xa6/0x170
       ? start_transaction+0x3a8/0x500
       __sb_start_write+0xf6/0x150
       ? start_transaction+0x3a8/0x500
       start_transaction+0x3a8/0x500
       btrfs_commit_inode_delayed_inode+0x59/0x110
       btrfs_evict_inode+0x19e/0x4c0
       ? var_wake_function+0x20/0x20
       evict+0xbc/0x1f0
       inode_lru_isolate+0x113/0x190
       ? discard_new_inode+0xc0/0xc0
       __list_lru_walk_one.isra.4+0x5c/0x100
       ? discard_new_inode+0xc0/0xc0
       list_lru_walk_one+0x32/0x50
       prune_icache_sb+0x36/0x80
       super_cache_scan+0x14a/0x1d0
       do_shrink_slab+0x131/0x320
       shrink_node+0xf7/0x380
       balance_pgdat+0x2d5/0x640
       kswapd+0x2ba/0x5e0
       ? __wake_up_common_lock+0x90/0x90
       kthread+0x147/0x160
       ? balance_pgdat+0x640/0x640
       ? __kthread_create_on_node+0x160/0x160
       ret_from_fork+0x24/0x30
      
      This is because btrfs_new_inode() calls new_inode() under the
      transaction.  We could probably move the new_inode() outside of this but
      for now just wrap it in memalloc_nofs_save().
      Reported-by: NZdenek Sojka <zsojka@seznam.cz>
      Fixes: 712e36c5 ("btrfs: use GFP_KERNEL in btrfs_alloc_inode")
      CC: stable@vger.kernel.org # 4.16+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11a19a90
  6. 09 9月, 2019 18 次提交
  7. 17 7月, 2019 1 次提交
    • Q
      btrfs: inode: Don't compress if NODATASUM or NODATACOW set · 42c16da6
      Qu Wenruo 提交于
      As btrfs(5) specified:
      
      	Note
      	If nodatacow or nodatasum are enabled, compression is disabled.
      
      If NODATASUM or NODATACOW set, we should not compress the extent.
      
      Normally NODATACOW is detected properly in run_delalloc_range() so
      compression won't happen for NODATACOW.
      
      However for NODATASUM we don't have any check, and it can cause
      compressed extent without csum pretty easily, just by:
        mkfs.btrfs -f $dev
        mount $dev $mnt -o nodatasum
        touch $mnt/foobar
        mount -o remount,datasum,compress $mnt
        xfs_io -f -c "pwrite 0 128K" $mnt/foobar
      
      And in fact, we have a bug report about corrupted compressed extent
      without proper data checksum so even RAID1 can't recover the corruption.
      (https://bugzilla.kernel.org/show_bug.cgi?id=199707)
      
      Running compression without proper checksum could cause more damage when
      corruption happens, as compressed data could make the whole extent
      unreadable, so there is no need to allow compression for
      NODATACSUM.
      
      The fix will refactor the inode compression check into two parts:
      
      - inode_can_compress()
        As the hard requirement, checked at btrfs_run_delalloc_range(), so no
        compression will happen for NODATASUM inode at all.
      
      - inode_need_compress()
        As the soft requirement, checked at btrfs_run_delalloc_range() and
        compress_file_range().
      Reported-by: NJames Harvey <jamespharvey20@gmail.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      42c16da6
  8. 04 7月, 2019 1 次提交
  9. 02 7月, 2019 2 次提交
    • J
      btrfs: run delayed iput at unlink time · 63611e73
      Josef Bacik 提交于
      We have been seeing issues in production where a cleaner script will end
      up unlinking a bunch of files that have pending iputs.  This means they
      will get their final iput's run at btrfs-cleaner time and thus are not
      throttled, which impacts the workload.
      
      Since we are unlinking these files we can just drop the delayed iput at
      unlink time.  We are already holding a reference to the inode so this
      will not be the final iput and thus is completely safe to do at this
      point.  Doing this means we are more likely to be doing the final iput
      at unlink time, and thus will get the IO charged to the caller and get
      throttled appropriately without affecting the main workload.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      63611e73
    • N
      btrfs: Use btrfs_get_io_geometry appropriately · 89b798ad
      Nikolay Borisov 提交于
      Presently btrfs_map_block is used not only to do everything necessary to
      map a bio to the underlying allocation profile but it's also used to
      identify how much data could be written based on btrfs' stripe logic
      without actually submitting anything. This is achieved by passing NULL
      for 'bbio_ret' parameter.
      
      This patch refactors all callers that require just the mapping length
      by switching them to using btrfs_io_geometry instead of calling
      btrfs_map_block with a special NULL value for 'bbio_ret'. No functional
      change.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      89b798ad
  10. 01 7月, 2019 4 次提交
  11. 29 5月, 2019 1 次提交
    • F
      Btrfs: fix wrong ctime and mtime of a directory after log replay · 5338e43a
      Filipe Manana 提交于
      When replaying a log that contains a new file or directory name that needs
      to be added to its parent directory, we end up updating the mtime and the
      ctime of the parent directory to the current time after we have set their
      values to the correct ones (set at fsync time), efectivelly losing them.
      
      Sample reproducer:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/file
      
        # fsync of the directory is optional, not needed
        $ xfs_io -c fsync /mnt/dir
        $ xfs_io -c fsync /mnt/dir/file
      
        $ stat -c %Y /mnt/dir
        1557856079
      
        <power failure>
      
        $ sleep 3
        $ mount /dev/sdb /mnt
        $ stat -c %Y /mnt/dir
        1557856082
      
          --> should have been 1557856079, the mtime is updated to the current
              time when replaying the log
      
      Fix this by not updating the mtime and ctime to the current time at
      btrfs_add_link() when we are replaying a log tree.
      
      This could be triggered by my recent fsync fuzz tester for fstests, for
      which an fstests patch exists titled "fstests: generic, fsync fuzz tester
      with fsstress".
      
      Fixes: e02119d5 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5338e43a
  12. 02 5月, 2019 3 次提交