1. 18 11月, 2019 5 次提交
  2. 12 11月, 2019 1 次提交
    • F
      Btrfs: fix log context list corruption after rename exchange operation · e6c61710
      Filipe Manana 提交于
      During rename exchange we might have successfully log the new name in the
      source root's log tree, in which case we leave our log context (allocated
      on stack) in the root's list of log contextes. However we might fail to
      log the new name in the destination root, in which case we fallback to
      a transaction commit later and never sync the log of the source root,
      which causes the source root log context to remain in the list of log
      contextes. This later causes invalid memory accesses because the context
      was allocated on stack and after rename exchange finishes the stack gets
      reused and overwritten for other purposes.
      
      The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
      detect this and report something like the following:
      
        [  691.489929] ------------[ cut here ]------------
        [  691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
        [  691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
        (...)
        [  691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
        [  691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [  691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
        (...)
        [  691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
        [  691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
        [  691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
        [  691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
        [  691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
        [  691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
        [  691.490019] FS:  00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
        [  691.490021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
        [  691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [  691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [  691.490029] Call Trace:
        [  691.490058]  btrfs_log_inode_parent+0x667/0x2730 [btrfs]
        [  691.490083]  ? join_transaction+0x24a/0xce0 [btrfs]
        [  691.490107]  ? btrfs_end_log_trans+0x80/0x80 [btrfs]
        [  691.490111]  ? dget_parent+0xb8/0x460
        [  691.490116]  ? lock_downgrade+0x6b0/0x6b0
        [  691.490121]  ? rwlock_bug.part.0+0x90/0x90
        [  691.490127]  ? do_raw_spin_unlock+0x142/0x220
        [  691.490151]  btrfs_log_dentry_safe+0x65/0x90 [btrfs]
        [  691.490172]  btrfs_sync_file+0x9f1/0xc00 [btrfs]
        [  691.490195]  ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
        [  691.490198]  ? rcu_read_lock_any_held.part.11+0x20/0x20
        [  691.490204]  ? __do_sys_newstat+0x88/0xd0
        [  691.490207]  ? cp_new_stat+0x5d0/0x5d0
        [  691.490218]  ? do_fsync+0x38/0x60
        [  691.490220]  do_fsync+0x38/0x60
        [  691.490224]  __x64_sys_fdatasync+0x32/0x40
        [  691.490228]  do_syscall_64+0x9f/0x540
        [  691.490233]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [  691.490235] RIP: 0033:0x7f4b253ad5f0
        (...)
        [  691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
        [  691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
        [  691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
        [  691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
        [  691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
        [  691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0
      
      This started happening recently when running some test cases from fstests
      like btrfs/004 for example, because support for rename exchange was added
      last week to fsstress from fstests.
      
      So fix this by deleting the log context for the source root from the list
      if we have logged the new name in the source root.
      Reported-by: NSu Yue <Damenly_Su@gmx.com>
      Fixes: d4682ba0 ("Btrfs: sync log after logging new name")
      CC: stable@vger.kernel.org # 4.19+
      Tested-by: NSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e6c61710
  3. 05 11月, 2019 2 次提交
    • D
      btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC · a5009d3a
      David Sterba 提交于
      The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as
      deprecated and scheduled for removal but we actualy do use them for
      'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351
      should have been just the async flag for subvolume creation.
      
      The deprecation has been added in this development cycle, remove it
      until it's time.
      
      Fixes: ebc87351 ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag")
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5009d3a
    • J
      btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range · d98da499
      Josef Bacik 提交于
      We hit a regression while rolling out 5.2 internally where we were
      hitting the following panic
      
        kernel BUG at mm/page-writeback.c:2659!
        RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
        Call Trace:
         __process_pages_contig+0x25a/0x350
         ? extent_clear_unlock_delalloc+0x43/0x70
         submit_compressed_extents+0x359/0x4d0
         normal_work_helper+0x15a/0x330
         process_one_work+0x1f5/0x3f0
         worker_thread+0x2d/0x3d0
         ? rescuer_thread+0x340/0x340
         kthread+0x111/0x130
         ? kthread_create_on_node+0x60/0x60
         ret_from_fork+0x1f/0x30
      
      This is happening because the page is not locked when doing
      clear_page_dirty_for_io.  Looking at the core dump it was because our
      async_extent had a ram_size of 24576 but our async_chunk range only
      spanned 20480, so we had a whole extra page in our ram_size for our
      async_extent.
      
      This happened because we try not to compress pages outside of our
      i_size, however a cleanup patch changed us to do
      
      actual_end = min_t(u64, i_size_read(inode), end + 1);
      
      which is problematic because i_size_read() can evaluate to different
      values in between checking and assigning.  So either an expanding
      truncate or a fallocate could increase our i_size while we're doing
      writeout and actual_end would end up being past the range we have
      locked.
      
      I confirmed this was what was happening by installing a debug kernel
      that had
      
        actual_end = min_t(u64, i_size_read(inode), end + 1);
        if (actual_end > end + 1) {
      	  printk(KERN_ERR "KABOOM\n");
      	  actual_end = end + 1;
        }
      
      and installing it onto 500 boxes of the tier that had been seeing the
      problem regularly.  Last night I got my debug message and no panic,
      confirming what I expected.
      
      [ dsterba: the assembly confirms a tiny race window:
      
          mov    0x20(%rsp),%rax
          cmp    %rax,0x48(%r15)           # read
          movl   $0x0,0x18(%rsp)
          mov    %rax,%r12
          mov    %r14,%rax
          cmovbe 0x48(%r15),%r12           # eval
      
        Where r15 is inode and 0x48 is offset of i_size.
      
        The original fix was to revert 62b37622 that would do an
        intermediate assignment and this would also avoid the doulble
        evaluation but is not future-proof, should the compiler merge the
        stores and call i_size_read anyway.
      
        There's a patch adding READ_ONCE to i_size_read but that's not being
        applied at the moment and we need to fix the bug. Instead, emulate
        READ_ONCE by two barrier()s that's what effectively happens. The
        assembly confirms single evaluation:
      
          mov    0x48(%rbp),%rax          # read once
          mov    0x20(%rsp),%rcx
          mov    $0x20,%edx
          cmp    %rax,%rcx
          cmovbe %rcx,%rax
          mov    %rax,(%rsp)
          mov    %rax,%rcx
          mov    %r14,%rax
      
        Where 0x48(%rbp) is inode->i_size stored to %eax.
      ]
      
      Fixes: 62b37622 ("btrfs: Remove isize local variable in compress_file_range")
      CC: stable@vger.kernel.org # v5.1+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ changelog updated ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d98da499
  4. 26 10月, 2019 3 次提交
    • F
      Btrfs: fix race leading to metadata space leak after task received signal · 0cab7acc
      Filipe Manana 提交于
      When a task that is allocating metadata needs to wait for the async
      reclaim job to process its ticket and gets a signal (because it was killed
      for example) before doing the wait, the task ends up erroring out but
      with space reserved for its ticket, which never gets released, resulting
      in a metadata space leak (more specifically a leak in the bytes_may_use
      counter of the metadata space_info object).
      
      Here's the sequence of steps leading to the space leak:
      
      1) A task tries to create a file for example, so it ends up trying to
         start a transaction at btrfs_create();
      
      2) The filesystem is currently in a state where there is not enough
         metadata free space to satisfy the transaction's needs. So at
         space-info.c:__reserve_metadata_bytes() we create a ticket and
         add it to the list of tickets of the space info object. Also,
         because the metadata async reclaim job is not running, we queue
         a job ro run metadata reclaim;
      
      3) In the meanwhile the task receives a signal (like SIGTERM from
         a kill command for example);
      
      4) After queing the async reclaim job, at __reserve_metadata_bytes(),
         we unlock the metadata space info and call handle_reserve_ticket();
      
      5) That last function calls wait_reserve_ticket(), which acquires the
         lock from the metadata space info. Then in the first iteration of
         its while loop, it calls prepare_to_wait_event(), which returns
         -ERESTARTSYS because the task has a pending signal. As a result,
         we set the error field of the ticket to -EINTR and exit the while
         loop without deleting the ticket from the list of tickets (in the
         space info object). After exiting the loop we unlock the space info;
      
      6) The async reclaim job is able to release enough metadata, acquires
         the metadata space info's lock and then reserves space for the ticket,
         since the ticket is still in the list of (non-priority) tickets. The
         space reservation happens at btrfs_try_granting_tickets(), called from
         maybe_fail_all_tickets(). This increments the bytes_may_use counter
         from the metadata space info object, sets the ticket's bytes field to
         zero (meaning success, that space was reserved) and removes it from
         the list of tickets;
      
      7) wait_reserve_ticket() returns, with the error field of the ticket
         set to -EINTR. Then handle_reserve_ticket() just propagates that error
         to the caller. Because an error was returned, the caller does not
         release the reserved space, since the expectation is that any error
         means no space was reserved.
      
      Fix this by removing the ticket from the list, while holding the space
      info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
      an error.
      
      Also add some comments and an assertion to guarantee we never end up with
      a ticket that has an error set and a bytes counter field set to zero, to
      more easily detect regressions in the future.
      
      This issue could be triggered sporadically by some test cases from fstests
      such as generic/269 for example, which tries to fill a filesystem and then
      kills fsstress processes running in the background.
      
      When this issue happens, we get a warning in syslog/dmesg when unmounting
      the filesystem, like the following:
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
        (...)
        CPU: 0 PID: 13240 Comm: umount Tainted: G        W    L    5.3.0-rc8-btrfs-next-48+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
        (...)
        RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
        RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
        RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
        R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
        FS:  00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         close_ctree+0x1ad/0x390 [btrfs]
         generic_shutdown_super+0x6c/0x110
         kill_anon_super+0xe/0x30
         btrfs_kill_super+0x12/0xa0 [btrfs]
         deactivate_locked_super+0x3a/0x70
         cleanup_mnt+0xb4/0x160
         task_work_run+0x7e/0xc0
         exit_to_usermode_loop+0xfa/0x100
         do_syscall_64+0x1cb/0x220
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f4274d2cb37
        (...)
        RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
        RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
        R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
        R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
        softirqs last  enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace bcf4b235461b26f6 ]---
        BTRFS info (device sdb): space_info 4 has 19116032 free, is full
        BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
        BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
      
      Fixes: 374bf9c5 ("btrfs: unify error handling for ticket flushing")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0cab7acc
    • Q
      btrfs: tree-checker: Fix wrong check on max devid · 8bb177d1
      Qu Wenruo 提交于
      [BUG]
      The following script will cause false alert on devid check.
        #!/bin/bash
      
        dev1=/dev/test/test
        dev2=/dev/test/scratch1
        mnt=/mnt/btrfs
      
        umount $dev1 &> /dev/null
        umount $dev2 &> /dev/null
        umount $mnt &> /dev/null
      
        mkfs.btrfs -f $dev1
      
        mount $dev1 $mnt
      
        _fail()
        {
                echo "!!! FAILED !!!"
                exit 1
        }
      
        for ((i = 0; i < 4096; i++)); do
                btrfs dev add -f $dev2 $mnt || _fail
                btrfs dev del $dev1 $mnt || _fail
                dev_tmp=$dev1
                dev1=$dev2
                dev2=$dev_tmp
        done
      
      [CAUSE]
      Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as
      upper limit for devid.  But we can have devid holes just like above
      script.
      
      So the check for devid is incorrect and could cause false alert.
      
      [FIX]
      Just remove the whole devid check.  We don't have any hard requirement
      for devid assignment.
      
      Furthermore, even devid could get corrupted by a bitflip, we still have
      dev extents verification at mount time, so corrupted data won't sneak
      in.
      
      This fixes fstests btrfs/194.
      Reported-by: NAnand Jain <anand.jain@oracle.com>
      Fixes: ab4ba2e1 ("btrfs: tree-checker: Verify dev item")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8bb177d1
    • Q
      btrfs: Consider system chunk array size for new SYSTEM chunks · c17add7a
      Qu Wenruo 提交于
      For SYSTEM chunks, despite the regular chunk item size limit, there is
      another limit due to system chunk array size.
      
      The extra limit was removed in a refactoring, so add it back.
      
      Fixes: e3ecdb3f ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk")
      CC: stable@vger.kernel.org # 5.3+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c17add7a
  5. 18 10月, 2019 2 次提交
    • F
      Btrfs: check for the full sync flag while holding the inode lock during fsync · ba0b084a
      Filipe Manana 提交于
      We were checking for the full fsync flag in the inode before locking the
      inode, which is racy, since at that that time it might not be set but
      after we acquire the inode lock some other task set it. One case where
      this can happen is on a system low on memory and some concurrent task
      failed to allocate an extent map and therefore set the full sync flag on
      the inode, to force the next fsync to work in full mode.
      
      A consequence of missing the full fsync flag set is hitting the problems
      fixed by commit 0c713cba ("Btrfs: fix race between ranged fsync and
      writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
      tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
      of weird inconsistencies after replaying a log due to file extents items
      representing ranges that overlap.
      
      So just move the check such that it's done after locking the inode and
      before starting writeback again.
      
      Fixes: 0c713cba ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ba0b084a
    • F
      Btrfs: fix qgroup double free after failure to reserve metadata for delalloc · c7967fc1
      Filipe Manana 提交于
      If we fail to reserve metadata for delalloc operations we end up releasing
      the previously reserved qgroup amount twice, once explicitly under the
      'out_qgroup' label by calling btrfs_qgroup_free_meta_prealloc() and once
      again, under label 'out_fail', by calling btrfs_inode_rsv_release() with a
      value of 'true' for its 'qgroup_free' argument, which results in
      btrfs_qgroup_free_meta_prealloc() being called again, so we end up having
      a double free.
      
      Also if we fail to reserve the necessary qgroup amount, we jump to the
      label 'out_fail', which calls btrfs_inode_rsv_release() and that in turns
      calls btrfs_qgroup_free_meta_prealloc(), even though we weren't able to
      reserve any qgroup amount. So we freed some amount we never reserved.
      
      So fix this by removing the call to btrfs_inode_rsv_release() in the
      failure path, since it's not necessary at all as we haven't changed the
      inode's block reserve in any way at this point.
      
      Fixes: c8eaeac7 ("btrfs: reserve delalloc metadata differently")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7967fc1
  6. 17 10月, 2019 1 次提交
  7. 16 10月, 2019 1 次提交
    • Q
      btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() · 8702ba93
      Qu Wenruo 提交于
      [Background]
      Btrfs qgroup uses two types of reserved space for METADATA space,
      PERTRANS and PREALLOC.
      
      PERTRANS is metadata space reserved for each transaction started by
      btrfs_start_transaction().
      While PREALLOC is for delalloc, where we reserve space before joining a
      transaction, and finally it will be converted to PERTRANS after the
      writeback is done.
      
      [Inconsistency]
      However there is inconsistency in how we handle PREALLOC metadata space.
      
      The most obvious one is:
      In btrfs_buffered_write():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
      
      We always free qgroup PREALLOC meta space.
      
      While in btrfs_truncate_block():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
      
      We only free qgroup PREALLOC meta space when something went wrong.
      
      [The Correct Behavior]
      The correct behavior should be the one in btrfs_buffered_write(), we
      should always free PREALLOC metadata space.
      
      The reason is, the btrfs_delalloc_* mechanism works by:
      - Reserve metadata first, even it's not necessary
        In btrfs_delalloc_reserve_metadata()
      
      - Free the unused metadata space
        Normally in:
        btrfs_delalloc_release_extents()
        |- btrfs_inode_rsv_release()
           Here we do calculation on whether we should release or not.
      
      E.g. for 64K buffered write, the metadata rsv works like:
      
      /* The first page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=0
      total:		num_bytes=calc_inode_reservations()
      /* The first page caused one outstanding extent, thus needs metadata
         rsv */
      
      /* The 2nd page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed
      /* The 2nd page doesn't cause new outstanding extent, needs no new meta
         rsv, so we free what we have reserved */
      
      /* The 3rd~16th pages */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed (still space for one outstanding extent)
      
      This means, if btrfs_delalloc_release_extents() determines to free some
      space, then those space should be freed NOW.
      So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
      than btrfs_qgroup_convert_reserved_meta().
      
      The good news is:
      - The callers are not that hot
        The hottest caller is in btrfs_buffered_write(), which is already
        fixed by commit 336a8bb8 ("btrfs: Fix wrong
        btrfs_delalloc_release_extents parameter"). Thus it's not that
        easy to cause false EDQUOT.
      
      - The trans commit in advance for qgroup would hide the bug
        Since commit f5fef459 ("btrfs: qgroup: Make qgroup async transaction
        commit more aggressive"), when btrfs qgroup metadata free space is slow,
        it will try to commit transaction and free the wrongly converted
        PERTRANS space, so it's not that easy to hit such bug.
      
      [FIX]
      So to fix the problem, remove the @qgroup_free parameter for
      btrfs_delalloc_release_extents(), and always pass true to
      btrfs_inode_rsv_release().
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8702ba93
  8. 15 10月, 2019 1 次提交
  9. 12 10月, 2019 2 次提交
  10. 08 10月, 2019 1 次提交
    • A
      btrfs: silence maybe-uninitialized warning in clone_range · 431d3988
      Austin Kim 提交于
      GCC throws warning message as below:
      
      ‘clone_src_i_size’ may be used uninitialized in this function
      [-Wmaybe-uninitialized]
       #define IS_ALIGNED(x, a)  (((x) & ((typeof(x))(a) - 1)) == 0)
                             ^
      fs/btrfs/send.c:5088:6: note: ‘clone_src_i_size’ was declared here
       u64 clone_src_i_size;
         ^
      The clone_src_i_size is only used as call-by-reference
      in a call to get_inode_info().
      
      Silence the warning by initializing clone_src_i_size to 0.
      
      Note that the warning is a false positive and reported by older versions
      of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the
      patch is applied. Setting clone_src_i_size to 0 does not otherwise make
      sense and would not do any action in case the code changes in the future.
      Signed-off-by: NAustin Kim <austindh.kim@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      431d3988
  11. 03 10月, 2019 1 次提交
  12. 02 10月, 2019 4 次提交
    • J
      btrfs: allocate new inode in NOFS context · 11a19a90
      Josef Bacik 提交于
      A user reported a lockdep splat
      
       ======================================================
       WARNING: possible circular locking dependency detected
       5.2.11-gentoo #2 Not tainted
       ------------------------------------------------------
       kswapd0/711 is trying to acquire lock:
       000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500
      
      but task is already holding lock:
       000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (fs_reclaim){+.+.}:
       kmem_cache_alloc+0x1f/0x1c0
       btrfs_alloc_inode+0x1f/0x260
       alloc_inode+0x16/0xa0
       new_inode+0xe/0xb0
       btrfs_new_inode+0x70/0x610
       btrfs_symlink+0xd0/0x420
       vfs_symlink+0x9c/0x100
       do_symlinkat+0x66/0xe0
       do_syscall_64+0x55/0x1c0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      -> #0 (sb_internal){.+.+}:
       __sb_start_write+0xf6/0x150
       start_transaction+0x3a8/0x500
       btrfs_commit_inode_delayed_inode+0x59/0x110
       btrfs_evict_inode+0x19e/0x4c0
       evict+0xbc/0x1f0
       inode_lru_isolate+0x113/0x190
       __list_lru_walk_one.isra.4+0x5c/0x100
       list_lru_walk_one+0x32/0x50
       prune_icache_sb+0x36/0x80
       super_cache_scan+0x14a/0x1d0
       do_shrink_slab+0x131/0x320
       shrink_node+0xf7/0x380
       balance_pgdat+0x2d5/0x640
       kswapd+0x2ba/0x5e0
       kthread+0x147/0x160
       ret_from_fork+0x24/0x30
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
       CPU0 CPU1
       ---- ----
       lock(fs_reclaim);
       lock(sb_internal);
       lock(fs_reclaim);
       lock(sb_internal);
      *** DEADLOCK ***
      
       3 locks held by kswapd0/711:
       #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
       #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380
       #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0
      
      stack backtrace:
       CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2
       Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017
       Call Trace:
       dump_stack+0x85/0xc7
       print_circular_bug.cold.40+0x1d9/0x235
       __lock_acquire+0x18b1/0x1f00
       lock_acquire+0xa6/0x170
       ? start_transaction+0x3a8/0x500
       __sb_start_write+0xf6/0x150
       ? start_transaction+0x3a8/0x500
       start_transaction+0x3a8/0x500
       btrfs_commit_inode_delayed_inode+0x59/0x110
       btrfs_evict_inode+0x19e/0x4c0
       ? var_wake_function+0x20/0x20
       evict+0xbc/0x1f0
       inode_lru_isolate+0x113/0x190
       ? discard_new_inode+0xc0/0xc0
       __list_lru_walk_one.isra.4+0x5c/0x100
       ? discard_new_inode+0xc0/0xc0
       list_lru_walk_one+0x32/0x50
       prune_icache_sb+0x36/0x80
       super_cache_scan+0x14a/0x1d0
       do_shrink_slab+0x131/0x320
       shrink_node+0xf7/0x380
       balance_pgdat+0x2d5/0x640
       kswapd+0x2ba/0x5e0
       ? __wake_up_common_lock+0x90/0x90
       kthread+0x147/0x160
       ? balance_pgdat+0x640/0x640
       ? __kthread_create_on_node+0x160/0x160
       ret_from_fork+0x24/0x30
      
      This is because btrfs_new_inode() calls new_inode() under the
      transaction.  We could probably move the new_inode() outside of this but
      for now just wrap it in memalloc_nofs_save().
      Reported-by: NZdenek Sojka <zsojka@seznam.cz>
      Fixes: 712e36c5 ("btrfs: use GFP_KERNEL in btrfs_alloc_inode")
      CC: stable@vger.kernel.org # 4.16+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11a19a90
    • Z
      btrfs: fix balance convert to single on 32-bit host CPUs · 7a547890
      Zygo Blaxell 提交于
      Currently, the command:
      
      	btrfs balance start -dconvert=single,soft .
      
      on a Raspberry Pi produces the following kernel message:
      
      	BTRFS error (device mmcblk0p2): balance: invalid convert data profile single
      
      This fails because we use is_power_of_2(unsigned long) to validate
      the new data profile, the constant for 'single' profile uses bit 48,
      and there are only 32 bits in a long on ARM.
      
      Fix by open-coding the check using u64 variables.
      
      Tested by completing the original balance command on several Raspberry
      Pis.
      
      Fixes: 818255fe ("btrfs: use common helper instead of open coding a bit test")
      CC: stable@vger.kernel.org # 4.20+
      Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a547890
    • J
      btrfs: fix incorrect updating of log root tree · 4203e968
      Josef Bacik 提交于
      We've historically had reports of being unable to mount file systems
      because the tree log root couldn't be read.  Usually this is the "parent
      transid failure", but could be any of the related errors, including
      "fsid mismatch" or "bad tree block", depending on which block got
      allocated.
      
      The modification of the individual log root items are serialized on the
      per-log root root_mutex.  This means that any modification to the
      per-subvol log root_item is completely protected.
      
      However we update the root item in the log root tree outside of the log
      root tree log_mutex.  We do this in order to allow multiple subvolumes
      to be updated in each log transaction.
      
      This is problematic however because when we are writing the log root
      tree out we update the super block with the _current_ log root node
      information.  Since these two operations happen independently of each
      other, you can end up updating the log root tree in between writing out
      the dirty blocks and setting the super block to point at the current
      root.
      
      This means we'll point at the new root node that hasn't been written
      out, instead of the one we should be pointing at.  Thus whatever garbage
      or old block we end up pointing at complains when we mount the file
      system later and try to replay the log.
      
      Fix this by copying the log's root item into a local root item copy.
      Then once we're safely under the log_root_tree->log_mutex we update the
      root item in the log_root_tree.  This way we do not modify the
      log_root_tree while we're committing it, fixing the problem.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NChris Mason <clm@fb.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4203e968
    • F
      Btrfs: fix memory leak due to concurrent append writes with fiemap · c67d970f
      Filipe Manana 提交于
      When we have a buffered write that starts at an offset greater than or
      equals to the file's size happening concurrently with a full ranged
      fiemap, we can end up leaking an extent state structure.
      
      Suppose we have a file with a size of 1Mb, and before the buffered write
      and fiemap are performed, it has a single extent state in its io tree
      representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.
      
      The following sequence diagram shows how the memory leak happens if a
      fiemap a buffered write, starting at offset 1Mb and with a length of
      4Kb, are performed concurrently.
      
                CPU 1                                                  CPU 2
      
        extent_fiemap()
          --> it's a full ranged fiemap
              range from 0 to LLONG_MAX - 1
              (9223372036854775807)
      
          --> locks range in the inode's
              io tree
            --> after this we have 2 extent
                states in the io tree:
                --> 1 for range [0, 1Mb[ with
                    the bits EXTENT_LOCKED and
                    EXTENT_DELALLOC_BITS set
                --> 1 for the range
                    [1Mb, LLONG_MAX[ with
                    the EXTENT_LOCKED bit set
      
                                                        --> start buffered write at offset
                                                            1Mb with a length of 4Kb
      
                                                        btrfs_file_write_iter()
      
                                                          btrfs_buffered_write()
                                                            --> cached_state is NULL
      
                                                            lock_and_cleanup_extent_if_need()
                                                              --> returns 0 and does not lock
                                                                  range because it starts
                                                                  at current i_size / eof
      
                                                            --> cached_state remains NULL
      
                                                            btrfs_dirty_pages()
                                                              btrfs_set_extent_delalloc()
                                                                (...)
                                                                __set_extent_bit()
      
                                                                  --> splits extent state for range
                                                                      [1Mb, LLONG_MAX[ and now we
                                                                      have 2 extent states:
      
                                                                      --> one for the range
                                                                          [1Mb, 1Mb + 4Kb[ with
                                                                          EXTENT_LOCKED set
                                                                      --> another one for the range
                                                                          [1Mb + 4Kb, LLONG_MAX[ with
                                                                          EXTENT_LOCKED set as well
      
                                                                  --> sets EXTENT_DELALLOC on the
                                                                      extent state for the range
                                                                      [1Mb, 1Mb + 4Kb[
                                                                  --> caches extent state
                                                                      [1Mb, 1Mb + 4Kb[ into
                                                                      @cached_state because it has
                                                                      the bit EXTENT_LOCKED set
      
                                                          --> btrfs_buffered_write() ends up
                                                              with a non-NULL cached_state and
                                                              never calls anything to release its
                                                              reference on it, resulting in a
                                                              memory leak
      
      Fix this by calling free_extent_state() on cached_state if the range was
      not locked by lock_and_cleanup_extent_if_need().
      
      The same issue can happen if anything else other than fiemap locks a range
      that covers eof and beyond.
      
      This could be triggered, sporadically, by test case generic/561 from the
      fstests suite, which makes duperemove run concurrently with fsstress, and
      duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
      the leak is reported in dmesg/syslog when removing the btrfs module with
      a message like the following:
      
        [77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1
      
      Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.
      
      CC: stable@vger.kernel.org # 4.16+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c67d970f
  13. 27 9月, 2019 2 次提交
    • Q
      btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls · d4e20494
      Qu Wenruo 提交于
      [BUG]
      The following script can cause btrfs qgroup data space leak:
      
        mkfs.btrfs -f $dev
        mount $dev -o nospace_cache $mnt
      
        btrfs subv create $mnt/subv
        btrfs quota en $mnt
        btrfs quota rescan -w $mnt
        btrfs qgroup limit 128m $mnt/subv
      
        for (( i = 0; i < 3; i++)); do
                # Create 3 64M holes for latter fallocate to fail
                truncate -s 192m $mnt/subv/file
                xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
                xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
                sync
      
                # it's supposed to fail, and each failure will leak at least 64M
                # data space
                xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
                rm $mnt/subv/file
                sync
        done
      
        # Shouldn't fail after we removed the file
        xfs_io -f -c "falloc 0 64m" $mnt/subv/file
      
      [CAUSE]
      Btrfs qgroup data reserve code allow multiple reservations to happen on
      a single extent_changeset:
      E.g:
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);
      
      Btrfs qgroup code has its internal tracking to make sure we don't
      double-reserve in above example.
      
      The only pattern utilizing this feature is in the main while loop of
      btrfs_fallocate() function.
      
      However btrfs_qgroup_reserve_data()'s error handling has a bug in that
      on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
      flag but doesn't free previously reserved bytes.
      
      This bug has a two fold effect:
      - Clearing EXTENT_QGROUP_RESERVED ranges
        This is the correct behavior, but it prevents
        btrfs_qgroup_check_reserved_leak() to catch the leakage as the
        detector is purely EXTENT_QGROUP_RESERVED flag based.
      
      - Leak the previously reserved data bytes.
      
      The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
      the last one fails, leaking space reserved in the previous ones.
      
      [FIX]
      Also free previously reserved data bytes when btrfs_qgroup_reserve_data
      fails.
      
      Fixes: 52472553 ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4e20494
    • Q
      btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space · bab32fc0
      Qu Wenruo 提交于
      [BUG]
      Under the following case with qgroup enabled, if some error happened
      after we have reserved delalloc space, then in error handling path, we
      could cause qgroup data space leakage:
      
      From btrfs_truncate_block() in inode.c:
      
      	ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
      					   block_start, blocksize);
      	if (ret)
      		goto out;
      
       again:
      	page = find_or_create_page(mapping, index, mask);
      	if (!page) {
      		btrfs_delalloc_release_space(inode, data_reserved,
      					     block_start, blocksize, true);
      		btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
      		ret = -ENOMEM;
      		goto out;
      	}
      
      [CAUSE]
      In the above case, btrfs_delalloc_reserve_space() will call
      btrfs_qgroup_reserve_data() and mark the io_tree range with
      EXTENT_QGROUP_RESERVED flag.
      
      In the error handling path, we have the following call stack:
      btrfs_delalloc_release_space()
      |- btrfs_free_reserved_data_space()
         |- btrsf_qgroup_free_data()
            |- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
               |- qgroup_free_reserved_data(reserved=@reserved)
                  |- clear_record_extent_bits();
                  |- freed += changeset.bytes_changed;
      
      However due to a completion bug, qgroup_free_reserved_data() will clear
      EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
      than the correct BTRFS_I(inode)->io_tree.
      Since io_failure_tree is never marked with that flag,
      btrfs_qgroup_free_data() will not free any data reserved space at all,
      causing a leakage.
      
      This type of error handling can only be triggered by errors outside of
      qgroup code. So EDQUOT error from qgroup can't trigger it.
      
      [FIX]
      Fix the wrong target io_tree.
      Reported-by: NJosef Bacik <josef@toxicpanda.com>
      Fixes: bc42bda2 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bab32fc0
  14. 25 9月, 2019 2 次提交
    • Q
      btrfs: Fix a regression which we can't convert to SINGLE profile · fab27359
      Qu Wenruo 提交于
      [BUG]
      With v5.3 kernel, we can't convert to SINGLE profile:
      
        # btrfs balance start -f -dconvert=single $mnt
        ERROR: error during balancing '/mnt/btrfs': Invalid argument
        # dmesg -t | tail
        validate_convert_profile: data profile=0x1000000000000 allowed=0x20 is_valid=1 final=0x1000000000000 ret=1
        BTRFS error (device dm-3): balance: invalid convert data profile single
      
      [CAUSE]
      With the extra debug output added, it shows that the @allowed bit is
      lacking the special in-memory only SINGLE profile bit.
      
      Thus we fail at that (profile & ~allowed) check.
      
      This regression is caused by commit 081db89b ("btrfs: use raid_attr
      to get allowed profiles for balance conversion") and the fact that we
      don't use any bit to indicate SINGLE profile on-disk, but uses special
      in-memory only bit to help distinguish different profiles.
      
      [FIX]
      Add that BTRFS_AVAIL_ALLOC_BIT_SINGLE to @allowed, so the code should be
      the same as it was and fix the regression.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Fixes: 081db89b ("btrfs: use raid_attr to get allowed profiles for balance conversion")
      CC: stable@vger.kernel.org # 5.3+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fab27359
    • Q
      btrfs: relocation: fix use-after-free on dead relocation roots · 1fac4a54
      Qu Wenruo 提交于
      [BUG]
      One user reported a reproducible KASAN report about use-after-free:
      
        BTRFS info (device sdi1): balance: start -dvrange=1256811659264..1256811659265
        BTRFS info (device sdi1): relocating block group 1256811659264 flags data|raid0
        ==================================================================
        BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
        Write of size 8 at addr ffff88856f671710 by task kworker/u24:10/261579
      
        CPU: 2 PID: 261579 Comm: kworker/u24:10 Tainted: P           OE     5.2.11-arch1-1-kasan #4
        Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
        Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        Call Trace:
         dump_stack+0x7b/0xba
         print_address_description+0x6c/0x22e
         ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
         __kasan_report.cold+0x1b/0x3b
         ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
         kasan_report+0x12/0x17
         __asan_report_store8_noabort+0x17/0x20
         btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
         record_root_in_trans+0x2a0/0x370 [btrfs]
         btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
         start_transaction+0x1ab/0xe90 [btrfs]
         btrfs_join_transaction+0x1d/0x20 [btrfs]
         btrfs_finish_ordered_io+0x7bf/0x18a0 [btrfs]
         ? lock_repin_lock+0x400/0x400
         ? __kmem_cache_shutdown.cold+0x140/0x1ad
         ? btrfs_unlink_subvol+0x9b0/0x9b0 [btrfs]
         finish_ordered_fn+0x15/0x20 [btrfs]
         normal_work_helper+0x1bd/0xca0 [btrfs]
         ? process_one_work+0x819/0x1720
         ? kasan_check_read+0x11/0x20
         btrfs_endio_write_helper+0x12/0x20 [btrfs]
         process_one_work+0x8c9/0x1720
         ? pwq_dec_nr_in_flight+0x2f0/0x2f0
         ? worker_thread+0x1d9/0x1030
         worker_thread+0x98/0x1030
         kthread+0x2bb/0x3b0
         ? process_one_work+0x1720/0x1720
         ? kthread_park+0x120/0x120
         ret_from_fork+0x35/0x40
      
        Allocated by task 369692:
         __kasan_kmalloc.part.0+0x44/0xc0
         __kasan_kmalloc.constprop.0+0xba/0xc0
         kasan_kmalloc+0x9/0x10
         kmem_cache_alloc_trace+0x138/0x260
         btrfs_read_tree_root+0x92/0x360 [btrfs]
         btrfs_read_fs_root+0x10/0xb0 [btrfs]
         create_reloc_root+0x47d/0xa10 [btrfs]
         btrfs_init_reloc_root+0x1e2/0x340 [btrfs]
         record_root_in_trans+0x2a0/0x370 [btrfs]
         btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
         start_transaction+0x1ab/0xe90 [btrfs]
         btrfs_start_transaction+0x1e/0x20 [btrfs]
         __btrfs_prealloc_file_range+0x1c2/0xa00 [btrfs]
         btrfs_prealloc_file_range+0x13/0x20 [btrfs]
         prealloc_file_extent_cluster+0x29f/0x570 [btrfs]
         relocate_file_extent_cluster+0x193/0xc30 [btrfs]
         relocate_data_extent+0x1f8/0x490 [btrfs]
         relocate_block_group+0x600/0x1060 [btrfs]
         btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
         btrfs_relocate_chunk+0x9e/0x180 [btrfs]
         btrfs_balance+0x14e4/0x2fc0 [btrfs]
         btrfs_ioctl_balance+0x47f/0x640 [btrfs]
         btrfs_ioctl+0x119d/0x8380 [btrfs]
         do_vfs_ioctl+0x9f5/0x1060
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x73/0xb0
         do_syscall_64+0xa5/0x370
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 369692:
         __kasan_slab_free+0x14f/0x210
         kasan_slab_free+0xe/0x10
         kfree+0xd8/0x270
         btrfs_drop_snapshot+0x154c/0x1eb0 [btrfs]
         clean_dirty_subvols+0x227/0x340 [btrfs]
         relocate_block_group+0x972/0x1060 [btrfs]
         btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
         btrfs_relocate_chunk+0x9e/0x180 [btrfs]
         btrfs_balance+0x14e4/0x2fc0 [btrfs]
         btrfs_ioctl_balance+0x47f/0x640 [btrfs]
         btrfs_ioctl+0x119d/0x8380 [btrfs]
         do_vfs_ioctl+0x9f5/0x1060
         ksys_ioctl+0x67/0x90
         __x64_sys_ioctl+0x73/0xb0
         do_syscall_64+0xa5/0x370
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff88856f671100
         which belongs to the cache kmalloc-4k of size 4096
        The buggy address is located 1552 bytes inside of
         4096-byte region [ffff88856f671100, ffff88856f672100)
        The buggy address belongs to the page:
        page:ffffea0015bd9c00 refcount:1 mapcount:0 mapping:ffff88864400e600 index:0x0 compound_mapcount: 0
        flags: 0x2ffff0000010200(slab|head)
        raw: 02ffff0000010200 dead000000000100 dead000000000200 ffff88864400e600
        raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff88856f671600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff88856f671680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        >ffff88856f671700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                 ^
         ffff88856f671780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff88856f671800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
        BTRFS info (device sdi1): 1 enospc errors during balance
        BTRFS info (device sdi1): balance: ended with status: -28
      
      [CAUSE]
      The problem happens when finish_ordered_io() get called with balance
      still running, while the reloc root of that subvolume is already dead.
      (Tree is swap already done, but tree not yet deleted for possible qgroup
      usage.)
      
      That means root->reloc_root still exists, but that reloc_root can be
      under btrfs_drop_snapshot(), thus we shouldn't access it.
      
      The following race could cause the use-after-free problem:
      
                      CPU1              |                CPU2
      --------------------------------------------------------------------------
                                        | relocate_block_group()
                                        | |- unset_reloc_control(rc)
                                        | |- btrfs_commit_transaction()
      btrfs_finish_ordered_io()         | |- clean_dirty_subvols()
      |- btrfs_join_transaction()       |    |
         |- record_root_in_trans()      |    |
            |- btrfs_init_reloc_root()  |    |
               |- if (root->reloc_root) |    |
               |                        |    |- root->reloc_root = NULL
               |                        |    |- btrfs_drop_snapshot(reloc_root);
               |- reloc_root->last_trans|
                       = trans->transid |
      	    ^^^^^^^^^^^^^^^^^^^^^^
                  Use after free
      
      [FIX]
      Fix it by the following modifications:
      
      - Test if the root has dead reloc tree before accessing root->reloc_root
        If the root has BTRFS_ROOT_DEAD_RELOC_TREE, then we don't need to
        create or update root->reloc_tree
      
      - Clear the BTRFS_ROOT_DEAD_RELOC_TREE flag until we have fully dropped
        reloc tree
        To co-operate with above modification, so as long as
        BTRFS_ROOT_DEAD_RELOC_TREE is still set, we won't try to re-create
        reloc tree at record_root_in_trans().
      Reported-by: NCebtenzzre <cebtenzzre@gmail.com>
      Fixes: d2311e69 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
      CC: stable@vger.kernel.org # 5.1+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1fac4a54
  15. 24 9月, 2019 4 次提交
    • F
      Btrfs: fix race setting up and completing qgroup rescan workers · 13fc1d27
      Filipe Manana 提交于
      There is a race between setting up a qgroup rescan worker and completing
      a qgroup rescan worker that can lead to callers of the qgroup rescan wait
      ioctl to either not wait for the rescan worker to complete or to hang
      forever due to missing wake ups. The following diagram shows a sequence
      of steps that illustrates the race.
      
              CPU 1                                                         CPU 2                                  CPU 3
      
       btrfs_ioctl_quota_rescan()
        btrfs_qgroup_rescan()
         qgroup_rescan_init()
          mutex_lock(&fs_info->qgroup_rescan_lock)
          spin_lock(&fs_info->qgroup_lock)
      
          fs_info->qgroup_flags |=
            BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
          init_completion(
            &fs_info->qgroup_rescan_completion)
      
          fs_info->qgroup_rescan_running = true
      
          mutex_unlock(&fs_info->qgroup_rescan_lock)
          spin_unlock(&fs_info->qgroup_lock)
      
          btrfs_init_work()
           --> starts the worker
      
                                                              btrfs_qgroup_rescan_worker()
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
      
                                                               fs_info->qgroup_flags &=
                                                                 ~BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
                                                               starts transaction, updates qgroup status
                                                               item, etc
      
                                                                                                                 btrfs_ioctl_quota_rescan()
                                                                                                                  btrfs_qgroup_rescan()
                                                                                                                   qgroup_rescan_init()
                                                                                                                    mutex_lock(&fs_info->qgroup_rescan_lock)
                                                                                                                    spin_lock(&fs_info->qgroup_lock)
      
                                                                                                                    fs_info->qgroup_flags |=
                                                                                                                      BTRFS_QGROUP_STATUS_FLAG_RESCAN
      
                                                                                                                    init_completion(
                                                                                                                      &fs_info->qgroup_rescan_completion)
      
                                                                                                                    fs_info->qgroup_rescan_running = true
      
                                                                                                                    mutex_unlock(&fs_info->qgroup_rescan_lock)
                                                                                                                    spin_unlock(&fs_info->qgroup_lock)
      
                                                                                                                    btrfs_init_work()
                                                                                                                     --> starts another worker
      
                                                               mutex_lock(&fs_info->qgroup_rescan_lock)
      
                                                               fs_info->qgroup_rescan_running = false
      
                                                               mutex_unlock(&fs_info->qgroup_rescan_lock)
      
      							 complete_all(&fs_info->qgroup_rescan_completion)
      
      Before the rescan worker started by the task at CPU 3 completes, if
      another task calls btrfs_ioctl_quota_rescan(), it will get -EINPROGRESS
      because the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN is set at
      fs_info->qgroup_flags, which is expected and correct behaviour.
      
      However if other task calls btrfs_ioctl_quota_rescan_wait() before the
      rescan worker started by the task at CPU 3 completes, it will return
      immediately without waiting for the new rescan worker to complete,
      because fs_info->qgroup_rescan_running is set to false by CPU 2.
      
      This race is making test case btrfs/171 (from fstests) to fail often:
      
        btrfs/171 9s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad)
            --- tests/btrfs/171.out     2018-09-16 21:30:48.505104287 +0100
            +++ /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad      2019-09-19 02:01:36.938486039 +0100
            @@ -1,2 +1,3 @@
             QA output created by 171
            +ERROR: quota rescan failed: Operation now in progress
             Silence is golden
            ...
            (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/171.out /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad'  to see the entire diff)
      
      That is because the test calls the btrfs-progs commands "qgroup quota
      rescan -w", "qgroup assign" and "qgroup remove" in a sequence that makes
      calls to the rescan start ioctl fail with -EINPROGRESS (note the "btrfs"
      commands 'qgroup assign' and 'qgroup remove' often call the rescan start
      ioctl after calling the qgroup assign ioctl,
      btrfs_ioctl_qgroup_assign()), since previous waits didn't actually wait
      for a rescan worker to complete.
      
      Another problem the race can cause is missing wake ups for waiters,
      since the call to complete_all() happens outside a critical section and
      after clearing the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN. In the sequence
      diagram above, if we have a waiter for the first rescan task (executed
      by CPU 2), then fs_info->qgroup_rescan_completion.wait is not empty, and
      if after the rescan worker clears BTRFS_QGROUP_STATUS_FLAG_RESCAN and
      before it calls complete_all() against
      fs_info->qgroup_rescan_completion, the task at CPU 3 calls
      init_completion() against fs_info->qgroup_rescan_completion which
      re-initilizes its wait queue to an empty queue, therefore causing the
      rescan worker at CPU 2 to call complete_all() against an empty queue,
      never waking up the task waiting for that rescan worker.
      
      Fix this by clearing BTRFS_QGROUP_STATUS_FLAG_RESCAN and setting
      fs_info->qgroup_rescan_running to false in the same critical section,
      delimited by the mutex fs_info->qgroup_rescan_lock, as well as doing the
      call to complete_all() in that same critical section. This gives the
      protection needed to avoid rescan wait ioctl callers not waiting for a
      running rescan worker and the lost wake ups problem, since setting that
      rescan flag and boolean as well as initializing the wait queue is done
      already in a critical section delimited by that mutex (at
      qgroup_rescan_init()).
      
      Fixes: 57254b6e ("Btrfs: add ioctl to wait for qgroup rescan completion")
      Fixes: d2c609b8 ("btrfs: properly track when rescan worker is running")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      13fc1d27
    • F
      Btrfs: fix missing error return if writeback for extent buffer never started · 0607eb1d
      Filipe Manana 提交于
      If lock_extent_buffer_for_io() fails, it returns a negative value, but its
      caller btree_write_cache_pages() ignores such error. This means that a
      call to flush_write_bio(), from lock_extent_buffer_for_io(), might have
      failed. We should make btree_write_cache_pages() notice such error values
      and stop immediatelly, making sure filemap_fdatawrite_range() returns an
      error to the transaction commit path. A failure from flush_write_bio()
      should also result in the endio callback end_bio_extent_buffer_writepage()
      being invoked, which sets the BTRFS_FS_*_ERR bits appropriately, so that
      there's no risk a transaction or log commit doesn't catch a writeback
      failure.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0607eb1d
    • D
      btrfs: adjust dirty_metadata_bytes after writeback failure of extent buffer · eb5b64f1
      Dennis Zhou 提交于
      Before, if a eb failed to write out, we would end up triggering a
      BUG_ON(). As of f4340622 ("btrfs: extent_io: Move the BUG_ON() in
      flush_write_bio() one level up"), we no longer BUG_ON(), so we should
      make life consistent and add back the unwritten bytes to
      dirty_metadata_bytes.
      
      Fixes: f4340622 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
      CC: stable@vger.kernel.org # 5.2+
      Reviewed-by: NFilipe Manana <fdmanana@kernel.org>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb5b64f1
    • F
      Btrfs: fix selftests failure due to uninitialized i_mode in test inodes · 9f7fec0b
      Filipe Manana 提交于
      Some of the self tests create a test inode, setup some extents and then do
      calls to btrfs_get_extent() to test that the corresponding extent maps
      exist and are correct. However btrfs_get_extent(), since the 5.2 merge
      window, now errors out when it finds a regular or prealloc extent for an
      inode that does not correspond to a regular file (its ->i_mode is not
      S_IFREG). This causes the self tests to fail sometimes, specially when
      KASAN, slub_debug and page poisoning are enabled:
      
        $ modprobe btrfs
        modprobe: ERROR: could not insert 'btrfs': Invalid argument
      
        $ dmesg
        [ 9414.691648] Btrfs loaded, crc32c=crc32c-intel, debug=on, assert=on, integrity-checker=on, ref-verify=on
        [ 9414.692655] BTRFS: selftest: sectorsize: 4096  nodesize: 4096
        [ 9414.692658] BTRFS: selftest: running btrfs free space cache tests
        [ 9414.692918] BTRFS: selftest: running extent only tests
        [ 9414.693061] BTRFS: selftest: running bitmap only tests
        [ 9414.693366] BTRFS: selftest: running bitmap and extent tests
        [ 9414.696455] BTRFS: selftest: running space stealing from bitmap to extent tests
        [ 9414.697131] BTRFS: selftest: running extent buffer operation tests
        [ 9414.697133] BTRFS: selftest: running btrfs_split_item tests
        [ 9414.697564] BTRFS: selftest: running extent I/O tests
        [ 9414.697583] BTRFS: selftest: running find delalloc tests
        [ 9415.081125] BTRFS: selftest: running find_first_clear_extent_bit test
        [ 9415.081278] BTRFS: selftest: running extent buffer bitmap tests
        [ 9415.124192] BTRFS: selftest: running inode tests
        [ 9415.124195] BTRFS: selftest: running btrfs_get_extent tests
        [ 9415.127909] BTRFS: selftest: running hole first btrfs_get_extent test
        [ 9415.128343] BTRFS critical (device (efault)): regular/prealloc extent found for non-regular inode 256
        [ 9415.131428] BTRFS: selftest: fs/btrfs/tests/inode-tests.c:904 expected a real extent, got 0
      
      This happens because the test inodes are created without ever initializing
      the i_mode field of the inode, and neither VFS's new_inode() nor the btrfs
      callback btrfs_alloc_inode() initialize the i_mode. Initialization of the
      i_mode is done through the various callbacks used by the VFS to create
      new inodes (regular files, directories, symlinks, tmpfiles, etc), which
      all call btrfs_new_inode() which in turn calls inode_init_owner(), which
      sets the inode's i_mode. Since the tests only uses new_inode() to create
      the test inodes, the i_mode was never initialized.
      
      This always happens on a VM I used with kasan, slub_debug and many other
      debug facilities enabled. It also happened to someone who reported this
      on bugzilla (on a 5.3-rc).
      
      Fix this by setting i_mode to S_IFREG at btrfs_new_test_inode().
      
      Fixes: 6bf9e4bd ("btrfs: inode: Verify inode mode to avoid NULL pointer dereference")
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204397Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9f7fec0b
  16. 12 9月, 2019 2 次提交
    • F
      Btrfs: fix unwritten extent buffers and hangs on future writeback attempts · 18dfa711
      Filipe Manana 提交于
      The lock_extent_buffer_io() returns 1 to the caller to tell it everything
      went fine and the callers needs to start writeback for the extent buffer
      (submit a bio, etc), 0 to tell the caller everything went fine but it does
      not need to start writeback for the extent buffer, and a negative value if
      some error happened.
      
      When it's about to return 1 it tries to lock all pages, and if a try lock
      on a page fails, and we didn't flush any existing bio in our "epd", it
      calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
      an error. The page might have been locked elsewhere, not with the goal
      of starting writeback of the extent buffer, and even by some code other
      than btrfs, like page migration for example, so it does not mean the
      writeback of the extent buffer was already started by some other task,
      so returning a 0 tells the caller (btree_write_cache_pages()) to not
      start writeback for the extent buffer. Note that epd might currently have
      either no bio, so flush_write_bio() returns 0 (success) or it might have
      a bio for another extent buffer with a lower index (logical address).
      
      Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
      extent buffer and writeback is never started for the extent buffer,
      future attempts to writeback the extent buffer will hang forever waiting
      on that bit to be cleared, since it can only be cleared after writeback
      completes. Such hang is reported with a trace like the following:
      
        [49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds.
        [49887.347059]       Not tainted 5.2.13-gentoo #2
        [49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [49887.347062] btrfs-transacti D    0  1752      2 0x80004000
        [49887.347064] Call Trace:
        [49887.347069]  ? __schedule+0x265/0x830
        [49887.347071]  ? bit_wait+0x50/0x50
        [49887.347072]  ? bit_wait+0x50/0x50
        [49887.347074]  schedule+0x24/0x90
        [49887.347075]  io_schedule+0x3c/0x60
        [49887.347077]  bit_wait_io+0x8/0x50
        [49887.347079]  __wait_on_bit+0x6c/0x80
        [49887.347081]  ? __lock_release.isra.29+0x155/0x2d0
        [49887.347083]  out_of_line_wait_on_bit+0x7b/0x80
        [49887.347084]  ? var_wake_function+0x20/0x20
        [49887.347087]  lock_extent_buffer_for_io+0x28c/0x390
        [49887.347089]  btree_write_cache_pages+0x18e/0x340
        [49887.347091]  do_writepages+0x29/0xb0
        [49887.347093]  ? kmem_cache_free+0x132/0x160
        [49887.347095]  ? convert_extent_bit+0x544/0x680
        [49887.347097]  filemap_fdatawrite_range+0x70/0x90
        [49887.347099]  btrfs_write_marked_extents+0x53/0x120
        [49887.347100]  btrfs_write_and_wait_transaction.isra.4+0x38/0xa0
        [49887.347102]  btrfs_commit_transaction+0x6bb/0x990
        [49887.347103]  ? start_transaction+0x33e/0x500
        [49887.347105]  transaction_kthread+0x139/0x15c
      
      So fix this by not overwriting the return value (ret) with the result
      from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
      bit in case flush_write_bio() returns an error, otherwise it will hang
      any future attempts to writeback the extent buffer, and undo all work
      done before (set back EXTENT_BUFFER_DIRTY, etc).
      
      This is a regression introduced in the 5.2 kernel.
      
      Fixes: 2e3c2513 ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()")
      Fixes: f4340622 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
      Reported-by: NZdenek Sojka <zsojka@seznam.cz>
      Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#uReported-by: NStefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#tReported-by: NDrazen Kacar <drazen.kacar@oradian.com>
      Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18dfa711
    • F
      Btrfs: fix assertion failure during fsync and use of stale transaction · 410f954c
      Filipe Manana 提交于
      Sometimes when fsync'ing a file we need to log that other inodes exist and
      when we need to do that we acquire a reference on the inodes and then drop
      that reference using iput() after logging them.
      
      That generally is not a problem except if we end up doing the final iput()
      (dropping the last reference) on the inode and that inode has a link count
      of 0, which can happen in a very short time window if the logging path
      gets a reference on the inode while it's being unlinked.
      
      In that case we end up getting the eviction callback, btrfs_evict_inode(),
      invoked through the iput() call chain which needs to drop all of the
      inode's items from its subvolume btree, and in order to do that, it needs
      to join a transaction at the helper function evict_refill_and_join().
      However because the task previously started a transaction at the fsync
      handler, btrfs_sync_file(), it has current->journal_info already pointing
      to a transaction handle and therefore evict_refill_and_join() will get
      that transaction handle from btrfs_join_transaction(). From this point on,
      two different problems can happen:
      
      1) evict_refill_and_join() will often change the transaction handle's
         block reserve (->block_rsv) and set its ->bytes_reserved field to a
         value greater than 0. If evict_refill_and_join() never commits the
         transaction, the eviction handler ends up decreasing the reference
         count (->use_count) of the transaction handle through the call to
         btrfs_end_transaction(), and after that point we have a transaction
         handle with a NULL ->block_rsv (which is the value prior to the
         transaction join from evict_refill_and_join()) and a ->bytes_reserved
         value greater than 0. If after the eviction/iput completes the inode
         logging path hits an error or it decides that it must fallback to a
         transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
         non-zero value from btrfs_log_dentry_safe(), and because of that
         non-zero value it tries to commit the transaction using a handle with
         a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
         the transaction commit hit an assertion failure at
         btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
         the ->block_rsv is NULL. The produced stack trace for that is like the
         following:
      
         [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
         [192922.917553] ------------[ cut here ]------------
         [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
         [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
         [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
         [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
         [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
         (...)
         [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
         [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
         [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
         [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
         [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
         [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
         [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
         [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
         [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [192922.925105] Call Trace:
         [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
         [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
         [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
         [192922.926731]  do_fsync+0x38/0x60
         [192922.927138]  __x64_sys_fdatasync+0x13/0x20
         [192922.927543]  do_syscall_64+0x60/0x1c0
         [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
         (...)
         [192922.934077] ---[ end trace f00808b12068168f ]---
      
      2) If evict_refill_and_join() decides to commit the transaction, it will
         be able to do it, since the nested transaction join only increments the
         transaction handle's ->use_count reference counter and it does not
         prevent the transaction from getting committed. This means that after
         eviction completes, the fsync logging path will be using a transaction
         handle that refers to an already committed transaction. What happens
         when using such a stale transaction can be unpredictable, we are at
         least having a use-after-free on the transaction handle itself, since
         the transaction commit will call kmem_cache_free() against the handle
         regardless of its ->use_count value, or we can end up silently losing
         all the updates to the log tree after that iput() in the logging path,
         or using a transaction handle that in the meanwhile was allocated to
         another task for a new transaction, etc, pretty much unpredictable
         what can happen.
      
      In order to fix both of them, instead of using iput() during logging, use
      btrfs_add_delayed_iput(), so that the logging path of fsync never drops
      the last reference on an inode, that step is offloaded to a safe context
      (usually the cleaner kthread).
      
      The assertion failure issue was sporadically triggered by the test case
      generic/475 from fstests, which loads the dm error target while fsstress
      is running, which lead to fsync failing while logging inodes with -EIO
      errors and then trying later to commit the transaction, triggering the
      assertion failure.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      410f954c
  17. 09 9月, 2019 6 次提交
    • N
      btrfs: Relinquish CPUs in btrfs_compare_trees · 6af112b1
      Nikolay Borisov 提交于
      When doing any form of incremental send the parent and the child trees
      need to be compared via btrfs_compare_trees. This  can result in long
      loop chains without ever relinquishing the CPU. This causes softlockup
      detector to trigger when comparing trees with a lot of items. Example
      report:
      
      watchdog: BUG: soft lockup - CPU#0 stuck for 24s! [snapperd:16153]
      CPU: 0 PID: 16153 Comm: snapperd Not tainted 5.2.9-1-default #1 openSUSE Tumbleweed (unreleased)
      Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      pstate: 40000005 (nZcv daif -PAN -UAO)
      pc : __ll_sc_arch_atomic_sub_return+0x14/0x20
      lr : btrfs_release_extent_buffer_pages+0xe0/0x1e8 [btrfs]
      sp : ffff00001273b7e0
      Call trace:
       __ll_sc_arch_atomic_sub_return+0x14/0x20
       release_extent_buffer+0xdc/0x120 [btrfs]
       free_extent_buffer.part.0+0xb0/0x118 [btrfs]
       free_extent_buffer+0x24/0x30 [btrfs]
       btrfs_release_path+0x4c/0xa0 [btrfs]
       btrfs_free_path.part.0+0x20/0x40 [btrfs]
       btrfs_free_path+0x24/0x30 [btrfs]
       get_inode_info+0xa8/0xf8 [btrfs]
       finish_inode_if_needed+0xe0/0x6d8 [btrfs]
       changed_cb+0x9c/0x410 [btrfs]
       btrfs_compare_trees+0x284/0x648 [btrfs]
       send_subvol+0x33c/0x520 [btrfs]
       btrfs_ioctl_send+0x8a0/0xaf0 [btrfs]
       btrfs_ioctl+0x199c/0x2288 [btrfs]
       do_vfs_ioctl+0x4b0/0x820
       ksys_ioctl+0x84/0xb8
       __arm64_sys_ioctl+0x28/0x38
       el0_svc_common.constprop.0+0x7c/0x188
       el0_svc_handler+0x34/0x90
       el0_svc+0x8/0xc
      
      Fix this by adding a call to cond_resched at the beginning of the main
      loop in btrfs_compare_trees.
      
      Fixes: 7069830a ("Btrfs: add btrfs_compare_trees function")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6af112b1
    • N
      btrfs: Don't assign retval of btrfs_try_tree_write_lock/btrfs_tree_read_lock_atomic · 65e99c43
      Nikolay Borisov 提交于
      Those function are simple boolean predicates there is no need to assign
      their return values to interim variables. Use them directly as
      predicates. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      65e99c43
    • J
      btrfs: create structure to encode checksum type and length · af024ed2
      Johannes Thumshirn 提交于
      Create a structure to encode the type and length for the known on-disk
      checksums.  This makes it easier to add new checksums later.
      
      The structure and helpers are moved from ctree.h so they don't occupy
      space in all headers including ctree.h. This save some space in the
      final object.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      af024ed2
    • J
      btrfs: add enospc debug messages for ticket failure · 84fe47a4
      Josef Bacik 提交于
      When debugging weird enospc problems it's handy to be able to dump the
      space info when we wake up all tickets, and see what the ticket values
      are.  This helped me figure out cases where we were enospc'ing when we
      shouldn't have been.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      84fe47a4
    • J
      btrfs: do not account global reserve in can_overcommit · 0096420a
      Josef Bacik 提交于
      We ran into a problem in production where a box with plenty of space was
      getting wedged doing ENOSPC flushing.  These boxes only had 20% of the
      disk allocated, but their metadata space + global reserve was right at
      the size of their metadata chunk.
      
      In this case can_overcommit should be allowing allocations without
      problem, but there's logic in can_overcommit that doesn't allow us to
      overcommit if there's not enough real space to satisfy the global
      reserve.
      
      This is for historical reasons.  Before there were only certain places
      we could allocate chunks.  We could go to commit the transaction and not
      have enough space for our pending delayed refs and such and be unable to
      allocate a new chunk.  This would result in a abort because of ENOSPC.
      This code was added to solve this problem.
      
      However since then we've gained the ability to always be able to
      allocate a chunk.  So we can easily overcommit in these cases without
      risking a transaction abort because of ENOSPC.
      
      Also prior to now the global reserve really would be used because that's
      the space we relied on for delayed refs.  With delayed refs being
      tracked separately we no longer have to worry about running out of
      delayed refs space while committing.  We are much less likely to
      exhaust our global reserve space during transaction commit.
      
      Fix the can_overcommit code to simply see if our current usage + what we
      want is less than our current free space plus whatever slack space we
      have in the disk is.  This solves the problem we were seeing in
      production and keeps us from flushing as aggressively as we approach our
      actual metadata size usage.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0096420a
    • J
      btrfs: use btrfs_try_granting_tickets in update_global_rsv · 426551f6
      Josef Bacik 提交于
      We have some annoying xfstests tests that will create a very small fs,
      fill it up, delete it, and repeat to make sure everything works right.
      This trips btrfs up sometimes because we may commit a transaction to
      free space, but most of the free metadata space was being reserved by
      the global reserve.  So we commit and update the global reserve, but the
      space is simply added to bytes_may_use directly, instead of trying to
      add it to existing tickets.  This results in ENOSPC when we really did
      have space.  Fix this by calling btrfs_try_granting_tickets once we add
      back our excess space to wake any pending tickets.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      426551f6