1. 18 11月, 2019 18 次提交
  2. 12 11月, 2019 1 次提交
    • F
      Btrfs: fix log context list corruption after rename exchange operation · e6c61710
      Filipe Manana 提交于
      During rename exchange we might have successfully log the new name in the
      source root's log tree, in which case we leave our log context (allocated
      on stack) in the root's list of log contextes. However we might fail to
      log the new name in the destination root, in which case we fallback to
      a transaction commit later and never sync the log of the source root,
      which causes the source root log context to remain in the list of log
      contextes. This later causes invalid memory accesses because the context
      was allocated on stack and after rename exchange finishes the stack gets
      reused and overwritten for other purposes.
      
      The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
      detect this and report something like the following:
      
        [  691.489929] ------------[ cut here ]------------
        [  691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
        [  691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
        (...)
        [  691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
        [  691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        [  691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
        (...)
        [  691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
        [  691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
        [  691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
        [  691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
        [  691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
        [  691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
        [  691.490019] FS:  00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
        [  691.490021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
        [  691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [  691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [  691.490029] Call Trace:
        [  691.490058]  btrfs_log_inode_parent+0x667/0x2730 [btrfs]
        [  691.490083]  ? join_transaction+0x24a/0xce0 [btrfs]
        [  691.490107]  ? btrfs_end_log_trans+0x80/0x80 [btrfs]
        [  691.490111]  ? dget_parent+0xb8/0x460
        [  691.490116]  ? lock_downgrade+0x6b0/0x6b0
        [  691.490121]  ? rwlock_bug.part.0+0x90/0x90
        [  691.490127]  ? do_raw_spin_unlock+0x142/0x220
        [  691.490151]  btrfs_log_dentry_safe+0x65/0x90 [btrfs]
        [  691.490172]  btrfs_sync_file+0x9f1/0xc00 [btrfs]
        [  691.490195]  ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
        [  691.490198]  ? rcu_read_lock_any_held.part.11+0x20/0x20
        [  691.490204]  ? __do_sys_newstat+0x88/0xd0
        [  691.490207]  ? cp_new_stat+0x5d0/0x5d0
        [  691.490218]  ? do_fsync+0x38/0x60
        [  691.490220]  do_fsync+0x38/0x60
        [  691.490224]  __x64_sys_fdatasync+0x32/0x40
        [  691.490228]  do_syscall_64+0x9f/0x540
        [  691.490233]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [  691.490235] RIP: 0033:0x7f4b253ad5f0
        (...)
        [  691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
        [  691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
        [  691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
        [  691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
        [  691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
        [  691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0
      
      This started happening recently when running some test cases from fstests
      like btrfs/004 for example, because support for rename exchange was added
      last week to fsstress from fstests.
      
      So fix this by deleting the log context for the source root from the list
      if we have logged the new name in the source root.
      Reported-by: NSu Yue <Damenly_Su@gmx.com>
      Fixes: d4682ba0 ("Btrfs: sync log after logging new name")
      CC: stable@vger.kernel.org # 4.19+
      Tested-by: NSu Yue <Damenly_Su@gmx.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e6c61710
  3. 05 11月, 2019 2 次提交
    • D
      btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC · a5009d3a
      David Sterba 提交于
      The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as
      deprecated and scheduled for removal but we actualy do use them for
      'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351
      should have been just the async flag for subvolume creation.
      
      The deprecation has been added in this development cycle, remove it
      until it's time.
      
      Fixes: ebc87351 ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag")
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5009d3a
    • J
      btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range · d98da499
      Josef Bacik 提交于
      We hit a regression while rolling out 5.2 internally where we were
      hitting the following panic
      
        kernel BUG at mm/page-writeback.c:2659!
        RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
        Call Trace:
         __process_pages_contig+0x25a/0x350
         ? extent_clear_unlock_delalloc+0x43/0x70
         submit_compressed_extents+0x359/0x4d0
         normal_work_helper+0x15a/0x330
         process_one_work+0x1f5/0x3f0
         worker_thread+0x2d/0x3d0
         ? rescuer_thread+0x340/0x340
         kthread+0x111/0x130
         ? kthread_create_on_node+0x60/0x60
         ret_from_fork+0x1f/0x30
      
      This is happening because the page is not locked when doing
      clear_page_dirty_for_io.  Looking at the core dump it was because our
      async_extent had a ram_size of 24576 but our async_chunk range only
      spanned 20480, so we had a whole extra page in our ram_size for our
      async_extent.
      
      This happened because we try not to compress pages outside of our
      i_size, however a cleanup patch changed us to do
      
      actual_end = min_t(u64, i_size_read(inode), end + 1);
      
      which is problematic because i_size_read() can evaluate to different
      values in between checking and assigning.  So either an expanding
      truncate or a fallocate could increase our i_size while we're doing
      writeout and actual_end would end up being past the range we have
      locked.
      
      I confirmed this was what was happening by installing a debug kernel
      that had
      
        actual_end = min_t(u64, i_size_read(inode), end + 1);
        if (actual_end > end + 1) {
      	  printk(KERN_ERR "KABOOM\n");
      	  actual_end = end + 1;
        }
      
      and installing it onto 500 boxes of the tier that had been seeing the
      problem regularly.  Last night I got my debug message and no panic,
      confirming what I expected.
      
      [ dsterba: the assembly confirms a tiny race window:
      
          mov    0x20(%rsp),%rax
          cmp    %rax,0x48(%r15)           # read
          movl   $0x0,0x18(%rsp)
          mov    %rax,%r12
          mov    %r14,%rax
          cmovbe 0x48(%r15),%r12           # eval
      
        Where r15 is inode and 0x48 is offset of i_size.
      
        The original fix was to revert 62b37622 that would do an
        intermediate assignment and this would also avoid the doulble
        evaluation but is not future-proof, should the compiler merge the
        stores and call i_size_read anyway.
      
        There's a patch adding READ_ONCE to i_size_read but that's not being
        applied at the moment and we need to fix the bug. Instead, emulate
        READ_ONCE by two barrier()s that's what effectively happens. The
        assembly confirms single evaluation:
      
          mov    0x48(%rbp),%rax          # read once
          mov    0x20(%rsp),%rcx
          mov    $0x20,%edx
          cmp    %rax,%rcx
          cmovbe %rcx,%rax
          mov    %rax,(%rsp)
          mov    %rax,%rcx
          mov    %r14,%rax
      
        Where 0x48(%rbp) is inode->i_size stored to %eax.
      ]
      
      Fixes: 62b37622 ("btrfs: Remove isize local variable in compress_file_range")
      CC: stable@vger.kernel.org # v5.1+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ changelog updated ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d98da499
  4. 26 10月, 2019 3 次提交
    • F
      Btrfs: fix race leading to metadata space leak after task received signal · 0cab7acc
      Filipe Manana 提交于
      When a task that is allocating metadata needs to wait for the async
      reclaim job to process its ticket and gets a signal (because it was killed
      for example) before doing the wait, the task ends up erroring out but
      with space reserved for its ticket, which never gets released, resulting
      in a metadata space leak (more specifically a leak in the bytes_may_use
      counter of the metadata space_info object).
      
      Here's the sequence of steps leading to the space leak:
      
      1) A task tries to create a file for example, so it ends up trying to
         start a transaction at btrfs_create();
      
      2) The filesystem is currently in a state where there is not enough
         metadata free space to satisfy the transaction's needs. So at
         space-info.c:__reserve_metadata_bytes() we create a ticket and
         add it to the list of tickets of the space info object. Also,
         because the metadata async reclaim job is not running, we queue
         a job ro run metadata reclaim;
      
      3) In the meanwhile the task receives a signal (like SIGTERM from
         a kill command for example);
      
      4) After queing the async reclaim job, at __reserve_metadata_bytes(),
         we unlock the metadata space info and call handle_reserve_ticket();
      
      5) That last function calls wait_reserve_ticket(), which acquires the
         lock from the metadata space info. Then in the first iteration of
         its while loop, it calls prepare_to_wait_event(), which returns
         -ERESTARTSYS because the task has a pending signal. As a result,
         we set the error field of the ticket to -EINTR and exit the while
         loop without deleting the ticket from the list of tickets (in the
         space info object). After exiting the loop we unlock the space info;
      
      6) The async reclaim job is able to release enough metadata, acquires
         the metadata space info's lock and then reserves space for the ticket,
         since the ticket is still in the list of (non-priority) tickets. The
         space reservation happens at btrfs_try_granting_tickets(), called from
         maybe_fail_all_tickets(). This increments the bytes_may_use counter
         from the metadata space info object, sets the ticket's bytes field to
         zero (meaning success, that space was reserved) and removes it from
         the list of tickets;
      
      7) wait_reserve_ticket() returns, with the error field of the ticket
         set to -EINTR. Then handle_reserve_ticket() just propagates that error
         to the caller. Because an error was returned, the caller does not
         release the reserved space, since the expectation is that any error
         means no space was reserved.
      
      Fix this by removing the ticket from the list, while holding the space
      info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
      an error.
      
      Also add some comments and an assertion to guarantee we never end up with
      a ticket that has an error set and a bytes counter field set to zero, to
      more easily detect regressions in the future.
      
      This issue could be triggered sporadically by some test cases from fstests
      such as generic/269 for example, which tries to fill a filesystem and then
      kills fsstress processes running in the background.
      
      When this issue happens, we get a warning in syslog/dmesg when unmounting
      the filesystem, like the following:
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
        (...)
        CPU: 0 PID: 13240 Comm: umount Tainted: G        W    L    5.3.0-rc8-btrfs-next-48+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
        (...)
        RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
        RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
        RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
        R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
        R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
        FS:  00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         close_ctree+0x1ad/0x390 [btrfs]
         generic_shutdown_super+0x6c/0x110
         kill_anon_super+0xe/0x30
         btrfs_kill_super+0x12/0xa0 [btrfs]
         deactivate_locked_super+0x3a/0x70
         cleanup_mnt+0xb4/0x160
         task_work_run+0x7e/0xc0
         exit_to_usermode_loop+0xfa/0x100
         do_syscall_64+0x1cb/0x220
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f4274d2cb37
        (...)
        RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
        RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
        RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
        R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
        R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
        softirqs last  enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace bcf4b235461b26f6 ]---
        BTRFS info (device sdb): space_info 4 has 19116032 free, is full
        BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
        BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
        BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0
      
      Fixes: 374bf9c5 ("btrfs: unify error handling for ticket flushing")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0cab7acc
    • Q
      btrfs: tree-checker: Fix wrong check on max devid · 8bb177d1
      Qu Wenruo 提交于
      [BUG]
      The following script will cause false alert on devid check.
        #!/bin/bash
      
        dev1=/dev/test/test
        dev2=/dev/test/scratch1
        mnt=/mnt/btrfs
      
        umount $dev1 &> /dev/null
        umount $dev2 &> /dev/null
        umount $mnt &> /dev/null
      
        mkfs.btrfs -f $dev1
      
        mount $dev1 $mnt
      
        _fail()
        {
                echo "!!! FAILED !!!"
                exit 1
        }
      
        for ((i = 0; i < 4096; i++)); do
                btrfs dev add -f $dev2 $mnt || _fail
                btrfs dev del $dev1 $mnt || _fail
                dev_tmp=$dev1
                dev1=$dev2
                dev2=$dev_tmp
        done
      
      [CAUSE]
      Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as
      upper limit for devid.  But we can have devid holes just like above
      script.
      
      So the check for devid is incorrect and could cause false alert.
      
      [FIX]
      Just remove the whole devid check.  We don't have any hard requirement
      for devid assignment.
      
      Furthermore, even devid could get corrupted by a bitflip, we still have
      dev extents verification at mount time, so corrupted data won't sneak
      in.
      
      This fixes fstests btrfs/194.
      Reported-by: NAnand Jain <anand.jain@oracle.com>
      Fixes: ab4ba2e1 ("btrfs: tree-checker: Verify dev item")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8bb177d1
    • Q
      btrfs: Consider system chunk array size for new SYSTEM chunks · c17add7a
      Qu Wenruo 提交于
      For SYSTEM chunks, despite the regular chunk item size limit, there is
      another limit due to system chunk array size.
      
      The extra limit was removed in a refactoring, so add it back.
      
      Fixes: e3ecdb3f ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk")
      CC: stable@vger.kernel.org # 5.3+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c17add7a
  5. 18 10月, 2019 2 次提交
    • F
      Btrfs: check for the full sync flag while holding the inode lock during fsync · ba0b084a
      Filipe Manana 提交于
      We were checking for the full fsync flag in the inode before locking the
      inode, which is racy, since at that that time it might not be set but
      after we acquire the inode lock some other task set it. One case where
      this can happen is on a system low on memory and some concurrent task
      failed to allocate an extent map and therefore set the full sync flag on
      the inode, to force the next fsync to work in full mode.
      
      A consequence of missing the full fsync flag set is hitting the problems
      fixed by commit 0c713cba ("Btrfs: fix race between ranged fsync and
      writeback of adjacent ranges"), BUG_ON() when dropping extents from a log
      tree, hitting assertion failures at tree-log.c:copy_items() or all sorts
      of weird inconsistencies after replaying a log due to file extents items
      representing ranges that overlap.
      
      So just move the check such that it's done after locking the inode and
      before starting writeback again.
      
      Fixes: 0c713cba ("Btrfs: fix race between ranged fsync and writeback of adjacent ranges")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ba0b084a
    • F
      Btrfs: fix qgroup double free after failure to reserve metadata for delalloc · c7967fc1
      Filipe Manana 提交于
      If we fail to reserve metadata for delalloc operations we end up releasing
      the previously reserved qgroup amount twice, once explicitly under the
      'out_qgroup' label by calling btrfs_qgroup_free_meta_prealloc() and once
      again, under label 'out_fail', by calling btrfs_inode_rsv_release() with a
      value of 'true' for its 'qgroup_free' argument, which results in
      btrfs_qgroup_free_meta_prealloc() being called again, so we end up having
      a double free.
      
      Also if we fail to reserve the necessary qgroup amount, we jump to the
      label 'out_fail', which calls btrfs_inode_rsv_release() and that in turns
      calls btrfs_qgroup_free_meta_prealloc(), even though we weren't able to
      reserve any qgroup amount. So we freed some amount we never reserved.
      
      So fix this by removing the call to btrfs_inode_rsv_release() in the
      failure path, since it's not necessary at all as we haven't changed the
      inode's block reserve in any way at this point.
      
      Fixes: c8eaeac7 ("btrfs: reserve delalloc metadata differently")
      CC: stable@vger.kernel.org # 5.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7967fc1
  6. 17 10月, 2019 1 次提交
  7. 16 10月, 2019 1 次提交
    • Q
      btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() · 8702ba93
      Qu Wenruo 提交于
      [Background]
      Btrfs qgroup uses two types of reserved space for METADATA space,
      PERTRANS and PREALLOC.
      
      PERTRANS is metadata space reserved for each transaction started by
      btrfs_start_transaction().
      While PREALLOC is for delalloc, where we reserve space before joining a
      transaction, and finally it will be converted to PERTRANS after the
      writeback is done.
      
      [Inconsistency]
      However there is inconsistency in how we handle PREALLOC metadata space.
      
      The most obvious one is:
      In btrfs_buffered_write():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
      
      We always free qgroup PREALLOC meta space.
      
      While in btrfs_truncate_block():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
      
      We only free qgroup PREALLOC meta space when something went wrong.
      
      [The Correct Behavior]
      The correct behavior should be the one in btrfs_buffered_write(), we
      should always free PREALLOC metadata space.
      
      The reason is, the btrfs_delalloc_* mechanism works by:
      - Reserve metadata first, even it's not necessary
        In btrfs_delalloc_reserve_metadata()
      
      - Free the unused metadata space
        Normally in:
        btrfs_delalloc_release_extents()
        |- btrfs_inode_rsv_release()
           Here we do calculation on whether we should release or not.
      
      E.g. for 64K buffered write, the metadata rsv works like:
      
      /* The first page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=0
      total:		num_bytes=calc_inode_reservations()
      /* The first page caused one outstanding extent, thus needs metadata
         rsv */
      
      /* The 2nd page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed
      /* The 2nd page doesn't cause new outstanding extent, needs no new meta
         rsv, so we free what we have reserved */
      
      /* The 3rd~16th pages */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed (still space for one outstanding extent)
      
      This means, if btrfs_delalloc_release_extents() determines to free some
      space, then those space should be freed NOW.
      So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
      than btrfs_qgroup_convert_reserved_meta().
      
      The good news is:
      - The callers are not that hot
        The hottest caller is in btrfs_buffered_write(), which is already
        fixed by commit 336a8bb8 ("btrfs: Fix wrong
        btrfs_delalloc_release_extents parameter"). Thus it's not that
        easy to cause false EDQUOT.
      
      - The trans commit in advance for qgroup would hide the bug
        Since commit f5fef459 ("btrfs: qgroup: Make qgroup async transaction
        commit more aggressive"), when btrfs qgroup metadata free space is slow,
        it will try to commit transaction and free the wrongly converted
        PERTRANS space, so it's not that easy to hit such bug.
      
      [FIX]
      So to fix the problem, remove the @qgroup_free parameter for
      btrfs_delalloc_release_extents(), and always pass true to
      btrfs_inode_rsv_release().
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8702ba93
  8. 15 10月, 2019 1 次提交
  9. 12 10月, 2019 2 次提交
  10. 08 10月, 2019 1 次提交
    • A
      btrfs: silence maybe-uninitialized warning in clone_range · 431d3988
      Austin Kim 提交于
      GCC throws warning message as below:
      
      ‘clone_src_i_size’ may be used uninitialized in this function
      [-Wmaybe-uninitialized]
       #define IS_ALIGNED(x, a)  (((x) & ((typeof(x))(a) - 1)) == 0)
                             ^
      fs/btrfs/send.c:5088:6: note: ‘clone_src_i_size’ was declared here
       u64 clone_src_i_size;
         ^
      The clone_src_i_size is only used as call-by-reference
      in a call to get_inode_info().
      
      Silence the warning by initializing clone_src_i_size to 0.
      
      Note that the warning is a false positive and reported by older versions
      of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the
      patch is applied. Setting clone_src_i_size to 0 does not otherwise make
      sense and would not do any action in case the code changes in the future.
      Signed-off-by: NAustin Kim <austindh.kim@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      431d3988
  11. 03 10月, 2019 1 次提交
  12. 02 10月, 2019 4 次提交
    • J
      btrfs: allocate new inode in NOFS context · 11a19a90
      Josef Bacik 提交于
      A user reported a lockdep splat
      
       ======================================================
       WARNING: possible circular locking dependency detected
       5.2.11-gentoo #2 Not tainted
       ------------------------------------------------------
       kswapd0/711 is trying to acquire lock:
       000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500
      
      but task is already holding lock:
       000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (fs_reclaim){+.+.}:
       kmem_cache_alloc+0x1f/0x1c0
       btrfs_alloc_inode+0x1f/0x260
       alloc_inode+0x16/0xa0
       new_inode+0xe/0xb0
       btrfs_new_inode+0x70/0x610
       btrfs_symlink+0xd0/0x420
       vfs_symlink+0x9c/0x100
       do_symlinkat+0x66/0xe0
       do_syscall_64+0x55/0x1c0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      -> #0 (sb_internal){.+.+}:
       __sb_start_write+0xf6/0x150
       start_transaction+0x3a8/0x500
       btrfs_commit_inode_delayed_inode+0x59/0x110
       btrfs_evict_inode+0x19e/0x4c0
       evict+0xbc/0x1f0
       inode_lru_isolate+0x113/0x190
       __list_lru_walk_one.isra.4+0x5c/0x100
       list_lru_walk_one+0x32/0x50
       prune_icache_sb+0x36/0x80
       super_cache_scan+0x14a/0x1d0
       do_shrink_slab+0x131/0x320
       shrink_node+0xf7/0x380
       balance_pgdat+0x2d5/0x640
       kswapd+0x2ba/0x5e0
       kthread+0x147/0x160
       ret_from_fork+0x24/0x30
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
       CPU0 CPU1
       ---- ----
       lock(fs_reclaim);
       lock(sb_internal);
       lock(fs_reclaim);
       lock(sb_internal);
      *** DEADLOCK ***
      
       3 locks held by kswapd0/711:
       #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
       #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380
       #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0
      
      stack backtrace:
       CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2
       Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017
       Call Trace:
       dump_stack+0x85/0xc7
       print_circular_bug.cold.40+0x1d9/0x235
       __lock_acquire+0x18b1/0x1f00
       lock_acquire+0xa6/0x170
       ? start_transaction+0x3a8/0x500
       __sb_start_write+0xf6/0x150
       ? start_transaction+0x3a8/0x500
       start_transaction+0x3a8/0x500
       btrfs_commit_inode_delayed_inode+0x59/0x110
       btrfs_evict_inode+0x19e/0x4c0
       ? var_wake_function+0x20/0x20
       evict+0xbc/0x1f0
       inode_lru_isolate+0x113/0x190
       ? discard_new_inode+0xc0/0xc0
       __list_lru_walk_one.isra.4+0x5c/0x100
       ? discard_new_inode+0xc0/0xc0
       list_lru_walk_one+0x32/0x50
       prune_icache_sb+0x36/0x80
       super_cache_scan+0x14a/0x1d0
       do_shrink_slab+0x131/0x320
       shrink_node+0xf7/0x380
       balance_pgdat+0x2d5/0x640
       kswapd+0x2ba/0x5e0
       ? __wake_up_common_lock+0x90/0x90
       kthread+0x147/0x160
       ? balance_pgdat+0x640/0x640
       ? __kthread_create_on_node+0x160/0x160
       ret_from_fork+0x24/0x30
      
      This is because btrfs_new_inode() calls new_inode() under the
      transaction.  We could probably move the new_inode() outside of this but
      for now just wrap it in memalloc_nofs_save().
      Reported-by: NZdenek Sojka <zsojka@seznam.cz>
      Fixes: 712e36c5 ("btrfs: use GFP_KERNEL in btrfs_alloc_inode")
      CC: stable@vger.kernel.org # 4.16+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11a19a90
    • Z
      btrfs: fix balance convert to single on 32-bit host CPUs · 7a547890
      Zygo Blaxell 提交于
      Currently, the command:
      
      	btrfs balance start -dconvert=single,soft .
      
      on a Raspberry Pi produces the following kernel message:
      
      	BTRFS error (device mmcblk0p2): balance: invalid convert data profile single
      
      This fails because we use is_power_of_2(unsigned long) to validate
      the new data profile, the constant for 'single' profile uses bit 48,
      and there are only 32 bits in a long on ARM.
      
      Fix by open-coding the check using u64 variables.
      
      Tested by completing the original balance command on several Raspberry
      Pis.
      
      Fixes: 818255fe ("btrfs: use common helper instead of open coding a bit test")
      CC: stable@vger.kernel.org # 4.20+
      Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a547890
    • J
      btrfs: fix incorrect updating of log root tree · 4203e968
      Josef Bacik 提交于
      We've historically had reports of being unable to mount file systems
      because the tree log root couldn't be read.  Usually this is the "parent
      transid failure", but could be any of the related errors, including
      "fsid mismatch" or "bad tree block", depending on which block got
      allocated.
      
      The modification of the individual log root items are serialized on the
      per-log root root_mutex.  This means that any modification to the
      per-subvol log root_item is completely protected.
      
      However we update the root item in the log root tree outside of the log
      root tree log_mutex.  We do this in order to allow multiple subvolumes
      to be updated in each log transaction.
      
      This is problematic however because when we are writing the log root
      tree out we update the super block with the _current_ log root node
      information.  Since these two operations happen independently of each
      other, you can end up updating the log root tree in between writing out
      the dirty blocks and setting the super block to point at the current
      root.
      
      This means we'll point at the new root node that hasn't been written
      out, instead of the one we should be pointing at.  Thus whatever garbage
      or old block we end up pointing at complains when we mount the file
      system later and try to replay the log.
      
      Fix this by copying the log's root item into a local root item copy.
      Then once we're safely under the log_root_tree->log_mutex we update the
      root item in the log_root_tree.  This way we do not modify the
      log_root_tree while we're committing it, fixing the problem.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NChris Mason <clm@fb.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4203e968
    • F
      Btrfs: fix memory leak due to concurrent append writes with fiemap · c67d970f
      Filipe Manana 提交于
      When we have a buffered write that starts at an offset greater than or
      equals to the file's size happening concurrently with a full ranged
      fiemap, we can end up leaking an extent state structure.
      
      Suppose we have a file with a size of 1Mb, and before the buffered write
      and fiemap are performed, it has a single extent state in its io tree
      representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.
      
      The following sequence diagram shows how the memory leak happens if a
      fiemap a buffered write, starting at offset 1Mb and with a length of
      4Kb, are performed concurrently.
      
                CPU 1                                                  CPU 2
      
        extent_fiemap()
          --> it's a full ranged fiemap
              range from 0 to LLONG_MAX - 1
              (9223372036854775807)
      
          --> locks range in the inode's
              io tree
            --> after this we have 2 extent
                states in the io tree:
                --> 1 for range [0, 1Mb[ with
                    the bits EXTENT_LOCKED and
                    EXTENT_DELALLOC_BITS set
                --> 1 for the range
                    [1Mb, LLONG_MAX[ with
                    the EXTENT_LOCKED bit set
      
                                                        --> start buffered write at offset
                                                            1Mb with a length of 4Kb
      
                                                        btrfs_file_write_iter()
      
                                                          btrfs_buffered_write()
                                                            --> cached_state is NULL
      
                                                            lock_and_cleanup_extent_if_need()
                                                              --> returns 0 and does not lock
                                                                  range because it starts
                                                                  at current i_size / eof
      
                                                            --> cached_state remains NULL
      
                                                            btrfs_dirty_pages()
                                                              btrfs_set_extent_delalloc()
                                                                (...)
                                                                __set_extent_bit()
      
                                                                  --> splits extent state for range
                                                                      [1Mb, LLONG_MAX[ and now we
                                                                      have 2 extent states:
      
                                                                      --> one for the range
                                                                          [1Mb, 1Mb + 4Kb[ with
                                                                          EXTENT_LOCKED set
                                                                      --> another one for the range
                                                                          [1Mb + 4Kb, LLONG_MAX[ with
                                                                          EXTENT_LOCKED set as well
      
                                                                  --> sets EXTENT_DELALLOC on the
                                                                      extent state for the range
                                                                      [1Mb, 1Mb + 4Kb[
                                                                  --> caches extent state
                                                                      [1Mb, 1Mb + 4Kb[ into
                                                                      @cached_state because it has
                                                                      the bit EXTENT_LOCKED set
      
                                                          --> btrfs_buffered_write() ends up
                                                              with a non-NULL cached_state and
                                                              never calls anything to release its
                                                              reference on it, resulting in a
                                                              memory leak
      
      Fix this by calling free_extent_state() on cached_state if the range was
      not locked by lock_and_cleanup_extent_if_need().
      
      The same issue can happen if anything else other than fiemap locks a range
      that covers eof and beyond.
      
      This could be triggered, sporadically, by test case generic/561 from the
      fstests suite, which makes duperemove run concurrently with fsstress, and
      duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
      the leak is reported in dmesg/syslog when removing the btrfs module with
      a message like the following:
      
        [77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1
      
      Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.
      
      CC: stable@vger.kernel.org # 4.16+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c67d970f
  13. 27 9月, 2019 2 次提交
    • Q
      btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls · d4e20494
      Qu Wenruo 提交于
      [BUG]
      The following script can cause btrfs qgroup data space leak:
      
        mkfs.btrfs -f $dev
        mount $dev -o nospace_cache $mnt
      
        btrfs subv create $mnt/subv
        btrfs quota en $mnt
        btrfs quota rescan -w $mnt
        btrfs qgroup limit 128m $mnt/subv
      
        for (( i = 0; i < 3; i++)); do
                # Create 3 64M holes for latter fallocate to fail
                truncate -s 192m $mnt/subv/file
                xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
                xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
                sync
      
                # it's supposed to fail, and each failure will leak at least 64M
                # data space
                xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
                rm $mnt/subv/file
                sync
        done
      
        # Shouldn't fail after we removed the file
        xfs_io -f -c "falloc 0 64m" $mnt/subv/file
      
      [CAUSE]
      Btrfs qgroup data reserve code allow multiple reservations to happen on
      a single extent_changeset:
      E.g:
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
      	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);
      
      Btrfs qgroup code has its internal tracking to make sure we don't
      double-reserve in above example.
      
      The only pattern utilizing this feature is in the main while loop of
      btrfs_fallocate() function.
      
      However btrfs_qgroup_reserve_data()'s error handling has a bug in that
      on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
      flag but doesn't free previously reserved bytes.
      
      This bug has a two fold effect:
      - Clearing EXTENT_QGROUP_RESERVED ranges
        This is the correct behavior, but it prevents
        btrfs_qgroup_check_reserved_leak() to catch the leakage as the
        detector is purely EXTENT_QGROUP_RESERVED flag based.
      
      - Leak the previously reserved data bytes.
      
      The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
      the last one fails, leaking space reserved in the previous ones.
      
      [FIX]
      Also free previously reserved data bytes when btrfs_qgroup_reserve_data
      fails.
      
      Fixes: 52472553 ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4e20494
    • Q
      btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space · bab32fc0
      Qu Wenruo 提交于
      [BUG]
      Under the following case with qgroup enabled, if some error happened
      after we have reserved delalloc space, then in error handling path, we
      could cause qgroup data space leakage:
      
      From btrfs_truncate_block() in inode.c:
      
      	ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
      					   block_start, blocksize);
      	if (ret)
      		goto out;
      
       again:
      	page = find_or_create_page(mapping, index, mask);
      	if (!page) {
      		btrfs_delalloc_release_space(inode, data_reserved,
      					     block_start, blocksize, true);
      		btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
      		ret = -ENOMEM;
      		goto out;
      	}
      
      [CAUSE]
      In the above case, btrfs_delalloc_reserve_space() will call
      btrfs_qgroup_reserve_data() and mark the io_tree range with
      EXTENT_QGROUP_RESERVED flag.
      
      In the error handling path, we have the following call stack:
      btrfs_delalloc_release_space()
      |- btrfs_free_reserved_data_space()
         |- btrsf_qgroup_free_data()
            |- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
               |- qgroup_free_reserved_data(reserved=@reserved)
                  |- clear_record_extent_bits();
                  |- freed += changeset.bytes_changed;
      
      However due to a completion bug, qgroup_free_reserved_data() will clear
      EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
      than the correct BTRFS_I(inode)->io_tree.
      Since io_failure_tree is never marked with that flag,
      btrfs_qgroup_free_data() will not free any data reserved space at all,
      causing a leakage.
      
      This type of error handling can only be triggered by errors outside of
      qgroup code. So EDQUOT error from qgroup can't trigger it.
      
      [FIX]
      Fix the wrong target io_tree.
      Reported-by: NJosef Bacik <josef@toxicpanda.com>
      Fixes: bc42bda2 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bab32fc0
  14. 25 9月, 2019 1 次提交
    • Q
      btrfs: Fix a regression which we can't convert to SINGLE profile · fab27359
      Qu Wenruo 提交于
      [BUG]
      With v5.3 kernel, we can't convert to SINGLE profile:
      
        # btrfs balance start -f -dconvert=single $mnt
        ERROR: error during balancing '/mnt/btrfs': Invalid argument
        # dmesg -t | tail
        validate_convert_profile: data profile=0x1000000000000 allowed=0x20 is_valid=1 final=0x1000000000000 ret=1
        BTRFS error (device dm-3): balance: invalid convert data profile single
      
      [CAUSE]
      With the extra debug output added, it shows that the @allowed bit is
      lacking the special in-memory only SINGLE profile bit.
      
      Thus we fail at that (profile & ~allowed) check.
      
      This regression is caused by commit 081db89b ("btrfs: use raid_attr
      to get allowed profiles for balance conversion") and the fact that we
      don't use any bit to indicate SINGLE profile on-disk, but uses special
      in-memory only bit to help distinguish different profiles.
      
      [FIX]
      Add that BTRFS_AVAIL_ALLOC_BIT_SINGLE to @allowed, so the code should be
      the same as it was and fix the regression.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Fixes: 081db89b ("btrfs: use raid_attr to get allowed profiles for balance conversion")
      CC: stable@vger.kernel.org # 5.3+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fab27359