1. 22 7月, 2021 1 次提交
    • F
      btrfs: fix lock inversion problem when doing qgroup extent tracing · 8949b9a1
      Filipe Manana 提交于
      At btrfs_qgroup_trace_extent_post() we call btrfs_find_all_roots() with a
      NULL value as the transaction handle argument, which makes that function
      take the commit_root_sem semaphore, which is necessary when we don't hold
      a transaction handle or any other mechanism to prevent a transaction
      commit from wiping out commit roots.
      
      However btrfs_qgroup_trace_extent_post() can be called in a context where
      we are holding a write lock on an extent buffer from a subvolume tree,
      namely from btrfs_truncate_inode_items(), called either during truncate
      or unlink operations. In this case we end up with a lock inversion problem
      because the commit_root_sem is a higher level lock, always supposed to be
      acquired before locking any extent buffer.
      
      Lockdep detects this lock inversion problem since we switched the extent
      buffer locks from custom locks to semaphores, and when running btrfs/158
      from fstests, it reported the following trace:
      
      [ 9057.626435] ======================================================
      [ 9057.627541] WARNING: possible circular locking dependency detected
      [ 9057.628334] 5.14.0-rc2-btrfs-next-93 #1 Not tainted
      [ 9057.628961] ------------------------------------------------------
      [ 9057.629867] kworker/u16:4/30781 is trying to acquire lock:
      [ 9057.630824] ffff8e2590f58760 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.632542]
                     but task is already holding lock:
      [ 9057.633551] ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
      [ 9057.635255]
                     which lock already depends on the new lock.
      
      [ 9057.636292]
                     the existing dependency chain (in reverse order) is:
      [ 9057.637240]
                     -> #1 (&fs_info->commit_root_sem){++++}-{3:3}:
      [ 9057.638138]        down_read+0x46/0x140
      [ 9057.638648]        btrfs_find_all_roots+0x41/0x80 [btrfs]
      [ 9057.639398]        btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
      [ 9057.640283]        btrfs_add_delayed_data_ref+0x418/0x490 [btrfs]
      [ 9057.641114]        btrfs_free_extent+0x35/0xb0 [btrfs]
      [ 9057.641819]        btrfs_truncate_inode_items+0x424/0xf70 [btrfs]
      [ 9057.642643]        btrfs_evict_inode+0x454/0x4f0 [btrfs]
      [ 9057.643418]        evict+0xcf/0x1d0
      [ 9057.643895]        do_unlinkat+0x1e9/0x300
      [ 9057.644525]        do_syscall_64+0x3b/0xc0
      [ 9057.645110]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 9057.645835]
                     -> #0 (btrfs-tree-00){++++}-{3:3}:
      [ 9057.646600]        __lock_acquire+0x130e/0x2210
      [ 9057.647248]        lock_acquire+0xd7/0x310
      [ 9057.647773]        down_read_nested+0x4b/0x140
      [ 9057.648350]        __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.649175]        btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 9057.650010]        btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 9057.650849]        scrub_print_warning_inode+0x89/0x370 [btrfs]
      [ 9057.651733]        iterate_extent_inodes+0x1e3/0x280 [btrfs]
      [ 9057.652501]        scrub_print_warning+0x15d/0x2f0 [btrfs]
      [ 9057.653264]        scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
      [ 9057.654295]        scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
      [ 9057.655111]        btrfs_work_helper+0xf8/0x400 [btrfs]
      [ 9057.655831]        process_one_work+0x247/0x5a0
      [ 9057.656425]        worker_thread+0x55/0x3c0
      [ 9057.656993]        kthread+0x155/0x180
      [ 9057.657494]        ret_from_fork+0x22/0x30
      [ 9057.658030]
                     other info that might help us debug this:
      
      [ 9057.659064]  Possible unsafe locking scenario:
      
      [ 9057.659824]        CPU0                    CPU1
      [ 9057.660402]        ----                    ----
      [ 9057.660988]   lock(&fs_info->commit_root_sem);
      [ 9057.661581]                                lock(btrfs-tree-00);
      [ 9057.662348]                                lock(&fs_info->commit_root_sem);
      [ 9057.663254]   lock(btrfs-tree-00);
      [ 9057.663690]
                      *** DEADLOCK ***
      
      [ 9057.664437] 4 locks held by kworker/u16:4/30781:
      [ 9057.665023]  #0: ffff8e25922a1148 ((wq_completion)btrfs-scrub){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
      [ 9057.666260]  #1: ffffabb3451ffe70 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
      [ 9057.667639]  #2: ffff8e25922da198 (&ret->mutex){+.+.}-{3:3}, at: scrub_handle_errored_block.isra.0+0x5d2/0x1640 [btrfs]
      [ 9057.669017]  #3: ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
      [ 9057.670408]
                     stack backtrace:
      [ 9057.670976] CPU: 7 PID: 30781 Comm: kworker/u16:4 Not tainted 5.14.0-rc2-btrfs-next-93 #1
      [ 9057.672030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [ 9057.673492] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
      [ 9057.674258] Call Trace:
      [ 9057.674588]  dump_stack_lvl+0x57/0x72
      [ 9057.675083]  check_noncircular+0xf3/0x110
      [ 9057.675611]  __lock_acquire+0x130e/0x2210
      [ 9057.676132]  lock_acquire+0xd7/0x310
      [ 9057.676605]  ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.677313]  ? lock_is_held_type+0xe8/0x140
      [ 9057.677849]  down_read_nested+0x4b/0x140
      [ 9057.678349]  ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.679068]  __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.679760]  btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 9057.680458]  btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 9057.681083]  ? _raw_spin_unlock+0x29/0x40
      [ 9057.681594]  ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
      [ 9057.682336]  scrub_print_warning_inode+0x89/0x370 [btrfs]
      [ 9057.683058]  ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
      [ 9057.683834]  ? scrub_write_block_to_dev_replace+0xb0/0xb0 [btrfs]
      [ 9057.684632]  iterate_extent_inodes+0x1e3/0x280 [btrfs]
      [ 9057.685316]  scrub_print_warning+0x15d/0x2f0 [btrfs]
      [ 9057.685977]  ? ___ratelimit+0xa4/0x110
      [ 9057.686460]  scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
      [ 9057.687316]  scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
      [ 9057.688021]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [ 9057.688649]  ? lock_is_held_type+0xe8/0x140
      [ 9057.689180]  process_one_work+0x247/0x5a0
      [ 9057.689696]  worker_thread+0x55/0x3c0
      [ 9057.690175]  ? process_one_work+0x5a0/0x5a0
      [ 9057.690731]  kthread+0x155/0x180
      [ 9057.691158]  ? set_kthread_struct+0x40/0x40
      [ 9057.691697]  ret_from_fork+0x22/0x30
      
      Fix this by making btrfs_find_all_roots() never attempt to lock the
      commit_root_sem when it is called from btrfs_qgroup_trace_extent_post().
      
      We can't just pass a non-NULL transaction handle to btrfs_find_all_roots()
      from btrfs_qgroup_trace_extent_post(), because that would make backref
      lookup not use commit roots and acquire read locks on extent buffers, and
      therefore could deadlock when btrfs_qgroup_trace_extent_post() is called
      from the btrfs_truncate_inode_items() code path which has acquired a write
      lock on an extent buffer of the subvolume btree.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8949b9a1
  2. 02 3月, 2021 1 次提交
  3. 27 7月, 2020 6 次提交
  4. 19 2月, 2020 1 次提交
  5. 19 11月, 2019 1 次提交
  6. 25 2月, 2019 4 次提交
    • Q
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to... · 1418bae1
      Qu Wenruo 提交于
      btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record
      
      [BUG]
      Btrfs/139 will fail with a high probability if the testing machine (VM)
      has only 2G RAM.
      
      Resulting the final write success while it should fail due to EDQUOT,
      and the fs will have quota exceeding the limit by 16K.
      
      The simplified reproducer will be: (needs a 2G ram VM)
      
        $ mkfs.btrfs -f $dev
        $ mount $dev $mnt
      
        $ btrfs subv create $mnt/subv
        $ btrfs quota enable $mnt
        $ btrfs quota rescan -w $mnt
        $ btrfs qgroup limit -e 1G $mnt/subv
      
        $ for i in $(seq -w  1 8); do
        	xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
        	echo "file $i written" > /dev/kmsg
          done
        $ sync
        $ btrfs qgroup show -pcre --raw $mnt
      
      The last pwrite will not trigger EDQUOT and final 'qgroup show' will
      show something like:
      
        qgroupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5             16384        16384         none         none ---     ---
        0/256      1073758208   1073758208         none   1073741824 ---     ---
      
      And 1073758208 is larger than
        > 1073741824.
      
      [CAUSE]
      It's a bug in btrfs qgroup data reserved space management.
      
      For quota limit, we must ensure that:
        reserved (data + metadata) + rfer/excl <= limit
      
      Since rfer/excl is only updated at transaction commmit time, reserved
      space needs to be taken special care.
      
      One important part of reserved space is data, and for a new data extent
      written to disk, we still need to take the reserved space until
      rfer/excl numbers get updated.
      
      Originally when an ordered extent finishes, we migrate the reserved
      qgroup data space from extent_io tree to delayed ref head of the data
      extent, expecting delayed ref will only be cleaned up at commit
      transaction time.
      
      However for small RAM machine, due to memory pressure dirty pages can be
      flushed back to disk without committing a transaction.
      
      The related events will be something like:
      
        file 1 written
        btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
        btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
        btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
        btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
        btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
        cleanup_ref_head: num_bytes=54947840
        cleanup_ref_head: num_bytes=5636096
        cleanup_ref_head: num_bytes=569344
        cleanup_ref_head: num_bytes=57344
        cleanup_ref_head: num_bytes=8192
        ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
        file 2 written
        ...
        file 8 written
        cleanup_ref_head: num_bytes=8192
        ...
        btrfs_commit_transaction  <<< the only transaction committed during
      				the test
      
      When file 2 is written, we have already freed 128M reserved qgroup data
      space for ino 258. Thus later write won't trigger EDQUOT.
      
      This allows us to write more data beyond qgroup limit.
      
      In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.
      
      [FIX]
      By moving reserved qgroup data space from btrfs_delayed_ref_head to
      btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
      space won't be freed half way before commit transaction, thus fix the
      problem.
      
      Fixes: f64d5ca8 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1418bae1
    • Q
      btrfs: qgroup: Cleanup old subtree swap code · 9627736b
      Qu Wenruo 提交于
      Since it's replaced by new delayed subtree swap code, remove the
      original code.
      
      The cleanup is small since most of its core function is still used by
      delayed subtree swap trace.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9627736b
    • Q
      btrfs: qgroup: Use delayed subtree rescan for balance · f616f5cd
      Qu Wenruo 提交于
      Before this patch, qgroup code traces the whole subtree of subvolume and
      reloc trees unconditionally.
      
      This makes qgroup numbers consistent, but it could cause tons of
      unnecessary extent tracing, which causes a lot of overhead.
      
      However for subtree swap of balance, just swap both subtrees because
      they contain the same contents and tree structure, so qgroup numbers
      won't change.
      
      It's the race window between subtree swap and transaction commit could
      cause qgroup number change.
      
      This patch will delay the qgroup subtree scan until COW happens for the
      subtree root.
      
      So if there is no other operations for the fs, balance won't cause extra
      qgroup overhead. (best case scenario)
      Depending on the workload, most of the subtree scan can still be
      avoided.
      
      Only for worst case scenario, it will fall back to old subtree swap
      overhead. (scan all swapped subtrees)
      
      [[Benchmark]]
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
      And after file system population, there is no other activity, so it
      should be the best case scenario.
      
                           | v4.20-rc1            | w/ patchset    | diff
      -----------------------------------------------------------------------
      relocated extents    | 22615                | 22457          | -0.1%
      qgroup dirty extents | 163457               | 121606         | -25.6%
      time (sys)           | 22.884s              | 18.842s        | -17.6%
      time (real)          | 27.724s              | 22.884s        | -17.5%
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f616f5cd
    • Q
      btrfs: qgroup: Introduce per-root swapped blocks infrastructure · 370a11b8
      Qu Wenruo 提交于
      To allow delayed subtree swap rescan, btrfs needs to record per-root
      information about which tree blocks get swapped.  This patch introduces
      the required infrastructure.
      
      The designed workflow will be:
      
      1) Record the subtree root block that gets swapped.
      
         During subtree swap:
         O = Old tree blocks
         N = New tree blocks
               reloc tree                         subvolume tree X
                  Root                               Root
                 /    \                             /    \
               NA     OB                          OA      OB
             /  |     |  \                      /  |      |  \
           NC  ND     OE  OF                   OC  OD     OE  OF
      
        In this case, NA and OA are going to be swapped, record (NA, OA) into
        subvolume tree X.
      
      2) After subtree swap.
               reloc tree                         subvolume tree X
                  Root                               Root
                 /    \                             /    \
               OA     OB                          NA      OB
             /  |     |  \                      /  |      |  \
           OC  OD     OE  OF                   NC  ND     OE  OF
      
      3a) COW happens for OB
          If we are going to COW tree block OB, we check OB's bytenr against
          tree X's swapped_blocks structure.
          If it doesn't fit any, nothing will happen.
      
      3b) COW happens for NA
          Check NA's bytenr against tree X's swapped_blocks, and get a hit.
          Then we do subtree scan on both subtrees OA and NA.
          Resulting 6 tree blocks to be scanned (OA, OC, OD, NA, NC, ND).
      
          Then no matter what we do to subvolume tree X, qgroup numbers will
          still be correct.
          Then NA's record gets removed from X's swapped_blocks.
      
      4)  Transaction commit
          Any record in X's swapped_blocks gets removed, since there is no
          modification to swapped subtrees, no need to trigger heavy qgroup
          subtree rescan for them.
      
      This will introduce 128 bytes overhead for each btrfs_root even qgroup
      is not enabled. This is to reduce memory allocations and potential
      failures.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      370a11b8
  7. 17 12月, 2018 2 次提交
  8. 15 10月, 2018 3 次提交
    • Q
      btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled · 3628b4ca
      Qu Wenruo 提交于
      Some qgroup trace events like btrfs_qgroup_release_data() and
      btrfs_qgroup_free_delayed_ref() can still be triggered even if qgroup is
      not enabled.
      
      This is caused by the lack of qgroup status check before calling some
      qgroup functions.  Thankfully the functions can handle quota disabled
      case well and just do nothing for qgroup disabled case.
      
      This patch will do earlier check before triggering related trace events.
      
      And for enabled <-> disabled race case:
      
      1) For enabled->disabled case
         Disable will wipe out all qgroups data including reservation and
         excl/rfer. Even if we leak some reservation or numbers, it will
         still be cleared, so nothing will go wrong.
      
      2) For disabled -> enabled case
         Current btrfs_qgroup_release_data() will use extent_io tree to ensure
         we won't underflow reservation. And for delayed_ref we use
         head->qgroup_reserved to record the reserved space, so in that case
         head->qgroup_reserved should be 0 and we won't underflow.
      
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtQau7DtuUUeycCkZ36qjbKuxNzsgqJ7+sJ6W0dK_NLE3w@mail.gmail.com/Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3628b4ca
    • Q
      btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group · 3d0174f7
      Qu Wenruo 提交于
      For qgroup_trace_extent_swap(), if we find one leaf that needs to be
      traced, we will also iterate all file extents and trace them.
      
      This is OK if we're relocating data block groups, but if we're
      relocating metadata block groups, balance code itself has ensured that
      both subtree of file tree and reloc tree contain the same contents.
      
      That's to say, if we're relocating metadata block groups, all file
      extents in reloc and file tree should match, thus no need to trace them.
      This should reduce the total number of dirty extents processed in metadata
      block group balance.
      
      [[Benchmark]] (with all previous enhancement)
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
                           | v4.19-rc1    | w/ patchset    | diff (*)
      ---------------------------------------------------------------
      relocated extents    | 22929        | 22851          | -0.3%
      qgroup dirty extents | 227757       | 140886         | -38.1%
      time (sys)           | 65.253s      | 37.464s        | -42.6%
      time (real)          | 74.032s      | 44.722s        | -39.6%
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d0174f7
    • Q
      btrfs: qgroup: Use generation-aware subtree swap to mark dirty extents · 5f527822
      Qu Wenruo 提交于
      Before this patch, with quota enabled during balance, we need to mark
      the whole subtree dirty for quota.
      
      E.g.
      OO = Old tree blocks (from file tree)
      NN = New tree blocks (from reloc tree)
      
              File tree (src)		          Reloc tree (dst)
                  OO (a)                              NN (a)
                 /  \                                /  \
           (b) OO    OO (c)                    (b) NN    NN (c)
              /  \  /  \                          /  \  /  \
             OO  OO OO OO (d)                    OO  OO OO NN (d)
      
      For old balance + quota case, quota will mark the whole src and dst tree
      dirty, including all the 3 old tree blocks in reloc tree.
      
      It's doable for small file tree or new tree blocks are all located at
      lower level.
      
      But for large file tree or new tree blocks are all located at higher
      level, this will lead to mark the whole tree dirty, and be unbelievably
      slow.
      
      This patch will change how we handle such balance with quota enabled
      case.
      
      Now we will search from (b) and (c) for any new tree blocks whose
      generation is equal to @last_snapshot, and only mark them dirty.
      
      In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
      (NN(a) will be traced when COW happens for nodeptr modification).  And
      also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
      happens for nodeptr modification.)
      
      For above case, we could skip 3 tree blocks, but for larger tree, we can
      skip tons of unmodified tree blocks, and hugely speed up balance.
      
      This patch will introduce a new function,
      btrfs_qgroup_trace_subtree_swap(), which will do the following main
      work:
      
      1) Read out real root eb
         And setup basic dst_path for later calls
      2) Call qgroup_trace_new_subtree_blocks()
         To trace all new tree blocks in reloc tree and their counter
         parts in the file tree.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f527822
  9. 06 8月, 2018 12 次提交
  10. 12 4月, 2018 1 次提交
  11. 31 3月, 2018 5 次提交
  12. 30 6月, 2017 3 次提交
    • Q
      btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges · bc42bda2
      Qu Wenruo 提交于
      [BUG]
      For the following case, btrfs can underflow qgroup reserved space
      at an error path:
      (Page size 4K, function name without "btrfs_" prefix)
      
               Task A                  |             Task B
      ----------------------------------------------------------------------
      Buffered_write [0, 2K)           |
      |- check_data_free_space()       |
      |  |- qgroup_reserve_data()      |
      |     Range aligned to page      |
      |     range [0, 4K)          <<< |
      |     4K bytes reserved      <<< |
      |- copy pages to page cache      |
                                       | Buffered_write [2K, 4K)
                                       | |- check_data_free_space()
                                       | |  |- qgroup_reserved_data()
                                       | |     Range alinged to page
                                       | |     range [0, 4K)
                                       | |     Already reserved by A <<<
                                       | |     0 bytes reserved      <<<
                                       | |- delalloc_reserve_metadata()
                                       | |  And it *FAILED* (Maybe EQUOTA)
                                       | |- free_reserved_data_space()
                                            |- qgroup_free_data()
                                               Range aligned to page range
                                               [0, 4K)
                                               Freeing 4K
      (Special thanks to Chandan for the detailed report and analyse)
      
      [CAUSE]
      Above Task B is freeing reserved data range [0, 4K) which is actually
      reserved by Task A.
      
      And at writeback time, page dirty by Task A will go through writeback
      routine, which will free 4K reserved data space at file extent insert
      time, causing the qgroup underflow.
      
      [FIX]
      For btrfs_qgroup_free_data(), add @reserved parameter to only free
      data ranges reserved by previous btrfs_qgroup_reserve_data().
      So in above case, Task B will try to free 0 byte, so no underflow.
      Reported-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Tested-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc42bda2
    • Q
      btrfs: qgroup: Introduce extent changeset for qgroup reserve functions · 364ecf36
      Qu Wenruo 提交于
      Introduce a new parameter, struct extent_changeset for
      btrfs_qgroup_reserved_data() and its callers.
      
      Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
      which range it reserved in current reserve, so it can free it in error
      paths.
      
      The reason we need to export it to callers is, at buffered write error
      path, without knowing what exactly which range we reserved in current
      allocation, we can free space which is not reserved by us.
      
      This will lead to qgroup reserved space underflow.
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      364ecf36
    • Q
      btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function · d1b8b94a
      Qu Wenruo 提交于
      Quite a lot of qgroup corruption happens due to wrong time of calling
      btrfs_qgroup_prepare_account_extents().
      
      Since the safest time is to call it just before
      btrfs_qgroup_account_extents(), there is no need to separate these 2
      functions.
      
      Merging them will make code cleaner and less bug prone.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog and comment adjustments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1b8b94a