1. 15 10月, 2018 6 次提交
    • Q
      btrfs: qgroup: Use generation-aware subtree swap to mark dirty extents · 5f527822
      Qu Wenruo 提交于
      Before this patch, with quota enabled during balance, we need to mark
      the whole subtree dirty for quota.
      
      E.g.
      OO = Old tree blocks (from file tree)
      NN = New tree blocks (from reloc tree)
      
              File tree (src)		          Reloc tree (dst)
                  OO (a)                              NN (a)
                 /  \                                /  \
           (b) OO    OO (c)                    (b) NN    NN (c)
              /  \  /  \                          /  \  /  \
             OO  OO OO OO (d)                    OO  OO OO NN (d)
      
      For old balance + quota case, quota will mark the whole src and dst tree
      dirty, including all the 3 old tree blocks in reloc tree.
      
      It's doable for small file tree or new tree blocks are all located at
      lower level.
      
      But for large file tree or new tree blocks are all located at higher
      level, this will lead to mark the whole tree dirty, and be unbelievably
      slow.
      
      This patch will change how we handle such balance with quota enabled
      case.
      
      Now we will search from (b) and (c) for any new tree blocks whose
      generation is equal to @last_snapshot, and only mark them dirty.
      
      In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
      (NN(a) will be traced when COW happens for nodeptr modification).  And
      also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
      happens for nodeptr modification.)
      
      For above case, we could skip 3 tree blocks, but for larger tree, we can
      skip tons of unmodified tree blocks, and hugely speed up balance.
      
      This patch will introduce a new function,
      btrfs_qgroup_trace_subtree_swap(), which will do the following main
      work:
      
      1) Read out real root eb
         And setup basic dst_path for later calls
      2) Call qgroup_trace_new_subtree_blocks()
         To trace all new tree blocks in reloc tree and their counter
         parts in the file tree.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f527822
    • Q
      btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree · ea49f3e7
      Qu Wenruo 提交于
      Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
      all new tree blocks in a reloc tree.
      So that qgroup could skip unrelated tree blocks during balance, which
      should hugely speedup balance speed when quota is enabled.
      
      The function qgroup_trace_new_subtree_blocks() itself only cares about
      new tree blocks in reloc tree.
      
      All its main works are:
      
      1) Read out tree blocks according to parent pointers
      
      2) Do recursive depth-first search
         Will call the same function on all its children tree blocks, with
         search level set to current level -1.
         And will also skip all children whose generation is smaller than
         @last_snapshot.
      
      3) Call qgroup_trace_extent_swap() to trace tree blocks
      
      So although we have parameter list related to source file tree, it's not
      used at all, but only passed to qgroup_trace_extent_swap().
      Thus despite the tree read code, the core should be pretty short and all
      about recursive depth-first search.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ea49f3e7
    • Q
      btrfs: qgroup: Introduce function to trace two swaped extents · 25982561
      Qu Wenruo 提交于
      Introduce a new function, qgroup_trace_extent_swap(), which will be used
      later for balance qgroup speedup.
      
      The basis idea of balance is swapping tree blocks between reloc tree and
      the real file tree.
      
      The swap will happen in highest tree block, but there may be a lot of
      tree blocks involved.
      
      For example:
       OO = Old tree blocks
       NN = New tree blocks allocated during balance
      
                File tree (257)                  Reloc tree for 257
      L2              OO                                NN
                    /    \                            /    \
      L1          OO      OO (a)                    OO      NN (a)
                 / \     / \                       / \     / \
      L0       OO   OO OO   OO                   OO   OO NN   NN
                       (b)  (c)                          (b)  (c)
      
      When calling qgroup_trace_extent_swap(), we will pass:
      @src_eb = OO(a)
      @dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ]
      @dst_level = 0
      @root_level = 1
      
      In that case, qgroup_trace_extent_swap() will search from OO(a) to
      reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty.
      
      The main work of qgroup_trace_extent_swap() can be split into 3 parts:
      
      1) Tree search from @src_eb
         It should acts as a simplified btrfs_search_slot().
         The key for search can be extracted from @dst_path->nodes[dst_level]
         (first key).
      
      2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty
         NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty.
         They should be marked during preivous (@dst_level = 1) iteration.
      
      3) Mark file extents in leaves dirty
         We don't have good way to pick out new file extents only.
         So we still follow the old method by scanning all file extents in
         the leave.
      
      This function can free us from keeping two pathes, thus later we only need
      to care about how to iterate all new tree blocks in reloc tree.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ copy changelog to function comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25982561
    • Q
      btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted · c337e7b0
      Qu Wenruo 提交于
      Number of qgroup dirty extents is directly linked to the performance
      overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to
      record how many dirty extents is processed in
      btrfs_qgroup_account_extents().
      
      This will be pretty handy to analyze later balance performance
      improvement.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c337e7b0
    • Q
      btrfs: qgroup: Dirty all qgroups before rescan · 9c7b0c2e
      Qu Wenruo 提交于
      [BUG]
      In the following case, rescan won't zero out the number of qgroup 1/0:
      
        $ mkfs.btrfs -fq $DEV
        $ mount $DEV /mnt
      
        $ btrfs quota enable /mnt
        $ btrfs qgroup create 1/0 /mnt
        $ btrfs sub create /mnt/sub
        $ btrfs qgroup assign 0/257 1/0 /mnt
      
        $ dd if=/dev/urandom of=/mnt/sub/file bs=1k count=1000
        $ btrfs sub snap /mnt/sub /mnt/snap
        $ btrfs quota rescan -w /mnt
        $ btrfs qgroup show -pcre /mnt
        qgroupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5          16.00KiB     16.00KiB         none         none ---     ---
        0/257      1016.00KiB     16.00KiB         none         none 1/0     ---
        0/258      1016.00KiB     16.00KiB         none         none ---     ---
        1/0        1016.00KiB     16.00KiB         none         none ---     0/257
      
      So far so good, but:
      
        $ btrfs qgroup remove 0/257 1/0 /mnt
        WARNING: quotas may be inconsistent, rescan needed
        $ btrfs quota rescan -w /mnt
        $ btrfs qgroup show -pcre  /mnt
        qgoupid         rfer         excl     max_rfer     max_excl parent  child
        --------         ----         ----     --------     -------- ------  -----
        0/5          16.00KiB     16.00KiB         none         none ---     ---
        0/257      1016.00KiB     16.00KiB         none         none ---     ---
        0/258      1016.00KiB     16.00KiB         none         none ---     ---
        1/0        1016.00KiB     16.00KiB         none         none ---     ---
      	     ^^^^^^^^^^     ^^^^^^^^ not cleared
      
      [CAUSE]
      Before rescan we call qgroup_rescan_zero_tracking() to zero out all
      qgroups' accounting numbers.
      
      However we don't mark all qgroups dirty, but rely on rescan to do so.
      
      If we have any high level qgroup without children, it won't be marked
      dirty during rescan, since we cannot reach that qgroup.
      
      This will cause QGROUP_INFO items of childless qgroups never get updated
      in the quota tree, thus their numbers will stay the same in "btrfs
      qgroup show" output.
      
      [FIX]
      Just mark all qgroups dirty in qgroup_rescan_zero_tracking(), so even if
      we have childless qgroups, their QGROUP_INFO items will still get
      updated during rescan.
      Reported-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Tested-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c7b0c2e
    • M
      btrfs: Remove 'objectid' member from struct btrfs_root · 4fd786e6
      Misono Tomohiro 提交于
      There are two members in struct btrfs_root which indicate root's
      objectid: objectid and root_key.objectid.
      
      They are both set to the same value in __setup_root():
      
        static void __setup_root(struct btrfs_root *root,
                                 struct btrfs_fs_info *fs_info,
                                 u64 objectid)
        {
          ...
          root->objectid = objectid;
          ...
          root->root_key.objectid = objecitd;
          ...
        }
      
      and not changed to other value after initialization.
      
      grep in btrfs directory shows both are used in many places:
        $ grep -rI "root->root_key.objectid" | wc -l
        133
        $ grep -rI "root->objectid" | wc -l
        55
       (4.17, inc. some noise)
      
      It is confusing to have two similar variable names and it seems
      that there is no rule about which should be used in a certain case.
      
      Since ->root_key itself is needed for tree reloc tree, let's remove
      'objecitd' member and unify code to use ->root_key.objectid in all places.
      Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4fd786e6
  2. 23 8月, 2018 1 次提交
  3. 06 8月, 2018 22 次提交
  4. 28 6月, 2018 2 次提交
  5. 30 5月, 2018 1 次提交
    • Q
      btrfs: qgroup: show more meaningful qgroup_rescan_init error message · 9593bf49
      Qu Wenruo 提交于
      Error message from qgroup_rescan_init() mostly looks like:
      
        BTRFS info (device nvme0n1p1): qgroup_rescan_init failed with -115
      
      Which is far from meaningful, and sometimes confusing as for above
      -EINPROGRESS it's mostly (despite the init race) harmless, but sometimes
      it can also indicate problem if the return value is -EINVAL.
      
      Change it to some more meaningful messages like:
      
        BTRFS info (device nvme0n1p1): qgroup rescan is already in progress
      
      And
      
        BTRFS err(device nvme0n1p1): qgroup rescan init failed, qgroup is not enabled
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      [ update the messages and level ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9593bf49
  6. 29 5月, 2018 4 次提交
    • Q
      btrfs: qgroup: Finish rescan when hit the last leaf of extent tree · ff3d27a0
      Qu Wenruo 提交于
      Under the following case, qgroup rescan can double account cowed tree
      blocks:
      
      In this case, extent tree only has one tree block.
      
      -
      | transid=5 last committed=4
      | btrfs_qgroup_rescan_worker()
      | |- btrfs_start_transaction()
      | |  transid = 5
      | |- qgroup_rescan_leaf()
      |    |- btrfs_search_slot_for_read() on extent tree
      |       Get the only extent tree block from commit root (transid = 4).
      |       Scan it, set qgroup_rescan_progress to the last
      |       EXTENT/META_ITEM + 1
      |       now qgroup_rescan_progress = A + 1.
      |
      | fs tree get CoWed, new tree block is at A + 16K
      | transid 5 get committed
      -
      | transid=6 last committed=5
      | btrfs_qgroup_rescan_worker()
      | btrfs_qgroup_rescan_worker()
      | |- btrfs_start_transaction()
      | |  transid = 5
      | |- qgroup_rescan_leaf()
      |    |- btrfs_search_slot_for_read() on extent tree
      |       Get the only extent tree block from commit root (transid = 5).
      |       scan it using qgroup_rescan_progress (A + 1).
      |       found new tree block beyong A, and it's fs tree block,
      |       account it to increase qgroup numbers.
      -
      
      In above case, tree block A, and tree block A + 16K get accounted twice,
      while qgroup rescan should stop when it already reach the last leaf,
      other than continue using its qgroup_rescan_progress.
      
      Such case could happen by just looping btrfs/017 and with some
      possibility it can hit such double qgroup accounting problem.
      
      Fix it by checking the path to determine if we should finish qgroup
      rescan, other than relying on next loop to exit.
      Reported-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff3d27a0
    • Q
      btrfs: qgroup: Search commit root for rescan to avoid missing extent · b6debf15
      Qu Wenruo 提交于
      When doing qgroup rescan using the following script (modified from
      btrfs/017 test case), we can sometimes hit qgroup corruption.
      
      ------
      umount $dev &> /dev/null
      umount $mnt &> /dev/null
      
      mkfs.btrfs -f -n 64k $dev
      mount $dev $mnt
      
      extent_size=8192
      
      xfs_io -f -d -c "pwrite 0 $extent_size" $mnt/foo > /dev/null
      btrfs subvolume snapshot $mnt $mnt/snap
      
      xfs_io -f -c "reflink $mnt/foo" $mnt/foo-reflink > /dev/null
      xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink > /dev/null
      xfs_io -f -c "reflink $mnt/foo" $mnt/snap/foo-reflink2 > /dev/unll
      btrfs quota enable $mnt
      
       # -W is the new option to only wait rescan while not starting new one
      btrfs quota rescan -W $mnt
      btrfs qgroup show -prce $mnt
      umount $mnt
      
       # Need to patch btrfs-progs to report qgroup mismatch as error
      btrfs check $dev || _fail
      ------
      
      For fast machine, we can hit some corruption which missed accounting
      tree blocks:
      ------
      qgroupid         rfer         excl     max_rfer     max_excl parent  child
      --------         ----         ----     --------     -------- ------  -----
      0/5           8.00KiB        0.00B         none         none ---     ---
      0/257         8.00KiB        0.00B         none         none ---     ---
      ------
      
      This is due to the fact that we're always searching commit root for
      btrfs_find_all_roots() at qgroup_rescan_leaf(), but the leaf we get is
      from current transaction, not commit root.
      
      And if our tree blocks get modified in current transaction, we won't
      find any owner in commit root, thus causing the corruption.
      
      Fix it by searching commit root for extent tree for
      qgroup_rescan_leaf().
      Reported-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b6debf15
    • Q
      btrfs: qgroup: Allow trace_btrfs_qgroup_account_extent() to record its transid · c9f6f3cd
      Qu Wenruo 提交于
      When debugging quota rescan race, some times btrfs rescan could account
      some old (committed) leaf and then re-account newly committed leaf
      in next generation.
      
      This race needs extra transid to locate, so add @transid for
      trace_btrfs_qgroup_account_extent() for such debug.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c9f6f3cd
    • Q
      btrfs: trace: Allow trace_qgroup_update_counters() to record old rfer/excl value · 8b317901
      Qu Wenruo 提交于
      Origin trace_qgroup_update_counters() only records qgroup id and its
      reference count change.
      
      It's good enough to debug qgroup accounting change, but when rescan race
      is involved, it's pretty hard to distinguish which modification belongs
      to which rescan.
      
      So add old_rfer and old_excl trace output to help distinguishing
      different rescan instance.
      (Different rescan instance should reset its qgroup->rfer to 0)
      
      For trace event parameter, it just changes from u64 qgroup_id to struct
      btrfs_qgroup *qgroup, so number of parameters is not changed at all.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b317901
  7. 18 4月, 2018 1 次提交
    • Q
      btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT · a514d638
      Qu Wenruo 提交于
      Unlike previous method that tries to commit transaction inside
      qgroup_reserve(), this time we will try to commit transaction using
      fs_info->transaction_kthread to avoid nested transaction and no need to
      worry about locking context.
      
      Since it's an asynchronous function call and we won't wait for
      transaction commit, unlike previous method, we must call it before we
      hit the qgroup limit.
      
      So this patch will use the ratio and size of qgroup meta_pertrans
      reservation as indicator to check if we should trigger a transaction
      commit.  (meta_prealloc won't be cleaned in transaction committ, it's
      useless anyway)
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a514d638
  8. 12 4月, 2018 1 次提交
  9. 31 3月, 2018 2 次提交
    • D
      btrfs: use lockdep_assert_held for spinlocks · a4666e68
      David Sterba 提交于
      Using lockdep_assert_held is preferred, replace assert_spin_locked.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a4666e68
    • Q
      btrfs: Validate child tree block's level and first key · 581c1760
      Qu Wenruo 提交于
      We have several reports about node pointer points to incorrect child
      tree blocks, which could have even wrong owner and level but still with
      valid generation and checksum.
      
      Although btrfs check could handle it and print error message like:
      leaf parent key incorrect 60670574592
      
      Kernel doesn't have enough check on this type of corruption correctly.
      At least add such check to read_tree_block() and btrfs_read_buffer(),
      where we need two new parameters @level and @first_key to verify the
      child tree block.
      
      The new @level check is mandatory and all call sites are already
      modified to extract expected level from its call chain.
      
      While @first_key is optional, the following call sites are skipping such
      check:
      1) Root node/leaf
         As ROOT_ITEM doesn't contain the first key, skip @first_key check.
      2) Direct backref
         Only parent bytenr and level is known and we need to resolve the key
         all by ourselves, skip @first_key check.
      
      Another note of this verification is, it needs extra info from nodeptr
      or ROOT_ITEM, so it can't fit into current tree-checker framework, which
      is limited to node/leaf boundary.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      581c1760