1. 15 10月, 2018 7 次提交
    • Q
      btrfs: relocation: Cleanup while loop using rbtree_postorder_for_each_entry_safe · 98ff7b94
      Qu Wenruo 提交于
      And add one line comment explaining what we're doing for each loop.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98ff7b94
    • Q
      btrfs: qgroup: Only trace data extents in leaves if we're relocating data block group · 3d0174f7
      Qu Wenruo 提交于
      For qgroup_trace_extent_swap(), if we find one leaf that needs to be
      traced, we will also iterate all file extents and trace them.
      
      This is OK if we're relocating data block groups, but if we're
      relocating metadata block groups, balance code itself has ensured that
      both subtree of file tree and reloc tree contain the same contents.
      
      That's to say, if we're relocating metadata block groups, all file
      extents in reloc and file tree should match, thus no need to trace them.
      This should reduce the total number of dirty extents processed in metadata
      block group balance.
      
      [[Benchmark]] (with all previous enhancement)
      Hardware:
      	VM 4G vRAM, 8 vCPUs,
      	disk is using 'unsafe' cache mode,
      	backing device is SAMSUNG 850 evo SSD.
      	Host has 16G ram.
      
      Mkfs parameter:
      	--nodesize 4K (To bump up tree size)
      
      Initial subvolume contents:
      	4G data copied from /usr and /lib.
      	(With enough regular small files)
      
      Snapshots:
      	16 snapshots of the original subvolume.
      	each snapshot has 3 random files modified.
      
      balance parameter:
      	-m
      
      So the content should be pretty similar to a real world root fs layout.
      
                           | v4.19-rc1    | w/ patchset    | diff (*)
      ---------------------------------------------------------------
      relocated extents    | 22929        | 22851          | -0.3%
      qgroup dirty extents | 227757       | 140886         | -38.1%
      time (sys)           | 65.253s      | 37.464s        | -42.6%
      time (real)          | 74.032s      | 44.722s        | -39.6%
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d0174f7
    • Q
      btrfs: qgroup: Use generation-aware subtree swap to mark dirty extents · 5f527822
      Qu Wenruo 提交于
      Before this patch, with quota enabled during balance, we need to mark
      the whole subtree dirty for quota.
      
      E.g.
      OO = Old tree blocks (from file tree)
      NN = New tree blocks (from reloc tree)
      
              File tree (src)		          Reloc tree (dst)
                  OO (a)                              NN (a)
                 /  \                                /  \
           (b) OO    OO (c)                    (b) NN    NN (c)
              /  \  /  \                          /  \  /  \
             OO  OO OO OO (d)                    OO  OO OO NN (d)
      
      For old balance + quota case, quota will mark the whole src and dst tree
      dirty, including all the 3 old tree blocks in reloc tree.
      
      It's doable for small file tree or new tree blocks are all located at
      lower level.
      
      But for large file tree or new tree blocks are all located at higher
      level, this will lead to mark the whole tree dirty, and be unbelievably
      slow.
      
      This patch will change how we handle such balance with quota enabled
      case.
      
      Now we will search from (b) and (c) for any new tree blocks whose
      generation is equal to @last_snapshot, and only mark them dirty.
      
      In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
      (NN(a) will be traced when COW happens for nodeptr modification).  And
      also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
      happens for nodeptr modification.)
      
      For above case, we could skip 3 tree blocks, but for larger tree, we can
      skip tons of unmodified tree blocks, and hugely speed up balance.
      
      This patch will introduce a new function,
      btrfs_qgroup_trace_subtree_swap(), which will do the following main
      work:
      
      1) Read out real root eb
         And setup basic dst_path for later calls
      2) Call qgroup_trace_new_subtree_blocks()
         To trace all new tree blocks in reloc tree and their counter
         parts in the file tree.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f527822
    • Q
      btrfs: relocation: Add basic extent backref related comments for build_backref_tree · fa6ac715
      Qu Wenruo 提交于
      fs/btrfs/relocation.c:build_backref_tree() is some code from 2009 era,
      although it works pretty fine, it's not that easy to understand.
      Especially combined with the complex btrfs backref format.
      
      This patch adds some basic comment for the backref build part of the
      code, making it less hard to read, at least for backref searching part.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa6ac715
    • Q
      btrfs: Handle owner mismatch gracefully when walking up tree · 65c6e82b
      Qu Wenruo 提交于
      [BUG]
      When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
      when trying to recover balance:
      
        kernel BUG at fs/btrfs/extent-tree.c:8956!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
        RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
        RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
        Call Trace:
         walk_up_tree+0x172/0x1f0 [btrfs]
         btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
         merge_reloc_roots+0xe1/0x1d0 [btrfs]
         btrfs_recover_relocation+0x3ea/0x420 [btrfs]
         open_ctree+0x1af3/0x1dd0 [btrfs]
         btrfs_mount_root+0x66b/0x740 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         btrfs_mount+0x16d/0x890 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         do_mount+0x1fd/0xda0
         ksys_mount+0xba/0xd0
         __x64_sys_mount+0x21/0x30
         do_syscall_64+0x60/0x210
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      Extent tree corruption.  In this particular case, reloc tree root's
      owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
      corrupted and we failed the owner check in walk_up_tree().
      
      [FIX]
      It's pretty hard to take care of every extent tree corruption, but at
      least we can remove such BUG_ON() and exit more gracefully.
      
      And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
      the same root (which is obviously invalid), we needs to make
      __del_reloc_root() more robust to detect such invalid sharing to avoid
      possible NULL dereference as root->node can be NULL in this case.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411Reported-by: NXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      65c6e82b
    • M
      btrfs: Remove 'objectid' member from struct btrfs_root · 4fd786e6
      Misono Tomohiro 提交于
      There are two members in struct btrfs_root which indicate root's
      objectid: objectid and root_key.objectid.
      
      They are both set to the same value in __setup_root():
      
        static void __setup_root(struct btrfs_root *root,
                                 struct btrfs_fs_info *fs_info,
                                 u64 objectid)
        {
          ...
          root->objectid = objectid;
          ...
          root->root_key.objectid = objecitd;
          ...
        }
      
      and not changed to other value after initialization.
      
      grep in btrfs directory shows both are used in many places:
        $ grep -rI "root->root_key.objectid" | wc -l
        133
        $ grep -rI "root->objectid" | wc -l
        55
       (4.17, inc. some noise)
      
      It is confusing to have two similar variable names and it seems
      that there is no rule about which should be used in a certain case.
      
      Since ->root_key itself is needed for tree reloc tree, let's remove
      'objecitd' member and unify code to use ->root_key.objectid in all places.
      Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4fd786e6
    • L
      btrfs: switch update_size to bool in btrfs_block_rsv_migrate and btrfs_rsv_add_bytes · 3a584174
      Lu Fengqi 提交于
      Using true and false here is closer to the expected semantic than using
      0 and 1.  No functional change.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3a584174
  2. 06 8月, 2018 9 次提交
  3. 29 5月, 2018 2 次提交
  4. 26 4月, 2018 1 次提交
    • Q
      btrfs: Fix wrong first_key parameter in replace_path · 17515f1b
      Qu Wenruo 提交于
      Commit 581c1760 ("btrfs: Validate child tree block's level and first
      key") introduced new @first_key parameter for read_tree_block(), however
      caller in replace_path() is parasing wrong key to read_tree_block().
      
      It should use parameter @first_key other than @key.
      
      Normally it won't expose problem as @key is normally initialzied to the
      same value of @first_key we expect.
      However in relocation recovery case, @key can be set to (0, 0, 0), and
      since no valid key in relocation tree can be (0, 0, 0), it will cause
      read_tree_block() to return -EUCLEAN and interrupt relocation recovery.
      
      Fix it by setting @first_key correctly.
      
      Fixes: 581c1760 ("btrfs: Validate child tree block's level and first key")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      17515f1b
  5. 12 4月, 2018 1 次提交
  6. 31 3月, 2018 2 次提交
    • Q
      btrfs: Validate child tree block's level and first key · 581c1760
      Qu Wenruo 提交于
      We have several reports about node pointer points to incorrect child
      tree blocks, which could have even wrong owner and level but still with
      valid generation and checksum.
      
      Although btrfs check could handle it and print error message like:
      leaf parent key incorrect 60670574592
      
      Kernel doesn't have enough check on this type of corruption correctly.
      At least add such check to read_tree_block() and btrfs_read_buffer(),
      where we need two new parameters @level and @first_key to verify the
      child tree block.
      
      The new @level check is mandatory and all call sites are already
      modified to extract expected level from its call chain.
      
      While @first_key is optional, the following call sites are skipping such
      check:
      1) Root node/leaf
         As ROOT_ITEM doesn't contain the first key, skip @first_key check.
      2) Direct backref
         Only parent bytenr and level is known and we need to resolve the key
         all by ourselves, skip @first_key check.
      
      Another note of this verification is, it needs extra info from nodeptr
      or ROOT_ITEM, so it can't fit into current tree-checker framework, which
      is limited to node/leaf boundary.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      581c1760
    • Q
      btrfs: qgroup: Use separate meta reservation type for delalloc · 43b18595
      Qu Wenruo 提交于
      Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
      preallocated meta rsv, making it quite easy to underflow qgroup meta
      reservation.
      
      Since we have the new qgroup meta rsv types, apply it to delalloc
      reservation.
      
      Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
      rsv type.
      
      And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
      they will convert corresponding META_PREALLOC reservation to
      META_PERTRANS.
      
      This is mainly due to the fact that current qgroup numbers will only be
      updated in btrfs_commit_transaction(), that's to say if we don't keep
      such placeholder reservation, we can exceed qgroup limitation.
      
      And for callers freeing outstanding extent in error handler, we will
      just free META_PREALLOC bytes.
      
      This behavior makes callers of btrfs_qgroup_release_meta() or
      btrfs_qgroup_convert_meta() to be aware of which type they are.
      So in this patch, btrfs_delalloc_release_metadata() and its callers get
      an extra parameter to info qgroup to do correct meta convert/release.
      
      The good news is, even we use the wrong type (convert or free), it won't
      cause obvious bug, as prealloc type is always in good shape, and the
      type only affects how per-trans meta is increased or not.
      
      So the worst case will be at most metadata limitation can be sometimes
      exceeded (no convert at all) or metadata limitation is reached too soon
      (no free at all).
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      43b18595
  7. 01 3月, 2018 1 次提交
  8. 16 11月, 2017 1 次提交
    • F
      Btrfs: fix reported number of inode blocks after buffered append writes · e3b8a485
      Filipe Manana 提交于
      The patch from commit a7e3b975 ("Btrfs: fix reported number of inode
      blocks") introduced a regression where if we do a buffered write starting
      at position equal to or greater than the file's size and then stat(2) the
      file before writeback is triggered, the number of used blocks does not
      change (unless there's a prealloc/unwritten extent). Example:
      
        $ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
        $ du -h foobar
        0	foobar
        $ sync
        $ du -h foobar
        64K	foobar
      
      The first version of that patch didn't had this regression and the second
      version, which was the one committed, was made only to address some
      performance regression detected by the intel test robots using fs_mark.
      
      This fixes the regression by setting the new delaloc bit in the range, and
      doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
      well, so that this way we set both bits at once avoiding navigation of the
      inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
      meaninful place, as we should set the new dellaloc bit when if we set the
      delalloc bit, which happens only if we copied bytes into the pages at
      __btrfs_buffered_write().
      
      This was making some of LTP's du tests fail, which can be quickly run
      using a command line like the following:
      
        $ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt
      
      Fixes: a7e3b975 ("Btrfs: fix reported number of inode blocks")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3b8a485
  9. 02 11月, 2017 1 次提交
    • J
      Btrfs: rework outstanding_extents · 8b62f87b
      Josef Bacik 提交于
      Right now we do a lot of weird hoops around outstanding_extents in order
      to keep the extent count consistent.  This is because we logically
      transfer the outstanding_extent count from the initial reservation
      through the set_delalloc_bits.  This makes it pretty difficult to get a
      handle on how and when we need to mess with outstanding_extents.
      
      Fix this by revamping the rules of how we deal with outstanding_extents.
      Now instead everybody that is holding on to a delalloc extent is
      required to increase the outstanding extents count for itself.  This
      means we'll have something like this
      
      btrfs_delalloc_reserve_metadata	- outstanding_extents = 1
       btrfs_set_extent_delalloc	- outstanding_extents = 2
      btrfs_release_delalloc_extents	- outstanding_extents = 1
      
      for an initial file write.  Now take the append write where we extend an
      existing delalloc range but still under the maximum extent size
      
      btrfs_delalloc_reserve_metadata - outstanding_extents = 2
        btrfs_set_extent_delalloc
          btrfs_set_bit_hook		- outstanding_extents = 3
          btrfs_merge_extent_hook	- outstanding_extents = 2
      btrfs_delalloc_release_extents	- outstanding_extnets = 1
      
      In order to make the ordered extent transition we of course must now
      make ordered extents carry their own outstanding_extent reservation, so
      for cow_file_range we end up with
      
      btrfs_add_ordered_extent	- outstanding_extents = 2
      clear_extent_bit		- outstanding_extents = 1
      btrfs_remove_ordered_extent	- outstanding_extents = 0
      
      This makes all manipulations of outstanding_extents much more explicit.
      Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
      combined with btrfs_release_delalloc_extents, even in the error case, as
      that is the only function that actually modifies the
      outstanding_extents counter.
      
      The drawback to this is now we are much more likely to have transient
      cases where outstanding_extents is much larger than it actually should
      be.  This could happen before as we manipulated the delalloc bits, but
      now it happens basically at every write.  This may put more pressure on
      the ENOSPC flushing code, but I think making this code simpler is worth
      the cost.  I have another change coming to mitigate this side-effect
      somewhat.
      
      I also added trace points for the counter manipulation.  These were used
      by a bpf script I wrote to help track down leak issues.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b62f87b
  10. 30 10月, 2017 1 次提交
  11. 26 9月, 2017 1 次提交
  12. 21 8月, 2017 3 次提交
  13. 16 8月, 2017 1 次提交
  14. 30 6月, 2017 3 次提交
    • C
      btrfs: fix integer overflow in calc_reclaim_items_nr · 6374e57a
      Chris Mason 提交于
      Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
      v4.12-rc6.  This was because commit 70e7af24 made it possible for
      calc_reclaim_items_nr() to return a negative number.  It's not really a
      bug in that commit, it just didn't go far enough down the stack to find
      all the possible 64->32 bit overflows.
      
      This switches calc_reclaim_items_nr() to return a u64 and changes everyone
      that uses the results of that math to u64 as well.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Fixes: 70e7af24 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6374e57a
    • Q
      btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges · bc42bda2
      Qu Wenruo 提交于
      [BUG]
      For the following case, btrfs can underflow qgroup reserved space
      at an error path:
      (Page size 4K, function name without "btrfs_" prefix)
      
               Task A                  |             Task B
      ----------------------------------------------------------------------
      Buffered_write [0, 2K)           |
      |- check_data_free_space()       |
      |  |- qgroup_reserve_data()      |
      |     Range aligned to page      |
      |     range [0, 4K)          <<< |
      |     4K bytes reserved      <<< |
      |- copy pages to page cache      |
                                       | Buffered_write [2K, 4K)
                                       | |- check_data_free_space()
                                       | |  |- qgroup_reserved_data()
                                       | |     Range alinged to page
                                       | |     range [0, 4K)
                                       | |     Already reserved by A <<<
                                       | |     0 bytes reserved      <<<
                                       | |- delalloc_reserve_metadata()
                                       | |  And it *FAILED* (Maybe EQUOTA)
                                       | |- free_reserved_data_space()
                                            |- qgroup_free_data()
                                               Range aligned to page range
                                               [0, 4K)
                                               Freeing 4K
      (Special thanks to Chandan for the detailed report and analyse)
      
      [CAUSE]
      Above Task B is freeing reserved data range [0, 4K) which is actually
      reserved by Task A.
      
      And at writeback time, page dirty by Task A will go through writeback
      routine, which will free 4K reserved data space at file extent insert
      time, causing the qgroup underflow.
      
      [FIX]
      For btrfs_qgroup_free_data(), add @reserved parameter to only free
      data ranges reserved by previous btrfs_qgroup_reserve_data().
      So in above case, Task B will try to free 0 byte, so no underflow.
      Reported-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Tested-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc42bda2
    • Q
      btrfs: qgroup: Introduce extent changeset for qgroup reserve functions · 364ecf36
      Qu Wenruo 提交于
      Introduce a new parameter, struct extent_changeset for
      btrfs_qgroup_reserved_data() and its callers.
      
      Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
      which range it reserved in current reserve, so it can free it in error
      paths.
      
      The reason we need to export it to callers is, at buffered write error
      path, without knowing what exactly which range we reserved in current
      allocation, we can free space which is not reserved by us.
      
      This will lead to qgroup reserved space underflow.
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      364ecf36
  15. 20 6月, 2017 1 次提交
  16. 28 2月, 2017 4 次提交
  17. 17 2月, 2017 1 次提交