1. 27 7月, 2020 4 次提交
    • N
      btrfs: make __btrfs_add_ordered_extent take struct btrfs_inode · da69fea9
      Nikolay Borisov 提交于
      This is internal btrfs function what really needs the vfs_inode only for
      igrab and a tracepoint.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      da69fea9
    • F
      btrfs: remove no longer used trans_list member of struct btrfs_ordered_extent · 3ef64143
      Filipe Manana 提交于
      The 'trans_list' member of an ordered extent was used to keep track of the
      ordered extents for which a transaction commit had to wait. These were
      ordered extents that were started and logged by an fsync. However we don't
      do that anymore and before we stopped doing it we changed the approach to
      wait for the ordered extents in commit 161c3549 ("Btrfs: change how
      we wait for pending ordered extents"), which stopped using that list and
      therefore the 'trans_list' member is not used anymore since that commit.
      So just remove it since it's doing nothing and making each ordered extent
      structure waste memory (2 pointers).
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ef64143
    • F
      btrfs: remove no longer used log_list member of struct btrfs_ordered_extent · cd8d39f4
      Filipe Manana 提交于
      The 'log_list' member of an ordered extent was used keep track of which
      ordered extents we needed to wait after logging metadata, but is not used
      anymore since commit 5636cf7d ("btrfs: remove the logged extents
      infrastructure"), as we now always wait on ordered extent completion
      before logging metadata. So just remove it since it's doing nothing and
      making each ordered extent structure waste more memory (2 pointers).
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd8d39f4
    • Q
      btrfs: change timing for qgroup reserved space for ordered extents to fix reserved space leak · 7dbeaad0
      Qu Wenruo 提交于
      [BUG]
      The following simple workload from fsstress can lead to qgroup reserved
      data space leak:
        0/0: creat f0 x:0 0 0
        0/0: creat add id=0,parent=-1
        0/1: write f0[259 1 0 0 0 0] [600030,27288] 0
        0/4: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 64 627318] return 25, fallback to stat()
        0/4: dwrite f0[259 1 0 0 64 627318] [610304,106496] 0
      
      This would cause btrfs qgroup to leak 20480 bytes for data reserved
      space.  If btrfs qgroup limit is enabled, such leak can lead to
      unexpected early EDQUOT and unusable space.
      
      [CAUSE]
      When doing direct IO, kernel will try to writeback existing buffered
      page cache, then invalidate them:
        generic_file_direct_write()
        |- filemap_write_and_wait_range();
        |- invalidate_inode_pages2_range();
      
      However for btrfs, the bi_end_io hook doesn't finish all its heavy work
      right after bio ends.  In fact, it delays its work further:
      
        submit_extent_page(end_io_func=end_bio_extent_writepage);
        end_bio_extent_writepage()
        |- btrfs_writepage_endio_finish_ordered()
           |- btrfs_init_work(finish_ordered_fn);
      
        <<< Work queue execution >>>
        finish_ordered_fn()
        |- btrfs_finish_ordered_io();
           |- Clear qgroup bits
      
      This means, when filemap_write_and_wait_range() returns,
      btrfs_finish_ordered_io() is not guaranteed to be executed, thus the
      qgroup bits for related range are not cleared.
      
      Now into how the leak happens, this will only focus on the overlapping
      part of buffered and direct IO part.
      
      1. After buffered write
         The inode had the following range with QGROUP_RESERVED bit:
         	596		616K
      	|///////////////|
         Qgroup reserved data space: 20K
      
      2. Writeback part for range [596K, 616K)
         Write back finished, but btrfs_finish_ordered_io() not get called
         yet.
         So we still have:
         	596K		616K
      	|///////////////|
         Qgroup reserved data space: 20K
      
      3. Pages for range [596K, 616K) get released
         This will clear all qgroup bits, but don't update the reserved data
         space.
         So we have:
         	596K		616K
      	|		|
         Qgroup reserved data space: 20K
         That number doesn't match the qgroup bit range anymore.
      
      4. Dio prepare space for range [596K, 700K)
         Qgroup reserved data space for that range, we got:
         	596K		616K			700K
      	|///////////////|///////////////////////|
         Qgroup reserved data space: 20K + 104K = 124K
      
      5. btrfs_finish_ordered_range() gets executed for range [596K, 616K)
         Qgroup free reserved space for that range, we got:
         	596K		616K			700K
      	|		|///////////////////////|
         We need to free that range of reserved space.
         Qgroup reserved data space: 124K - 20K = 104K
      
      6. btrfs_finish_ordered_range() gets executed for range [596K, 700K)
         However qgroup bit for range [596K, 616K) is already cleared in
         previous step, so we only free 84K for qgroup reserved space.
         	596K		616K			700K
      	|		|			|
         We need to free that range of reserved space.
         Qgroup reserved data space: 104K - 84K = 20K
      
         Now there is no way to release that 20K unless disabling qgroup or
         unmounting the fs.
      
      [FIX]
      This patch will change the timing of btrfs_qgroup_release/free_data()
      call.  Here it uses buffered COW write as an example.
      
      	The new timing			|	The old timing
      ----------------------------------------+---------------------------------------
       btrfs_buffered_write()			| btrfs_buffered_write()
       |- btrfs_qgroup_reserve_data() 	| |- btrfs_qgroup_reserve_data()
      					|
       btrfs_run_delalloc_range()		| btrfs_run_delalloc_range()
       |- btrfs_add_ordered_extent()  	|
          |- btrfs_qgroup_release_data()	|
             The reserved is passed into	|
             btrfs_ordered_extent structure	|
      					|
       btrfs_finish_ordered_io()		| btrfs_finish_ordered_io()
       |- The reserved space is passed to 	| |- btrfs_qgroup_release_data()
          btrfs_qgroup_record			|    The resereved space is passed
      					|    to btrfs_qgroup_recrod
      					|
       btrfs_qgroup_account_extents()		| btrfs_qgroup_account_extents()
       |- btrfs_qgroup_free_refroot()		| |- btrfs_qgroup_free_refroot()
      
      The point of such change is to ensure, when ordered extents are
      submitted, the qgroup reserved space is already released, to keep the
      timing aligned with file_write_and_wait_range().
      
      So that qgroup data reserved space is all bound to btrfs_ordered_extent
      and solve the timing mismatch.
      
      Fixes: f695fdce ("btrfs: qgroup: Introduce functions to release/free qgroup reserve data space")
      Suggested-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7dbeaad0
  2. 24 3月, 2020 4 次提交
  3. 19 2月, 2020 1 次提交
  4. 20 1月, 2020 2 次提交
  5. 19 11月, 2019 1 次提交
    • F
      Btrfs: fix block group remaining RO forever after error during device replace · 042528f8
      Filipe Manana 提交于
      When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
      set the block group to RO mode and then wait for any ongoing writes into
      extents of the block group to complete. While doing that wait we overwrite
      the value of the variable 'ret' and can break out of the loop if an error
      happens without turning the block group back into RW mode. So what happens
      is the following:
      
      1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
         to RO mode (its ->ro field set to 1 or incremented to some value > 1);
      
      2) Then btrfs_wait_ordered_roots() returns a value > 0;
      
      3) Then if either joining or committing the transaction fails, we break
         out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
         the block group in RO mode forever.
      
      To fix this, just remove the code that waits for ongoing writes to extents
      of the block group, since it's not needed because in the initial setup
      phase of a device replace operation, before starting to find all chunks
      and their extents, we set the target device for replace while holding
      fs_info->dev_replace->rwsem, which ensures that after releasing that
      semaphore, any writes into the source device are made to the target device
      as well (__btrfs_map_block() guarantees that). So while at
      scrub_enumerate_chunks() we only need to worry about finding and copying
      extents (from the source device to the target device) that were written
      before we started the device replace operation.
      
      Fixes: f0e9b7d6 ("Btrfs: fix race setting block group readonly during device replace")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      042528f8
  6. 18 11月, 2019 1 次提交
    • O
      btrfs: get rid of unique workqueue helper functions · a0cac0ec
      Omar Sandoval 提交于
      Commit 9e0af237 ("Btrfs: fix task hang under heavy compressed
      write") worked around the issue that a recycled work item could get a
      false dependency on the original work item due to how the workqueue code
      guarantees non-reentrancy. It did so by giving different work functions
      to different types of work.
      
      However, the fixes in the previous few patches are more complete, as
      they prevent a work item from being recycled at all (except for a tiny
      window that the kernel workqueue code handles for us). This obsoletes
      the previous fix, so we don't need the unique helpers for correctness.
      The only other reason to keep them would be so they show up in stack
      traces, but they always seem to be optimized to a tail call, so they
      don't show up anyways. So, let's just get rid of the extra indirection.
      
      While we're here, rename normal_work_helper() to the more informative
      btrfs_work_helper().
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a0cac0ec
  7. 09 9月, 2019 1 次提交
  8. 26 7月, 2019 1 次提交
  9. 04 7月, 2019 1 次提交
  10. 01 7月, 2019 3 次提交
  11. 30 4月, 2019 2 次提交
  12. 25 4月, 2019 1 次提交
    • N
      btrfs: Switch memory allocations in async csum calculation path to kvmalloc · a3d46aea
      Nikolay Borisov 提交于
      Recent multi-page biovec rework allowed creation of bios that can span
      large regions - up to 128 megabytes in the case of btrfs. OTOH btrfs'
      submission path currently allocates a contiguous array to store the
      checksums for every bio submitted. This means we can request up to
      (128mb / BTRFS_SECTOR_SIZE) * 4 bytes + 32bytes of memory from kmalloc.
      On busy systems with possibly fragmented memory said kmalloc can fail
      which will trigger BUG_ON due to improper error handling IO submission
      context in btrfs.
      
      Until error handling is improved or bios in btrfs limited to a more
      manageable size (e.g. 1m) let's use kvmalloc to fallback to vmalloc for
      such large allocations. There is no hard requirement that the memory
      allocated for checksums during IO submission has to be contiguous, but
      this is a simple fix that does not require several non-contiguous
      allocations.
      
      For small writes this is unlikely to have any visible effect since
      kmalloc will still satisfy allocation requests as usual. For larger
      requests the code will just fallback to vmalloc.
      
      We've performed evaluation on several workload types and there was no
      significant difference kmalloc vs kvmalloc.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a3d46aea
  13. 17 12月, 2018 1 次提交
  14. 06 8月, 2018 3 次提交
  15. 29 5月, 2018 1 次提交
  16. 12 4月, 2018 1 次提交
  17. 31 3月, 2018 1 次提交
    • Q
      btrfs: qgroup: Use separate meta reservation type for delalloc · 43b18595
      Qu Wenruo 提交于
      Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
      preallocated meta rsv, making it quite easy to underflow qgroup meta
      reservation.
      
      Since we have the new qgroup meta rsv types, apply it to delalloc
      reservation.
      
      Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
      rsv type.
      
      And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
      they will convert corresponding META_PREALLOC reservation to
      META_PERTRANS.
      
      This is mainly due to the fact that current qgroup numbers will only be
      updated in btrfs_commit_transaction(), that's to say if we don't keep
      such placeholder reservation, we can exceed qgroup limitation.
      
      And for callers freeing outstanding extent in error handler, we will
      just free META_PREALLOC bytes.
      
      This behavior makes callers of btrfs_qgroup_release_meta() or
      btrfs_qgroup_convert_meta() to be aware of which type they are.
      So in this patch, btrfs_delalloc_release_metadata() and its callers get
      an extra parameter to info qgroup to do correct meta convert/release.
      
      The good news is, even we use the wrong type (convert or free), it won't
      cause obvious bug, as prealloc type is always in good shape, and the
      type only affects how per-trans meta is increased or not.
      
      So the worst case will be at most metadata limitation can be sometimes
      exceeded (no convert at all) or metadata limitation is reached too soon
      (no free at all).
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      43b18595
  18. 26 3月, 2018 1 次提交
    • D
      btrfs: add more __cold annotations · e67c718b
      David Sterba 提交于
      The __cold functions are placed to a special section, as they're
      expected to be called rarely. This could help i-cache prefetches or help
      compiler to decide which branches are more/less likely to be taken
      without any other annotations needed.
      
      Though we can't add more __exit annotations, it's still possible to add
      __cold (that's also added with __exit). That way the following function
      categories are tagged:
      
      - printf wrappers, error messages
      - exit helpers
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e67c718b
  19. 02 11月, 2017 1 次提交
    • J
      Btrfs: rework outstanding_extents · 8b62f87b
      Josef Bacik 提交于
      Right now we do a lot of weird hoops around outstanding_extents in order
      to keep the extent count consistent.  This is because we logically
      transfer the outstanding_extent count from the initial reservation
      through the set_delalloc_bits.  This makes it pretty difficult to get a
      handle on how and when we need to mess with outstanding_extents.
      
      Fix this by revamping the rules of how we deal with outstanding_extents.
      Now instead everybody that is holding on to a delalloc extent is
      required to increase the outstanding extents count for itself.  This
      means we'll have something like this
      
      btrfs_delalloc_reserve_metadata	- outstanding_extents = 1
       btrfs_set_extent_delalloc	- outstanding_extents = 2
      btrfs_release_delalloc_extents	- outstanding_extents = 1
      
      for an initial file write.  Now take the append write where we extend an
      existing delalloc range but still under the maximum extent size
      
      btrfs_delalloc_reserve_metadata - outstanding_extents = 2
        btrfs_set_extent_delalloc
          btrfs_set_bit_hook		- outstanding_extents = 3
          btrfs_merge_extent_hook	- outstanding_extents = 2
      btrfs_delalloc_release_extents	- outstanding_extnets = 1
      
      In order to make the ordered extent transition we of course must now
      make ordered extents carry their own outstanding_extent reservation, so
      for cow_file_range we end up with
      
      btrfs_add_ordered_extent	- outstanding_extents = 2
      clear_extent_bit		- outstanding_extents = 1
      btrfs_remove_ordered_extent	- outstanding_extents = 0
      
      This makes all manipulations of outstanding_extents much more explicit.
      Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
      combined with btrfs_release_delalloc_extents, even in the error case, as
      that is the only function that actually modifies the
      outstanding_extents counter.
      
      The drawback to this is now we are much more likely to have transient
      cases where outstanding_extents is much larger than it actually should
      be.  This could happen before as we manipulated the delalloc bits, but
      now it happens basically at every write.  This may put more pressure on
      the ENOSPC flushing code, but I think making this code simpler is worth
      the cost.  I have another change coming to mitigate this side-effect
      somewhat.
      
      I also added trace points for the counter manipulation.  These were used
      by a bpf script I wrote to help track down leak issues.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b62f87b
  20. 30 6月, 2017 1 次提交
  21. 18 4月, 2017 2 次提交
  22. 28 2月, 2017 1 次提交
  23. 14 2月, 2017 3 次提交
  24. 06 12月, 2016 2 次提交