1. 23 8月, 2021 5 次提交
    • J
      btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc · 03fe78cc
      Josef Bacik 提交于
      We have been hitting some early ENOSPC issues in production with more
      recent kernels, and I tracked it down to us simply not flushing delalloc
      as aggressively as we should be.  With tracing I was seeing us failing
      all tickets with all of the block rsvs at or around 0, with very little
      pinned space, but still around 120MiB of outstanding bytes_may_used.
      Upon further investigation I saw that we were flushing around 14 pages
      per shrink call for delalloc, despite having around 2GiB of delalloc
      outstanding.
      
      Consider the example of a 8 way machine, all CPUs trying to create a
      file in parallel, which at the time of this commit requires 5 items to
      do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
      size waiting on reservations.  Now assume we have 128MiB of delalloc
      outstanding.  With our current math we would set items to 20, and then
      set to_reclaim to 20 * 256k, or 5MiB.
      
      Assuming that we went through this loop all 3 times, for both
      FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
      twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
      leave a fair bit of delalloc reservations still hanging around by the
      time we go to ENOSPC out all the remaining tickets.
      
      Fix this two ways.  First, change the calculations to be a fraction of
      the total delalloc bytes on the system.  Prior to this change we were
      calculating based on dirty inodes so our math made more sense, now it's
      just completely unrelated to what we're actually doing.
      
      Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
      gone through the flush states at least once.  This will empty the system
      of all delalloc so we're sure to be truly out of space when we start
      failing tickets.
      
      I'm tagging stable 5.10 and forward, because this is where we started
      using the page stuff heavily again.  This affects earlier kernel
      versions as well, but would be a pain to backport to them as the
      flushing mechanisms aren't the same.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      03fe78cc
    • D
      btrfs: make btrfs_next_leaf static inline · 809d6902
      David Sterba 提交于
      btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it
      to header to avoid the function call overhead.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      809d6902
    • D
      btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_ordered · 25c1252a
      David Sterba 提交于
      The uptodate parameter should be bool, change the type.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25c1252a
    • Q
      btrfs: remove unused start and end parameters from btrfs_run_delalloc_range() · a129ffb8
      Qu Wenruo 提交于
      Since commit d75855b4 ("btrfs: Remove
      extent_io_ops::writepage_start_hook") removes the writepage_start_hook()
      and adds btrfs_writepage_cow_fixup() function, there is no need to
      follow the old hook parameters.
      
      Remove the @start and @end hook, since currently the fixup check is full
      page check, it doesn't need @start and @end hook.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a129ffb8
    • J
      btrfs: zoned: remove max_zone_append_size logic · 5a80d1c6
      Johannes Thumshirn 提交于
      There used to be a patch in the original series for zoned support which
      limited the extent size to max_zone_append_size, but this patch has been
      dropped somewhere around v9.
      
      We've decided to go the opposite direction, instead of limiting extents
      in the first place we split them before submission to comply with the
      device's limits.
      
      Remove the related code, btrfs_fs_info::max_zone_append_size and
      btrfs_zoned_device_info::max_zone_append_size.
      
      This also removes the workaround for dm-crypt introduced in
      1d68128c ("btrfs: zoned: fail mount if the device does not support
      zone append") because the fix has been merged as f34ee1dc ("dm
      crypt: Fix zoned block device support").
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a80d1c6
  2. 23 6月, 2021 1 次提交
  3. 22 6月, 2021 5 次提交
    • J
      btrfs: rip out may_commit_transaction · c416a30c
      Josef Bacik 提交于
      may_commit_transaction was introduced before the ticketing
      infrastructure existed.  There was a problem where we'd legitimately be
      out of space, but every reservation would trigger a transaction commit
      and then fail.  Thus if you had 1000 things trying to make a
      reservation, they'd all do the flushing loop and thus commit the
      transaction 1000 times before they'd get their ENOSPC.
      
      This helper was introduced to short circuit this, if there wasn't space
      that could be reclaimed by committing the transaction then simply ENOSPC
      out.  This made true ENOSPC tests much faster as we didn't waste a bunch
      of time.
      
      However many of our bugs over the years have been from cases where we
      didn't account for some space that would be reclaimed by committing a
      transaction.  The delayed refs rsv space, delayed rsv, many pinned bytes
      miscalculations, etc.  And in the meantime the original problem has been
      solved with ticketing.  We no longer will commit the transaction 1000
      times.  Instead we'll get 1000 waiters, we will go through the flushing
      mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
      out.  The ticketing infrastructure gives us a deterministic way to see
      if we're making progress or not, thus we avoid a lot of extra work.
      
      So simplify this step by simply unconditionally committing the
      transaction.  This removes what is arguably our most common source of
      early ENOSPC bugs and will allow us to drastically simplify many of the
      things we track because we simply won't need them with this stuff gone.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c416a30c
    • F
      btrfs: ensure relocation never runs while we have send operations running · 1cea5cf0
      Filipe Manana 提交于
      Relocation and send do not play well together because while send is
      running a block group can be relocated, a transaction committed and
      the respective disk extents get re-allocated and written to or discarded
      while send is about to do something with the extents.
      
      This was explained in commit 9e967495 ("Btrfs: prevent send failures
      and crashes due to concurrent relocation"), which prevented balance and
      send from running in parallel but it did not address one remaining case
      where chunk relocation can happen: shrinking a device (and device deletion
      which shrinks a device's size to 0 before deleting the device).
      
      We also have now one more case where relocation is triggered: on zoned
      filesystems partially used block groups get relocated by a background
      thread, introduced in commit 18bb8bbf ("btrfs: zoned: automatically
      reclaim zones").
      
      So make sure that instead of preventing balance from running when there
      are ongoing send operations, we prevent relocation from happening.
      This uses the infrastructure recently added by a patch that has the
      subject: "btrfs: add cancellable chunk relocation support".
      
      Also it adds a spinlock used exclusively for the exclusivity between
      send and relocation, as before fs_info->balance_mutex was used, which
      would make an attempt to run send to block waiting for balance to
      finish, which can take a lot of time on large filesystems.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cea5cf0
    • D
      btrfs: shorten integrity checker extent data mount option · cbeaae4f
      David Sterba 提交于
      Subjectively, CHECK_INTEGRITY_INCLUDING_EXTENT_DATA is quite long and
      calling it CHECK_INTEGRITY_DATA still keeps the meaning and matches the
      mount option name.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cbeaae4f
    • D
      btrfs: switch mount option bits to enums and use wider type · ccd9395b
      David Sterba 提交于
      Switch defines of BTRFS_MOUNT_* to an enum (the symbolic names are
      recorded in the debugging information for convenience).
      
      There are two more things done but separating them would not make much
      sense as it's touching the same lines:
      
      - Renumber shifts 18..31 to 17..30 to get rid of the hole in the
        sequence.
      
      - Use 1UL as the value that gets shifted because we're approaching the
        32bit limit and due to integer promotions the value of (1 << 31)
        becomes 0xffffffff80000000 when cast to unsigned long (eg. the option
        manipulating helpers).
      
        This is not causing any problems yet as the operations are in-memory
        and masking the 31st bit works, we don't have more than 31 bits so the
        ill effects of not masking higher bits don't happen. But once we have
        more, the problems will emerge.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ccd9395b
    • D
      btrfs: fix typos in comments · 1a9fd417
      David Sterba 提交于
      Fix typos that have snuck in since the last round. Found by codespell.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a9fd417
  4. 21 6月, 2021 10 次提交
    • Q
      btrfs: make btrfs_set_range_writeback() subpage compatible · d2a91064
      Qu Wenruo 提交于
      Function btrfs_set_range_writeback() currently just sets the page
      writeback unconditionally.
      
      Change it to call the subpage helper so that we can handle both cases
      well.
      
      Since the subpage helpers needs btrfs_fs_info, also change the parameter
      to accept btrfs_inode.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d2a91064
    • Q
      btrfs: rename PagePrivate2 to PageOrdered inside btrfs · f57ad937
      Qu Wenruo 提交于
      Inside btrfs we use Private2 page status to indicate we have an ordered
      extent with pending IO for the sector.
      
      But the page status name, Private2, tells us nothing about the bit
      itself, so this patch will rename it to Ordered.
      And with extra comment about the bit added, so reader who is still
      uncertain about the page Ordered status, will find the comment pretty
      easily.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f57ad937
    • Q
      btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered() · 38a39ac7
      Qu Wenruo 提交于
      There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
      end_compressed_bio_write().
      
      It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
      which is only supposed to accept inode pages.
      
      Thankfully the important info here is the inode, so let's pass
      btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
      make @page parameter optional.
      
      By this, end_compressed_bio_write() can happily pass page=NULL while
      still getting everything done properly.
      
      Also, to cooperate with such modification, replace @page parameter for
      trace_btrfs_writepage_end_io_hook() with btrfs_inode.
      Although this removes page_index info, the existing start/len should be
      enough for most usage.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38a39ac7
    • Q
      btrfs: refactor submit_extent_page() to make bio and its flag tracing easier · 390ed29b
      Qu Wenruo 提交于
      There is a lot of code inside extent_io.c needs both "struct bio
      **bio_ret" and "unsigned long prev_bio_flags", along with some
      parameters like "unsigned long bio_flags".
      
      Such strange parameters are here for bio assembly.
      
      For example, we have such inode page layout:
      
        0       4K      8K      12K
        |<-- Extent A-->|<- EB->|
      
      Then what we do is:
      
      - Page [0, 4K)
        *bio_ret = NULL
        So we allocate a new bio to bio_ret,
        Add page [0, 4K) to *bio_ret.
      
      - Page [4K, 8K)
        *bio_ret != NULL
        We found this page is continuous to *bio_ret,
        and if we're not at stripe boundary, we
        add page [4K, 8K) to *bio_ret.
      
      - Page [8K, 12K)
        *bio_ret != NULL
        But we found this page is not continuous, so
        we submit *bio_ret, then allocate a new bio,
        and add page [8K, 12K) to the new bio.
      
      This means we need to record both the bio and its bio_flag, but we
      record them manually using those strange parameter list, other than
      encapsulating them into their own structure.
      
      So this patch will introduce a new structure, btrfs_bio_ctrl, to record
      both the bio, and its bio_flags.
      
      Also, in above case, for all pages added to the bio, we need to check if
      the new page crosses stripe boundary.  This check itself can be time
      consuming, and we don't really need to do that for each page.
      
      This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
      When a new bio is allocated, the stripe and ordered extent boundary is
      also calculated, so no matter how large the bio will be, we only
      calculate the boundaries once, to save some CPU time.
      
      The following functions/structures are affected:
      
      - struct extent_page_data
        Replace its bio pointer with structure btrfs_bio_ctrl (embedded
        structure, not pointer)
      
      - end_write_bio()
      - flush_write_bio()
        Just change how bio is fetched
      
      - btrfs_bio_add_page()
        Use pre-calculated boundaries instead of re-calculating them.
        And use @bio_ctrl to replace @bio and @prev_bio_flags.
      
      - calc_bio_boundaries()
        New function
      
      - submit_extent_page() callers
      - btrfs_do_readpage() callers
      - contiguous_readpages() callers
        To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
        bio.
      
      - btrfs_bio_fits_in_ordered_extent()
        Removed, as now the ordered extent size limit is done at bio
        allocation time, no need to check for each page range.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      390ed29b
    • D
      btrfs: introduce try-lock semantics for exclusive op start · 578bda9e
      David Sterba 提交于
      Add try-lock for exclusive operation start to allow callers to do more
      checks. The same operation must already be running. The try-lock and
      unlock must pair and are a substitute for btrfs_exclop_start, thus it
      must also pair with btrfs_exclop_finish to release the exclop context.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      578bda9e
    • D
      btrfs: add cancellable chunk relocation support · 907d2710
      David Sterba 提交于
      Add support code that will allow canceling relocation on the chunk
      granularity. This is different and independent of balance, that also
      uses relocation but is a higher level operation and manages it's own
      state and pause/cancellation requests.
      
      Relocation is used for resize (shrink) and device deletion so this will
      be a common point to implement cancellation for both. The context is
      entirely in btrfs_relocate_block_group and btrfs_recover_relocation,
      enclosing one chunk relocation. The status bit is set and unset between
      the chunks. As relocation can take long, the effects may not be
      immediate and the request and actual action can slightly race.
      
      The fs_info::reloc_cancel_req is only supposed to be increased and does
      not pair with decrement like fs_info::balance_cancel_req.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      907d2710
    • D
      btrfs: protect exclusive_operation by super_lock · 0d7ed32c
      David Sterba 提交于
      The exclusive operation is now atomically checked and set using bit
      operations. Switch it to protection by spinlock. The super block lock is
      not frequently used and adding a new lock seems like an overkill so it
      should be safe to reuse it.
      
      The reason to use spinlock is to enhance the locking context so more
      checks can be done, eg. allowing the same exclusive operation enter
      the exclop section and cancel the running one. This will be used for
      resize and device delete.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d7ed32c
    • D
    • F
      btrfs: don't set the full sync flag when truncation does not touch extents · 0d7d3165
      Filipe Manana 提交于
      At btrfs_truncate() where we truncate the inode either to the same size
      or to a smaller size, we always set the full sync flag on the inode.
      
      This is needed in case the truncation drops or trims any file extent items
      that start beyond or cross the new inode size, so that the next fsync
      drops all inode items from the log and scans again the fs/subvolume tree
      to find all items that must be logged.
      
      However if the truncation does not drop or trims any file extent items, we
      do not need to set the full sync flag and force the next fsync to use the
      slow code path. So do not set the full sync flag in such cases.
      
      One use case where it is frequent to do truncations that do not change
      the inode size and do not drop any extents (no prealloc extents beyond
      i_size) is when running Microsoft's SQL Server inside a Docker container.
      One example workload is the one Philipp Fent reported recently, in the
      thread with a link below. In this workload a large number of fsyncs are
      preceded by such truncate operations.
      
      After this change I constantly get the runtime for that workload from
      Philipp to be reduced by about -12%, for example from 184 seconds down
      to 162 seconds.
      
      Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/Tested-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d7d3165
    • Q
      btrfs: make btrfs_verify_data_csum() to return a bitmap · 08508fea
      Qu Wenruo 提交于
      This will provide the basis for later per-sector repair for subpage,
      while still keeping the existing code happy.
      
      As if all csums match, the return value will be 0, same as now.
      Only when csum mismatches, the return value is different.
      
      The new return value will be a bitmap, for 4K sectorsize and 4K page
      size, it will be either 1, instead of the -EIO (which is not used
      directly by the callers, no effective change).
      
      But for 4K sectorsize and 64K page size, aka subpage case, since the
      bvec can contain multiple sectors, knowing which sectors are corrupted
      will allow us to submit repair only for corrupted sectors.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      08508fea
  5. 29 4月, 2021 1 次提交
    • F
      btrfs: fix deadlock when cloning inline extents and using qgroups · f9baa501
      Filipe Manana 提交于
      There are a few exceptional cases where cloning an inline extent needs to
      copy the inline extent data into a page of the destination inode.
      
      When this happens, we end up starting a transaction while having a dirty
      page for the destination inode and while having the range locked in the
      destination's inode iotree too. Because when reserving metadata space
      for a transaction we may need to flush existing delalloc in case there is
      not enough free space, we have a mechanism in place to prevent a deadlock,
      which was introduced in commit 3d45f221 ("btrfs: fix deadlock when
      cloning inline extent and low on free metadata space").
      
      However when using qgroups, a transaction also reserves metadata qgroup
      space, which can also result in flushing delalloc in case there is not
      enough available space at the moment. When this happens we deadlock, since
      flushing delalloc requires locking the file range in the inode's iotree
      and the range was already locked at the very beginning of the clone
      operation, before attempting to start the transaction.
      
      When this issue happens, stack traces like the following are reported:
      
        [72747.556262] task:kworker/u81:9   state:D stack:    0 pid:  225 ppid:     2 flags:0x00004000
        [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
        [72747.556271] Call Trace:
        [72747.556273]  __schedule+0x296/0x760
        [72747.556277]  schedule+0x3c/0xa0
        [72747.556279]  io_schedule+0x12/0x40
        [72747.556284]  __lock_page+0x13c/0x280
        [72747.556287]  ? generic_file_readonly_mmap+0x70/0x70
        [72747.556325]  extent_write_cache_pages+0x22a/0x440 [btrfs]
        [72747.556331]  ? __set_page_dirty_nobuffers+0xe7/0x160
        [72747.556358]  ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
        [72747.556362]  ? update_group_capacity+0x25/0x210
        [72747.556366]  ? cpumask_next_and+0x1a/0x20
        [72747.556391]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.556394]  do_writepages+0x41/0xd0
        [72747.556398]  __writeback_single_inode+0x39/0x2a0
        [72747.556403]  writeback_sb_inodes+0x1ea/0x440
        [72747.556407]  __writeback_inodes_wb+0x5f/0xc0
        [72747.556410]  wb_writeback+0x235/0x2b0
        [72747.556414]  ? get_nr_inodes+0x35/0x50
        [72747.556417]  wb_workfn+0x354/0x490
        [72747.556420]  ? newidle_balance+0x2c5/0x3e0
        [72747.556424]  process_one_work+0x1aa/0x340
        [72747.556426]  worker_thread+0x30/0x390
        [72747.556429]  ? create_worker+0x1a0/0x1a0
        [72747.556432]  kthread+0x116/0x130
        [72747.556435]  ? kthread_park+0x80/0x80
        [72747.556438]  ret_from_fork+0x1f/0x30
      
        [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [72747.566961] Call Trace:
        [72747.566964]  __schedule+0x296/0x760
        [72747.566968]  ? finish_wait+0x80/0x80
        [72747.566970]  schedule+0x3c/0xa0
        [72747.566995]  wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
        [72747.566999]  ? finish_wait+0x80/0x80
        [72747.567024]  lock_extent_bits+0x37/0x90 [btrfs]
        [72747.567047]  btrfs_invalidatepage+0x299/0x2c0 [btrfs]
        [72747.567051]  ? find_get_pages_range_tag+0x2cd/0x380
        [72747.567076]  __extent_writepage+0x203/0x320 [btrfs]
        [72747.567102]  extent_write_cache_pages+0x2bb/0x440 [btrfs]
        [72747.567106]  ? update_load_avg+0x7e/0x5f0
        [72747.567109]  ? enqueue_entity+0xf4/0x6f0
        [72747.567134]  extent_writepages+0x44/0xa0 [btrfs]
        [72747.567137]  ? enqueue_task_fair+0x93/0x6f0
        [72747.567140]  do_writepages+0x41/0xd0
        [72747.567144]  __filemap_fdatawrite_range+0xc7/0x100
        [72747.567167]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [72747.567195]  btrfs_work_helper+0xc2/0x300 [btrfs]
        [72747.567200]  process_one_work+0x1aa/0x340
        [72747.567202]  worker_thread+0x30/0x390
        [72747.567205]  ? create_worker+0x1a0/0x1a0
        [72747.567208]  kthread+0x116/0x130
        [72747.567211]  ? kthread_park+0x80/0x80
        [72747.567214]  ret_from_fork+0x1f/0x30
      
        [72747.569686] task:fsstress        state:D stack:    0 pid:841421 ppid:841417 flags:0x00000000
        [72747.569689] Call Trace:
        [72747.569691]  __schedule+0x296/0x760
        [72747.569694]  schedule+0x3c/0xa0
        [72747.569721]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569725]  ? finish_wait+0x80/0x80
        [72747.569753]  btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
        [72747.569781]  btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
        [72747.569804]  btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
        [72747.569810]  ? path_lookupat.isra.48+0x97/0x140
        [72747.569833]  btrfs_file_write_iter+0x81/0x410 [btrfs]
        [72747.569836]  ? __kmalloc+0x16a/0x2c0
        [72747.569839]  do_iter_readv_writev+0x160/0x1c0
        [72747.569843]  do_iter_write+0x80/0x1b0
        [72747.569847]  vfs_writev+0x84/0x140
        [72747.569869]  ? btrfs_file_llseek+0x38/0x270 [btrfs]
        [72747.569873]  do_writev+0x65/0x100
        [72747.569876]  do_syscall_64+0x33/0x40
        [72747.569879]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        [72747.569899] task:fsstress        state:D stack:    0 pid:841424 ppid:841417 flags:0x00004000
        [72747.569903] Call Trace:
        [72747.569906]  __schedule+0x296/0x760
        [72747.569909]  schedule+0x3c/0xa0
        [72747.569936]  try_flush_qgroup+0x95/0x140 [btrfs]
        [72747.569940]  ? finish_wait+0x80/0x80
        [72747.569967]  __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
        [72747.569989]  start_transaction+0x279/0x580 [btrfs]
        [72747.570014]  clone_copy_inline_extent+0x332/0x490 [btrfs]
        [72747.570041]  btrfs_clone+0x5b7/0x7a0 [btrfs]
        [72747.570068]  ? lock_extent_bits+0x64/0x90 [btrfs]
        [72747.570095]  btrfs_clone_files+0xfc/0x150 [btrfs]
        [72747.570122]  btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
        [72747.570126]  do_clone_file_range+0xed/0x200
        [72747.570131]  vfs_clone_file_range+0x37/0x110
        [72747.570134]  ioctl_file_clone+0x7d/0xb0
        [72747.570137]  do_vfs_ioctl+0x138/0x630
        [72747.570140]  __x64_sys_ioctl+0x62/0xc0
        [72747.570143]  do_syscall_64+0x33/0x40
        [72747.570146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      So fix this by skipping the flush of delalloc for an inode that is
      flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
      such a special case of cloning an inline extent, when flushing delalloc
      during qgroup metadata reservation.
      
      The special cases for cloning inline extents were added in kernel 5.7 by
      by commit 05a5a762 ("Btrfs: implement full reflink support for
      inline extents"), while having qgroup metadata space reservation flushing
      delalloc when low on space was added in kernel 5.9 by commit
      c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get
      -EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
      kernel backports.
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
      Fixes: c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
      CC: stable@vger.kernel.org # 5.9+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f9baa501
  6. 21 4月, 2021 3 次提交
    • J
      btrfs: zoned: automatically reclaim zones · 18bb8bbf
      Johannes Thumshirn 提交于
      When a file gets deleted on a zoned file system, the space freed is not
      returned back into the block group's free space, but is migrated to
      zone_unusable.
      
      As this zone_unusable space is behind the current write pointer it is not
      possible to use it for new allocations. In the current implementation a
      zone is reset once all of the block group's space is accounted as zone
      unusable.
      
      This behaviour can lead to premature ENOSPC errors on a busy file system.
      
      Instead of only reclaiming the zone once it is completely unusable,
      kick off a reclaim job once the amount of unusable bytes exceeds a user
      configurable threshold between 51% and 100%. It can be set per mounted
      filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
      by default.
      
      Similar to reclaiming unused block groups, these dirty block groups are
      added to a to_reclaim list and then on a transaction commit, the reclaim
      process is triggered but after we deleted unused block groups, which will
      free space for the relocation process.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18bb8bbf
    • J
      btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lock · f3372065
      Johannes Thumshirn 提交于
      As a preparation for extending the block group deletion use case, rename
      the unused_bgs_mutex to reclaim_bgs_lock.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3372065
    • Q
      btrfs: more graceful errors/warnings on 32bit systems when reaching limits · e9306ad4
      Qu Wenruo 提交于
      Btrfs uses internally mapped u64 address space for all its metadata.
      Due to the page cache limit on 32bit systems, btrfs can't access
      metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. See
      how MAX_LFS_FILESIZE and page::index are defined.  This is 16T for 4K
      page size while 256T for 64K page size.
      
      Users can have a filesystem which doesn't have metadata beyond the
      boundary at mount time, but later balance can cause it to create
      metadata beyond the boundary.
      
      And modification to MM layer is unrealistic just for such minor use
      case. We can't do more than to prevent mounting such filesystem or warn
      early when the numbers are still within the limits.
      
      To address such problem, this patch will introduce the following checks:
      
      - Mount time rejection
        This will reject any fs which has metadata chunk at or beyond the
        boundary.
      
      - Mount time early warning
        If there is any metadata chunk beyond 5/8th of the boundary, we do an
        early warning and hope the end user will see it.
      
      - Runtime extent buffer rejection
        If we're going to allocate an extent buffer at or beyond the boundary,
        reject such request with EOVERFLOW.
        This is definitely going to cause problems like transaction abort, but
        we have no better ways.
      
      - Runtime extent buffer early warning
        If an extent buffer beyond 5/8th of the max file size is allocated, do
        an early warning.
      
      Above error/warning message will only be printed once for each fs to
      reduce dmesg flood.
      
      If the mount is rejected, the filesystem will be mountable only on a
      64bit host.
      
      Link: https://lore.kernel.org/linux-btrfs/1783f16d-7a28-80e6-4c32-fdf19b705ed0@gmx.com/Reported-by: NErik Jensen <erikjensen@rkjnsn.net>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e9306ad4
  7. 19 4月, 2021 8 次提交
    • F
      btrfs: improve btree readahead for full send operations · ace75066
      Filipe Manana 提交于
      Currently a full send operation uses the standard btree readahead when
      iterating over the subvolume/snapshot btree, which despite bringing good
      performance benefits, it could be improved in a few aspects for use cases
      such as full send operations, which are guaranteed to visit every node
      and leaf of a btree, in ascending and sequential order. The limitations
      of that standard btree readahead implementation are the following:
      
      1) It only triggers readahead for leaves that are physically close
         to the leaf being read, within a 64K range;
      
      2) It only triggers readahead for the next or previous leaves if the
         leaf being read is not currently in memory;
      
      3) It never triggers readahead for nodes.
      
      So add a new readahead mode that addresses all these points and use it
      for full send operations.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk and with 16GiB of RAM:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        initial_file_count=500000
        add_files $initial_file_count 0 4
      
        echo
        echo "Creating first snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        echo
        echo "Adding more files..."
        add_files $((initial_file_count / 4)) $initial_file_count 4
      
        echo
        echo "Updating 1/50th of the initial files..."
        for ((i = 1; i < $initial_file_count; i += 50)); do
            xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
        done
      
        echo
        echo "Creating second snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap2
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
      The durations of the full send operation in seconds were the following:
      
      Before this change:  217 seconds
      After this change:   205 seconds (-5.7%)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ace75066
    • F
      btrfs: use a bit to track the existence of tree mod log users · bc03f39e
      Filipe Manana 提交于
      The tree modification log functions are called very frequently, basically
      they are called every time a btree is modified (a pointer added or removed
      to a node, a new root for a btree is set, etc). Because of that, to avoid
      heavy lock contention on the lock that protects the list of tree mod log
      users, we have checks that test the emptiness of the list with a full
      memory barrier before the checks, so that when there are no tree mod log
      users we avoid taking the lock.
      
      Replace the memory barrier and list emptiness check with a test for a new
      bit set at fs_info->flags. This bit is used to indicate when there are
      tree mod log users, set whenever a user is added to the list and cleared
      when the last user is removed from the list. This makes the intention a
      bit more obvious and possibly more efficient (assuming test_bit() may be
      cheaper than a full memory barrier on some architectures).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc03f39e
    • F
      btrfs: move the tree mod log code into its own file · f3a84ccd
      Filipe Manana 提交于
      The tree modification log, which records modifications done to btrees, is
      quite large and currently spread all over ctree.c, which is a huge file
      already.
      
      To make things better organized, move all that code into its own separate
      source and header files. Functions and definitions that are used outside
      of the module (mostly by ctree.c) are renamed so that they start with a
      "btrfs_" prefix. Everything else remains unchanged.
      
      This makes it easier to go over the tree modification log code every
      time I need to go read it to fix a bug.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor comment updates ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3a84ccd
    • J
      btrfs: remove duplicated in_range() macro · cea62800
      Johannes Thumshirn 提交于
      The in_range() macro is defined twice in btrfs' source, once in ctree.h
      and once in misc.h.
      
      Remove the definition in ctree.h and include misc.h in the files depending
      on it.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cea62800
    • J
      btrfs: add a i_mmap_lock to our inode · 8318ba79
      Josef Bacik 提交于
      We need to be able to exclude page_mkwrite from happening concurrently
      with certain operations.  To facilitate this, add a i_mmap_lock to our
      inode, down_read() it in our mkwrite, and add a new ILOCK flag to
      indicate that we want to take the i_mmap_lock as well.  I used pahole to
      check the size of the btrfs_inode, the sizes are as follows
      
      no lockdep:
      before: 1120 (3 per 4k page)
      after: 1160 (3 per 4k page)
      
      lockdep:
      before: 2072 (1 per 4k page)
      after: 2224 (1 per 4k page)
      
      We're slightly larger but it doesn't change how many objects we can fit
      per page.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8318ba79
    • G
      btrfs: remove mirror argument from btrfs_csum_verify_data() · 5e295768
      Goldwyn Rodrigues 提交于
      The parameter mirror is not used and does not make sense for checksum
      verification of the given bio.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5e295768
    • A
      btrfs: unexport btrfs_extent_readonly() and make it static · 05947ae1
      Anand Jain 提交于
      btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So
      move it from extent-tree.c to inode.c and declare it as static.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      05947ae1
    • N
  8. 12 4月, 2021 1 次提交
  9. 23 2月, 2021 1 次提交
    • F
      btrfs: fix race between writes to swap files and scrub · 195a49ea
      Filipe Manana 提交于
      When we active a swap file, at btrfs_swap_activate(), we acquire the
      exclusive operation lock to prevent the physical location of the swap
      file extents to be changed by operations such as balance and device
      replace/resize/remove. We also call there can_nocow_extent() which,
      among other things, checks if the block group of a swap file extent is
      currently RO, and if it is we can not use the extent, since a write
      into it would result in COWing the extent.
      
      However we have no protection against a scrub operation running after we
      activate the swap file, which can result in the swap file extents to be
      COWed while the scrub is running and operating on the respective block
      group, because scrub turns a block group into RO before it processes it
      and then back again to RW mode after processing it. That means an attempt
      to write into a swap file extent while scrub is processing the respective
      block group, will result in COWing the extent, changing its physical
      location on disk.
      
      Fix this by making sure that block groups that have extents that are used
      by active swap files can not be turned into RO mode, therefore making it
      not possible for a scrub to turn them into RO mode. When a scrub finds a
      block group that can not be turned to RO due to the existence of extents
      used by swap files, it proceeds to the next block group and logs a warning
      message that mentions the block group was skipped due to active swap
      files - this is the same approach we currently use for balance.
      
      Fixes: ed46ff3d ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      195a49ea
  10. 09 2月, 2021 5 次提交
    • N
      btrfs: zoned: enable to mount ZONED incompat flag · 9d294a68
      Naohiro Aota 提交于
      This final patch adds the ZONED incompat flag to the supported flags
      and enables to mount ZONED flagged file system.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9d294a68
    • N
      btrfs: zoned: extend zoned allocator to use dedicated tree-log block group · 40ab3be1
      Naohiro Aota 提交于
      This is the 1/3 patch to enable tree log on zoned filesystems.
      
      The tree-log feature does not work on a zoned filesystem as is. Blocks for
      a tree-log tree are allocated mixed with other metadata blocks and btrfs
      writes and syncs the tree-log blocks to devices at the time of fsync(),
      which has a different timing than a global transaction commit. As a
      result, both writing tree-log blocks and writing other metadata blocks
      become non-sequential writes that zoned filesystems must avoid.
      
      Introduce a dedicated block group for tree-log blocks, so that tree-log
      blocks and other metadata blocks can be separate write streams.  As a
      result, each write stream can now be written to devices separately.
      "fs_info->treelog_bg" tracks the dedicated block group and assigns
      "treelog_bg" on-demand on tree-log block allocation time.
      
      This commit extends the zoned block allocator to use the block group.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40ab3be1
    • N
      btrfs: zoned: serialize metadata IO · 0bc09ca1
      Naohiro Aota 提交于
      We cannot use zone append for writing metadata, because the B-tree nodes
      have references to each other using logical address. Without knowing
      the address in advance, we cannot construct the tree in the first place.
      So we need to serialize write IOs for metadata.
      
      We cannot add a mutex around allocation and submission because metadata
      blocks are allocated in an earlier stage to build up B-trees.
      
      Add a zoned_meta_io_lock and hold it during metadata IO submission in
      btree_write_cache_pages() to serialize IOs.
      
      Furthermore, this adds a per-block group metadata IO submission pointer
      "meta_write_pointer" to ensure sequential writing, which can break when
      attempting to write back blocks in an unfinished transaction. If the
      writing out failed because of a hole and the write out is for data
      integrity (WB_SYNC_ALL), it returns EAGAIN.
      
      A caller like fsync() code should handle this properly e.g. by falling
      back to a full transaction commit.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0bc09ca1
    • J
      btrfs: zoned: check if bio spans across an ordered extent · cacb2cea
      Johannes Thumshirn 提交于
      To ensure that an ordered extent maps to a contiguous region on disk, we
      need to maintain a "one bio == one ordered extent" rule.
      
      Ensure that constructing bio does not span more than an ordered extent.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cacb2cea
    • J
      btrfs: improve preemptive background space flushing · 576fa348
      Josef Bacik 提交于
      Currently if we ever have to flush space because we do not have enough
      we allocate a ticket and attach it to the space_info, and then
      systematically flush things in the filesystem that hold space
      reservations until our space is reclaimed.
      
      However this has a latency cost, we must go to sleep and wait for the
      flushing to make progress before we are woken up and allowed to continue
      doing our work.
      
      In order to address that we used to kick off the async worker to flush
      space preemptively, so that we could be reclaiming space hopefully
      before any tasks needed to stop and wait for space to reclaim.
      
      When I introduced the ticketed ENOSPC stuff this broke slightly in the
      fact that we were using tickets to indicate if we were done flushing.
      No tickets, no more flushing.  However this meant that we essentially
      never preemptively flushed.  This caused a write performance regression
      that Nikolay noticed in an unrelated patch that removed the committing
      of the transaction during btrfs_end_transaction.
      
      The behavior that happened pre that patch was btrfs_end_transaction()
      would see that we were low on space, and it would commit the
      transaction.  This was bad because in this particular case you could end
      up with thousands and thousands of transactions being committed during
      the 5 minute reproducer.  With the patch to remove this behavior we got
      much more sane transaction commits, but we ended up slower because we
      would write for a while, flush, write for a while, flush again.
      
      To address this we need to reinstate a preemptive flushing mechanism.
      However it is distinctly different from our ticketing flushing in that
      it doesn't have tickets to base it's decisions on.  Instead of bolting
      this logic into our existing flushing work, add another worker to handle
      this preemptive flushing.  Here we will attempt to be slightly
      intelligent about the things that we flushing, attempting to balance
      between whichever pool is taking up the most space.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      576fa348