1. 22 6月, 2023 1 次提交
  2. 17 5月, 2023 1 次提交
    • Q
      btrfs: scrub: try harder to mark RAID56 block groups read-only · 7561551e
      Qu Wenruo 提交于
      Currently we allow a block group not to be marked read-only for scrub.
      
      But for RAID56 block groups if we require the block group to be
      read-only, then we're allowed to use cached content from scrub stripe to
      reduce unnecessary RAID56 reads.
      
      So this patch would:
      
      - Make btrfs_inc_block_group_ro() try harder
        During my tests, for cases like btrfs/061 and btrfs/064, we can hit
        ENOSPC from btrfs_inc_block_group_ro() calls during scrub.
      
        The reason is if we only have one single data chunk, and trying to
        scrub it, we won't have any space left for any newer data writes.
      
        But this check should be done by the caller, especially for scrub
        cases we only temporarily mark the chunk read-only.
        And newer data writes would always try to allocate a new data chunk
        when needed.
      
      - Return error for scrub if we failed to mark a RAID56 chunk read-only
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7561551e
  3. 18 4月, 2023 4 次提交
    • Q
      btrfs: scrub: remove the old scrub recheck code · e9255d6c
      Qu Wenruo 提交于
      The old scrub code has different entrance to verify the content, and
      since we have removed the writeback path, now we can start removing the
      re-check part, including:
      
      - scrub_recover structure
      - scrub_sector::recover member
      - function scrub_setup_recheck_block()
      - function scrub_recheck_block()
      - function scrub_recheck_block_checksum()
      - function scrub_repair_block_group_good_copy()
      - function scrub_repair_sector_from_good_copy()
      - function scrub_is_page_on_raid56()
      
      - function full_stripe_lock()
      - function search_full_stripe_lock()
      - function get_full_stripe_logical()
      - function insert_full_stripe_lock()
      - function lock_full_stripe()
      - function unlock_full_stripe()
      - btrfs_block_group::full_stripe_locks_root member
      - btrfs_full_stripe_locks_tree structure
        This infrastructure is to ensure RAID56 scrub is properly handling
        recovery and P/Q scrub correctly.
      
        This is no longer needed, before P/Q scrub we will wait for all
        the involved data stripes to be scrubbed first, and RAID56 code has
        internal lock to ensure no race in the same full stripe.
      
      - function scrub_print_warning()
      - function scrub_get_recover()
      - function scrub_put_recover()
      - function scrub_handle_errored_block()
      - function scrub_setup_recheck_block()
      - function scrub_bio_wait_endio()
      - function scrub_submit_raid56_bio_wait()
      - function scrub_recheck_block_on_raid56()
      - function scrub_recheck_block()
      - function scrub_recheck_block_checksum()
      - function scrub_repair_block_from_good_copy()
      - function scrub_repair_sector_from_good_copy()
      
      And two more functions exported temporarily for later cleanup:
      
      - alloc_scrub_sector()
      - alloc_scrub_block()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e9255d6c
    • F
      btrfs: remove bytes_used argument from btrfs_make_block_group() · 5758d1bd
      Filipe Manana 提交于
      The only caller of btrfs_make_block_group() always passes 0 as the value
      for the bytes_used argument, so remove it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5758d1bd
    • Q
      btrfs: reduce div64 calls by limiting the number of stripes of a chunk to u32 · 6ded22c1
      Qu Wenruo 提交于
      There are quite some div64 calls inside btrfs_map_block() and its
      variants.
      
      Such calls are for @stripe_nr, where @stripe_nr is the number of
      stripes before our logical bytenr inside a chunk.
      
      However we can eliminate such div64 calls by just reducing the width of
      @stripe_nr from 64 to 32.
      
      This can be done because our chunk size limit is already 10G, with fixed
      stripe length 64K.
      Thus a U32 is definitely enough to contain the number of stripes.
      
      With such width reduction, we can get rid of slower div64, and extra
      warning for certain 32bit arch.
      
      This patch would do:
      
      - Add a new tree-checker chunk validation on chunk length
        Make sure no chunk can reach 256G, which can also act as a bitflip
        checker.
      
      - Reduce the width from u64 to u32 for @stripe_nr variables
      
      - Replace unnecessary div64 calls with regular modulo and division
        32bit division and modulo are much faster than 64bit operations, and
        we are finally free of the div64 fear at least in those involved
        functions.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6ded22c1
    • Q
      btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN · a97699d1
      Qu Wenruo 提交于
      Currently btrfs doesn't support stripe lengths other than 64KiB.
      This is already set in the tree-checker.
      
      There is really no meaning to record that fixed value in map_lookup for
      now, and can all be replaced with BTRFS_STRIPE_LEN.
      
      Furthermore we can use the fix stripe length to do the following
      optimization:
      
      - Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division
        Now we only need to do a right shift.
      
        And the value of BTRFS_STRIPE_LEN itself is already too large for bit
        shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift,
        a compiler warning would be triggered.
      
        Thus this bit shift optimization would be safe.
      
      - Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a97699d1
  4. 16 3月, 2023 2 次提交
  5. 08 3月, 2023 1 次提交
    • F
      btrfs: fix block group item corruption after inserting new block group · 675dfe12
      Filipe Manana 提交于
      We can often end up inserting a block group item, for a new block group,
      with a wrong value for the used bytes field.
      
      This happens if for the new allocated block group, in the same transaction
      that created the block group, we have tasks allocating extents from it as
      well as tasks removing extents from it.
      
      For example:
      
      1) Task A creates a metadata block group X;
      
      2) Two extents are allocated from block group X, so its "used" field is
         updated to 32K, and its "commit_used" field remains as 0;
      
      3) Transaction commit starts, by some task B, and it enters
         btrfs_start_dirty_block_groups(). There it tries to update the block
         group item for block group X, which currently has its "used" field with
         a value of 32K. But that fails since the block group item was not yet
         inserted, and so on failure update_block_group_item() sets the
         "commit_used" field of the block group back to 0;
      
      4) The block group item is inserted by task A, when for example
         btrfs_create_pending_block_groups() is called when releasing its
         transaction handle. This results in insert_block_group_item() inserting
         the block group item in the extent tree (or block group tree), with a
         "used" field having a value of 32K, but without updating the
         "commit_used" field in the block group, which remains with value of 0;
      
      5) The two extents are freed from block X, so its "used" field changes
         from 32K to 0;
      
      6) The transaction commit by task B continues, it enters
         btrfs_write_dirty_block_groups() which calls update_block_group_item()
         for block group X, and there it decides to skip the block group item
         update, because "used" has a value of 0 and "commit_used" has a value
         of 0 too.
      
         As a result, we end up with a block item having a 32K "used" field but
         no extents allocated from it.
      
      When this issue happens, a btrfs check reports an error like this:
      
         [1/7] checking root items
         [2/7] checking extents
         block group [1104150528 1073741824] used 39796736 but extent items used 0
         ERROR: errors found in extent allocation tree or chunk allocation
         (...)
      
      Fix this by making insert_block_group_item() update the block group's
      "commit_used" field.
      
      Fixes: 7248e0ce ("btrfs: skip update of block group item if used bytes are the same")
      CC: stable@vger.kernel.org # 6.2+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      675dfe12
  6. 07 3月, 2023 2 次提交
    • J
      btrfs: fix percent calculation for bg reclaim message · 95cd356c
      Johannes Thumshirn 提交于
      We have a report, that the info message for block-group reclaim is
      crossing the 100% used mark.
      
      This is happening as we were truncating the divisor for the division
      (the block_group->length) to a 32bit value.
      
      Fix this by using div64_u64() to not truncate the divisor.
      
      In the worst case, it can lead to a div by zero error and should be
      possible to trigger on 4 disks RAID0, and each device is large enough:
      
        $ mkfs.btrfs  -f /dev/test/scratch[1234] -m raid1 -d raid0
        btrfs-progs v6.1
        [...]
        Filesystem size:    40.00GiB
        Block group profiles:
          Data:             RAID0             4.00GiB <<<
          Metadata:         RAID1           256.00MiB
          System:           RAID1             8.00MiB
      Reported-by: NForza <forza@tnonline.net>
      Link: https://lore.kernel.org/linux-btrfs/e99483.c11a58d.1863591ca52@tnonline.net/
      Fixes: 5f93e776 ("btrfs: zoned: print unusable percentage when reclaiming block groups")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add Qu's note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95cd356c
    • B
      btrfs: fix potential dead lock in size class loading logic · 12148367
      Boris Burkov 提交于
      As reported by Filipe, there's a potential deadlock caused by
      using btrfs_search_forward on commit_root. The locking there is
      unconditional, even if ->skip_locking and ->search_commit_root is set.
      It's not meant to be used for commit roots, so it always needs to do
      locking.
      
      So if another task is COWing a child node of the same root node and
      then needs to wait for block group caching to complete when trying to
      allocate a metadata extent, it deadlocks.
      
      For example:
      
      [539604.239315] sysrq: Show Blocked State
      [539604.240133] task:kworker/u16:6   state:D stack:0     pid:2119594 ppid:2      flags:0x00004000
      [539604.241613] Workqueue: btrfs-cache btrfs_work_helper [btrfs]
      [539604.242673] Call Trace:
      [539604.243129]  <TASK>
      [539604.243925]  __schedule+0x41d/0xee0
      [539604.244797]  ? rcu_read_lock_sched_held+0x12/0x70
      [539604.245399]  ? rwsem_down_read_slowpath+0x185/0x490
      [539604.246111]  schedule+0x5d/0xf0
      [539604.246593]  rwsem_down_read_slowpath+0x2da/0x490
      [539604.247290]  ? rcu_barrier_tasks_trace+0x10/0x20
      [539604.248090]  __down_read_common+0x3d/0x150
      [539604.248702]  down_read_nested+0xc3/0x140
      [539604.249280]  __btrfs_tree_read_lock+0x24/0x100 [btrfs]
      [539604.250097]  btrfs_read_lock_root_node+0x48/0x60 [btrfs]
      [539604.250915]  btrfs_search_forward+0x59/0x460 [btrfs]
      [539604.251781]  ? btrfs_global_root+0x50/0x70 [btrfs]
      [539604.252476]  caching_thread+0x1be/0x920 [btrfs]
      [539604.253167]  btrfs_work_helper+0xf6/0x400 [btrfs]
      [539604.253848]  process_one_work+0x24f/0x5a0
      [539604.254476]  worker_thread+0x52/0x3b0
      [539604.255166]  ? __pfx_worker_thread+0x10/0x10
      [539604.256047]  kthread+0xf0/0x120
      [539604.256591]  ? __pfx_kthread+0x10/0x10
      [539604.257212]  ret_from_fork+0x29/0x50
      [539604.257822]  </TASK>
      [539604.258233] task:btrfs-transacti state:D stack:0     pid:2236474 ppid:2      flags:0x00004000
      [539604.259802] Call Trace:
      [539604.260243]  <TASK>
      [539604.260615]  __schedule+0x41d/0xee0
      [539604.261205]  ? rcu_read_lock_sched_held+0x12/0x70
      [539604.262000]  ? rwsem_down_read_slowpath+0x185/0x490
      [539604.262822]  schedule+0x5d/0xf0
      [539604.263374]  rwsem_down_read_slowpath+0x2da/0x490
      [539604.266228]  ? lock_acquire+0x160/0x310
      [539604.266917]  ? rcu_read_lock_sched_held+0x12/0x70
      [539604.267996]  ? lock_contended+0x19e/0x500
      [539604.268720]  __down_read_common+0x3d/0x150
      [539604.269400]  down_read_nested+0xc3/0x140
      [539604.270057]  __btrfs_tree_read_lock+0x24/0x100 [btrfs]
      [539604.271129]  btrfs_read_lock_root_node+0x48/0x60 [btrfs]
      [539604.272372]  btrfs_search_slot+0x143/0xf70 [btrfs]
      [539604.273295]  update_block_group_item+0x9e/0x190 [btrfs]
      [539604.274282]  btrfs_start_dirty_block_groups+0x1c4/0x4f0 [btrfs]
      [539604.275381]  ? __mutex_unlock_slowpath+0x45/0x280
      [539604.276390]  btrfs_commit_transaction+0xee/0xed0 [btrfs]
      [539604.277391]  ? lock_acquire+0x1a4/0x310
      [539604.278080]  ? start_transaction+0xcb/0x6c0 [btrfs]
      [539604.279099]  transaction_kthread+0x142/0x1c0 [btrfs]
      [539604.279996]  ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
      [539604.280673]  kthread+0xf0/0x120
      [539604.281050]  ? __pfx_kthread+0x10/0x10
      [539604.281496]  ret_from_fork+0x29/0x50
      [539604.281966]  </TASK>
      [539604.282255] task:fsstress        state:D stack:0     pid:2236483 ppid:1      flags:0x00004006
      [539604.283897] Call Trace:
      [539604.284700]  <TASK>
      [539604.285088]  __schedule+0x41d/0xee0
      [539604.285660]  schedule+0x5d/0xf0
      [539604.286175]  btrfs_wait_block_group_cache_progress+0xf2/0x170 [btrfs]
      [539604.287342]  ? __pfx_autoremove_wake_function+0x10/0x10
      [539604.288450]  find_free_extent+0xd93/0x1750 [btrfs]
      [539604.289256]  ? _raw_spin_unlock+0x29/0x50
      [539604.289911]  ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs]
      [539604.290843]  btrfs_reserve_extent+0x147/0x290 [btrfs]
      [539604.291943]  btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs]
      [539604.292903]  __btrfs_cow_block+0x138/0x580 [btrfs]
      [539604.293773]  btrfs_cow_block+0x10e/0x240 [btrfs]
      [539604.294595]  btrfs_search_slot+0x7f3/0xf70 [btrfs]
      [539604.295585]  btrfs_update_device+0x71/0x1b0 [btrfs]
      [539604.296459]  btrfs_chunk_alloc_add_chunk_item+0xe0/0x340 [btrfs]
      [539604.297489]  btrfs_chunk_alloc+0x1bf/0x490 [btrfs]
      [539604.298335]  find_free_extent+0x6fa/0x1750 [btrfs]
      [539604.299174]  ? _raw_spin_unlock+0x29/0x50
      [539604.299950]  ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs]
      [539604.300918]  btrfs_reserve_extent+0x147/0x290 [btrfs]
      [539604.301797]  btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs]
      [539604.303017]  ? lock_release+0x224/0x4a0
      [539604.303855]  __btrfs_cow_block+0x138/0x580 [btrfs]
      [539604.304789]  btrfs_cow_block+0x10e/0x240 [btrfs]
      [539604.305611]  btrfs_search_slot+0x7f3/0xf70 [btrfs]
      [539604.306682]  ? btrfs_global_root+0x50/0x70 [btrfs]
      [539604.308198]  lookup_inline_extent_backref+0x17b/0x7a0 [btrfs]
      [539604.309254]  lookup_extent_backref+0x43/0xd0 [btrfs]
      [539604.310122]  __btrfs_free_extent+0xf8/0x810 [btrfs]
      [539604.310874]  ? lock_release+0x224/0x4a0
      [539604.311724]  ? btrfs_merge_delayed_refs+0x17b/0x1d0 [btrfs]
      [539604.313023]  __btrfs_run_delayed_refs+0x2ba/0x1260 [btrfs]
      [539604.314271]  btrfs_run_delayed_refs+0x8f/0x1c0 [btrfs]
      [539604.315445]  ? rcu_read_lock_sched_held+0x12/0x70
      [539604.316706]  btrfs_commit_transaction+0xa2/0xed0 [btrfs]
      [539604.317855]  ? do_raw_spin_unlock+0x4b/0xa0
      [539604.318544]  ? _raw_spin_unlock+0x29/0x50
      [539604.319240]  create_subvol+0x53d/0x6e0 [btrfs]
      [539604.320283]  btrfs_mksubvol+0x4f5/0x590 [btrfs]
      [539604.321220]  __btrfs_ioctl_snap_create+0x11b/0x180 [btrfs]
      [539604.322307]  btrfs_ioctl_snap_create_v2+0xc6/0x150 [btrfs]
      [539604.323295]  btrfs_ioctl+0x9f7/0x33e0 [btrfs]
      [539604.324331]  ? rcu_read_lock_sched_held+0x12/0x70
      [539604.325137]  ? lock_release+0x224/0x4a0
      [539604.325808]  ? __x64_sys_ioctl+0x87/0xc0
      [539604.326467]  __x64_sys_ioctl+0x87/0xc0
      [539604.327109]  do_syscall_64+0x38/0x90
      [539604.327875]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
      [539604.328792] RIP: 0033:0x7f05a7babaeb
      
      This needs to use regular btrfs_search_slot() with some skip and stop
      logic.
      
      Since we only consider five samples (five search slots), don't bother
      with the complexity of looking for commit_root_sem contention. If
      necessary, it can be added to the load function in between samples.
      Reported-by: NFilipe Manana <fdmanana@kernel.org>
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H7eKMD44Z1+=Kb-1RFMMeZpAm2fwyO59yeBwCcSOU80Pg@mail.gmail.com/
      Fixes: c7eec3d9 ("btrfs: load block group size class when caching")
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      12148367
  7. 16 2月, 2023 1 次提交
  8. 14 2月, 2023 4 次提交
    • B
      btrfs: don't use size classes for zoned file systems · cb0922f2
      Boris Burkov 提交于
      When a file system has ZNS devices which are constrained by a maximum
      number of active block groups, then not being able to use all the block
      groups for every allocation is not ideal, and could cause us to loop a
      ton with mixed size allocations.
      
      In general, since zoned doesn't write into gaps behind where block
      groups are writing, it is not susceptible to the same sort of
      fragmentation that size classes are designed to solve, so we can skip
      size classes for zoned file systems in general, even though there would
      probably be no harm for SMR devices.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cb0922f2
    • B
      btrfs: load block group size class when caching · c7eec3d9
      Boris Burkov 提交于
      Since the size class is an artifact of an arbitrary anti fragmentation
      strategy, it doesn't really make sense to persist it. Furthermore, most
      of the size class logic assumes fresh block groups. That is of course
      not a reasonable assumption -- we will be upgrading kernels with
      existing filesystems whose block groups are not classified.
      
      To work around those issues, implement logic to compute the size class
      of the block groups as we cache them in. To perfectly assess the state
      of a block group, we would have to read the entire extent tree (since
      the free space cache mashes together contiguous extent items) which
      would be prohibitively expensive for larger file systems with more
      extents.
      
      We can do it relatively cheaply by implementing a simple heuristic of
      sampling a handful of extents and picking the smallest one we see. In
      the happy case where the block group was classified, we will only see
      extents of the correct size. In the unhappy case, we will hopefully find
      one of the smaller extents, but there is no perfect answer anyway.
      Autorelocation will eventually churn up the block group if there is
      significant freeing anyway.
      
      There was no regression in mount performance at end state of the fsperf
      test suite, and the delay until the block group is marked cached is
      minimized by the constant number of extent samples.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7eec3d9
    • B
      btrfs: introduce size class to block group allocator · 52bb7a21
      Boris Burkov 提交于
      The aim of this patch is to reduce the fragmentation of block groups
      under certain unhappy workloads. It is particularly effective when the
      size of extents correlates with their lifetime, which is something we
      have observed causing fragmentation in the fleet at Meta.
      
      This patch categorizes extents into size classes:
      
      - x < 128KiB: "small"
      - 128KiB < x < 8MiB: "medium"
      - x > 8MiB: "large"
      
      and as much as possible reduces allocations of extents into block groups
      that don't match the size class. This takes advantage of any (possible)
      correlation between size and lifetime and also leaves behind predictable
      re-usable gaps when extents are freed; small writes don't gum up bigger
      holes.
      
      Size classes are implemented in the following way:
      
      - Mark each new block group with a size class of the first allocation
        that goes into it.
      
      - Add two new passes to ffe: "unset size class" and "wrong size class".
        First, try only matching block groups, then try unset ones, then allow
        allocation of new ones, and finally allow mismatched block groups.
      
      - Filtering is done just by skipping inappropriate ones, there is no
        special size class indexing.
      
      Other solutions I considered were:
      
      - A best fit allocator with an rb-tree. This worked well, as small
        writes didn't leak big holes from large freed extents, but led to
        regressions in ffe and write performance due to lock contention on
        the rb-tree with every allocation possibly updating it in parallel.
        Perhaps something clever could be done to do the updates in the
        background while being "right enough".
      
      - A fixed size "working set". This prevents freeing an extent
        drastically changing where writes currently land, and seems like a
        good option too. Doesn't take advantage of size in any way.
      
      - The same size class idea, but implemented with xarray marks. This
        turned out to be slower than looping the linked list and skipping
        wrong block groups, and is also less flexible since we must have only
        3 size classes (max #marks). With the current approach we can have as
        many as we like.
      
      Performance testing was done via: https://github.com/josefbacik/fsperf
      Of particular relevance are the new fragmentation specific tests.
      
      A brief summary of the testing results:
      
      - Neutral results on existing tests. There are some minor regressions
        and improvements here and there, but nothing that truly stands out as
        notable.
      - Improvement on new tests where size class and extent lifetime are
        correlated. Fragmentation in these cases is completely eliminated
        and write performance is generally a little better. There is also
        significant improvement where extent sizes are just a bit larger than
        the size class boundaries.
      - Regression on one new tests: where the allocations are sized
        intentionally a hair under the borders of the size classes. Results
        are neutral on the test that intentionally attacks this new scheme by
        mixing extent size and lifetime.
      
      The full dump of the performance results can be found here:
      https://bur.io/fsperf/size-class-2022-11-15.txt
      (there are ANSI escape codes, so best to curl and view in terminal)
      
      Here is a snippet from the full results for a new test which mixes
      buffered writes appending to a long lived set of files and large short
      lived fallocates:
      
      bufferedappendvsfallocate results
               metric             baseline       current        stdev            diff
      ======================================================================================
      avg_commit_ms                    31.13         29.20          2.67     -6.22%
      bg_count                            14         15.60             0     11.43%
      commits                          11.10         12.20          0.32      9.91%
      elapsed                          27.30         26.40          2.98     -3.30%
      end_state_mount_ns         11122551.90   10635118.90     851143.04     -4.38%
      end_state_umount_ns           1.36e+09      1.35e+09   12248056.65     -1.07%
      find_free_extent_calls       116244.30     114354.30        964.56     -1.63%
      find_free_extent_ns_max      599507.20    1047168.20     103337.08     74.67%
      find_free_extent_ns_mean       3607.19       3672.11        101.20      1.80%
      find_free_extent_ns_min            500           512          6.67      2.40%
      find_free_extent_ns_p50           2848          2876         37.65      0.98%
      find_free_extent_ns_p95           4916          5000         75.45      1.71%
      find_free_extent_ns_p99       20734.49      20920.48       1670.93      0.90%
      frag_pct_max                     61.67             0          8.05   -100.00%
      frag_pct_mean                    43.59             0          6.10   -100.00%
      frag_pct_min                     25.91             0         16.60   -100.00%
      frag_pct_p50                     42.53             0          7.25   -100.00%
      frag_pct_p95                     61.67             0          8.05   -100.00%
      frag_pct_p99                     61.67             0          8.05   -100.00%
      fragmented_bg_count               6.10             0          1.45   -100.00%
      max_commit_ms                    49.80            46          5.37     -7.63%
      sys_cpu                           2.59          2.62          0.29      1.39%
      write_bw_bytes                1.62e+08      1.68e+08   17975843.50      3.23%
      write_clat_ns_mean            57426.39      54475.95       2292.72     -5.14%
      write_clat_ns_p50             46950.40      42905.60       2101.35     -8.62%
      write_clat_ns_p99            148070.40     143769.60       2115.17     -2.90%
      write_io_kbytes                4194304       4194304             0      0.00%
      write_iops                     2476.15       2556.10        274.29      3.23%
      write_lat_ns_max            2101667.60    2251129.50     370556.59      7.11%
      write_lat_ns_mean             59374.91      55682.00       2523.09     -6.22%
      write_lat_ns_min              17353.10         16250       1646.08     -6.36%
      
      There are some mixed improvements/regressions in most metrics along with
      an elimination of fragmentation in this workload.
      
      On the balance, the drastic 1->0 improvement in the happy cases seems
      worth the mix of regressions and improvements we do observe.
      
      Some considerations for future work:
      
      - Experimenting with more size classes
      - More hinting/search ordering work to approximate a best-fit allocator
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52bb7a21
    • J
      btrfs: fix uninitialized variable warning in btrfs_update_block_group · efbf35a1
      Josef Bacik 提交于
      reclaim isn't set in the alloc case, however we only care about
      reclaim in the !alloc case.  This isn't an actual problem, however
      -Wmaybe-uninitialized will complain, so initialize reclaim to quiet the
      compiler.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      efbf35a1
  9. 06 12月, 2022 10 次提交
    • D
      btrfs: convert btrfs_block_group::needs_free_space to runtime flag · 0d7764ff
      David Sterba 提交于
      We already have flags in block group to track various status bits,
      convert needs_free_space as well and reduce size of btrfs_block_group.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d7764ff
    • D
      btrfs: simplify percent calculation helpers, rename div_factor · 428c8e03
      David Sterba 提交于
      The div_factor* helpers calculate fraction or percentage fraction. The
      name is a bit confusing, we use it only for percentage calculations and
      there are two helpers.
      
      There's a helper mult_frac that's for general fractions, that tries to
      be accurate but we multiply and divide by small numbers so we can use
      the div_u64 helper.
      
      Rename the div_factor* helpers and use 1..100 percentage range, also drop
      the case checking for percentage == 100, it's never hit.
      
      The conversions:
      
      * div_factor calculates tenths and the numbers need to be adjusted
      * div_factor_fine is direct replacement
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      428c8e03
    • D
      btrfs: update function comments · 43dd529a
      David Sterba 提交于
      Update, reformat or reword function comments. This also removes the kdoc
      marker so we don't get reports when the function name is missing.
      
      Changes made:
      
      - remove kdoc markers
      - reformat the brief description to be a proper sentence
      - reword to imperative voice
      - align parameter list
      - fix typos
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      43dd529a
    • J
      btrfs: move extent-tree helpers into their own header file · a0231804
      Josef Bacik 提交于
      Move all the extent tree related prototypes to extent-tree.h out of
      ctree.h, and then go include it everywhere needed so everything
      compiles.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a0231804
    • J
      btrfs: move accessor helpers into accessors.h · 07e81dc9
      Josef Bacik 提交于
      This is a large patch, but because they're all macros it's impossible to
      split up.  Simply copy all of the item accessors in ctree.h and paste
      them in accessors.h, and then update any files to include the header so
      everything compiles.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ reformat comments, style fixups ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      07e81dc9
    • J
      btrfs: move fs wide helpers out of ctree.h · c7f13d42
      Josef Bacik 提交于
      We have several fs wide related helpers in ctree.h.  The bulk of these
      are the incompat flag test helpers, but there are things such as
      btrfs_fs_closing() and the read only helpers that also aren't directly
      related to the ctree code.  Move these into a fs.h header, which will
      serve as the location for file system wide related helpers.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7f13d42
    • Q
      btrfs: skip update of block group item if used bytes are the same · 7248e0ce
      Qu Wenruo 提交于
      [BACKGROUND]
      
      When committing a transaction, we will update block group items for all
      dirty block groups.
      
      But in fact, dirty block groups don't always need to update their block
      group items.
      It's pretty common to have a metadata block group which experienced
      several COW operations, but still have the same amount of used bytes.
      
      In that case, we may unnecessarily COW a tree block doing nothing.
      
      [ENHANCEMENT]
      
      This patch will introduce btrfs_block_group::commit_used member to
      remember the last used bytes, and use that new member to skip
      unnecessary block group item update.
      
      This would be more common for large filesystems, where metadata block
      group can be as large as 1GiB, containing at most 64K metadata items.
      
      In that case, if COW added and then deleted one metadata item near the
      end of the block group, then it's completely possible we don't need to
      touch the block group item at all.
      
      [BENCHMARK]
      
      The change itself can have quite a high chance (20~80%) to skip block
      group item updates in lot of workloads.
      
      As a result, it would result shorter time spent on
      btrfs_write_dirty_block_groups(), and overall reduce the execution time
      of the critical section of btrfs_commit_transaction().
      
      Here comes a fio command, which will do random writes in 4K block size,
      causing a very heavy metadata updates.
      
      fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \
          --ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \
          --name=random_write --fallocate=none --time_based --fsync_on_close=1
      
      The file size (512M) and number of threads (4) means 2GiB file size in
      total, but during the full 300s run time, my dedicated SATA SSD is able
      to write around 20~25GiB, which is over 10 times the file size.
      
      Thus after we fill the initial 2G, we should not cause much block group
      item updates.
      
      Please note, the fio numbers by themselves don't have much change, but
      if we look deeper, there is some reduced execution time, especially for
      the critical section of btrfs_commit_transaction().
      
      I added extra trace_printk() to measure the following per-transaction
      execution time:
      
      - Critical section of btrfs_commit_transaction()
        By re-using the existing update_commit_stats() function, which
        has already calculated the interval correctly.
      
      - The while() loop for btrfs_write_dirty_block_groups()
        Although this includes the execution time of btrfs_run_delayed_refs(),
        it should still be representative overall.
      
      Both result involves transid 7~30, the same amount of transaction
      committed.
      
      The result looks like this:
      
                            |      Before       |     After      |  Diff
      ----------------------+-------------------+----------------+--------
      Transaction interval  | 229247198.5       | 215016933.6    | -6.2%
      Block group interval  | 23133.33333       | 18970.83333    | -18.0%
      
      The change in block group item updates is more obvious, as skipped block
      group item updates also mean less delayed refs.
      
      And the overall execution time for that block group update loop is
      pretty small, thus we can assume the extent tree is already mostly
      cached.  If we can skip an uncached tree block, it would cause more
      obvious change.
      
      Unfortunately the overall reduction in commit transaction critical
      section is much smaller, as the block group item updates loop is not
      really the major part, at least not for the above fio script.
      
      But still we have a observable reduction in the critical section.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7248e0ce
    • B
      btrfs: re-check reclaim condition in reclaim worker · 81531225
      Boris Burkov 提交于
      I have observed the following case play out and lead to unnecessary
      relocations:
      
      1. write a file across multiple block groups
      2. delete the file
      3. several block groups fall below the reclaim threshold
      4. reclaim the first, moving extents into the others
      5. reclaim the others which are now actually very full, leading to poor
         reclaim behavior with lots of writing, allocating new block groups,
         etc.
      
      I believe the risk of missing some reasonable reclaims is worth it
      when traded off against the savings of avoiding overfull reclaims.
      
      Going forward, it could be interesting to make the check more advanced
      (zoned aware, fragmentation aware, etc...) so that it can be a really
      strong signal both at extent delete and reclaim time.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      81531225
    • B
      btrfs: skip reclaim if block_group is empty · cc4804bf
      Boris Burkov 提交于
      As we delete extents from a block group, at some deletion we cross below
      the reclaim threshold. It is possible we are still in the middle of
      deleting more extents and might soon hit 0. If the block group is empty
      by the time the reclaim worker runs, we will still relocate it.
      
      This works just fine, as relocating an empty block group ultimately
      results in properly deleting it. However, we have more direct ways of
      removing empty block groups in the cleaner thread. Those are either
      async discard or the unused_bgs list. In fact, when we decide whether to
      relocate a block group during extent deletion, we do check for emptiness
      and prefer the discard/unused_bgs mechanisms when possible.
      
      Not using relocation for this case reduces some modest overhead from
      empty bg relocation:
      
      - extra transactions
      - extra metadata use/churn for creating relocation metadata
      - trying to read the extent tree to look for extents (and in this case
        finding none)
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc4804bf
    • J
      btrfs: move btrfs_should_fragment_free_space into block-group.c · 06d61cb1
      Josef Bacik 提交于
      This function uses functions that are not defined in block-group.h, move
      it into block-group.c in order to keep the header clean.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      06d61cb1
  10. 11 10月, 2022 1 次提交
    • D
      btrfs: delete stale comments after merge conflict resolution · 295a53cc
      David Sterba 提交于
      There are two comments in btrfs_cache_block_group that I left when
      resolving conflict between commits ced8ecf0 "btrfs: fix space cache
      corruption and potential double allocations" and 527c490f "btrfs:
      delete btrfs_wait_space_cache_v1_finished".
      
      The former reworked the caching logic to wait until the caching ends in
      btrfs_cache_block_group while the latter only open coded the waiting.
      Both removed btrfs_wait_space_cache_v1_finished, the correct code is
      with the waiting and returning error. Thus the conflict resolution was
      OK.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      295a53cc
  11. 29 9月, 2022 1 次提交
  12. 26 9月, 2022 11 次提交
  13. 24 8月, 2022 1 次提交
    • O
      btrfs: fix space cache corruption and potential double allocations · ced8ecf0
      Omar Sandoval 提交于
      When testing space_cache v2 on a large set of machines, we encountered a
      few symptoms:
      
      1. "unable to add free space :-17" (EEXIST) errors.
      2. Missing free space info items, sometimes caught with a "missing free
         space info for X" error.
      3. Double-accounted space: ranges that were allocated in the extent tree
         and also marked as free in the free space tree, ranges that were
         marked as allocated twice in the extent tree, or ranges that were
         marked as free twice in the free space tree. If the latter made it
         onto disk, the next reboot would hit the BUG_ON() in
         add_new_free_space().
      4. On some hosts with no on-disk corruption or error messages, the
         in-memory space cache (dumped with drgn) disagreed with the free
         space tree.
      
      All of these symptoms have the same underlying cause: a race between
      caching the free space for a block group and returning free space to the
      in-memory space cache for pinned extents causes us to double-add a free
      range to the space cache. This race exists when free space is cached
      from the free space tree (space_cache=v2) or the extent tree
      (nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
      struct btrfs_block_group::last_byte_to_unpin and struct
      btrfs_block_group::progress are supposed to protect against this race,
      but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
      waiting for a transaction commit") subtly broke this by allowing
      multiple transactions to be unpinning extents at the same time.
      
      Specifically, the race is as follows:
      
      1. An extent is deleted from an uncached block group in transaction A.
      2. btrfs_commit_transaction() is called for transaction A.
      3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
         ref for the deleted extent.
      4. __btrfs_free_extent() -> do_free_extent_accounting() ->
         add_to_free_space_tree() adds the deleted extent back to the free
         space tree.
      5. do_free_extent_accounting() -> btrfs_update_block_group() ->
         btrfs_cache_block_group() queues up the block group to get cached.
         block_group->progress is set to block_group->start.
      6. btrfs_commit_transaction() for transaction A calls
         switch_commit_roots(). It sets block_group->last_byte_to_unpin to
         block_group->progress, which is block_group->start because the block
         group hasn't been cached yet.
      7. The caching thread gets to our block group. Since the commit roots
         were already switched, load_free_space_tree() sees the deleted extent
         as free and adds it to the space cache. It finishes caching and sets
         block_group->progress to U64_MAX.
      8. btrfs_commit_transaction() advances transaction A to
         TRANS_STATE_SUPER_COMMITTED.
      9. fsync calls btrfs_commit_transaction() for transaction B. Since
         transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
         commit is for fsync, it advances.
      10. btrfs_commit_transaction() for transaction B calls
          switch_commit_roots(). This time, the block group has already been
          cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
      11. btrfs_commit_transaction() for transaction A calls
          btrfs_finish_extent_commit(), which calls unpin_extent_range() for
          the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
          transaction B!), so it adds the deleted extent to the space cache
          again!
      
      This explains all of our symptoms above:
      
      * If the sequence of events is exactly as described above, when the free
        space is re-added in step 11, it will fail with EEXIST.
      * If another thread reallocates the deleted extent in between steps 7
        and 11, then step 11 will silently re-add that space to the space
        cache as free even though it is actually allocated. Then, if that
        space is allocated *again*, the free space tree will be corrupted
        (namely, the wrong item will be deleted).
      * If we don't catch this free space tree corruption, it will continue
        to get worse as extents are deleted and reallocated.
      
      The v1 space_cache is synchronously loaded when an extent is deleted
      (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
      with load_cache_only=1), so it is not normally affected by this bug.
      However, as noted above, if we fail to load the space cache, we will
      fall back to caching from the extent tree and may hit this bug.
      
      The easiest fix for this race is to also make caching from the free
      space tree or extent tree synchronous. Josef tested this and found no
      performance regressions.
      
      A few extra changes fall out of this change. Namely, this fix does the
      following, with step 2 being the crucial fix:
      
      1. Factor btrfs_caching_ctl_wait_done() out of
         btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
         that we already hold a reference to.
      2. Change the call in btrfs_cache_block_group() of
         btrfs_wait_space_cache_v1_finished() to
         btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
         space_cache option.
      3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
         space_cache_v1_done().
      4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
         `bool wait` to more accurately describe its new meaning.
      5. Change a few callers which had a separate call to
         btrfs_wait_block_group_cache_done() to use wait = true instead.
      6. Make btrfs_wait_block_group_cache_done() static now that it's not
         used outside of block-group.c anymore.
      
      Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ced8ecf0