1. 23 2月, 2021 2 次提交
    • F
      btrfs: fix race between writes to swap files and scrub · 195a49ea
      Filipe Manana 提交于
      When we active a swap file, at btrfs_swap_activate(), we acquire the
      exclusive operation lock to prevent the physical location of the swap
      file extents to be changed by operations such as balance and device
      replace/resize/remove. We also call there can_nocow_extent() which,
      among other things, checks if the block group of a swap file extent is
      currently RO, and if it is we can not use the extent, since a write
      into it would result in COWing the extent.
      
      However we have no protection against a scrub operation running after we
      activate the swap file, which can result in the swap file extents to be
      COWed while the scrub is running and operating on the respective block
      group, because scrub turns a block group into RO before it processes it
      and then back again to RW mode after processing it. That means an attempt
      to write into a swap file extent while scrub is processing the respective
      block group, will result in COWing the extent, changing its physical
      location on disk.
      
      Fix this by making sure that block groups that have extents that are used
      by active swap files can not be turned into RO mode, therefore making it
      not possible for a scrub to turn them into RO mode. When a scrub finds a
      block group that can not be turned to RO due to the existence of extents
      used by swap files, it proceeds to the next block group and logs a warning
      message that mentions the block group was skipped due to active swap
      files - this is the same approach we currently use for balance.
      
      Fixes: ed46ff3d ("Btrfs: support swap files")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      195a49ea
    • F
      btrfs: avoid checking for RO block group twice during nocow writeback · 20903032
      Filipe Manana 提交于
      During the nocow writeback path, we currently iterate the rbtree of block
      groups twice: once for checking if the target block group is RO with the
      call to btrfs_extent_readonly()), and once again for getting a nocow
      reference on the block group with a call to btrfs_inc_nocow_writers().
      
      Since btrfs_inc_nocow_writers() already returns false when the target
      block group is RO, remove the call to btrfs_extent_readonly(). Not only
      we avoid searching the blocks group rbtree twice, it also helps reduce
      contention on the lock that protects it (specially since it is a spin
      lock and not a read-write lock). That may make a noticeable difference
      on very large filesystems, with thousands of allocated block groups.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      20903032
  2. 09 2月, 2021 20 次提交
  3. 08 1月, 2021 1 次提交
    • J
      btrfs: shrink delalloc pages instead of full inodes · e076ab2a
      Josef Bacik 提交于
      Commit 38d715f4 ("btrfs: use btrfs_start_delalloc_roots in
      shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
      some infrastructure we have in place to flush inodes that we use for
      device replace and snapshot.  However this introduced a pretty serious
      performance regression.  To reproduce the user untarred the source
      tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
      see it take anywhere from 5 to 20 times as long to untar in 5.10
      compared to 5.9. This was observed on fast devices (SSD and better) and
      not on HDD.
      
      The root cause is because before we would generally use the normal
      writeback path to reclaim delalloc space, and for this we would provide
      it with the number of pages we wanted to flush.  The referenced commit
      changed this to flush that many inodes, which drastically increased the
      amount of space we were flushing in certain cases, which severely
      affected performance.
      
      We cannot revert this patch unfortunately because of 3d45f221
      ("btrfs: fix deadlock when cloning inline extent and low on free
      metadata space") which requires the ability to skip flushing inodes that
      are being cloned in certain scenarios, which means we need to keep using
      our flushing infrastructure or risk re-introducing the deadlock.
      
      Instead to fix this problem we can go back to providing
      btrfs_start_delalloc_roots with a number of pages to flush, and then set
      up a writeback_control and utilize sync_inode() to handle the flushing
      for us.  This gives us the same behavior we had prior to the fix, while
      still allowing us to avoid the deadlock that was fixed by Filipe.  I
      redid the users original test and got the following results on one of
      our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)
      
        5.9		0m54.258s
        5.10		1m26.212s
        5.10+patch	0m38.800s
      
      5.10+patch is significantly faster than plain 5.9 because of my patch
      series "Change data reservations to use the ticketing infra" which
      contained the patch that introduced the regression, but generally
      improved the overall ENOSPC flushing mechanisms.
      
      Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
      the results:
      
        5.10.5            4m00s
        5.10.5+patch      1m08s
        5.11-rc2	    5m14s
        5.11-rc2+patch    1m30s
      Reported-by: NRené Rebe <rene@exactcode.de>
      Fixes: 38d715f4 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
      CC: stable@vger.kernel.org # 5.10
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Tested-by: NDavid Sterba <dsterba@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add my test results ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e076ab2a
  4. 18 12月, 2020 1 次提交
    • F
      btrfs: fix deadlock when cloning inline extent and low on free metadata space · 3d45f221
      Filipe Manana 提交于
      When cloning an inline extent there are cases where we can not just copy
      the inline extent from the source range to the target range (e.g. when the
      target range starts at an offset greater than zero). In such cases we copy
      the inline extent's data into a page of the destination inode and then
      dirty that page. However, after that we will need to start a transaction
      for each processed extent and, if we are ever low on available metadata
      space, we may need to flush existing delalloc for all dirty inodes in an
      attempt to release metadata space - if that happens we may deadlock:
      
      * the async reclaim task queued a delalloc work to flush delalloc for
        the destination inode of the clone operation;
      
      * the task executing that delalloc work gets blocked waiting for the
        range with the dirty page to be unlocked, which is currently locked
        by the task doing the clone operation;
      
      * the async reclaim task blocks waiting for the delalloc work to complete;
      
      * the cloning task is waiting on the waitqueue of its reservation ticket
        while holding the range with the dirty page locked in the inode's
        io_tree;
      
      * if metadata space is not released by some other task (like delalloc for
        some other inode completing for example), the clone task waits forever
        and as a consequence the delalloc work and async reclaim tasks will hang
        forever as well. Releasing more space on the other hand may require
        starting a transaction, which will hang as well when trying to reserve
        metadata space, resulting in a deadlock between all these tasks.
      
      When this happens, traces like the following show up in dmesg/syslog:
      
        [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
        [87452.323644]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [87452.324852] task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
        [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [87452.326136] Call Trace:
        [87452.326737]  __schedule+0x5d1/0xcf0
        [87452.327390]  schedule+0x45/0xe0
        [87452.328174]  lock_extent_bits+0x1e6/0x2d0 [btrfs]
        [87452.328894]  ? finish_wait+0x90/0x90
        [87452.329474]  btrfs_invalidatepage+0x32c/0x390 [btrfs]
        [87452.330133]  ? __mod_memcg_state+0x8e/0x160
        [87452.330738]  __extent_writepage+0x2d4/0x400 [btrfs]
        [87452.331405]  extent_write_cache_pages+0x2b2/0x500 [btrfs]
        [87452.332007]  ? lock_release+0x20e/0x4c0
        [87452.332557]  ? trace_hardirqs_on+0x1b/0xf0
        [87452.333127]  extent_writepages+0x43/0x90 [btrfs]
        [87452.333653]  ? lock_acquire+0x1a3/0x490
        [87452.334177]  do_writepages+0x43/0xe0
        [87452.334699]  ? __filemap_fdatawrite_range+0xa4/0x100
        [87452.335720]  __filemap_fdatawrite_range+0xc5/0x100
        [87452.336500]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [87452.337216]  btrfs_work_helper+0xf1/0x600 [btrfs]
        [87452.337838]  process_one_work+0x24e/0x5e0
        [87452.338437]  worker_thread+0x50/0x3b0
        [87452.339137]  ? process_one_work+0x5e0/0x5e0
        [87452.339884]  kthread+0x153/0x170
        [87452.340507]  ? kthread_mod_delayed_work+0xc0/0xc0
        [87452.341153]  ret_from_fork+0x22/0x30
        [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
        [87452.342487]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [87452.344049] task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
        [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
        [87452.345655] Call Trace:
        [87452.346305]  __schedule+0x5d1/0xcf0
        [87452.346947]  ? kvm_clock_read+0x14/0x30
        [87452.347676]  ? wait_for_completion+0x81/0x110
        [87452.348389]  schedule+0x45/0xe0
        [87452.349077]  schedule_timeout+0x30c/0x580
        [87452.349718]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [87452.350340]  ? lock_acquire+0x1a3/0x490
        [87452.351006]  ? try_to_wake_up+0x7a/0xa20
        [87452.351541]  ? lock_release+0x20e/0x4c0
        [87452.352040]  ? lock_acquired+0x199/0x490
        [87452.352517]  ? wait_for_completion+0x81/0x110
        [87452.353000]  wait_for_completion+0xab/0x110
        [87452.353490]  start_delalloc_inodes+0x2af/0x390 [btrfs]
        [87452.353973]  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
        [87452.354455]  flush_space+0x24f/0x660 [btrfs]
        [87452.355063]  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
        [87452.355565]  process_one_work+0x24e/0x5e0
        [87452.356024]  worker_thread+0x20f/0x3b0
        [87452.356487]  ? process_one_work+0x5e0/0x5e0
        [87452.356973]  kthread+0x153/0x170
        [87452.357434]  ? kthread_mod_delayed_work+0xc0/0xc0
        [87452.357880]  ret_from_fork+0x22/0x30
        (...)
        < stack traces of several tasks waiting for the locks of the inodes of the
          clone operation >
        (...)
        [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
        [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
        [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
        [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
        [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
        [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
        [92867.447361] task:fsstress        state:D stack:    0 pid:2508238 ppid:2508153 flags:0x00004000
        [92867.447920] Call Trace:
        [92867.448435]  __schedule+0x5d1/0xcf0
        [92867.448934]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [92867.449423]  schedule+0x45/0xe0
        [92867.449916]  __reserve_bytes+0x4a4/0xb10 [btrfs]
        [92867.450576]  ? finish_wait+0x90/0x90
        [92867.451202]  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
        [92867.451815]  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
        [92867.452412]  start_transaction+0x2d1/0x760 [btrfs]
        [92867.453216]  clone_copy_inline_extent+0x333/0x490 [btrfs]
        [92867.453848]  ? lock_release+0x20e/0x4c0
        [92867.454539]  ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
        [92867.455218]  btrfs_clone+0x569/0x7e0 [btrfs]
        [92867.455952]  btrfs_clone_files+0xf6/0x150 [btrfs]
        [92867.456588]  btrfs_remap_file_range+0x324/0x3d0 [btrfs]
        [92867.457213]  do_clone_file_range+0xd4/0x1f0
        [92867.457828]  vfs_clone_file_range+0x4d/0x230
        [92867.458355]  ? lock_release+0x20e/0x4c0
        [92867.458890]  ioctl_file_clone+0x8f/0xc0
        [92867.459377]  do_vfs_ioctl+0x342/0x750
        [92867.459913]  __x64_sys_ioctl+0x62/0xb0
        [92867.460377]  do_syscall_64+0x33/0x80
        [92867.460842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        (...)
        < stack traces of more tasks blocked on metadata reservation like the clone
          task above, because the async reclaim task has deadlocked >
        (...)
      
      Another thing to notice is that the worker task that is deadlocked when
      trying to flush the destination inode of the clone operation is at
      btrfs_invalidatepage(). This is simply because the clone operation has a
      destination offset greater than the i_size and we only update the i_size
      of the destination file after cloning an extent (just like we do in the
      buffered write path).
      
      Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
      the flushing of delalloc for all inodes that have delalloc, add a runtime
      flag to an inode to signal it should not be flushed, and for inodes with
      that flag set, start_delalloc_inodes() will simply skip them. When the
      cloning code needs to dirty a page to copy an inline extent, set that flag
      on the inode and then clear it when the clone operation finishes.
      
      This could be sporadically triggered with test case generic/269 from
      fstests, which exercises many fsstress processes running in parallel with
      several dd processes filling up the entire filesystem.
      
      CC: stable@vger.kernel.org # 5.9+
      Fixes: 05a5a762 ("Btrfs: implement full reflink support for inline extents")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d45f221
  5. 10 12月, 2020 7 次提交
    • Q
      btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs · 6275193e
      Qu Wenruo 提交于
      Refactor btrfs_lookup_bio_sums() by:
      
      - Remove the @file_offset parameter
        There are two factors making the @file_offset parameter useless:
      
        * For csum lookup in csum tree, file offset makes no sense
          We only need disk_bytenr, which is unrelated to file_offset
      
        * page_offset (file offset) of each bvec is not contiguous.
          Pages can be added to the same bio as long as their on-disk bytenr
          is contiguous, meaning we could have pages at different file offsets
          in the same bio.
      
        Thus passing file_offset makes no sense any more.
        The only user of file_offset is for data reloc inode, we will use
        a new function, search_file_offset_in_bio(), to handle it.
      
      - Extract the csum tree lookup into search_csum_tree()
        The new function will handle the csum search in csum tree.
        The return value is the same as btrfs_find_ordered_sum(), returning
        the number of found sectors which have checksum.
      
      - Change how we do the main loop
        The only needed info from bio is:
        * the on-disk bytenr
        * the length
      
        After extracting the above info, we can do the search without bio
        at all, which makes the main loop much simpler:
      
      	for (cur_disk_bytenr = orig_disk_bytenr;
      	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
      	     cur_disk_bytenr += count * sectorsize) {
      
      		/* Lookup csum tree */
      		count = search_csum_tree(fs_info, path, cur_disk_bytenr,
      					 search_len, csum_dst);
      		if (!count) {
      			/* Csum hole handling */
      		}
      	}
      
      - Use single variable as the source to calculate all other offsets
        Instead of all different type of variables, we use only one main
        variable, cur_disk_bytenr, which represents the current disk bytenr.
      
        All involved values can be calculated from that variable, and
        all those variable will only be visible in the inner loop.
      
      The above refactoring makes btrfs_lookup_bio_sums() way more robust than
      it used to be, especially related to the file offset lookup.  Now
      file_offset lookup is only related to data reloc inode, otherwise we
      don't need to bother file_offset at all.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6275193e
    • Q
      btrfs: make btrfs_verify_data_csum follow sector size · f44cf410
      Qu Wenruo 提交于
      Currently btrfs_verify_data_csum() just passes the whole page to
      check_data_csum(), which is fine since we only support sectorsize ==
      PAGE_SIZE.
      
      To support subpage, we need to properly honor per-sector
      checksum verification, just like what we did in dio read path.
      
      This patch will do the csum verification in a for loop, starts with
      pg_off == start - page_offset(page), with sectorsize increase for
      each loop.
      
      For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we
      will only loop once.
      
      For subpage case, we do the iterate over each sector and if we found any
      error, we return error.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f44cf410
    • Q
      btrfs: pass bio_offset to check_data_csum() directly · 7ffd27e3
      Qu Wenruo 提交于
      Parameter icsum for check_data_csum() is a little hard to understand.
      So is the phy_offset for btrfs_verify_data_csum().
      
      Both parameters are calculated values for csum lookup.
      
      Instead of some calculated value, just pass bio_offset and let the
      final and only user, check_data_csum(), calculate whatever it needs.
      
      Since we are here, also make the bio_offset parameter and some related
      variables to be u32 (unsigned int).
      As bio size is limited by its bi_size, which is unsigned int, and has
      extra size limit check during various bio operations.
      Thus we are ensured that bio_offset won't overflow u32.
      
      Thus for all involved functions, not only rename the parameter from
      @phy_offset to @bio_offset, but also reduce its width to u32, so we
      won't have suspicious "u32 = u64 >> sector_bits;" lines anymore.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7ffd27e3
    • Q
      btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset · 1941b64b
      Qu Wenruo 提交于
      The parameter bio_offset of extent_submit_bio_start_t is very confusing.
      If it's really bio_offset (offset to bio), then it should be u32.  But
      in fact, it's only utilized by dio read, and that member is used as file
      offset, which must be u64.
      
      Rename it to dio_file_offset since the only user uses it as file offset,
      and add comment for who is using it.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1941b64b
    • N
      btrfs: remove inode number cache feature · 5297199a
      Nikolay Borisov 提交于
      It's been deprecated since commit b547a88e ("btrfs: start
      deprecation of mount option inode_cache") which enumerates the reasons.
      
      A filesystem that uses the feature (mount -o inode_cache) tracks the
      inode numbers in bitmaps, that data stay on the filesystem after this
      patch. The size is roughly 5MiB for 1M inodes [1], which is considered
      small enough to be left there. Removal of the change can be implemented
      in btrfs-progs if needed.
      
      [1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5297199a
    • N
      btrfs: replace calls to btrfs_find_free_ino with btrfs_find_free_objectid · abadc1fc
      Nikolay Borisov 提交于
      The former is going away as part of the inode map removal so switch
      callers to btrfs_find_free_objectid. No functional changes since with
      INODE_MAP disabled (default) find_free_objectid was called anyway.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      abadc1fc
    • D
      btrfs: drop casts of bio bi_sector · 1201b58b
      David Sterba 提交于
      Since commit 72deb455 ("block: remove CONFIG_LBDAF") (5.2) the
      sector_t type is u64 on all arches and configs so we don't need to
      typecast it.  It used to be unsigned long and the result of sector size
      shifts were not guaranteed to fit in the type.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1201b58b
  6. 08 12月, 2020 9 次提交
    • N
      btrfs: remove err variable from btrfs_delete_subvolume · ee0d904f
      Nikolay Borisov 提交于
      Use only a single 'ret' to control whether we should abort the
      transaction or not. That's fine, because if we abort a transaction then
      btrfs_end_transaction will return the same value as passed to
      btrfs_abort_transaction. No semantic changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee0d904f
    • F
      btrfs: unlock path before checking if extent is shared during nocow writeback · c65ca98f
      Filipe Manana 提交于
      When we are attempting to start writeback for an existing extent in NOCOW
      mode, at run_delalloc_nocow(), we must check if the extent is shared, and
      if it is, fallback to a COW write. However we do such check while still
      holding a read lock on the leaf that contains the file extent item, and
      that check, the call to btrfs_cross_ref_exist(), can take some time
      because:
      
      1) It needs to do a search on the extent tree, which obviously takes some
         time, specially if delayed references are being run at the moment, as
         we can block when trying to lock currently write locked btree nodes;
      
      2) It needs to check the delayed references for any existing reference
         for our data extent, this requires acquiring the delayed references'
         spinlock and maybe block on the mutex of a delayed reference head in the
         case where there is a delayed reference for our data extent, in the
         worst case it makes us release the path on the extent tree and retry
         the whole process again (going back to step 1).
      
      There are other operations we do while holding the leaf locked that can
      take some significant time as well (specially all together):
      
      * btrfs_extent_readonly() - to check if the block group containing the
        extent is currently in RO mode. This requires taking a spinlock and
        searching for the block group in a rbtree that can be big on large
        filesystems;
      
      * csum_exist_in_range() - to search if there are any checksums in the
        csum tree for the extent. Like before, this can take some time if we are
        in a filesystem that has both COW and NOCOW files, in which case the
        csum tree is not empty;
      
      * btrfs_inc_nocow_writers() - increment the number of nocow writers in the
        block group that contains the data extent. Needs to acquire a spinlock
        and search for the block group in a rbtree that can be big on large
        filesystems.
      
      So just unlock the leaf (release the path) before doing all those checks,
      since we do not need it anymore. In case we can not do a NOCOW write for
      the extent, due to any of those checks failing, and the writeback range
      goes beyond that extents' length, we will do another btree search for the
      next file extent item.
      
      The following script that calls dbench was used to measure the impact of
      this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
      directly (no intermediary filesystem on the host) and using a non-debug
      kernel (default configuration on Debian):
      
        $ cat test-dbench.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd -o nodatacow"
        MKFS_OPTIONS="-m single -d single"
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 300 64
      
        umount $MNT
      
      Before this change:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    9326331     0.317   399.957
       Close        6851198     0.002     6.402
       Rename        394894     2.621   402.819
       Unlink       1883131     0.931   398.082
       Deltree          256    19.160   303.580
       Mkdir            128     0.003     0.016
       Qpathinfo    8452314     0.068   116.133
       Qfileinfo    1481921     0.001     5.081
       Qfsinfo      1549963     0.002     4.444
       Sfileinfo     759679     0.084    17.079
       Find         3268168     0.396   118.196
       WriteX       4653310     0.056   110.993
       ReadX        14618818     0.005    23.314
       LockX          30364     0.003     0.497
       UnlockX        30364     0.002     1.720
       Flush         653619    16.954   569.299
      
      Throughput 966.651 MB/sec  64 clients  64 procs  max_latency=569.377 ms
      
      After this change:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    9710433     0.302   232.449
       Close        7132948     0.002    11.496
       Rename        411144     2.452   131.805
       Unlink       1960961     0.893   230.383
       Deltree          256    14.858   198.646
       Mkdir            128     0.002     0.005
       Qpathinfo    8800890     0.066   111.588
       Qfileinfo    1542556     0.001     3.852
       Qfsinfo      1613835     0.002     5.483
       Sfileinfo     790871     0.081    19.492
       Find         3402743     0.386   120.185
       WriteX       4842918     0.054   179.312
       ReadX        15220407     0.005    32.435
       LockX          31612     0.003     1.533
       UnlockX        31612     0.002     1.047
       Flush         680567    16.320   463.323
      
      Throughput 1016.59 MB/sec  64 clients  64 procs  max_latency=463.327 ms
      
      +5.0% throughput, -20.5% max latency
      
      Also, the following test using fio was run:
      
        $ cat test-fio.sh
        #!/bin/bash
      
        DEV=/dev/sdk
        MNT=/mnt/sdk
        MOUNT_OPTIONS="-o ssd -o nodatacow"
        MKFS_OPTIONS="-d single -m single"
      
        if [ $# -ne 4 ]; then
            echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ BLOCK_SIZE"
            exit 1
        fi
      
        NUM_JOBS=$1
        FILE_SIZE=$2
        FSYNC_FREQ=$3
        BLOCK_SIZE=$4
      
        cat <<EOF > /tmp/fio-job.ini
        [writers]
        rw=randwrite
        fsync=$FSYNC_FREQ
        fallocate=none
        group_reporting=1
        direct=0
        bs=$BLOCK_SIZE
        ioengine=sync
        size=$FILE_SIZE
        directory=$MNT
        numjobs=$NUM_JOBS
        EOF
      
        echo
        echo "Using fio config:"
        echo
        cat /tmp/fio-job.ini
        echo
        echo "mount options: $MOUNT_OPTIONS"
        echo
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo "Creating nodatacow files before fio runs..."
        for ((i = 0; i < $NUM_JOBS; i++)); do
            xfs_io -f -c "pwrite -b 128M 0 $FILE_SIZE" "$MNT/writers.$i.0"
        done
        sync
      
        fio /tmp/fio-job.ini
        umount $MNT
      
      Before this change:
      
      $ ./test-fio.sh 16 512M 2 4K
      (...)
      WRITE: bw=28.3MiB/s (29.6MB/s), 28.3MiB/s-28.3MiB/s (29.6MB/s-29.6MB/s), io=8192MiB (8590MB), run=289800-289800msec
      
      After this change:
      
      $ ./test-fio.sh 16 512M 2 4K
      (...)
      WRITE: bw=31.2MiB/s (32.7MB/s), 31.2MiB/s-31.2MiB/s (32.7MB/s-32.7MB/s), io=8192MiB (8590MB), run=262845-262845msec
      
      +9.7% throughput, -9.8% runtime
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c65ca98f
    • F
      btrfs: remove unnecessary attempt to drop extent maps after adding inline extent · f30bed83
      Filipe Manana 提交于
      At inode.c:cow_file_range_inline(), after we insert the inline extent
      in the fs/subvolume btree, we call btrfs_drop_extent_cache() to drop
      all extent maps in the file range, however that is not necessary because
      we have already done it in the call to btrfs_drop_extents(), which calls
      btrfs_drop_extent_cache() for us, and since at this point we have the file
      range locked in the inode's iotree (we are in the writeback path), we know
      no other task can come in and read stale file extent items or find none
      and therefore create either stale extent maps or an extent map that
      represents a hole.
      
      So just remove that unnecessary call to btrfs_drop_extent_cache(), as it's
      doing nothing and only wasting time. This call has been around since 2008,
      introduced in commit c8b97818 ("Btrfs: Add zlib compression support"),
      but even back then it seems it was not necessary, since we had the range
      locked in the inode's iotree and the call to btrfs_drop_extents() already
      used to always call btrfs_drop_extent_cache().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f30bed83
    • N
      btrfs: merge __set_extent_bit and set_extent_bit · 1cab5e72
      Nikolay Borisov 提交于
      There are only 2 direct calls to set_extent_bit outside of extent-io -
      in btrfs_find_new_delalloc_bytes and btrfs_truncate_block, the rest are
      thin wrappers around __set_extent_bit. This adds unnecessary indirection
      and just makes it more annoying when looking at the various extent bit
      manipulation functions.  This patch renames __set_extent_bit to
      set_extent_bit effectively removing a level of indirection. No
      functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ reformat and remove __must_check ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cab5e72
    • N
    • N
    • N
    • N
    • N