提交 · 03fe78cc2942c55cc13be5ca42578750f17204a1 · openeuler / Kernel

23 8月, 2021 5 次提交

btrfs: use delalloc_bytes to determine flush amount for shrink_delalloc · 03fe78cc

由 Josef Bacik 提交于 7月 14, 2021

We have been hitting some early ENOSPC issues in production with more
recent kernels, and I tracked it down to us simply not flushing delalloc
as aggressively as we should be.  With tracing I was seeing us failing
all tickets with all of the block rsvs at or around 0, with very little
pinned space, but still around 120MiB of outstanding bytes_may_used.
Upon further investigation I saw that we were flushing around 14 pages
per shrink call for delalloc, despite having around 2GiB of delalloc
outstanding.

Consider the example of a 8 way machine, all CPUs trying to create a
file in parallel, which at the time of this commit requires 5 items to
do.  Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
size waiting on reservations.  Now assume we have 128MiB of delalloc
outstanding.  With our current math we would set items to 20, and then
set to_reclaim to 20 * 256k, or 5MiB.

Assuming that we went through this loop all 3 times, for both
FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
twice, we'd only flush 60MiB of the 128MiB delalloc space.  This could
leave a fair bit of delalloc reservations still hanging around by the
time we go to ENOSPC out all the remaining tickets.

Fix this two ways.  First, change the calculations to be a fraction of
the total delalloc bytes on the system.  Prior to this change we were
calculating based on dirty inodes so our math made more sense, now it's
just completely unrelated to what we're actually doing.

Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
gone through the flush states at least once.  This will empty the system
of all delalloc so we're sure to be truly out of space when we start
failing tickets.

I'm tagging stable 5.10 and forward, because this is where we started
using the page stuff heavily again.  This affects earlier kernel
versions as well, but would be a pain to backport to them as the
flushing mechanisms aren't the same.

CC: stable@vger.kernel.org # 5.10+
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

03fe78cc

btrfs: make btrfs_next_leaf static inline · 809d6902

由 David Sterba 提交于 7月 26, 2021

btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it
to header to avoid the function call overhead.
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

809d6902

btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_ordered · 25c1252a

由 David Sterba 提交于 7月 26, 2021

The uptodate parameter should be bool, change the type.
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

25c1252a

btrfs: remove unused start and end parameters from btrfs_run_delalloc_range() · a129ffb8

由 Qu Wenruo 提交于 7月 27, 2021

Since commit d75855b4 ("btrfs: Remove
extent_io_ops::writepage_start_hook") removes the writepage_start_hook()
and adds btrfs_writepage_cow_fixup() function, there is no need to
follow the old hook parameters.

Remove the @start and @end hook, since currently the fixup check is full
page check, it doesn't need @start and @end hook.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a129ffb8

btrfs: zoned: remove max_zone_append_size logic · 5a80d1c6

由 Johannes Thumshirn 提交于 6月 29, 2021

There used to be a patch in the original series for zoned support which
limited the extent size to max_zone_append_size, but this patch has been
dropped somewhere around v9.

We've decided to go the opposite direction, instead of limiting extents
in the first place we split them before submission to comply with the
device's limits.

Remove the related code, btrfs_fs_info::max_zone_append_size and
btrfs_zoned_device_info::max_zone_append_size.

This also removes the workaround for dm-crypt introduced in
1d68128c ("btrfs: zoned: fail mount if the device does not support
zone append") because the fix has been merged as f34ee1dc ("dm
crypt: Fix zoned block device support").
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5a80d1c6

23 6月, 2021 1 次提交

btrfs: remove unused btrfs_fs_info::total_pinned · 629e33a1

由 Nikolay Borisov 提交于 6月 22, 2021

This got added 14 years ago in 324ae4df ("Btrfs: Add block group
pinned accounting back") but it was not ever used. Subsequently its
usage got gradually removed in 8790d502 ("Btrfs: Add support for
mirroring across drives") and 11833d66 ("Btrfs: improve async block
group caching"). Let's remove it for good!
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

629e33a1

22 6月, 2021 5 次提交

btrfs: rip out may_commit_transaction · c416a30c

由 Josef Bacik 提交于 6月 22, 2021

may_commit_transaction was introduced before the ticketing
infrastructure existed. There was a problem where we'd legitimately be
out of space, but every reservation would trigger a transaction commit
and then fail. Thus if you had 1000 things trying to make a
reservation, they'd all do the flushing loop and thus commit the
transaction 1000 times before they'd get their ENOSPC.

This helper was introduced to short circuit this, if there wasn't space
that could be reclaimed by committing the transaction then simply ENOSPC
out. This made true ENOSPC tests much faster as we didn't waste a bunch
of time.

However many of our bugs over the years have been from cases where we
didn't account for some space that would be reclaimed by committing a
transaction. The delayed refs rsv space, delayed rsv, many pinned bytes
miscalculations, etc. And in the meantime the original problem has been
solved with ticketing. We no longer will commit the transaction 1000
times. Instead we'll get 1000 waiters, we will go through the flushing
mechanisms, and if there's no progress after 2 loops we ENOSPC everybody
out. The ticketing infrastructure gives us a deterministic way to see
if we're making progress or not, thus we avoid a lot of extra work.

So simplify this step by simply unconditionally committing the
transaction. This removes what is arguably our most common source of
early ENOSPC bugs and will allow us to drastically simplify many of the
things we track because we simply won't need them with this stuff gone.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c416a30c

btrfs: ensure relocation never runs while we have send operations running · 1cea5cf0

由 Filipe Manana 提交于 6月 21, 2021

Relocation and send do not play well together because while send is
running a block group can be relocated, a transaction committed and
the respective disk extents get re-allocated and written to or discarded
while send is about to do something with the extents.

This was explained in commit 9e967495 ("Btrfs: prevent send failures
and crashes due to concurrent relocation"), which prevented balance and
send from running in parallel but it did not address one remaining case
where chunk relocation can happen: shrinking a device (and device deletion
which shrinks a device's size to 0 before deleting the device).

We also have now one more case where relocation is triggered: on zoned
filesystems partially used block groups get relocated by a background
thread, introduced in commit 18bb8bbf ("btrfs: zoned: automatically
reclaim zones").

So make sure that instead of preventing balance from running when there
are ongoing send operations, we prevent relocation from happening.
This uses the infrastructure recently added by a patch that has the
subject: "btrfs: add cancellable chunk relocation support".

Also it adds a spinlock used exclusively for the exclusivity between
send and relocation, as before fs_info->balance_mutex was used, which
would make an attempt to run send to block waiting for balance to
finish, which can take a lot of time on large filesystems.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1cea5cf0

btrfs: shorten integrity checker extent data mount option · cbeaae4f

由 David Sterba 提交于 6月 18, 2021

Subjectively, CHECK_INTEGRITY_INCLUDING_EXTENT_DATA is quite long and
calling it CHECK_INTEGRITY_DATA still keeps the meaning and matches the
mount option name.
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

cbeaae4f

btrfs: switch mount option bits to enums and use wider type · ccd9395b

由 David Sterba 提交于 6月 18, 2021

Switch defines of BTRFS_MOUNT_* to an enum (the symbolic names are
recorded in the debugging information for convenience).

There are two more things done but separating them would not make much
sense as it's touching the same lines:

- Renumber shifts 18..31 to 17..30 to get rid of the hole in the
  sequence.

- Use 1UL as the value that gets shifted because we're approaching the
  32bit limit and due to integer promotions the value of (1 << 31)
  becomes 0xffffffff80000000 when cast to unsigned long (eg. the option
  manipulating helpers).

  This is not causing any problems yet as the operations are in-memory
  and masking the 31st bit works, we don't have more than 31 bits so the
  ill effects of not masking higher bits don't happen. But once we have
  more, the problems will emerge.
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ccd9395b

btrfs: fix typos in comments · 1a9fd417

由 David Sterba 提交于 5月 21, 2021

Fix typos that have snuck in since the last round. Found by codespell.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1a9fd417

21 6月, 2021 10 次提交

btrfs: make btrfs_set_range_writeback() subpage compatible · d2a91064

由 Qu Wenruo 提交于 5月 31, 2021

Function btrfs_set_range_writeback() currently just sets the page
writeback unconditionally.

Change it to call the subpage helper so that we can handle both cases
well.

Since the subpage helpers needs btrfs_fs_info, also change the parameter
to accept btrfs_inode.

Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d2a91064

btrfs: rename PagePrivate2 to PageOrdered inside btrfs · f57ad937

由 Qu Wenruo 提交于 4月 07, 2021

Inside btrfs we use Private2 page status to indicate we have an ordered
extent with pending IO for the sector.

But the page status name, Private2, tells us nothing about the bit
itself, so this patch will rename it to Ordered.
And with extra comment about the bit added, so reader who is still
uncertain about the page Ordered status, will find the comment pretty
easily.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f57ad937

btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered() · 38a39ac7

由 Qu Wenruo 提交于 4月 08, 2021

There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
end_compressed_bio_write().

It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
which is only supposed to accept inode pages.

Thankfully the important info here is the inode, so let's pass
btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
make @page parameter optional.

By this, end_compressed_bio_write() can happily pass page=NULL while
still getting everything done properly.

Also, to cooperate with such modification, replace @page parameter for
trace_btrfs_writepage_end_io_hook() with btrfs_inode.
Although this removes page_index info, the existing start/len should be
enough for most usage.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

38a39ac7

btrfs: refactor submit_extent_page() to make bio and its flag tracing easier · 390ed29b

由 Qu Wenruo 提交于 4月 14, 2021

There is a lot of code inside extent_io.c needs both "struct bio
**bio_ret" and "unsigned long prev_bio_flags", along with some
parameters like "unsigned long bio_flags".

Such strange parameters are here for bio assembly.

For example, we have such inode page layout:

  0       4K      8K      12K
  |<-- Extent A-->|<- EB->|

Then what we do is:

- Page [0, 4K)
  *bio_ret = NULL
  So we allocate a new bio to bio_ret,
  Add page [0, 4K) to *bio_ret.

- Page [4K, 8K)
  *bio_ret != NULL
  We found this page is continuous to *bio_ret,
  and if we're not at stripe boundary, we
  add page [4K, 8K) to *bio_ret.

- Page [8K, 12K)
  *bio_ret != NULL
  But we found this page is not continuous, so
  we submit *bio_ret, then allocate a new bio,
  and add page [8K, 12K) to the new bio.

This means we need to record both the bio and its bio_flag, but we
record them manually using those strange parameter list, other than
encapsulating them into their own structure.

So this patch will introduce a new structure, btrfs_bio_ctrl, to record
both the bio, and its bio_flags.

Also, in above case, for all pages added to the bio, we need to check if
the new page crosses stripe boundary.  This check itself can be time
consuming, and we don't really need to do that for each page.

This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
When a new bio is allocated, the stripe and ordered extent boundary is
also calculated, so no matter how large the bio will be, we only
calculate the boundaries once, to save some CPU time.

The following functions/structures are affected:

- struct extent_page_data
  Replace its bio pointer with structure btrfs_bio_ctrl (embedded
  structure, not pointer)

- end_write_bio()
- flush_write_bio()
  Just change how bio is fetched

- btrfs_bio_add_page()
  Use pre-calculated boundaries instead of re-calculating them.
  And use @bio_ctrl to replace @bio and @prev_bio_flags.

- calc_bio_boundaries()
  New function

- submit_extent_page() callers
- btrfs_do_readpage() callers
- contiguous_readpages() callers
  To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
  bio.

- btrfs_bio_fits_in_ordered_extent()
  Removed, as now the ordered extent size limit is done at bio
  allocation time, no need to check for each page range.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

390ed29b

btrfs: introduce try-lock semantics for exclusive op start · 578bda9e

由 David Sterba 提交于 5月 18, 2021

Add try-lock for exclusive operation start to allow callers to do more
checks. The same operation must already be running. The try-lock and
unlock must pair and are a substitute for btrfs_exclop_start, thus it
must also pair with btrfs_exclop_finish to release the exclop context.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

578bda9e

btrfs: add cancellable chunk relocation support · 907d2710

由 David Sterba 提交于 5月 18, 2021

Add support code that will allow canceling relocation on the chunk
granularity. This is different and independent of balance, that also
uses relocation but is a higher level operation and manages it's own
state and pause/cancellation requests.

Relocation is used for resize (shrink) and device deletion so this will
be a common point to implement cancellation for both. The context is
entirely in btrfs_relocate_block_group and btrfs_recover_relocation,
enclosing one chunk relocation. The status bit is set and unset between
the chunks. As relocation can take long, the effects may not be
immediate and the request and actual action can slightly race.

The fs_info::reloc_cancel_req is only supposed to be increased and does
not pair with decrement like fs_info::balance_cancel_req.
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

907d2710

btrfs: protect exclusive_operation by super_lock · 0d7ed32c

由 David Sterba 提交于 5月 14, 2021

The exclusive operation is now atomically checked and set using bit
operations. Switch it to protection by spinlock. The super block lock is
not frequently used and adding a new lock seems like an overkill so it
should be safe to reuse it.

The reason to use spinlock is to enhance the locking context so more
checks can be done, eg. allowing the same exclusive operation enter
the exclop section and cancel the running one. This will be used for
resize and device delete.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0d7ed32c

D
btrfs: document byte swap optimization of root_item::flags accessors · 49547068
由 David Sterba 提交于 9月 15, 2020
```
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
49547068

btrfs: don't set the full sync flag when truncation does not touch extents · 0d7d3165

由 Filipe Manana 提交于 5月 24, 2021

At btrfs_truncate() where we truncate the inode either to the same size
or to a smaller size, we always set the full sync flag on the inode.

This is needed in case the truncation drops or trims any file extent items
that start beyond or cross the new inode size, so that the next fsync
drops all inode items from the log and scans again the fs/subvolume tree
to find all items that must be logged.

However if the truncation does not drop or trims any file extent items, we
do not need to set the full sync flag and force the next fsync to use the
slow code path. So do not set the full sync flag in such cases.

One use case where it is frequent to do truncations that do not change
the inode size and do not drop any extents (no prealloc extents beyond
i_size) is when running Microsoft's SQL Server inside a Docker container.
One example workload is the one Philipp Fent reported recently, in the
thread with a link below. In this workload a large number of fsyncs are
preceded by such truncate operations.

After this change I constantly get the runtime for that workload from
Philipp to be reduced by about -12%, for example from 184 seconds down
to 162 seconds.

Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/Tested-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0d7d3165

btrfs: make btrfs_verify_data_csum() to return a bitmap · 08508fea

由 Qu Wenruo 提交于 5月 03, 2021

This will provide the basis for later per-sector repair for subpage,
while still keeping the existing code happy.

As if all csums match, the return value will be 0, same as now.
Only when csum mismatches, the return value is different.

The new return value will be a bitmap, for 4K sectorsize and 4K page
size, it will be either 1, instead of the -EIO (which is not used
directly by the callers, no effective change).

But for 4K sectorsize and 64K page size, aka subpage case, since the
bvec can contain multiple sectors, knowing which sectors are corrupted
will allow us to submit repair only for corrupted sectors.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

08508fea

29 4月, 2021 1 次提交

btrfs: fix deadlock when cloning inline extents and using qgroups · f9baa501

由 Filipe Manana 提交于 4月 22, 2021

There are a few exceptional cases where cloning an inline extent needs to
copy the inline extent data into a page of the destination inode.

When this happens, we end up starting a transaction while having a dirty
page for the destination inode and while having the range locked in the
destination's inode iotree too. Because when reserving metadata space
for a transaction we may need to flush existing delalloc in case there is
not enough free space, we have a mechanism in place to prevent a deadlock,
which was introduced in commit 3d45f221 ("btrfs: fix deadlock when
cloning inline extent and low on free metadata space").

However when using qgroups, a transaction also reserves metadata qgroup
space, which can also result in flushing delalloc in case there is not
enough available space at the moment. When this happens we deadlock, since
flushing delalloc requires locking the file range in the inode's iotree
and the range was already locked at the very beginning of the clone
operation, before attempting to start the transaction.

When this issue happens, stack traces like the following are reported:

  [72747.556262] task:kworker/u81:9   state:D stack:    0 pid:  225 ppid:     2 flags:0x00004000
  [72747.556268] Workqueue: writeback wb_workfn (flush-btrfs-1142)
  [72747.556271] Call Trace:
  [72747.556273]  __schedule+0x296/0x760
  [72747.556277]  schedule+0x3c/0xa0
  [72747.556279]  io_schedule+0x12/0x40
  [72747.556284]  __lock_page+0x13c/0x280
  [72747.556287]  ? generic_file_readonly_mmap+0x70/0x70
  [72747.556325]  extent_write_cache_pages+0x22a/0x440 [btrfs]
  [72747.556331]  ? __set_page_dirty_nobuffers+0xe7/0x160
  [72747.556358]  ? set_extent_buffer_dirty+0x5e/0x80 [btrfs]
  [72747.556362]  ? update_group_capacity+0x25/0x210
  [72747.556366]  ? cpumask_next_and+0x1a/0x20
  [72747.556391]  extent_writepages+0x44/0xa0 [btrfs]
  [72747.556394]  do_writepages+0x41/0xd0
  [72747.556398]  __writeback_single_inode+0x39/0x2a0
  [72747.556403]  writeback_sb_inodes+0x1ea/0x440
  [72747.556407]  __writeback_inodes_wb+0x5f/0xc0
  [72747.556410]  wb_writeback+0x235/0x2b0
  [72747.556414]  ? get_nr_inodes+0x35/0x50
  [72747.556417]  wb_workfn+0x354/0x490
  [72747.556420]  ? newidle_balance+0x2c5/0x3e0
  [72747.556424]  process_one_work+0x1aa/0x340
  [72747.556426]  worker_thread+0x30/0x390
  [72747.556429]  ? create_worker+0x1a0/0x1a0
  [72747.556432]  kthread+0x116/0x130
  [72747.556435]  ? kthread_park+0x80/0x80
  [72747.556438]  ret_from_fork+0x1f/0x30

  [72747.566958] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
  [72747.566961] Call Trace:
  [72747.566964]  __schedule+0x296/0x760
  [72747.566968]  ? finish_wait+0x80/0x80
  [72747.566970]  schedule+0x3c/0xa0
  [72747.566995]  wait_extent_bit.constprop.68+0x13b/0x1c0 [btrfs]
  [72747.566999]  ? finish_wait+0x80/0x80
  [72747.567024]  lock_extent_bits+0x37/0x90 [btrfs]
  [72747.567047]  btrfs_invalidatepage+0x299/0x2c0 [btrfs]
  [72747.567051]  ? find_get_pages_range_tag+0x2cd/0x380
  [72747.567076]  __extent_writepage+0x203/0x320 [btrfs]
  [72747.567102]  extent_write_cache_pages+0x2bb/0x440 [btrfs]
  [72747.567106]  ? update_load_avg+0x7e/0x5f0
  [72747.567109]  ? enqueue_entity+0xf4/0x6f0
  [72747.567134]  extent_writepages+0x44/0xa0 [btrfs]
  [72747.567137]  ? enqueue_task_fair+0x93/0x6f0
  [72747.567140]  do_writepages+0x41/0xd0
  [72747.567144]  __filemap_fdatawrite_range+0xc7/0x100
  [72747.567167]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
  [72747.567195]  btrfs_work_helper+0xc2/0x300 [btrfs]
  [72747.567200]  process_one_work+0x1aa/0x340
  [72747.567202]  worker_thread+0x30/0x390
  [72747.567205]  ? create_worker+0x1a0/0x1a0
  [72747.567208]  kthread+0x116/0x130
  [72747.567211]  ? kthread_park+0x80/0x80
  [72747.567214]  ret_from_fork+0x1f/0x30

  [72747.569686] task:fsstress        state:D stack:    0 pid:841421 ppid:841417 flags:0x00000000
  [72747.569689] Call Trace:
  [72747.569691]  __schedule+0x296/0x760
  [72747.569694]  schedule+0x3c/0xa0
  [72747.569721]  try_flush_qgroup+0x95/0x140 [btrfs]
  [72747.569725]  ? finish_wait+0x80/0x80
  [72747.569753]  btrfs_qgroup_reserve_data+0x34/0x50 [btrfs]
  [72747.569781]  btrfs_check_data_free_space+0x5f/0xa0 [btrfs]
  [72747.569804]  btrfs_buffered_write+0x1f7/0x7f0 [btrfs]
  [72747.569810]  ? path_lookupat.isra.48+0x97/0x140
  [72747.569833]  btrfs_file_write_iter+0x81/0x410 [btrfs]
  [72747.569836]  ? __kmalloc+0x16a/0x2c0
  [72747.569839]  do_iter_readv_writev+0x160/0x1c0
  [72747.569843]  do_iter_write+0x80/0x1b0
  [72747.569847]  vfs_writev+0x84/0x140
  [72747.569869]  ? btrfs_file_llseek+0x38/0x270 [btrfs]
  [72747.569873]  do_writev+0x65/0x100
  [72747.569876]  do_syscall_64+0x33/0x40
  [72747.569879]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

  [72747.569899] task:fsstress        state:D stack:    0 pid:841424 ppid:841417 flags:0x00004000
  [72747.569903] Call Trace:
  [72747.569906]  __schedule+0x296/0x760
  [72747.569909]  schedule+0x3c/0xa0
  [72747.569936]  try_flush_qgroup+0x95/0x140 [btrfs]
  [72747.569940]  ? finish_wait+0x80/0x80
  [72747.569967]  __btrfs_qgroup_reserve_meta+0x36/0x50 [btrfs]
  [72747.569989]  start_transaction+0x279/0x580 [btrfs]
  [72747.570014]  clone_copy_inline_extent+0x332/0x490 [btrfs]
  [72747.570041]  btrfs_clone+0x5b7/0x7a0 [btrfs]
  [72747.570068]  ? lock_extent_bits+0x64/0x90 [btrfs]
  [72747.570095]  btrfs_clone_files+0xfc/0x150 [btrfs]
  [72747.570122]  btrfs_remap_file_range+0x3d8/0x4a0 [btrfs]
  [72747.570126]  do_clone_file_range+0xed/0x200
  [72747.570131]  vfs_clone_file_range+0x37/0x110
  [72747.570134]  ioctl_file_clone+0x7d/0xb0
  [72747.570137]  do_vfs_ioctl+0x138/0x630
  [72747.570140]  __x64_sys_ioctl+0x62/0xc0
  [72747.570143]  do_syscall_64+0x33/0x40
  [72747.570146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

So fix this by skipping the flush of delalloc for an inode that is
flagged with BTRFS_INODE_NO_DELALLOC_FLUSH, meaning it is currently under
such a special case of cloning an inline extent, when flushing delalloc
during qgroup metadata reservation.

The special cases for cloning inline extents were added in kernel 5.7 by
by commit 05a5a762 ("Btrfs: implement full reflink support for
inline extents"), while having qgroup metadata space reservation flushing
delalloc when low on space was added in kernel 5.9 by commit
c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get
-EDQUOT"). So use a "Fixes:" tag for the later commit to ease stable
kernel backports.
Reported-by: NWang Yugui <wangyugui@e16-tech.com>
Link: https://lore.kernel.org/linux-btrfs/20210421083137.31E3.409509F4@e16-tech.com/
Fixes: c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable@vger.kernel.org # 5.9+
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f9baa501

21 4月, 2021 3 次提交

btrfs: zoned: automatically reclaim zones · 18bb8bbf

由 Johannes Thumshirn 提交于 4月 19, 2021

When a file gets deleted on a zoned file system, the space freed is not
returned back into the block group's free space, but is migrated to
zone_unusable.

As this zone_unusable space is behind the current write pointer it is not
possible to use it for new allocations. In the current implementation a
zone is reset once all of the block group's space is accounted as zone
unusable.

This behaviour can lead to premature ENOSPC errors on a busy file system.

Instead of only reclaiming the zone once it is completely unusable,
kick off a reclaim job once the amount of unusable bytes exceeds a user
configurable threshold between 51% and 100%. It can be set per mounted
filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
by default.

Similar to reclaiming unused block groups, these dirty block groups are
added to a to_reclaim list and then on a transaction commit, the reclaim
process is triggered but after we deleted unused block groups, which will
free space for the relocation process.
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

18bb8bbf

btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lock · f3372065

由 Johannes Thumshirn 提交于 4月 19, 2021

As a preparation for extending the block group deletion use case, rename
the unused_bgs_mutex to reclaim_bgs_lock.
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f3372065

btrfs: more graceful errors/warnings on 32bit systems when reaching limits · e9306ad4

由 Qu Wenruo 提交于 2月 25, 2021

Btrfs uses internally mapped u64 address space for all its metadata.
Due to the page cache limit on 32bit systems, btrfs can't access
metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. See
how MAX_LFS_FILESIZE and page::index are defined. This is 16T for 4K
page size while 256T for 64K page size.

Users can have a filesystem which doesn't have metadata beyond the
boundary at mount time, but later balance can cause it to create
metadata beyond the boundary.

And modification to MM layer is unrealistic just for such minor use
case. We can't do more than to prevent mounting such filesystem or warn
early when the numbers are still within the limits.

To address such problem, this patch will introduce the following checks:

- Mount time rejection
This will reject any fs which has metadata chunk at or beyond the
boundary.

- Mount time early warning
If there is any metadata chunk beyond 5/8th of the boundary, we do an
early warning and hope the end user will see it.

- Runtime extent buffer rejection
If we're going to allocate an extent buffer at or beyond the boundary,
reject such request with EOVERFLOW.
This is definitely going to cause problems like transaction abort, but
we have no better ways.

- Runtime extent buffer early warning
If an extent buffer beyond 5/8th of the max file size is allocated, do
an early warning.

Above error/warning message will only be printed once for each fs to
reduce dmesg flood.

If the mount is rejected, the filesystem will be mountable only on a
64bit host.

Link: https://lore.kernel.org/linux-btrfs/1783f16d-7a28-80e6-4c32-fdf19b705ed0@gmx.com/Reported-by: NErik Jensen <erikjensen@rkjnsn.net>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e9306ad4

19 4月, 2021 8 次提交

btrfs: improve btree readahead for full send operations · ace75066

由 Filipe Manana 提交于 3月 31, 2021

Currently a full send operation uses the standard btree readahead when
iterating over the subvolume/snapshot btree, which despite bringing good
performance benefits, it could be improved in a few aspects for use cases
such as full send operations, which are guaranteed to visit every node
and leaf of a btree, in ascending and sequential order. The limitations
of that standard btree readahead implementation are the following:

1) It only triggers readahead for leaves that are physically close
   to the leaf being read, within a 64K range;

2) It only triggers readahead for the next or previous leaves if the
   leaf being read is not currently in memory;

3) It never triggers readahead for nodes.

So add a new readahead mode that addresses all these points and use it
for full send operations.

The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of RAM:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdj
  MNT=/mnt/sdj
  MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
  MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit

  mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
  mount $MOUNT_OPTIONS $DEV $MNT

  # Create files with inline data to make it easier and faster to create
  # large btrees.
  add_files()
  {
      local total=$1
      local start_offset=$2
      local number_jobs=$3
      local total_per_job=$(($total / $number_jobs))

      echo "Creating $total new files using $number_jobs jobs"
      for ((n = 0; n < $number_jobs; n++)); do
          (
              local start_num=$(($start_offset + $n * $total_per_job))
              for ((i = 1; i <= $total_per_job; i++)); do
                  local file_num=$((start_num + $i))
                  local file_path="$MNT/file_${file_num}"
                  xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                  if [ $? -ne 0 ]; then
                      echo "Failed creating file $file_path"
                      break
                  fi
              done
          ) &
          worker_pids[$n]=$!
      done

      wait ${worker_pids[@]}

      sync
      echo
      echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
  }

  initial_file_count=500000
  add_files $initial_file_count 0 4

  echo
  echo "Creating first snapshot..."
  btrfs subvolume snapshot -r $MNT $MNT/snap1

  echo
  echo "Adding more files..."
  add_files $((initial_file_count / 4)) $initial_file_count 4

  echo
  echo "Updating 1/50th of the initial files..."
  for ((i = 1; i < $initial_file_count; i += 50)); do
      xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
  done

  echo
  echo "Creating second snapshot..."
  btrfs subvolume snapshot -r $MNT $MNT/snap2

  umount $MNT

  echo 3 > /proc/sys/vm/drop_caches
  blockdev --flushbufs $DEV &> /dev/null
  hdparm -F $DEV &> /dev/null

  mount $MOUNT_OPTIONS $DEV $MNT

  echo
  echo "Testing full send..."
  start=$(date +%s)
  btrfs send $MNT/snap1 > /dev/null
  end=$(date +%s)
  echo
  echo "Full send took $((end - start)) seconds"

  umount $MNT

The durations of the full send operation in seconds were the following:

Before this change:  217 seconds
After this change:   205 seconds (-5.7%)
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ace75066

btrfs: use a bit to track the existence of tree mod log users · bc03f39e

由 Filipe Manana 提交于 3月 11, 2021

The tree modification log functions are called very frequently, basically
they are called every time a btree is modified (a pointer added or removed
to a node, a new root for a btree is set, etc). Because of that, to avoid
heavy lock contention on the lock that protects the list of tree mod log
users, we have checks that test the emptiness of the list with a full
memory barrier before the checks, so that when there are no tree mod log
users we avoid taking the lock.

Replace the memory barrier and list emptiness check with a test for a new
bit set at fs_info->flags. This bit is used to indicate when there are
tree mod log users, set whenever a user is added to the list and cleared
when the last user is removed from the list. This makes the intention a
bit more obvious and possibly more efficient (assuming test_bit() may be
cheaper than a full memory barrier on some architectures).
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

bc03f39e

btrfs: move the tree mod log code into its own file · f3a84ccd

由 Filipe Manana 提交于 3月 11, 2021

The tree modification log, which records modifications done to btrees, is
quite large and currently spread all over ctree.c, which is a huge file
already.

To make things better organized, move all that code into its own separate
source and header files. Functions and definitions that are used outside
of the module (mostly by ctree.c) are renamed so that they start with a
"btrfs_" prefix. Everything else remains unchanged.

This makes it easier to go over the tree modification log code every
time I need to go read it to fix a bug.
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f3a84ccd

btrfs: remove duplicated in_range() macro · cea62800

由 Johannes Thumshirn 提交于 3月 16, 2021

The in_range() macro is defined twice in btrfs' source, once in ctree.h
and once in misc.h.

Remove the definition in ctree.h and include misc.h in the files depending
on it.
Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

cea62800

btrfs: add a i_mmap_lock to our inode · 8318ba79

由 Josef Bacik 提交于 2月 10, 2021

We need to be able to exclude page_mkwrite from happening concurrently
with certain operations.  To facilitate this, add a i_mmap_lock to our
inode, down_read() it in our mkwrite, and add a new ILOCK flag to
indicate that we want to take the i_mmap_lock as well.  I used pahole to
check the size of the btrfs_inode, the sizes are as follows

no lockdep:
before: 1120 (3 per 4k page)
after: 1160 (3 per 4k page)

lockdep:
before: 2072 (1 per 4k page)
after: 2224 (1 per 4k page)

We're slightly larger but it doesn't change how many objects we can fit
per page.
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8318ba79

btrfs: remove mirror argument from btrfs_csum_verify_data() · 5e295768

由 Goldwyn Rodrigues 提交于 3月 03, 2021

The parameter mirror is not used and does not make sense for checksum
verification of the given bio.
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5e295768

btrfs: unexport btrfs_extent_readonly() and make it static · 05947ae1

由 Anand Jain 提交于 2月 10, 2021

btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So
move it from extent-tree.c to inode.c and declare it as static.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

05947ae1

btrfs: make btrfs_replace_file_extents take btrfs_inode · bfc78479

由 Nikolay Borisov 提交于 2月 17, 2021

Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

bfc78479

12 4月, 2021 1 次提交

btrfs: convert to fileattr · 97fc2977

由 Miklos Szeredi 提交于 4月 07, 2021

Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Cc: David Sterba <dsterba@suse.com>

97fc2977

23 2月, 2021 1 次提交

btrfs: fix race between writes to swap files and scrub · 195a49ea

由 Filipe Manana 提交于 2月 05, 2021

When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.

However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.

Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.

Fixes: ed46ff3d ("Btrfs: support swap files")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

195a49ea

09 2月, 2021 5 次提交

btrfs: zoned: enable to mount ZONED incompat flag · 9d294a68

由 Naohiro Aota 提交于 2月 04, 2021

This final patch adds the ZONED incompat flag to the supported flags
and enables to mount ZONED flagged file system.
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9d294a68

btrfs: zoned: extend zoned allocator to use dedicated tree-log block group · 40ab3be1

由 Naohiro Aota 提交于 2月 04, 2021

This is the 1/3 patch to enable tree log on zoned filesystems.

The tree-log feature does not work on a zoned filesystem as is. Blocks for
a tree-log tree are allocated mixed with other metadata blocks and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which has a different timing than a global transaction commit. As a
result, both writing tree-log blocks and writing other metadata blocks
become non-sequential writes that zoned filesystems must avoid.

Introduce a dedicated block group for tree-log blocks, so that tree-log
blocks and other metadata blocks can be separate write streams. As a
result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and assigns
"treelog_bg" on-demand on tree-log block allocation time.

This commit extends the zoned block allocator to use the block group.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

40ab3be1

btrfs: zoned: serialize metadata IO · 0bc09ca1

由 Naohiro Aota 提交于 2月 04, 2021

We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.

We cannot add a mutex around allocation and submission because metadata
blocks are allocated in an earlier stage to build up B-trees.

Add a zoned_meta_io_lock and hold it during metadata IO submission in
btree_write_cache_pages() to serialize IOs.

Furthermore, this adds a per-block group metadata IO submission pointer
"meta_write_pointer" to ensure sequential writing, which can break when
attempting to write back blocks in an unfinished transaction. If the
writing out failed because of a hole and the write out is for data
integrity (WB_SYNC_ALL), it returns EAGAIN.

A caller like fsync() code should handle this properly e.g. by falling
back to a full transaction commit.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0bc09ca1

btrfs: zoned: check if bio spans across an ordered extent · cacb2cea

由 Johannes Thumshirn 提交于 2月 04, 2021

To ensure that an ordered extent maps to a contiguous region on disk, we
need to maintain a "one bio == one ordered extent" rule.

Ensure that constructing bio does not span more than an ordered extent.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

cacb2cea

btrfs: improve preemptive background space flushing · 576fa348

由 Josef Bacik 提交于 10月 09, 2020

Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.

However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.

In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.

When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.

The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.

To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

576fa348

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功