提交 · c6730a0e5778bf2d83189f97ab5e3efe61f6aad1 · openeuler / Kernel

24 1月, 2020 11 次提交

btrfs: Factor out metadata_uuid code from find_fsid. · c6730a0e

由 Su Yue 提交于 1月 10, 2020

find_fsid became rather hairy with the introduction of metadata uuid
changing feature. Alleviate this by factoring out the metadata uuid
specific code in a dedicated function which deals with finding
correct fsid for a device with changed uuid.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NSu Yue <Damenly_Su@gmx.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c6730a0e

btrfs: Call find_fsid from find_fsid_inprogress · c0d81c7c

由 Su Yue 提交于 1月 10, 2020

Since find_fsid_inprogress should also handle the case in which an fs
didn't change its FSID make it call find_fsid directly. This makes the
code in device_list_add simpler by eliminating a conditional call of
find_fsid. No functional changes.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NSu Yue <Damenly_Su@gmx.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c0d81c7c

Btrfs: fix infinite loop during fsync after rename operations · b5e4ff9d

由 Filipe Manana 提交于 1月 15, 2020

Recently fsstress (from fstests) sporadically started to trigger an
infinite loop during fsync operations. This turned out to be because
support for the rename exchange and whiteout operations was added to
fsstress in fstests. These operations, unlike any others in fsstress,
cause file names to be reused, whence triggering this issue. However
it's not necessary to use rename exchange and rename whiteout operations
trigger this issue, simple rename operations and file creations are
enough to trigger the issue.

The issue boils down to when we are logging inodes that conflict (that
had the name of any inode we need to log during the fsync operation), we
keep logging them even if they were already logged before, and after
that we check if there's any other inode that conflicts with them and
then add it again to the list of inodes to log. Skipping already logged
inodes fixes the issue.

Consider the following example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/testdir                           # inode 257

  $ touch /mnt/testdir/zz                        # inode 258
  $ ln /mnt/testdir/zz /mnt/testdir/zz_link

  $ touch /mnt/testdir/a                         # inode 259

  $ sync

  # The following 3 renames achieve the same result as a rename exchange
  # operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a).

  $ mv /mnt/testdir/a /mnt/testdir/a/tmp
  $ mv /mnt/testdir/zz_link /mnt/testdir/a
  $ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link

  # The following rename and file creation give the same result as a
  # rename whiteout operation (<rename_whiteout> zz to a2).

  $ mv /mnt/testdir/zz /mnt/testdir/a2
  $ touch /mnt/testdir/zz                        # inode 260

  $ xfs_io -c fsync /mnt/testdir/zz
    --> results in the infinite loop

The following steps happen:

1) When logging inode 260, we find that its reference named "zz" was
   used by inode 258 in the previous transaction (through the commit
   root), so inode 258 is added to the list of conflicting indoes that
   need to be logged;

2) After logging inode 258, we find that its reference named "a" was
   used by inode 259 in the previous transaction, and therefore we add
   inode 259 to the list of conflicting inodes to be logged;

3) After logging inode 259, we find that its reference named "zz_link"
   was used by inode 258 in the previous transaction - we add inode 258
   to the list of conflicting inodes to log, again - we had already
   logged it before at step 3. After logging it again, we find again
   that inode 259 conflicts with him, and we add again 259 to the list,
   etc - we end up repeating all the previous steps.

So fix this by skipping logging of conflicting inodes that were already
logged.

Fixes: 6b5fc433 ("Btrfs: fix fsync after succession of renames of different files")
CC: stable@vger.kernel.org # 5.1+
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b5e4ff9d

btrfs: set trans->drity in btrfs_commit_transaction · d62b23c9

由 Josef Bacik 提交于 1月 17, 2020

If we abort a transaction we have the following sequence

if (!trans->dirty && list_empty(&trans->new_bgs))
	return;
WRITE_ONCE(trans->transaction->aborted, err);

The idea being if we didn't modify anything with our trans handle then
we don't really need to abort the whole transaction, maybe the other
trans handles are fine and we can carry on.

However in the case of create_snapshot we add a pending_snapshot object
to our transaction and then commit the transaction.  We don't actually
modify anything.  sync() behaves the same way, attach to an existing
transaction and commit it.  This means that if we have an IO error in
the right places we could abort the committing transaction with our
trans->dirty being not set and thus not set transaction->aborted.

This is a problem because in the create_snapshot() case we depend on
pending->error being set to something, or btrfs_commit_transaction
returning an error.

If we are not the trans handle that gets to commit the transaction, and
we're waiting on the commit to happen we get our return value from
cur_trans->aborted.  If this was not set to anything because sync() hit
an error in the transaction commit before it could modify anything then
cur_trans->aborted would be 0.  Thus we'd return 0 from
btrfs_commit_transaction() in create_snapshot.

This is a problem because we then try to do things with
pending_snapshot->snap, which will be NULL because we didn't create the
snapshot, and then we'll get a NULL pointer dereference like the
following

"BUG: kernel NULL pointer dereference, address: 00000000000001f0"
RIP: 0010:btrfs_orphan_cleanup+0x2d/0x330
Call Trace:
 ? btrfs_mksubvol.isra.31+0x3f2/0x510
 btrfs_mksubvol.isra.31+0x4bc/0x510
 ? __sb_start_write+0xfa/0x200
 ? mnt_want_write_file+0x24/0x50
 btrfs_ioctl_snap_create_transid+0x16c/0x1a0
 btrfs_ioctl_snap_create_v2+0x11e/0x1a0
 btrfs_ioctl+0x1534/0x2c10
 ? free_debug_processing+0x262/0x2a3
 do_vfs_ioctl+0xa6/0x6b0
 ? do_sys_open+0x188/0x220
 ? syscall_trace_enter+0x1f8/0x330
 ksys_ioctl+0x60/0x90
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x4a/0x1b0

In order to fix this we need to make sure anybody who calls
commit_transaction has trans->dirty set so that they properly set the
trans->transaction->aborted value properly so any waiters know bad
things happened.

This was found while I was running generic/475 with my modified
fsstress, it reproduced within a few runs.  I ran with this patch all
night and didn't see the problem again.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d62b23c9

btrfs: drop log root for dropped roots · 889bfa39

由 Josef Bacik 提交于 1月 17, 2020

If we fsync on a subvolume and create a log root for that volume, and
then later delete that subvolume we'll never clean up its log root.  Fix
this by making switch_commit_roots free the log for any dropped roots we
encounter.  The extra churn is because we need a btrfs_trans_handle, not
the btrfs_transaction.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

889bfa39

btrfs: sysfs, add devid/dev_state kobject and device attributes · 668e48af

由 Anand Jain 提交于 1月 06, 2020

New sysfs attributes that track the filesystem status of devices, stored
in the per-filesystem directory in /sys/fs/btrfs/FSID/devinfo . There's
a directory for each device, with name corresponding to the numerical
device id.

  in_fs_metadata    - device is in the list of fs metadata
  missing           - device is missing (no device node or block device)
  replace_target    - device is target of replace
  writeable         - writes from fs are allowed

These attributes reflect the state of the device::dev_state and created
at mount time.

Sample output:
  $ pwd
   /sys/fs/btrfs/6e1961f1-5918-4ecc-a22f-948897b409f7/devinfo/1/
  $ ls
    in_fs_metadata  missing  replace_target  writeable
  $ cat missing
    0

The output from these attributes are 0 or 1. 0 indicates unset and 1
indicates set.  These attributes are readonly.

It is observed that the device delete thread and sysfs read thread will
not race because the delete thread calls sysfs kobject_put() which in
turn waits for existing sysfs read to complete.

Note for device replace devid swap:

During the replace the target device temporarily assumes devid 0 before
assigning the devid of the soruce device.

In btrfs_dev_replace_finishing() we remove source sysfs devid using the
function btrfs_sysfs_remove_devices_attr(), so after that call
kobject_rename() to update the devid in the sysfs.  This adds and calls
btrfs_sysfs_update_devid() helper function to update the device id.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

668e48af

btrfs: Refactor btrfs_rmap_block to improve readability · 1776ad17

由 Nikolay Borisov 提交于 11月 19, 2019

Move variables to appropriate scope. Remove last BUG_ON in the function
and rework error handling accordingly. Make the duplicate detection code
more straightforward. Use in_range macro. And give variables more
descriptive name by explicitly distinguishing between IO stripe size
(size recorded in the chunk item) and data stripe size (the size of
an actual stripe, constituting a logical chunk/block group).
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1776ad17

btrfs: Add self-tests for btrfs_rmap_block · bf2e2eb0

由 Nikolay Borisov 提交于 12月 10, 2019

Add RAID1 and single testcases to verify that data stripes are excluded
from super block locations and that the address mapping is valid.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

bf2e2eb0

btrfs: selftests: Add support for dummy devices · b3ad2c17

由 Nikolay Borisov 提交于 11月 19, 2019

Add basic infrastructure to create and link dummy btrfs_devices. This
will be used in the pending btrfs_rmap_block test which deals with
the block groups.

Calling btrfs_alloc_dummy_device will link the newly created device to
the passed fs_info and the test framework will free them once the test
is finished.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b3ad2c17

btrfs: Move and unexport btrfs_rmap_block · 96a14336

由 Nikolay Borisov 提交于 12月 10, 2019

It's used only during initial block group reading to map physical
address of super block to a list of logical ones. Make it private to
block-group.c, add proper kernel doc and ensure it's exported only for
tests.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

96a14336

btrfs: separate definition of assertion failure handlers · 68c467cb

由 David Sterba 提交于 12月 16, 2019

There's a report where objtool detects unreachable instructions, eg.:

fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction

This seems to be a false positive due to compiler version. The cause is
in the ASSERT macro implementation that does the conditional check as
IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef.

To avoid that, use the ifdefs directly.

There are still 2 reports that aren't fixed:

fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction
fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction
Co-developed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Reported-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

68c467cb

20 1月, 2020 29 次提交

btrfs: device stats, log when stats are zeroed · a69976bc

由 Anand Jain 提交于 1月 10, 2020

We had a report indicating that some read errors aren't reported by the
device stats in the userland. It is important to have the errors
reported in the device stat as user land scripts might depend on it to
take the reasonable corrective actions. But to debug these issue we need
to be really sure that request to reset the device stat did not come
from the userland itself. So log an info message when device error reset
happens.

For example:
 BTRFS info (device sdc): device stats zeroed by btrfs(9223)

Reported-by: philip@philip-seeger.de
Link: https://www.spinics.net/lists/linux-btrfs/msg96528.htmlReviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a69976bc

btrfs: fix improper setting of scanned for range cyclic write cache pages · 556755a8

由 Josef Bacik 提交于 1月 03, 2020

We noticed that we were having regular CG OOM kills in cases where there
was still enough dirty pages to avoid OOM'ing.  It turned out there's
this corner case in btrfs's handling of range_cyclic where files that
were being redirtied were not getting fully written out because of how
we do range_cyclic writeback.

We unconditionally were setting scanned = 1; the first time we found any
pages in the inode.  This isn't actually what we want, we want it to be
set if we've scanned the entire file.  For range_cyclic we could be
starting in the middle or towards the end of the file, so we could write
one page and then not write any of the other dirty pages in the file
because we set scanned = 1.

Fix this by not setting scanned = 1 if we find pages.  The rules for
setting scanned should be

1) !range_cyclic.  In this case we have a specified range to write out.
2) range_cyclic && index == 0.  In this case we've started at the
   beginning and there is no need to loop around a second time.
3) range_cyclic && we started at index > 0 and we've reached the end of
   the file without satisfying our nr_to_write.

This patch fixes both of our writepages implementations to make sure
these rules hold true.  This fixed our over zealous CG OOMs in
production.

Fixes: d1310b2e ("Btrfs: Split the extent_map code into two parts")
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

556755a8

btrfs: safely advance counter when looking up bio csums · 4babad10

由 David Sterba 提交于 1月 08, 2020

Dan's smatch tool reports

  fs/btrfs/file-item.c:295 btrfs_lookup_bio_sums()
  warn: should this be 'count == -1'

which points to the while (count--) loop. With count == 0 the check
itself could decrement it to -1. There's a WARN_ON a few lines below
that has never been seen in practice though.

It turns out that the value of page_bytes_left matches the count (by
sectorsize multiples). The loop never reaches the state where count
would go to -1, because page_bytes_left == 0 is found first and this
breaks out.

For clarity, use only plain check on count (and only for positive
value), decrement safely inside the loop. Any other discrepancy after
the whole bio list processing should be reported by the exising
WARN_ON_ONCE as well.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4babad10

btrfs: remove unused member btrfs_device::work · 94f8c465

由 David Sterba 提交于 1月 06, 2020

This is a leftover from recently removed bio scheduling framework.

Fixes: ba8a9d07 ("Btrfs: delete the entire async bio submission framework")
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

94f8c465

btrfs: remove unnecessary wrapper get_alloc_profile · ef0a82da

由 Johannes Thumshirn 提交于 1月 02, 2020

btrfs_get_alloc_profile() is a simple wrapper over get_alloc_profile().
The only difference is btrfs_get_alloc_profile() is visible to other
functions in btrfs while get_alloc_profile() is static and thus only
visible to functions in block-group.c.

Let's just fold get_alloc_profile() into btrfs_get_alloc_profile() to
get rid of the unnecessary second function.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NJohannes Thumshirn <jth@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ef0a82da

btrfs: add correction to handle -1 edge case in async discard · 81b29a3b

由 Dennis Zhou 提交于 1月 02, 2020

From Dave's testing described below, it's possible to drive a file
system to have bogus values of discardable_extents and _bytes.  As
btrfs_discard_calc_delay() is the only user of discardable_extents, we
can correct here for any negative discardable_extents/discardable_bytes.

The problem is not reliably reproducible. The workload that created it
was based on linux git tree, switching between release tags, then
everytihng deleted followed by a full rebalance. At this state the
values of discardable_bytes was 16K and discardable_extents was -1,
expected values 0 and 0.

Repeating the workload again did not correct the bogus values so the
offset seems to be stable once it happens.
Reported-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

81b29a3b

btrfs: ensure removal of discardable_* in free_bitmap() · 27f0afc7

由 Dennis Zhou 提交于 1月 02, 2020

Most callers of free_bitmap() only call it if bitmap_info->bytes is 0.
However, there are certain cases where we may free the free space cache
via __btrfs_remove_free_space_cache(). This exposes a path where
free_bitmap() is called regardless. This may result in a bad accounting
situation for discardable_bytes and discardable_extents. So, remove the
stats and call btrfs_discard_update_discardable().
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

27f0afc7

btrfs: make smaller extents more likely to go into bitmaps · f9bb615a

由 Dennis Zhou 提交于 1月 02, 2020

It's less than ideal for small extents to eat into our extent budget, so
force extents <= 32KB into the bitmaps save for the first handful.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f9bb615a

btrfs: increase the metadata allowance for the free_space_cache · 5d90c5c7

由 Dennis Zhou 提交于 1月 02, 2020

Currently, there is no way for the free space cache to recover from
being serviced by purely bitmaps because the extent threshold is set to
0 in recalculate_thresholds() when we surpass the metadata allowance.

This adds a recovery mechanism by keeping large extents out of the
bitmaps and increases the metadata upper bound to 64KB. The recovery
mechanism bypasses this upper bound, thus making it a soft upper bound.
But, with the bypass being 1MB or greater, it shouldn't add unbounded
overhead.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5d90c5c7

btrfs: add async discard implementation overview · dbc2a8c9

由 Dennis Zhou 提交于 1月 02, 2020

Give a brief overview for how async discard is implemented.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

dbc2a8c9

btrfs: keep track of discard reuse stats · 9ddf648f

由 Dennis Zhou 提交于 1月 02, 2020

Keep track of how much we are discarding and how often we are reusing
with async discard. The discard_*_bytes values don't need any special
protection because the work item provides the single threaded access.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9ddf648f

btrfs: only keep track of data extents for async discard · 5cb0724e

由 Dennis Zhou 提交于 1月 02, 2020

As mentioned earlier, discarding data can be done either by issuing an
explicit discard or implicitly by reusing the LBA. Metadata block_groups
see much more frequent reuse due to well it being metadata. So instead
of explicitly discarding metadata block_groups, just leave them be and
let the latter implicit discarding be done for them.

For mixed block_groups, block_groups which contain both metadata and
data, we let them be as higher fragmentation is expected.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5cb0724e

btrfs: have multiple discard lists · 7fe6d45e

由 Dennis Zhou 提交于 1月 02, 2020

Non-block group destruction discarding currently only had a single list
with no minimum discard length. This can lead to caravaning more
meaningful discards behind a heavily fragmented block group.

This adds support for multiple lists with minimum discard lengths to
prevent the caravan effect. We promote block groups back up when we
exceed the BTRFS_ASYNC_DISCARD_MAX_FILTER size, currently we support
only 2 lists with filters of 1MB and 32KB respectively.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

7fe6d45e

btrfs: make max async discard size tunable · 19b2a2c7

由 Dennis Zhou 提交于 1月 02, 2020

Expose max_discard_size as a tunable via sysfs and switch the current
fixed maximum to the default value.
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

19b2a2c7

btrfs: limit max discard size for async discard · 4aa9ad52

由 Dennis Zhou 提交于 1月 02, 2020

Throttle the maximum size of a discard so that we can provide an upper
bound for the rate of async discard. While the block layer is able to
split discards into the appropriate sized discards, we want to be able
to account more accurately the rate at which we are consuming NCQ slots
as well as limit the upper bound of work for a discard.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4aa9ad52

btrfs: add kbps discard rate limit for async discard · e93591bb

由 Dennis Zhou 提交于 1月 02, 2020

Provide the ability to rate limit based on kbps in addition to iops as
additional guides for the target discard rate. The delay used ends up
being max(kbps_delay, iops_delay).
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e93591bb

btrfs: calculate discard delay based on number of extents · a2309300

由 Dennis Zhou 提交于 1月 02, 2020

An earlier patch keeps track of discardable_extents. These are
undiscarded extents managed by the free space cache. Here, we will use
this to dynamically calculate the discard delay interval.

There are 3 rate to consider. The first is the target convergence rate,
the rate to discard all discardable_extents over the
BTRFS_DISCARD_TARGET_MSEC time frame. This is clamped by the lower
limit, the iops limit or BTRFS_DISCARD_MIN_DELAY (1ms), and the upper
limit, BTRFS_DISCARD_MAX_DELAY (1s). We reevaluate this delay every
transaction commit.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a2309300

btrfs: keep track of discardable_bytes for async discard · 5dc7c10b

由 Dennis Zhou 提交于 12月 13, 2019

Keep track of this metric so that we can understand how ahead or behind
we are in discarding rate. This uses the same accounting method as
discardable_extents, deltas between previous/current values and
propagating them up.
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5dc7c10b

btrfs: track discardable extents for async discard · dfb79ddb

由 Dennis Zhou 提交于 12月 13, 2019

The number of discardable extents will serve as the rate limiting metric
for how often we should discard. This keeps track of discardable extents
in the free space caches by maintaining deltas and propagating them to
the global count.

The deltas are calculated from 2 values stored in PREV and CURR entries,
then propagated up to the global discard ctl.  The current counter value
becomes the previous counter value after update.
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

dfb79ddb

btrfs: sysfs: add UUID/debug/discard directory · e4faab84

由 Dennis Zhou 提交于 12月 13, 2019

Setup base sysfs directory for discard stats + tunables.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e4faab84

btrfs: sysfs: make UUID/debug have its own kobject · 93945cb4

由 Dennis Zhou 提交于 12月 13, 2019

Btrfs only allowed attributes to be exposed in debug/. Let's let other
groups be created by making debug its own kobject.

This also makes the per-fs debug options separate from the global
features mount attributes. This seems to be needed as
sysfs_create_files() requires const struct attribute * while
sysfs_create_group() can take struct attribute *. This seems nicer as
per file system, you'll probably use to_fs_info().
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

93945cb4

btrfs: sysfs: add removal calls for debug/ · 71e8978e

由 Dennis Zhou 提交于 12月 13, 2019

We probably should call sysfs_remove_group() on debug/.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

71e8978e

btrfs: discard one region at a time in async discard · 2bee7eb8

由 Dennis Zhou 提交于 12月 13, 2019

The prior two patches added discarding via a background workqueue. This
just piggybacked off of the fstrim code to trim the whole block at once.
Well inevitably this is worse performance wise and will aggressively
overtrim. But it was nice to plumb the other infrastructure to keep the
patches easier to review.

This adds the real goal of this series which is discarding slowly (ie. a
slow long running fstrim). The discarding is split into two phases,
extents and then bitmaps. The reason for this is two fold. First, the
bitmap regions overlap the extent regions. Second, discarding the
extents first will let the newly trimmed bitmaps have the highest chance
of coalescing when being readded to the free space cache.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2bee7eb8

btrfs: handle empty block_group removal for async discard · 6e80d4f8

由 Dennis Zhou 提交于 12月 13, 2019

block_group removal is a little tricky. It can race with the extent
allocator, the cleaner thread, and balancing. The current path is for a
block_group to be added to the unused_bgs list. Then, when the cleaner
thread comes around, it starts a transaction and then proceeds with
removing the block_group. Extents that are pinned are subsequently
removed from the pinned trees and then eventually a discard is issued
for the entire block_group.

Async discard introduces another player into the game, the discard
workqueue. While it has none of the racing issues, the new problem is
ensuring we don't leave free space untrimmed prior to forgetting the
block_group. This is handled by placing fully free block_groups on a
separate discard queue. This is necessary to maintain discarding order
as in the future we will slowly trim even fully free block_groups. The
ordering helps us make progress on the same block_group rather than say
the last fully freed block_group or needing to search through the fully
freed block groups at the beginning of a list and insert after.

The new order of events is a fully freed block group gets placed on the
unused discard queue first. Once it's processed, it will be placed on
the unusued_bgs list and then the original sequence of events will
happen, just without the final whole block_group discard.

The mount flags can change when processing unused_bgs, so when flipping
from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
free block groups on the discard_list to the unused_bg queue which will
do the final discard for us.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6e80d4f8

btrfs: add the beginning of async discard, discard workqueue · b0643e59

由 Dennis Zhou 提交于 12月 13, 2019

When discard is enabled, everytime a pinned extent is released back to
the block_group's free space cache, a discard is issued for the extent.
This is an overeager approach when it comes to discarding and helping
the SSD maintain enough free space to prevent severe garbage collection
situations.

This adds the beginning of async discard. Instead of issuing a discard
prior to returning it to the free space, it is just marked as untrimmed.
The block_group is then added to a LRU which then feeds into a workqueue
to issue discards at a much slower rate. Full discarding of unused block
groups is still done and will be addressed in a future patch of the
series.

For now, we don't persist the discard state of extents and bitmaps.
Therefore, our failure recovery mode will be to consider extents
untrimmed. This lets us handle failure and unmounting as one in the
same.

On a number of Facebook webservers, I collected data every minute
accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
is where we discard extents synchronously before returning them to the
free space cache.

discard=sync:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    ---------------------------------------------------------------
     Drive A  |           434           |          1170
     Drive B  |           880           |          2330
     Drive C  |          2943           |          3920
     Drive D  |          4763           |          5701

discard=async:
                 p99 total per minute       p99 total per minute
      Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
    --------------------------------------------------------------
     Drive A  |           134           |           956
     Drive B  |            64           |          1972
     Drive C  |            59           |          1032
     Drive D  |            62           |          1200

While it's not great that the stats are cumulative over 1m, all of these
servers are running the same workload and and the delta between the two
are substantial. We are spending significantly less time in
btrfs_finish_extent_commit() which is responsible for discarding.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b0643e59

btrfs: keep track of free space bitmap trim status cleanliness · da080fe1

由 Dennis Zhou 提交于 12月 13, 2019

There is a cap in btrfs in the amount of free extents that a block group
can have. When it surpasses that threshold, future extents are placed
into bitmaps. Instead of keeping track of if a certain bit is trimmed or
not in a second bitmap, keep track of the relative state of the bitmap.

With async discard, trimming bitmaps becomes a more frequent operation.
As a trade off with simplicity, we keep track of if discarding a bitmap
is in progress. If we fully scan a bitmap and trim as necessary, the
bitmap is marked clean. This has some caveats as the min block size may
skip over regions deemed too small. But this should be a reasonable
trade off rather than keeping a second bitmap and making allocation
paths more complex. The downside is we may overtrim, but ideally the min
block size should prevent us from doing that too often and getting stuck
trimming pathological cases.

BTRFS_TRIM_STATE_TRIMMING is added to indicate a bitmap is in the
process of being trimmed. If additional free space is added to that
bitmap, the bit is cleared. A bitmap will be marked
BTRFS_TRIM_STATE_TRIMMED if the trimming code was able to reach the end
of it and the former is still set.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

da080fe1

btrfs: keep track of which extents have been discarded · a7ccb255

由 Dennis Zhou 提交于 12月 13, 2019

Async discard will use the free space cache as backing knowledge for
which extents to discard. This patch plumbs knowledge about which
extents need to be discarded into the free space cache from
unpin_extent_range().

An untrimmed extent can merge with everything as this is a new region.
Absorbing trimmed extents is a tradeoff to for greater coalescing which
makes life better for find_free_extent(). Additionally, it seems the
size of a trim isn't as problematic as the trim io itself.

When reading in the free space cache from disk, if sync is set, mark all
extents as trimmed. The current code ensures at transaction commit that
all free space is trimmed when sync is set, so this reflects that.
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a7ccb255

btrfs: rename DISCARD mount option to to DISCARD_SYNC · 46b27f50

由 Dennis Zhou 提交于 12月 13, 2019

This series introduces async discard which will use the flag
DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is
synchronously done in transaction commit.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

46b27f50

bitmap: genericize percpu bitmap region iterators · e837dfde

由 Dennis Zhou 提交于 12月 13, 2019

Bitmaps are fairly popular for their space efficiency, but we don't have
generic iterators available. Make percpu's bitmap region iterators
available to everyone.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e837dfde

openeuler / Kernel 12 个月 前同步成功

openeuler / Kernel
12 个月前同步成功