提交 · 8103d10b71610aa65a65d6611cd3ad3f3bd7beeb · xiphi1978 / linux

07 6月, 2019 1 次提交

btrfs: Always trim all unallocated space in btrfs_trim_free_extents · 8103d10b

由 Nikolay Borisov 提交于 6月 03, 2019

This patch removes support for range parameters of FITRIM ioctl when
trimming unallocated space on devices. This is necessary since ranges
passed from user space are generally interpreted as logical addresses,
whereas btrfs_trim_free_extents used to interpret them as device
physical extents. This could result in counter-intuitive behavior for
users so it's best to remove that support altogether.

Additionally, the existing range support had a bug where if an offset
was passed to FITRIM which overflows u64 e.g. -1 (parsed as u64
18446744073709551615) then wrong data was fed into btrfs_issue_discard,
which in turn leads to wrap-around when aligning the passed range and
results in wrong regions being discarded which leads to data corruption.

Fixes: c2d1b3aa ("btrfs: Honour FITRIM range constraints during free space trim")
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8103d10b

16 5月, 2019 2 次提交

btrfs: extent-tree: Fix a bug that btrfs is unable to add pinned bytes · 14ae4ec1

由 Qu Wenruo 提交于 5月 10, 2019

Commit ddf30cf0 ("btrfs: extent-tree: Use btrfs_ref to refactor
add_pinned_bytes()") refactored add_pinned_bytes(), but during that
refactor, there are two callers which add the pinned bytes instead
of subtracting.

That refactor misses those two caller, causing incorrect pinned bytes
calculation and resulting unexpected ENOSPC error.

Fix it by adding a new parameter @sign to restore the original behavior.
Reported-by: Nkernel test robot <rong.a.chen@intel.com>
Fixes: ddf30cf0 ("btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()")
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

14ae4ec1

btrfs: sysfs: Fix error path kobject memory leak · 450ff834

由 Tobin C. Harding 提交于 5月 13, 2019

If a call to kobject_init_and_add() fails we must call kobject_put()
otherwise we leak memory.

Calling kobject_put() when kobject_init_and_add() fails drops the
refcount back to 0 and calls the ktype release method (which in turn
calls the percpu destroy and kfree).

Add call to kobject_put() in the error path of call to
kobject_init_and_add().

Cc: stable@vger.kernel.org # v4.4+
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NTobin C. Harding <tobin@kernel.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

450ff834

02 5月, 2019 1 次提交

btrfs: reserve delalloc metadata differently · c8eaeac7

由 Josef Bacik 提交于 4月 10, 2019

With the per-inode block reserves we started refilling the reserve based
on the calculated size of the outstanding csum bytes and extents for the
inode, including the amount we were adding with the new operation.

However, generic/224 exposed a problem with this approach. With 1000
files all writing at the same time we ended up with a bunch of bytes
being reserved but unusable.

When you write to a file we reserve space for the csum leaves for those
bytes, the number of extent items required to cover those bytes, and a
single transaction item for updating the inode at ordered extent finish
for that range of bytes. This is held until the ordered extent finishes
and we release all of the reserved space.

If a second write comes in at this point we would add a single
reservation for the new outstanding extent and however many reservations
for the csum leaves. At this point we find the delta of how much we
have reserved and how much outstanding size this is and attempt to
reserve this delta. If the first write finishes it will not release any
space, because the space it had reserved for the initial write is still
needed for the second write. However some space would have been used,
as we have added csums, extent items, and dirtied the inode. Our
reserved space would be > 0 but less than the total needed reserved
space.

This is just for a single inode, now consider generic/224. This has
1000 inodes writing in parallel to a very small file system, 1GiB. In
my testing this usually means we get about a 120MiB metadata area to
work with, more than enough to allow the writes to continue, but not
enough if all of the inodes are stuck trying to reserve the slack space
while continuing to hold their leftovers from their initial writes.

Fix this by pre-reserved _only_ for the space we are currently trying to
add. Then once that is successful modify our inodes csum count and
outstanding extents, and then add the newly reserved space to the inodes
block_rsv. This allows us to actually pass generic/224 without running
out of metadata space.
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c8eaeac7

30 4月, 2019 36 次提交

btrfs: track DIO bytes in flight · 4297ff84

由 Josef Bacik 提交于 4月 10, 2019

When diagnosing a slowdown of generic/224 I noticed we were not doing
anything when calling into shrink_delalloc().  This is because all
writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes
counter is 0, which short circuits most of the work inside of
shrink_delalloc().  However O_DIRECT writes still consume metadata
resources and generate ordered extents, which we can still wait on.

Fix this by tracking outstanding DIO write bytes, and use this as well
as the delalloc bytes counter to decide if we need to lookup and wait on
any ordered extents.  If we have more DIO writes than delalloc bytes
we'll go ahead and wait on any ordered extents regardless of our flush
state as flushing delalloc is likely to not gain us anything.
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
[ use dio instead of odirect in identifiers ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4297ff84

D
btrfs: remove unused parameter fs_info from btrfs_set_disk_extent_flags · f5c8daa5
由 David Sterba 提交于 3月 20, 2019
```
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
f5c8daa5
D
btrfs: remove unused parameter fs_info from btrfs_add_delayed_extent_op · c6e340bc
由 David Sterba 提交于 3月 20, 2019
```
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
c6e340bc
D
btrfs: remove unused parameter fs_info from btrfs_extend_item · c71dd880
由 David Sterba 提交于 3月 20, 2019
```
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
c71dd880
D
btrfs: remove unused parameter fs_info from btrfs_truncate_item · 78ac4f9e
由 David Sterba 提交于 3月 20, 2019
```
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
78ac4f9e

btrfs: qgroup: Don't scan leaf if we're modifying reloc tree · c4140cbf

由 Qu Wenruo 提交于 4月 04, 2019

Since reloc tree doesn't contribute to qgroup numbers, just skip them.

This should catch the final cause of unnecessary data ref processing
when running balance of metadata with qgroups on.

The 4G data 16 snapshots test (*) should explain it pretty well:

             | delayed subtree | refactor delayed ref | this patch
---------------------------------------------------------------------
relocated    |           22653 |                22673 |         22744
qgroup dirty |          122792 |                48360 |            70
time         |          24.494 |               11.606 |         3.944

Finally, we're at the stage where qgroup + metadata balance cost no
obvious overhead.

Test environment:

Test VM:
- vRAM		8G
- vCPU		8
- block dev	vitrio-blk, 'unsafe' cache mode
- host block	850evo

Test workload:
- Copy 4G data from /usr/ to one subvolume
- Create 16 snapshots of that subvolume, and modify 3 files in each
  snapshot
- Enable quota, rescan
- Time "btrfs balance start -m"
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c4140cbf

btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent() · ffd4bb2a

由 Qu Wenruo 提交于 4月 04, 2019

Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long
parameter list and the confusing @owner parameter.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ffd4bb2a

btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref() · 82fa113f

由 Qu Wenruo 提交于 4月 04, 2019

Use the new btrfs_ref structure and replace parameter list to clean up
the usage of owner and level to distinguish the extent types.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

82fa113f

btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes() · ddf30cf0

由 Qu Wenruo 提交于 4月 04, 2019

Since add_pinned_bytes() only needs to know if the extent is metadata
and if it's a chunk tree extent, btrfs_ref is a perfect match for it, as
we don't need various owner/level trick to determine extent type.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ddf30cf0

btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod() · 8a5040f7

由 Qu Wenruo 提交于 4月 04, 2019

It's a perfect match for btrfs_ref_tree_mod() to use btrfs_ref, as
btrfs_ref describes a metadata/data reference update comprehensively.

Now we have one less function use confusing owner/level trick.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8a5040f7

btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_data_ref() · 76675593

由 Qu Wenruo 提交于 4月 04, 2019

Just like btrfs_add_delayed_tree_ref(), use btrfs_ref to refactor
btrfs_add_delayed_data_ref().
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

76675593

btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_tree_ref() · ed4f255b

由 Qu Wenruo 提交于 4月 04, 2019

btrfs_add_delayed_tree_ref() has a longer and longer parameter list, and
some callers like btrfs_inc_extent_ref() are using @owner as level for
delayed tree ref.

Instead of making the parameter list longer, use btrfs_ref to refactor
it, so each parameter assignment should be self-explaining without dirty
level/owner trick, and provides the basis for later refactoring.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ed4f255b

btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref · dd28b6a5

由 Qu Wenruo 提交于 4月 04, 2019

The process_func function pointer is local to __btrfs_mod_ref() and
points to either btrfs_inc_extent_ref() or btrfs_free_extent().

Open code it to make later delayed ref refactor easier, so we can
refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different
patches.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

dd28b6a5

btrfs: get fs_info from block group in btrfs_find_space_cluster · 2ceeae2e

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from the block group cache structure and can drop it
from the parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2ceeae2e

btrfs: get fs_info from block group in load_free_space_cache · bb6cb1c5

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from the block group cache structure and can drop it
from the parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

bb6cb1c5

btrfs: get fs_info from block group in lookup_free_space_inode · 7949f339

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from the block group cache structure and can drop it
from the parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

7949f339

btrfs: get fs_info from block group in pin_down_extent · fdf08605

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from the block group cache structure and can drop it
from the parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

fdf08605

btrfs: get fs_info from block group in next_block_group · f87b7eb8

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from the block group cache structure and can drop it
from the parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f87b7eb8

Btrfs: remove no longer used function to run delayed refs asynchronously · 32b593bf

由 Filipe Manana 提交于 4月 17, 2019

It used to be called from only two places (truncate path and releasing a
transaction handle), but commits 28bad212 ("btrfs: fix truncate
throttling") and db2462a6 ("btrfs: don't run delayed refs in the end
transaction logic") removed their calls to this function, so it's not used
anymore. Just remove it and all its helpers.
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

32b593bf

Btrfs: remove no longer used member num_dirty_bgs from transaction · 74f657d8

由 Filipe Manana 提交于 4月 15, 2019

The member num_dirty_bgs of struct btrfs_transaction is not used anymore,
it is set and incremented but nothing reads its value anymore. Its last
read use was removed by commit 64403612 ("btrfs: rework
btrfs_check_space_for_delayed_refs"). So just remove that member.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

74f657d8

btrfs: get fs_info from trans in btrfs_write_out_cache · fe041534

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

fe041534

btrfs: get fs_info from trans in create_free_space_inode · 4ca75f1b

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4ca75f1b

btrfs: get fs_info from trans in btrfs_set_log_full_commit · 90787766

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

90787766

btrfs: get fs_info from trans in update_block_group · 6b279408

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6b279408

btrfs: get fs_info from trans in btrfs_write_dirty_block_groups · 5742d15f

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5742d15f

btrfs: get fs_info from trans in btrfs_setup_space_cache · bbebb3e0

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

bbebb3e0

btrfs: get fs_info from trans in write_one_cache_group · 39db232d

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from the transaction and can drop it from the
parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

39db232d

btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit · 929be17a

由 Nikolay Borisov 提交于 3月 27, 2019

Instead of always calling the allocator to search for a free extent,
that satisfies the input criteria, switch btrfs_trim_free_extents to
using find_first_clear_extent_bit. With this change it's no longer
necessary to read the device tree in order to figure out holes in
the devices.

Now the code always searches in-memory data structure to figure out the
space range which contains the requested which should result in speed
improvements.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

929be17a

btrfs: Optimize unallocated chunks discard · 8811133d

由 Nikolay Borisov 提交于 3月 27, 2019

Currently unallocated chunks are always trimmed. For example
2 consecutive trims on large storage would trim freespace twice
irrespective of whether the space was actually allocated or not between
those trims.

Optimise this behavior by exploiting the newly introduced alloc_state
tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark
those unallocated chunks which have been trimmed and have not been
allocated afterwards. On chunk allocation the respective underlying devices'
physical space will have its CHUNK_TRIMMED flag cleared. This avoids
submitting discards for space which hasn't been changed since the last
time discard was issued.

This applies to the single mount period of the filesystem as the
information is not stored permanently.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8811133d

btrfs: Factor out in_range macro · e74e3993

由 Nikolay Borisov 提交于 3月 27, 2019

This is used in more than one places so let's factor it out in ctree.h.
No functional changes.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e74e3993

btrfs: Remove 'trans' argument from find_free_dev_extent(_start) · 60dfdf25

由 Nikolay Borisov 提交于 3月 27, 2019

Now that these functions no longer require a handle to transaction to
inspect pending/pinned chunks the argument can be removed. At the same
time also remove any surrounding code which acquired the handle.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

60dfdf25

btrfs: replace pending/pinned chunks lists with io tree · 1c11b63e

由 Jeff Mahoney 提交于 3月 27, 2019

The pending chunks list contains chunks that are allocated in the
current transaction but haven't been created yet. The pinned chunks
list contains chunks that are being released in the current transaction.
Both describe chunks that are not reflected on disk as in use but are
unavailable just the same.

The pending chunks list is anchored by the transaction handle, which
means that we need to hold a reference to a transaction when working
with the list.

The way we use them is by iterating over both lists to perform
comparisons on the stripes they describe for each device. This is
backwards and requires that we keep a transaction handle open while
we're trimming.

This patchset adds an extent_io_tree to btrfs_device that maintains
the allocation state of the device.  Extents are set dirty when
chunks are first allocated -- when the extent maps are added to the
mapping tree. They're cleared when last removed -- when the extent
maps are removed from the mapping tree. This matches the lifespan
of the pending and pinned chunks list and allows us to do trims
on unallocated space safely without pinning the transaction for what
may be a lengthy operation. We can also use this io tree to mark
which chunks have already been trimmed so we don't repeat the operation.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1c11b63e

btrfs: Honour FITRIM range constraints during free space trim · c2d1b3aa

由 Nikolay Borisov 提交于 3月 25, 2019

Up until now trimming the freespace was done irrespective of what the
arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments
will be entirely ignored. Fix it by correctly handling those paramter.
This requires breaking if the found freespace extent is after the end of
the passed range as well as completing trim after trimming
fstrim_range::len bytes.

Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c2d1b3aa

btrfs: get fs_info from eb in clean_tree_block · 6a884d7d

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6a884d7d

btrfs: get fs_info from eb in btrfs_exclude_logged_extents · bcdc428c

由 David Sterba 提交于 3月 20, 2019

We can read fs_info from extent buffer and can drop it from the
parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

bcdc428c

Btrfs: remove no longer used 'sync' member from transaction handle · 3b1da515

由 Filipe Manana 提交于 3月 11, 2019

Commit db2462a6 ("btrfs: don't run delayed refs in the end transaction
logic") removed its last use, so now it does absolutely nothing, therefore
remove it.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3b1da515