提交 · 2556fbb0bead7929ddf67f8b4184f434cee4e7d7 · openeuler / Kernel

06 8月, 2018 11 次提交

btrfs: use customized batch size for total_bytes_pinned · dec59fa3

由 Ethan Lien 提交于 7月 13, 2018

In commit b150a4f1 ("Btrfs: use a percpu to keep track of possibly
pinned bytes") we use total_bytes_pinned to track how many bytes we are
going to free in this transaction. When we are close to ENOSPC, we check it
and know if we can make the allocation by commit the current transaction.
For every data/metadata extent we are going to free, we add
total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and
release it in unpin_extent_range() when we finish the transaction. So this
is a variable we frequently update but rarely read - just the suitable
use of percpu_counter. But in previous commit we update total_bytes_pinned
by default 32 batch size, making every update essentially a spin lock
protected update. Since every spin lock/unlock operation involves syncing
a globally used variable and some kind of barrier in a SMP system, this is
more expensive than using total_bytes_pinned as a simple atomic64_t.

So fix this by using a customized batch size. Since we only read
total_bytes_pinned when we are close to ENOSPC and fail to allocate new
chunk, we can use a really large batch size and have nearly no penalty
in most cases.

[Test]
We tested the patch on a 4-cores x86 machine:

1. fallocate a 16GiB size test file
2. take snapshot (so all following writes will be COW)
3. run a 180 sec, 4 jobs, 4K random write fio on test file

We also added a temporary lockdep class on percpu_counter's spin lock
used by total_bytes_pinned to track it by lock_stat.

[Results]
unpatched:
lock_stat version 0.4
-----------------------------------------------------------------------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

               total_bytes_pinned_percpu:            82             82
        0.21           0.61          29.46           0.36         298340
      635973           0.09          11.01      173476.25           0.27

patched:
lock_stat version 0.4
-----------------------------------------------------------------------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg

               total_bytes_pinned_percpu:             1              1
        0.62           0.62           0.62           0.62          13601
       31542           0.14           9.61       11016.90           0.35

[Analysis]
Since the spin lock only protects a single in-memory variable, the
contentions (number of lock acquisitions that had to wait) in both
unpatched and patched version are low. But when we see acquisitions and
acq-bounces, we get much lower counts in patched version. Here the most
important metric is acq-bounces. It means how many times the lock gets
transferred between different cpus, so the patch can really reduce
cacheline bouncing of spin lock (also the global counter of percpu_counter)
in a SMP system.

Fixes: b150a4f1 ("Btrfs: use a percpu to keep track of possibly pinned bytes")
Signed-off-by: NEthan Lien <ethanlien@synology.com>
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

dec59fa3

btrfs: Add graceful handling of V0 extents · ba3c2b19

由 Nikolay Borisov 提交于 6月 26, 2018

Following the removal of the v0 handling code let's be courteous and
print an error message when such extents are handled. In the cases
where we have a transaction just abort it, otherwise just call
btrfs_handle_fs_error. Both cases result in the FS being re-mounted RO.

In case the error handling would be too intrusive, leave the BUG_ON in
place, like extent_data_ref_count, other proper handling would catch
that earlier.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ba3c2b19

btrfs: Remove V0 extent support · a79865c6

由 Nikolay Borisov 提交于 6月 21, 2018

The v0 compat code was introduced in commit 5d4f98a2
("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT CHANGE)") 9
years ago, which was merged in 2.6.31. This means that the code is
there to support filesystems which are _VERY_ old and if you are using
btrfs on such an old kernel, you have much bigger problems. This coupled
with the fact that no one is likely testing/maintining this code likely
means it has bugs lurking. All things considered I think 43 kernel
releases later it's high time this remnant of the past got removed.

This patch removes all code wrapped in #ifdefs but leaves the BUG_ONs in case
we have a v0 with no support intact as a sort of safety-net.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a79865c6

btrfs: Get rid of the confusing btrfs_file_extent_inline_len · e41ca589

由 Qu Wenruo 提交于 6月 06, 2018

We used to call btrfs_file_extent_inline_len() to get the uncompressed
data size of an inlined extent.

However this function is hiding evil, for compressed extent, it has no
choice but to directly read out ram_bytes from btrfs_file_extent_item.
While for uncompressed extent, it uses item size to calculate the real
data size, and ignoring ram_bytes completely.

In fact, for corrupted ram_bytes, due to above behavior kernel
btrfs_print_leaf() can't even print correct ram_bytes to expose the bug.

Since we have the tree-checker to verify all EXTENT_DATA, such mismatch
can be detected pretty easily, thus we can trust ram_bytes without the
evil btrfs_file_extent_inline_len().
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e41ca589

btrfs: Remove fs_info from btrfs_force_chunk_alloc · 43a7e99d

由 Nikolay Borisov 提交于 6月 20, 2018

It can be referenced from the passed transaction handle.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

43a7e99d

btrfs: Remove fs_info from btrfs_inc_block_group_ro · c83488af

由 Nikolay Borisov 提交于 6月 20, 2018

It can be referenced from the passed bg cache.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c83488af

btrfs: Remove fs_info from btrfs_alloc_logged_file_extent · 61da2abf

由 Nikolay Borisov 提交于 6月 20, 2018

It can be referenced from trans since the function is always called
within a valid transaction.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

61da2abf

btrfs: Remove fs_info from check_system_chunk · 451a2c13

由 Nikolay Borisov 提交于 6月 20, 2018

It can be referenced from trans since the function is always called
within a transaction.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

451a2c13

btrfs: Remove fs_info from btrfs_remove_block_group · 5a98ec01

由 Nikolay Borisov 提交于 6月 20, 2018

This function is always called with a valid transaction handle from
where we can reference fs_info. No functional changes.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5a98ec01

btrfs: Remove fs_info from btrfs_make_block_group · e7e02096

由 Nikolay Borisov 提交于 6月 20, 2018

This function is always called with a valid transaction handle from
where we can reference the fs_info. No functional changes.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e7e02096

btrfs: replace get_seconds with new 64bit time API · a944442c

由 Allen Pais 提交于 6月 12, 2018

The get_seconds() function is deprecated as it truncates the timestamp
to 32 bits. Change it to or ktime_get_real_seconds().
Signed-off-by: NAllen Pais <allen.lkml@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a944442c

07 6月, 2018 1 次提交

btrfs: change return type of btrfs_page_mkwrite to vm_fault_t · a528a241

由 Souptick Joarder 提交于 6月 06, 2018

Use the new return type vm_fault_t for fault handler. For now, this is
just documenting that the function returns a VM_FAULT value rather than
an errno. Once all instances are converted, vm_fault_t will become a
distinct type.

Reference commit 1c8f4220 ("mm: change return type to vm_fault_t")

vmf_error() is the newly introduced inline function in 4.17-rc6.
Signed-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a528a241

30 5月, 2018 3 次提交

btrfs: drop unused parameter qgroup_reserved · c4c129db

由 Gu JinXiang 提交于 5月 30, 2018

Since commit 7775c818 ("btrfs: remove unused parameter from
btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
by caller of function btrfs_subvolume_reserve_metadata.  So remove it.
Signed-off-by: NGu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c4c129db

btrfs: Remove fs_info argument from btrfs_uuid_tree_rem · d1957791

由 Lu Fengqi 提交于 5月 29, 2018

This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.
Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
[ rename the function ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d1957791

btrfs: Remove fs_info argument from btrfs_uuid_tree_add · cdb345a8

由 Lu Fengqi 提交于 5月 29, 2018

This function always takes a transaction handle which contains a
reference to the fs_info. Use that and remove the extra argument.
Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

cdb345a8

29 5月, 2018 15 次提交

Btrfs: get rid of unused orphan infrastructure · a575ceeb

由 Omar Sandoval 提交于 5月 11, 2018

Now that we don't keep long-standing reservations for orphan items,
root->orphan_block_rsv isn't used. We can git rid of it, along with:

- root->orphan_lock, which was used to protect root->orphan_block_rsv
- root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv
- BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations
  in root->orphan_block_rsv
- btrfs_orphan_commit_root(), which was the last user of any of these
  and does nothing else
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a575ceeb

btrfs: rename btrfs_update_iflags to reflect which flags it touches · 7b6a221e

由 David Sterba 提交于 3月 26, 2018

The btrfs inode flag flavour is now simply called 'inode flags' and the
vfs inode are i_flags.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

7b6a221e

btrfs: Unexport and rename btrfs_invalidate_inodes · 20a68004

由 Nikolay Borisov 提交于 4月 27, 2018

This function is no longer used outside of inode.c so just make it
static. At the same time give a more becoming name, since it's not
really invalidating the inodes but just calling d_prune_alias. Last,
but not least - move the function above the sole caller to avoid
introducing yet-another-pointless forward declaration.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

20a68004

btrfs: introduce conditional wakeup helpers · 110a21fe

由 David Sterba 提交于 2月 26, 2018

Add convenience wrappers for the waitqueue management that involves
memory barriers to prevent deadlocks. The helpers will let us remove
barriers and the necessary comments in several places.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

110a21fe

btrfs: Remove fs_info argument from add_new_free_space · 4457c1c7

由 Nikolay Borisov 提交于 5月 10, 2018

This function also takes a btrfs_block_group_cache which contains a
referene to the fs_info. So use that and remove the extra argument.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4457c1c7

btrfs: Unexport btrfs_alloc_delalloc_work · 3a2f8c07

由 Nikolay Borisov 提交于 4月 24, 2018

It's used only in inode.c so makes no sense to have it exported. Also
move the definition of btrfs_delalloc_work to inode.c since it's used
only this file.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3a2f8c07

btrfs: Remove delayed_iput member from btrfs_delalloc_work · 076da91c

由 Nikolay Borisov 提交于 4月 23, 2018

When allocating a delalloc work we are always setting the delayed_iput
to 0. So remove the delay_iput member of btrfs_delalloc_work, as a
result also remove it as a parameter from btrfs_alloc_delalloc_work
since it's not used anymore.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

076da91c

btrfs: Remove delayed_iput parameter from btrfs_start_delalloc_inodes · 76f32e24

由 Nikolay Borisov 提交于 4月 23, 2018

It's always set to 0, so just remove it and collapse the constant value
to the only function we are passing it.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

76f32e24

btrfs: Remove delayed_iput parameter of btrfs_start_delalloc_roots · 82b3e53b

由 Nikolay Borisov 提交于 4月 23, 2018

This parameter was introduced alongside the function in
eb73c1b7 ("Btrfs: introduce per-subvolume delalloc inode list") to
avoid deadlocks since this function was used in the transaction commit
path. However, commit 8d875f95 ("btrfs: disable strict file flushes
for renames and truncates") removed that usage, rendering the parameter
obsolete.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

82b3e53b

btrfs: drop lock parameter from update_ioctl_balance_args and rename · 008ef096

由 David Sterba 提交于 3月 21, 2018

The parameter controls locking of the stats part but we can lock it
unconditionally, as this only happens once when balance starts. This is
not performance critical.

Add the prefix for an exported function.
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

008ef096

btrfs: track running balance in a simpler way · 3009a62f

由 David Sterba 提交于 3月 21, 2018

Currently fs_info::balance_running is 0 or 1 and does not use the
semantics of atomics. The pause and cancel check for 0, that can happen
only after __btrfs_balance exits for whatever reason.

Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple
times but will block on the balance_mutex that protects the
fs_info::flags bit.
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3009a62f

btrfs: kill btrfs_fs_info::volume_mutex · dccdb07b

由 David Sterba 提交于 3月 21, 2018

Mutual exclusion of device add/rm and balance was done by the volume
mutex up to version 3.7. The commit 5ac00add ("Btrfs: disallow
mutually exclusive admin operations from user mode") added a bit that
essentially tracked the same information.

The status bit has an advantage over a mutex that it can be set without
restrictions of function context, so it started to be used in the
mount-time resuming of balance or device replace.

But we don't really need to track the same information in two ways.

1) After the previous cleanups, the main ioctl handlers for
   add/del/resize copy the EXCL_OP bit next to the volume mutex, here
   it's clearly safe.

2) Resuming balance during mount or after rw remount will set only the
   EXCL_OP bit and the volume_mutex is held in the kernel thread that
   calls btrfs_balance.

3) Resuming device replace during mount or after rw remount is done
   after balance and is excluded by the EXCL_OP bit. It does not take
   the volume_mutex at all and completely relies on the EXCL_OP bit.

4) The resuming of balance and dev-replace cannot hapen at the same time
   as the ioctls cannot be started in parallel. Nevertheless, a crafted
   image could trigger that and a warning is printed.

5) Balance is normally excluded by EXCL_OP and also uses own mutex to
   protect against concurrent access to its status data. There's some
   trickery to maintain the right lock nesting in case we need to
   reexamine the status in btrfs_ioctl_balance. The volume_mutex is
   removed and the unlock/lock sequence is left in place as we might
   expect other waiters to proceed.

6) Similar to 5, the unlock/lock sequence is kept in
   btrfs_cancel_balance to allow waiters to continue.
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

dccdb07b

btrfs: Remove btrfs_wait_and_free_delalloc_work · 40012f96

由 Nikolay Borisov 提交于 4月 19, 2018

This function is called from only 1 place and is effectively a wrapper
over wait_completion/kfree. It doesn't really bring any value having
those two calls in a separate function. Just open code it and remove it.
No functional changes.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

40012f96

btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy() · f60a2364

由 Misono Tomohiro 提交于 4月 18, 2018

Factor out the second half of btrfs_ioctl_snap_destroy() as
btrfs_delete_subvolume(), which performs some subvolume specific checks
before deletion:

1. send is not in progress
2. the subvolume is not the default subvolume
3. the subvolume does not contain other subvolumes

and actual deletion process. btrfs_delete_subvolume() requires
inode_lock for both @dir and inode of @dentry. The remaining part of
btrfs_ioctl_snap_destroy() is mainly permission checks.

Note that call of d_delete() is not included in btrfs_delete_subvolume()
as this function will also be used by btrfs_rmdir() to delete an empty
subvolume and in that case d_delete() is called in VFS layer.

As a result, btrfs_unlink_subvol() and may_destroy_subvol()
become static functions. No functional changes.
Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f60a2364

btrfs: Move may_destroy_subvol() from ioctl.c to inode.c · ec42f167

由 Misono Tomohiro 提交于 4月 18, 2018

This is a preparation work to refactor btrfs_ioctl_snap_destroy()
and to allow rmdir(2) to delete an empty subvolume.
Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ minor update of the function comment ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ec42f167

28 5月, 2018 1 次提交

btrfs: rename btrfs_get_block_group_info and make it static · c065f5b1

由 Su Yue 提交于 4月 02, 2018

The function btrfs_get_block_group_info() was introduced by the
commit 5af3e8cc ("Btrfs: make filesystem read-only when submitting
 barrier fails") which used it in disk-io.c.

However, the function is only called in ioctl.c now.
Its parameter type btrfs_ioctl_space_info* is only for ioctl.

So, make it static and rename it to be original name
get_block_group_info.

No functional change.
Signed-off-by: NSu Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c065f5b1

17 5月, 2018 1 次提交

btrfs: Split btrfs_del_delalloc_inode into 2 functions · 2b877331

由 Nikolay Borisov 提交于 4月 27, 2018

This is in preparation of fixing delalloc inodes leakage on transaction
abort. Also export the new function.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2b877331

18 4月, 2018 2 次提交

btrfs: qgroup: Use independent and accurate per inode qgroup rsv · ff6bc37e

由 Qu Wenruo 提交于 12月 21, 2017

Unlike reservation calculation used in inode rsv for metadata, qgroup
doesn't really need to care about things like csum size or extent usage
for the whole tree COW.

Qgroups care more about net change of the extent usage.
That's to say, if we're going to insert one file extent, it will mostly
find its place in COWed tree block, leaving no change in extent usage.
Or causing a leaf split, resulting in one new net extent and increasing
qgroup number by nodesize.
Or in an even more rare case, increase the tree level, increasing qgroup
number by 2 * nodesize.

So here instead of using the complicated calculation for extent
allocator, which cares more about accuracy and no error, qgroup doesn't
need that over-estimated reservation.

This patch will maintain 2 new members in btrfs_block_rsv structure for
qgroup, using much smaller calculation for qgroup rsv, reducing false
EDQUOT.
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>

ff6bc37e

btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT · a514d638

由 Qu Wenruo 提交于 12月 22, 2017

Unlike previous method that tries to commit transaction inside
qgroup_reserve(), this time we will try to commit transaction using
fs_info->transaction_kthread to avoid nested transaction and no need to
worry about locking context.

Since it's an asynchronous function call and we won't wait for
transaction commit, unlike previous method, we must call it before we
hit the qgroup limit.

So this patch will use the ratio and size of qgroup meta_pertrans
reservation as indicator to check if we should trigger a transaction
commit.  (meta_prealloc won't be cleaned in transaction committ, it's
useless anyway)
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a514d638

12 4月, 2018 1 次提交

btrfs: replace GPL boilerplate by SPDX -- headers · 9888c340

由 David Sterba 提交于 4月 03, 2018

Remove GPL boilerplate text (long, short, one-line) and keep the rest,
ie. personal, company or original source copyright statements. Add the
SPDX header.

Unify the include protection macros to match the file names.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9888c340

31 3月, 2018 5 次提交

btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space · 8287475a

由 Qu Wenruo 提交于 12月 12, 2017

For quota disabled->enable case, it's possible that at reservation time
quota was not enabled so no bytes were really reserved, while at release
time, quota was enabled so we will try to release some bytes we didn't
really own.

Such situation can cause metadata reserveation underflow, for both types,
also less possible for per-trans type since quota enable will commit
transaction.

To address this, record qgroup meta reserved bytes into
root::qgroup_meta_rsv_pertrans and ::prealloc.
So at releasing time we won't free any bytes we didn't reserve.

For DATA, it's already handled by io_tree, so nothing needs to be done
there.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8287475a

btrfs: qgroup: Use separate meta reservation type for delalloc · 43b18595

由 Qu Wenruo 提交于 12月 12, 2017

Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
preallocated meta rsv, making it quite easy to underflow qgroup meta
reservation.

Since we have the new qgroup meta rsv types, apply it to delalloc
reservation.

Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
rsv type.

And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
they will convert corresponding META_PREALLOC reservation to
META_PERTRANS.

This is mainly due to the fact that current qgroup numbers will only be
updated in btrfs_commit_transaction(), that's to say if we don't keep
such placeholder reservation, we can exceed qgroup limitation.

And for callers freeing outstanding extent in error handler, we will
just free META_PREALLOC bytes.

This behavior makes callers of btrfs_qgroup_release_meta() or
btrfs_qgroup_convert_meta() to be aware of which type they are.
So in this patch, btrfs_delalloc_release_metadata() and its callers get
an extra parameter to info qgroup to do correct meta convert/release.

The good news is, even we use the wrong type (convert or free), it won't
cause obvious bug, as prealloc type is always in good shape, and the
type only affects how per-trans meta is increased or not.

So the worst case will be at most metadata limitation can be sometimes
exceeded (no convert at all) or metadata limitation is reached too soon
(no free at all).
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

43b18595

btrfs: qgroup: Don't use root->qgroup_meta_rsv for qgroup · e1211d0e

由 Qu Wenruo 提交于 12月 12, 2017

Since qgroup has seperate metadata reservation types now, we can
completely get rid of the old root->qgroup_meta_rsv, which mostly acts
as current META_PERTRANS reservation type.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e1211d0e

btrfs: ctree.h: Fix wrong comment position about csum size · 4408ea7c

由 Misono, Tomohiro 提交于 3月 20, 2018

Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4408ea7c

btrfs: defer adding raid type kobject until after chunk relocation · 75cb379d

由 Jeff Mahoney 提交于 3月 20, 2018

Any time the first block group of a new type is created, we add a new
kobject to sysfs to hold the attributes for that type.  Kobject-internal
allocations always use GFP_KERNEL, making them prone to fs-reclaim races.
While it appears as if this can occur any time a block group is created,
the only times the first block group of a new type can be created in
memory is at mount and when we create the first new block group during
raid conversion.

This patch adds a new list to track pending kobject additions and then
handles them after we do chunk relocation.  Between relocating the
target chunk (or forcing allocation of a new chunk in the case of data)
and removing the old chunk, we're in a safe place for fs-reclaim to
occur.  We're holding the volume mutex, which is already held across
page faults, and the delete_unused_bgs_mutex, which will only stall
the cleaner thread.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

75cb379d

openeuler / Kernel 12 个月 前同步成功

openeuler / Kernel
12 个月前同步成功