提交 · 22a5b2abb7340cb05a34eb47d0271a6714cf5e3f · openeuler / Kernel

25 7月, 2022 11 次提交

btrfs: send: add OTIME as utimes attribute for proto 2+ by default · 22a5b2ab

由 David Sterba 提交于 5月 17, 2022

When send v1 was introduced the otime (inode creation time) was not
available, however the attribute in btrfs send protocol exists. Though
it would be possible to add it for v1 too as the attribute would be
ignored by v1 receive, let's not change the layout of v1 and only add
that to v2+. The otime cannot be changed and is only informative.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

22a5b2ab

btrfs: send: use boolean types for current inode status · 9555e1f1

由 David Sterba 提交于 6月 02, 2022

The new, new_gen and deleted indicate a status, use boolean type instead
of int.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9555e1f1

btrfs: send: remove old TODO regarding ERESTARTSYS · cec3dad9

由 David Sterba 提交于 6月 02, 2022

The whole send operation is restartable and handling properly a buffer
write may not be easy. We can't know what caused that and if a short
delay and retry will fix it or how many retries should be performed in
case it's a temporary condition.

The error value is returned to the ioctl caller so in case it's
transient problem, the user would be notified about the reason. Remove
the TODO note as there's no plan to handle ERESTARTSYS.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

cec3dad9

btrfs: send: simplify includes · 8234d3f6

由 David Sterba 提交于 6月 02, 2022

We don't need the whole ctree.h in send.h, none of the data types
defined there are used.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8234d3f6

btrfs: send: enable support for stream v2 and compressed writes · d6815592

由 Omar Sandoval 提交于 3月 17, 2022

Now that the new support is implemented, allow the ioctl to accept v2
and the compressed flag, and update the version in sysfs.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d6815592

btrfs: send: send compressed extents with encoded writes · 3ea4dc5b

由 Omar Sandoval 提交于 3月 17, 2022

Now that all of the pieces are in place, we can use the ENCODED_WRITE
command to send compressed extents when appropriate.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3ea4dc5b

btrfs: send: get send buffer pages for protocol v2 · a4b333f2

由 Omar Sandoval 提交于 4月 04, 2022

For encoded writes in send v2, we will get the encoded data with
btrfs_encoded_read_regular_fill_pages(), which expects a list of raw
pages. To avoid extra buffers and copies, we should read directly into
the send buffer. Therefore, we need the raw pages for the send buffer.

We currently allocate the send buffer with kvmalloc(), which may return
a kmalloc'd buffer or a vmalloc'd buffer. For vmalloc, we can get the
pages with vmalloc_to_page(). For kmalloc, we could use virt_to_page().
However, the buffer size we use (144K) is not a power of two, which in
theory is not guaranteed to return a page-aligned buffer, and in
practice would waste a lot of memory due to rounding up to the next
power of two. 144K is large enough that it usually gets allocated with
vmalloc(), anyways. So, for send v2, replace kvmalloc() with vmalloc()
and save the pages in an array.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a4b333f2

btrfs: send: write larger chunks when using stream v2 · 356bbbb6

由 Omar Sandoval 提交于 3月 17, 2022

The length field of the send stream TLV header is 16 bits. This means
that the maximum amount of data that can be sent for one write is 64K
minus one. However, encoded writes must be able to send the maximum
compressed extent (128K) in one command, or more. To support this, send
stream version 2 encodes the DATA attribute differently: it has no
length field, and the length is implicitly up to the end of containing
command (which has a 32bit length field). Although this is necessary
for encoded writes, normal writes can benefit from it, too.

Also add a check to enforce that the DATA attribute is last. It is only
strictly necessary for v2, but we might as well make v1 consistent with
it.

For v2, let's bump up the send buffer to the maximum compressed extent
size plus 16K for the other metadata (144K total). Since this will most
likely be vmalloc'd (and always will be after the next commit), we round
it up to the next page since we might as well use the rest of the page
on systems with >16K pages.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

356bbbb6

btrfs: send: add stream v2 definitions · b7c14f23

由 Omar Sandoval 提交于 3月 17, 2022

This adds the definitions of the new commands for send stream version 2
and their respective attributes: fallocate, FS_IOC_SETFLAGS (a.k.a.
chattr), and encoded writes. It also documents two changes to the send
stream format in v2: the receiver shouldn't assume a maximum command
size, and the DATA attribute is encoded differently to allow for writes
larger than 64k. These will be implemented in subsequent changes, and
then the ioctl will accept the new version and flag.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b7c14f23

btrfs: send: explicitly number commands and attributes · 54cab6af

由 Omar Sandoval 提交于 3月 17, 2022

Commit e77fbf99 ("btrfs: send: prepare for v2 protocol") added
_BTRFS_SEND_C_MAX_V* macros equal to the maximum command number for the
version plus 1, but as written this creates gaps in the number space.

The maximum command number is currently 22, and __BTRFS_SEND_C_MAX_V1 is
accordingly 23. But then __BTRFS_SEND_C_MAX_V2 is 24, suggesting that v2
has a command numbered 23, and __BTRFS_SEND_C_MAX is 25, suggesting that
23 and 24 are valid commands.

Instead, let's explicitly number all of the commands, attributes, and
sentinel MAX constants.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

54cab6af

btrfs: send: remove unused send_ctx::{total,cmd}_send_size · ca182acc

由 Omar Sandoval 提交于 3月 17, 2022

We collect these statistics but have never exposed them in any way. I
also didn't find any patches that ever attempted to make use of them.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ca182acc

16 7月, 2022 1 次提交

Revert "btrfs: turn name_cache radix tree into XArray in send_ctx" · 5b8418b8

由 David Sterba 提交于 7月 15, 2022

This reverts commit 40769420.

Revert the xarray conversion, there's a problem with potential
sleep-inside-spinlock [1] when calling xa_insert that triggers GFP_NOFS
allocation. The radix tree used the preloading mechanism to avoid
sleeping but this is not available in xarray.

Conversion from spin lock to mutex is possible but at time of rc6 is
riskier than a clean revert.

[1] https://lore.kernel.org/linux-btrfs/cover.1657097693.git.fdmanana@suse.com/Reported-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5b8418b8

18 5月, 2022 1 次提交

btrfs: send: avoid trashing the page cache · 152555b3

由 Filipe Manana 提交于 5月 17, 2022

A send operation reads extent data using the buffered IO path for getting
extent data to send in write commands and this is both because it's simple
and to make use of the generic readahead infrastructure, which results in
a massive speedup.

However this fills the page cache with data that, most of the time, is
really only used by the send operation - once the write commands are sent,
it's not useful to have the data in the page cache anymore. For large
snapshots, bringing all data into the page cache eventually leads to the
need to evict other data from the page cache that may be more useful for
applications (and kernel subsystems).

Even if extents are shared with the subvolume on which a snapshot is based
on and the data is currently on the page cache due to being read through
the subvolume, attempting to read the data through the snapshot will
always result in bringing a new copy of the data into another location in
the page cache (there's currently no shared memory for shared extents).

So make send evict the data it has read before if when it first opened
the inode, its mapping had no pages currently loaded: when
inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
based on the return value of filemap_range_has_page() before reading an
extent because the generic readahead mechanism may read pages beyond the
range we request (and it very often does it), which means a call to
filemap_range_has_page() will return true due to the readahead that was
triggered when processing a previous extent - we don't have a simple way
to distinguish this case from the case where the data was brought into
the page cache through someone else. So checking for the mapping number
of pages being 0 when we first open the inode is simple, cheap and it
generally accomplishes the goal of not trashing the page cache - the
only exception is if part of data was previously loaded into the page
cache through the snapshot by some other process, in that case we end
up not evicting any data send brings into the page cache, just like
before this change - but that however is not the common case.

Example scenario, on a box with 32G of RAM:

  $ btrfs subvolume create /mnt/sv1
  $ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1

  $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1

  $ free -m
                 total        used        free      shared  buff/cache   available
  Mem:           31937         186       26866           0        4883       31297
  Swap:           8188           0        8188

  # After this we get less 4G of free memory.
  $ btrfs send /mnt/snap1 >/dev/null

  $ free -m
                 total        used        free      shared  buff/cache   available
  Mem:           31937         186       22814           0        8935       31297
  Swap:           8188           0        8188

The same, obviously, applies to an incremental send.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

152555b3

16 5月, 2022 9 次提交

btrfs: send: keep the current inode open while processing it · 521b6803

由 Filipe Manana 提交于 5月 05, 2022

Every time we send a write command, we open the inode, read some data to
a buffer and then close the inode. The amount of data we read for each
write command is at most 48K, returned by max_send_read_size(), and that
corresponds to: BTRFS_SEND_BUF_SIZE - 16K = 48K. In practice this does
not add any significant overhead, because the time elapsed between every
close (iput()) and open (btrfs_iget()) is very short, so the inode is kept
in the VFS's cache after the iput() and it's still there by the time we
do the next btrfs_iget().

As between processing extents of the current inode we don't do anything
else, it makes sense to keep the inode open after we process its first
extent that needs to be sent and keep it open until we start processing
the next inode. This serves to facilitate the next change, which aims
to avoid having send operations trash the page cache with data extents.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

521b6803

btrfs: turn name_cache radix tree into XArray in send_ctx · 40769420

由 Gabriel Niebler 提交于 4月 26, 2022

… and adjust all usages of this object to use the XArray API for the sake
of consistency.

XArray API provides array semantics, so it is notionally easier to use and
understand, and it also takes care of locking for us.

None of this makes a real difference in this particular patch, but it does
in other places where similar replacements are or have been made and we
want to be consistent in our usage of data structures in btrfs.
Signed-off-by: NGabriel Niebler <gniebler@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

40769420

btrfs: use btrfs_for_each_slot in btrfs_unlink_all_paths · 3d64f060

由 Gabriel Niebler 提交于 3月 09, 2022

This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.
Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: NGabriel Niebler <gniebler@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3d64f060

btrfs: use btrfs_for_each_slot in process_all_extents · 9930e9d4

由 Gabriel Niebler 提交于 3月 09, 2022

This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.
Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: NGabriel Niebler <gniebler@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9930e9d4

btrfs: use btrfs_for_each_slot in process_all_new_xattrs · 69e43177

由 Gabriel Niebler 提交于 3月 09, 2022

This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.
Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: NGabriel Niebler <gniebler@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

69e43177

btrfs: use btrfs_for_each_slot in process_all_refs · 649b9635

由 Gabriel Niebler 提交于 3月 09, 2022

This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.
Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: NGabriel Niebler <gniebler@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

649b9635

btrfs: use btrfs_for_each_slot in is_ancestor · 35a68080

由 Gabriel Niebler 提交于 3月 09, 2022

This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.
Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: NGabriel Niebler <gniebler@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

35a68080

btrfs: use btrfs_for_each_slot in can_rmdir · 18f80f1f

由 Gabriel Niebler 提交于 3月 09, 2022

This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.
Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: NGabriel Niebler <gniebler@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

18f80f1f

btrfs: use btrfs_for_each_slot in did_create_dir · 6dcee260

由 Gabriel Niebler 提交于 3月 09, 2022

This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.
Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: NGabriel Niebler <gniebler@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6dcee260

10 5月, 2022 1 次提交

btrfs: Convert btrfs to read_folio · fb12489b

由 Matthew Wilcox (Oracle) 提交于 4月 29, 2022

This is a "weak" conversion which converts straight back to using pages.
A full conversion should be performed at some point, hopefully by
someone familiar with the filesystem.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>

fb12489b

09 5月, 2022 1 次提交

mm/readahead: Convert page_cache_async_readahead to take a folio · 2ebdd1df

由 Matthew Wilcox (Oracle) 提交于 3月 17, 2021

Removes a couple of calls to compound_head and saves a few bytes.
Also convert verity's read_file_data_page() to be folio-based.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

2ebdd1df

14 3月, 2022 2 次提交

btrfs: send: remove redundant ret variable in fs_path_copy · 0292ecf1

由 Minghao Chi 提交于 1月 11, 2022

Return value from fs_path_add_path() directly instead of taking this in
another redundant variable.
Reported-by: NZeal Robot <zealci@zte.com.cn>
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NMinghao Chi <chi.minghao@zte.com.cn>
Signed-off-by: NCGEL ZTE <cgel.zte@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0292ecf1

btrfs: reuse existing inode from btrfs_ioctl · 9ad12305

由 Sahil Kang 提交于 1月 15, 2022

btrfs_ioctl extracts inode from file so we can pass that into the
callbacks.
Signed-off-by: NSahil Kang <sahil.kang@asilaycomputing.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9ad12305

10 2月, 2022 1 次提交

btrfs: send: in case of IO error log it · 2e7be9db

由 Dāvis Mosāns 提交于 2月 05, 2022

Currently if we get IO error while doing send then we abort without
logging information about which file caused issue.  So log it to help
with debugging.

CC: stable@vger.kernel.org # 4.9+
Signed-off-by: NDāvis Mosāns <davispuh@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2e7be9db

07 1月, 2022 1 次提交

btrfs: make send work with concurrent block group relocation · d96b3424

由 Filipe Manana 提交于 11月 22, 2021

We don't allow send and balance/relocation to run in parallel in order
to prevent send failing or silently producing some bad stream. This is
because while send is using an extent (specially metadata) or about to
read a metadata extent and expecting it belongs to a specific parent
node, relocation can run, the transaction used for the relocation is
committed and the extent gets reallocated while send is still using the
extent, so it ends up with a different content than expected. This can
result in just failing to read a metadata extent due to failure of the
validation checks (parent transid, level, etc), failure to find a
backreference for a data extent, and other unexpected failures. Besides
reallocation, there's also a similar problem of an extent getting
discarded when it's unpinned after the transaction used for block group
relocation is committed.

The restriction between balance and send was added in commit 9e967495
("Btrfs: prevent send failures and crashes due to concurrent relocation"),
kernel 5.3, while the more general restriction between send and relocation
was added in commit 1cea5cf0 ("btrfs: ensure relocation never runs
while we have send operations running"), kernel 5.14.

Both send and relocation can be very long running operations. Relocation
because it has to do a lot of IO and expensive backreference lookups in
case there are many snapshots, and send due to read IO when operating on
very large trees. This makes it inconvenient for users and tools to deal
with scheduling both operations.

For zoned filesystem we also have automatic block group relocation, so
send can fail with -EAGAIN when users least expect it or send can end up
delaying the block group relocation for too long. In the future we might
also get the automatic block group relocation for non zoned filesystems.

This change makes it possible for send and relocation to run in parallel.
This is achieved the following way:

1) For all tree searches, send acquires a read lock on the commit root
semaphore;

2) After each tree search, and before releasing the commit root semaphore,
the leaf is cloned and placed in the search path (struct btrfs_path);

3) After releasing the commit root semaphore, the changed_cb() callback
is invoked, which operates on the leaf and writes commands to the pipe
(or file in case send/receive is not used with a pipe). It's important
here to not hold a lock on the commit root semaphore, because if we did
we could deadlock when sending and receiving to the same filesystem
using a pipe - the send task blocks on the pipe because it's full, the
receive task, which is the only consumer of the pipe, triggers a
transaction commit when attempting to create a subvolume or reserve
space for a write operation for example, but the transaction commit
blocks trying to write lock the commit root semaphore, resulting in a
deadlock;

4) Before moving to the next key, or advancing to the next change in case
of an incremental send, check if a transaction used for relocation was
committed (or is about to finish its commit). If so, release the search
path(s) and restart the search, to where we were before, so that we
don't operate on stale extent buffers. The search restarts are always
possible because both the send and parent roots are RO, and no one can
add, remove of update keys (change their offset) in RO trees - the
only exception is deduplication, but that is still not allowed to run
in parallel with send;

5) Periodically check if there is contention on the commit root semaphore,
which means there is a transaction commit trying to write lock it, and
release the semaphore and reschedule if there is contention, so as to
avoid causing any significant delays to transaction commits.

This leaves some room for optimizations for send to have less path
releases and re searching the trees when there's relocation running, but
for now it's kept simple as it performs quite well (on very large trees
with resulting send streams in the order of a few hundred gigabytes).

Test case btrfs/187, from fstests, stresses relocation, send and
deduplication attempting to run in parallel, but without verifying if send
succeeds and if it produces correct streams. A new test case will be added
that exercises relocation happening in parallel with send and then checks
that send succeeds and the resulting streams are correct.

A final note is that for now this still leaves the mutual exclusion
between send operations and deduplication on files belonging to a root
used by send operations. A solution for that will be slightly more complex
but it will eventually be built on top of this change.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d96b3424

03 1月, 2022 4 次提交

btrfs: send: remove unused type parameter to iterate_inode_ref_t · b1dea4e7

由 Omar Sandoval 提交于 11月 04, 2021

Again, I don't think this was ever used since iterate_dir_item() is only
used for xattrs. No functional change.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b1dea4e7

btrfs: send: remove unused found_type parameter to lookup_dir_item_inode() · eab67c06

由 Omar Sandoval 提交于 11月 04, 2021

As far as I can tell, this was never used. No functional change.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

eab67c06

btrfs: drop the _nr from the item helpers · 3212fa14

由 Josef Bacik 提交于 10月 21, 2021

Now that all call sites are using the slot number to modify item values,
rename the SETGET helpers to raw_item_*(), and then rework the _nr()
helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
rename all of the callers to the new helpers.
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3212fa14

btrfs: use btrfs_item_size_nr/btrfs_item_offset_nr everywhere · 227f3cd0

由 Josef Bacik 提交于 10月 21, 2021

We have this pattern in a lot of places

	item = btrfs_item_nr(slot);
	btrfs_item_size(leaf, item);

when we could simply use

	btrfs_item_size(leaf, slot);

Fix all callers of btrfs_item_size() and btrfs_item_offset() to use the
_nr variation of the helpers.
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

227f3cd0

29 10月, 2021 1 次提交

btrfs: send: prepare for v2 protocol · e77fbf99

由 David Sterba 提交于 10月 22, 2021

This is preparatory work for send protocol update to version 2 and
higher.

We have many pending protocol update requests but still don't have the
basic protocol rev in place, the first thing that must happen is to do
the actual versioning support.

The protocol version is u32 and is a new member in the send ioctl
struct. Validity of the version field is backed by a new flag bit. Old
kernels would fail when a higher version is requested. Version protocol
0 will pick the highest supported version, BTRFS_SEND_STREAM_VERSION,
  that's also exported in sysfs.

The version is still unchanged and will be increased once we have new
incompatible commands or stream updates.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e77fbf99

26 10月, 2021 1 次提交

btrfs: send: simplify send_create_inode_if_needed · 0e3dd5bc

由 Marcos Paulo de Souza 提交于 8月 01, 2021

The out label is being overused, we can simply return if the condition
permits.

No functional changes.
Reviewed-by: NSu Yue <l@damenly.su>
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0e3dd5bc

23 8月, 2021 2 次提交

btrfs: allocate backref_ctx on stack in find_extent_clone · dce28150

由 Goldwyn Rodrigues 提交于 7月 27, 2021

Instead of using kmalloc() to allocate backref_ctx, allocate backref_ctx
on stack. The size is reasonably small.

sizeof(backref_ctx) = 48
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

dce28150

btrfs: constify and cleanup variables in comparators · 214cc184

由 David Sterba 提交于 7月 26, 2021

Comparators just read the data and thus get const parameters. This
should be also preserved by the local variables, update all comparators
passed to sort or bsearch.

Cleanups:

- unnecessary casts are dropped
- btrfs_cmp_device_free_bytes is cleaned up to follow the common pattern
  and 'inline' is dropped as the function address is taken
Signed-off-by: NDavid Sterba <dsterba@suse.com>

214cc184

22 6月, 2021 4 次提交

btrfs: send: fix crash when memory allocations trigger reclaim · 35b22c19

由 Filipe Manana 提交于 6月 21, 2021

When doing a send we don't expect the task to ever start a transaction
after the initial check that verifies if commit roots match the regular
roots. This is because after that we set current->journal_info with a
stub (special value) that signals we are in send context, so that we take
a read lock on an extent buffer when reading it from disk and verifying
it is valid (its generation matches the generation stored in the parent).
This stub was introduced in 2014 by commit a26e8c9f ("Btrfs: don't
clear uptodate if the eb is under IO") in order to fix a concurrency issue
between send and balance.

However there is one particular exception where we end up needing to start
a transaction and when this happens it results in a crash with a stack
trace like the following:

[60015.902283] kernel: WARNING: CPU: 3 PID: 58159 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x21/0x80
[60015.902292] kernel: Modules linked in: uinput rfcomm snd_seq_dummy (...)
[60015.902384] kernel: CPU: 3 PID: 58159 Comm: btrfs Not tainted 5.12.9-300.fc34.x86_64 #1
[60015.902387] kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XN-WIFI, BIOS F6 12/24/2015
[60015.902389] kernel: RIP: 0010:kfence_protect_page+0x21/0x80
[60015.902393] kernel: Code: ff 0f 1f 84 00 00 00 00 00 55 48 89 fd (...)
[60015.902396] kernel: RSP: 0018:ffff9fb583453220 EFLAGS: 00010246
[60015.902399] kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fb583453224
[60015.902401] kernel: RDX: ffff9fb583453224 RSI: 0000000000000000 RDI: 0000000000000000
[60015.902402] kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[60015.902404] kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
[60015.902406] kernel: R13: ffff9fb583453348 R14: 0000000000000000 R15: 0000000000000001
[60015.902408] kernel: FS:  00007f158e62d8c0(0000) GS:ffff93bd37580000(0000) knlGS:0000000000000000
[60015.902410] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[60015.902412] kernel: CR2: 0000000000000039 CR3: 00000001256d2000 CR4: 00000000000506e0
[60015.902414] kernel: Call Trace:
[60015.902419] kernel:  kfence_unprotect+0x13/0x30
[60015.902423] kernel:  page_fault_oops+0x89/0x270
[60015.902427] kernel:  ? search_module_extables+0xf/0x40
[60015.902431] kernel:  ? search_bpf_extables+0x57/0x70
[60015.902435] kernel:  kernelmode_fixup_or_oops+0xd6/0xf0
[60015.902437] kernel:  __bad_area_nosemaphore+0x142/0x180
[60015.902440] kernel:  exc_page_fault+0x67/0x150
[60015.902445] kernel:  asm_exc_page_fault+0x1e/0x30
[60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580
[60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 (...)
[60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246
[60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
[60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039
[60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000
[60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001
[60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000
[60015.902468] kernel:  btrfs_commit_inode_delayed_inode+0x5d/0x120
[60015.902473] kernel:  btrfs_evict_inode+0x2c5/0x3f0
[60015.902476] kernel:  evict+0xd1/0x180
[60015.902480] kernel:  inode_lru_isolate+0xe7/0x180
[60015.902483] kernel:  __list_lru_walk_one+0x77/0x150
[60015.902487] kernel:  ? iput+0x1a0/0x1a0
[60015.902489] kernel:  ? iput+0x1a0/0x1a0
[60015.902491] kernel:  list_lru_walk_one+0x47/0x70
[60015.902495] kernel:  prune_icache_sb+0x39/0x50
[60015.902497] kernel:  super_cache_scan+0x161/0x1f0
[60015.902501] kernel:  do_shrink_slab+0x142/0x240
[60015.902505] kernel:  shrink_slab+0x164/0x280
[60015.902509] kernel:  shrink_node+0x2c8/0x6e0
[60015.902512] kernel:  do_try_to_free_pages+0xcb/0x4b0
[60015.902514] kernel:  try_to_free_pages+0xda/0x190
[60015.902516] kernel:  __alloc_pages_slowpath.constprop.0+0x373/0xcc0
[60015.902521] kernel:  ? __memcg_kmem_charge_page+0xc2/0x1e0
[60015.902525] kernel:  __alloc_pages_nodemask+0x30a/0x340
[60015.902528] kernel:  pipe_write+0x30b/0x5c0
[60015.902531] kernel:  ? set_next_entity+0xad/0x1e0
[60015.902534] kernel:  ? switch_mm_irqs_off+0x58/0x440
[60015.902538] kernel:  __kernel_write+0x13a/0x2b0
[60015.902541] kernel:  kernel_write+0x73/0x150
[60015.902543] kernel:  send_cmd+0x7b/0xd0
[60015.902545] kernel:  send_extent_data+0x5a3/0x6b0
[60015.902549] kernel:  process_extent+0x19b/0xed0
[60015.902551] kernel:  btrfs_ioctl_send+0x1434/0x17e0
[60015.902554] kernel:  ? _btrfs_ioctl_send+0xe1/0x100
[60015.902557] kernel:  _btrfs_ioctl_send+0xbf/0x100
[60015.902559] kernel:  ? enqueue_entity+0x18c/0x7b0
[60015.902562] kernel:  btrfs_ioctl+0x185f/0x2f80
[60015.902564] kernel:  ? psi_task_change+0x84/0xc0
[60015.902569] kernel:  ? _flat_send_IPI_mask+0x21/0x40
[60015.902572] kernel:  ? check_preempt_curr+0x2f/0x70
[60015.902576] kernel:  ? selinux_file_ioctl+0x137/0x1e0
[60015.902579] kernel:  ? expand_files+0x1cb/0x1d0
[60015.902582] kernel:  ? __x64_sys_ioctl+0x82/0xb0
[60015.902585] kernel:  __x64_sys_ioctl+0x82/0xb0
[60015.902588] kernel:  do_syscall_64+0x33/0x40
[60015.902591] kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
[60015.902595] kernel: RIP: 0033:0x7f158e38f0ab
[60015.902599] kernel: Code: ff ff ff 85 c0 79 9b (...)
[60015.902602] kernel: RSP: 002b:00007ffcb2519bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[60015.902605] kernel: RAX: ffffffffffffffda RBX: 00007ffcb251ae00 RCX: 00007f158e38f0ab
[60015.902607] kernel: RDX: 00007ffcb2519cf0 RSI: 0000000040489426 RDI: 0000000000000004
[60015.902608] kernel: RBP: 0000000000000004 R08: 00007f158e297640 R09: 00007f158e297640
[60015.902610] kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
[60015.902612] kernel: R13: 0000000000000002 R14: 00007ffcb251aee0 R15: 0000558c1a83e2a0
[60015.902615] kernel: ---[ end trace 7bbc33e23bb887ae ]---

This happens because when writing to the pipe, by calling kernel_write(),
we end up doing page allocations using GFP_HIGHUSER | __GFP_ACCOUNT as the
gfp flags, which allow reclaim to happen if there is memory pressure. This
allocation happens at fs/pipe.c:pipe_write().

If the reclaim is triggered, inode eviction can be triggered and that in
turn can result in starting a transaction if the inode has a link count
of 0. The transaction start happens early on during eviction, when we call
btrfs_commit_inode_delayed_inode() at btrfs_evict_inode(). This happens if
there is currently an open file descriptor for an inode with a link count
of 0 and the reclaim task gets a reference on the inode before that
descriptor is closed, in which case the reclaim task ends up doing the
final iput that triggers the inode eviction.

When we have assertions enabled (CONFIG_BTRFS_ASSERT=y), this triggers
the following assertion at transaction.c:start_transaction():

    /* Send isn't supposed to start transactions. */
    ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);

And when assertions are not enabled, it triggers a crash since after that
assertion we cast current->journal_info into a transaction handle pointer
and then dereference it:

   if (current->journal_info) {
       WARN_ON(type & TRANS_EXTWRITERS);
       h = current->journal_info;
       refcount_inc(&h->use_count);
       (...)

Which obviously results in a crash due to an invalid memory access.

The same type of issue can happen during other memory allocations we
do directly in the send code with kmalloc (and friends) as they use
GFP_KERNEL and therefore may trigger reclaim too, which started to
happen since 2016 after commit e780b0d1 ("btrfs: send: use
GFP_KERNEL everywhere").

The issue could be solved by setting up a NOFS context for the entire
send operation so that reclaim could not be triggered when allocating
memory or pages through kernel_write(). However that is not very friendly
and we can in fact get rid of the send stub because:

1) The stub was introduced way back in 2014 by commit a26e8c9f
   ("Btrfs: don't clear uptodate if the eb is under IO") to solve an
   issue exclusive to when send and balance are running in parallel,
   however there were other problems between balance and send and we do
   not allow anymore to have balance and send run concurrently since
   commit 9e967495 ("Btrfs: prevent send failures and crashes due
   to concurrent relocation"). More generically the issues are between
   send and relocation, and that last commit eliminated only the
   possibility of having send and balance run concurrently, but shrinking
   a device also can trigger relocation, and on zoned filesystems we have
   relocation of partially used block groups triggered automatically as
   well. The previous patch that has a subject of:

   "btrfs: ensure relocation never runs while we have send operations running"

   Addresses all the remaining cases that can trigger relocation.

2) We can actually allow starting and even committing transactions while
   in a send context if needed because send is not holding any locks that
   would block the start or the commit of a transaction.

So get rid of all the logic added by commit a26e8c9f ("Btrfs: don't
clear uptodate if the eb is under IO"). We can now always call
clear_extent_buffer_uptodate() at verify_parent_transid() since send is
the only case that uses commit roots without having a transaction open or
without holding the commit_root_sem.
Reported-by: NChris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtRQ57=qXo3kygwpwEBOU_CA_eKvdmjP52sU=eFvuVOEGw@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

35b22c19

btrfs: ensure relocation never runs while we have send operations running · 1cea5cf0

由 Filipe Manana 提交于 6月 21, 2021

Relocation and send do not play well together because while send is
running a block group can be relocated, a transaction committed and
the respective disk extents get re-allocated and written to or discarded
while send is about to do something with the extents.

This was explained in commit 9e967495 ("Btrfs: prevent send failures
and crashes due to concurrent relocation"), which prevented balance and
send from running in parallel but it did not address one remaining case
where chunk relocation can happen: shrinking a device (and device deletion
which shrinks a device's size to 0 before deleting the device).

We also have now one more case where relocation is triggered: on zoned
filesystems partially used block groups get relocated by a background
thread, introduced in commit 18bb8bbf ("btrfs: zoned: automatically
reclaim zones").

So make sure that instead of preventing balance from running when there
are ongoing send operations, we prevent relocation from happening.
This uses the infrastructure recently added by a patch that has the
subject: "btrfs: add cancellable chunk relocation support".

Also it adds a spinlock used exclusively for the exclusivity between
send and relocation, as before fs_info->balance_mutex was used, which
would make an attempt to run send to block waiting for balance to
finish, which can take a lot of time on large filesystems.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1cea5cf0

btrfs: fix typos in comments · 1a9fd417

由 David Sterba 提交于 5月 21, 2021

Fix typos that have snuck in since the last round. Found by codespell.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1a9fd417

btrfs: send: use list_move_tail instead of list_del/list_add_tail · bb930007

由 Baokun Li 提交于 6月 11, 2021

Use list_move_tail() instead of list_del() + list_add_tail() as it's
doing the same thing and allows further cleanups.  Open code
name_cache_used() as there is only one user.
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

bb930007

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功