提交 · eb065d301e8c83643367bdb0898becc364046bda · openeuler / Kernel

26 11月, 2019 18 次提交

io_uring: io_allocate_scq_urings() should return a sane state · eb065d30

由 Jens Axboe 提交于 11月 20, 2019

We currently rely on the ring destroy on cleaning things up in case of
failure, but io_allocate_scq_urings() can leave things half initialized
if only parts of it fails.

Be nice and return with either everything setup in success, or return an
error with things nicely cleaned up.

Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eb065d30

io_uring: Always REQ_F_FREE_SQE for allocated sqe · bbad27b2

由 Pavel Begunkov 提交于 11月 19, 2019

Always mark requests with allocated sqe and deallocate it in
__io_free_req(). It's easier to follow and doesn't add edge cases.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bbad27b2

io_uring: io_fail_links() should only consider first linked timeout · 5d960724

由 Jens Axboe 提交于 11月 19, 2019

We currently clear the linked timeout field if we cancel such a timeout,
but we should only attempt to cancel if it's the first one we see.
Others should simply be freed like other requests, as they haven't
been started yet.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5d960724

io_uring: Fix leaking linked timeouts · 09fbb0a8

由 Pavel Begunkov 提交于 11月 19, 2019

let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT

1. submission stage: submission references for REQ and LINK_TIMEOUT
are dropped. So, references respectively (1,1,2)

2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all
linked timeouts will call cancel_timeout() and drop 1 reference.
So, references after: (0,0,1). That's a leak.

Make it treat only the first linked timeout as such, and pass others
through __io_double_put_req().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

09fbb0a8

io_uring: remove redundant check · f70193d6

由 Pavel Begunkov 提交于 11月 19, 2019

Pass any IORING_OP_LINK_TIMEOUT request further, where it will
eventually fail in io_issue_sqe().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f70193d6

io_uring: break links for failed defer · d3b35796

由 Pavel Begunkov 提交于 11月 19, 2019

If io_req_defer() failed, it needs to cancel a dependant link.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d3b35796

io-wq: remove extra space characters · b2e9c7d6

由 Dan Carpenter 提交于 11月 19, 2019

These lines are indented an extra space character.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b2e9c7d6

io-wq: wait for io_wq_create() to setup necessary workers · b60fda60

由 Jens Axboe 提交于 11月 19, 2019

We currently have a race where if setup is really slow, we can be
calling io_wq_destroy() before we're done setting up. This will cause
the caller to get stuck waiting for the manager to set things up, but
the manager already exited.

Fix this by doing a sync setup of the manager. This also fixes the case
where if we failed creating workers, we'd also get stuck.

In practice this race window was really small, as we already wait for
the manager to start. Hence someone would have to call io_wq_destroy()
after the task has started, but before it started the first loop. The
reported test case forked tons of these, which is why it became an
issue.

Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com
Fixes: 771b53d0 ("io-wq: small threadpool implementation for io_uring")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b60fda60

io_uring: request cancellations should break links · fba38c27

由 Jens Axboe 提交于 11月 18, 2019

We currently don't explicitly break links if a request is cancelled, but
we should. Add explicitly link breakage for all types of request
cancellations that we support.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fba38c27

io_uring: correct poll cancel and linked timeout expiration completion · b0dd8a41

由 Jens Axboe 提交于 11月 18, 2019

Currently a poll request fills a completion entry of 0, even if it got
cancelled. This is odd, and it makes it harder to support with chains.
Ensure that it returns -ECANCELED in the completions events if it got
cancelled, and furthermore ensure that the linked timeout that triggered
it completes with -ETIME if we did indeed trigger the completions
through a timeout.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b0dd8a41

io_uring: remove dead REQ_F_SEQ_PREV flag · e0e328c4

由 Jens Axboe 提交于 11月 15, 2019

With the conversion to io-wq, we no longer use that flag. Kill it.

Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e0e328c4

io_uring: fix sequencing issues with linked timeouts · 94ae5e77

由 Jens Axboe 提交于 11月 14, 2019

We have an issue with timeout links that are deeper in the submit chain,
because we only handle it upfront, not from later submissions. Move the
prep + issue of the timeout link to the async work prep handler, and do
it normally for non-async queue. If we validate and prepare the timeout
links upfront when we first see them, there's nothing stopping us from
supporting any sort of nesting.

Fixes: 2665abfd ("io_uring: add support for linked SQE timeouts")
Reported-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

94ae5e77

io_uring: make req->timeout be dynamically allocated · ad8a48ac

由 Jens Axboe 提交于 11月 15, 2019

There are a few reasons for this:

- As a prep to improving the linked timeout logic
- io_timeout is the biggest member in the io_kiocb opcode union

This also enables a few cleanups, like unifying the timer setup between
IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple
arguments to the link/prep helpers.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ad8a48ac

io_uring: make io_double_put_req() use normal completion path · 978db57e

由 Jens Axboe 提交于 11月 14, 2019

If we don't use the normal completion path, we may skip killing links
that should be errored and freed. Add __io_double_put_req() for use
within the completion path itself, other calls should just use
io_double_put_req().
Signed-off-by: NJens Axboe <axboe@kernel.dk>

978db57e

io_uring: cleanup return values from the queueing functions · 0e0702da

由 Jens Axboe 提交于 11月 14, 2019

__io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err,
but the caller doesn't care since the errors are handled inline. Clean
these up and just make them void.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0e0702da

io_uring: io_async_cancel() should pass in 'nxt' request pointer · 95a5bbae

由 Jens Axboe 提交于 11月 14, 2019

If we have a linked request, this enables us to pass it back directly
without having to go through async context.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

95a5bbae

vfs: properly and reliably lock f_pos in fdget_pos() · 0be0ee71

由 Linus Torvalds 提交于 11月 11, 2019

fdget_pos() is used by file operations that will read and update f_pos:
things like "read()", "write()" and "lseek()" (but not, for example,
"pread()/pwrite" that get their file positions elsewhere).

However, it had two separate escape clauses for this, because not
everybody wants or needs serialization of the file position.

The first and most obvious case is the "file descriptor doesn't have a
position at all", ie a stream-like file.  Except we didn't actually use
FMODE_STREAM, but instead used FMODE_ATOMIC_POS.  The reason for that
was that FMODE_STREAM didn't exist back in the days, but also that we
didn't want to mark all the special cases, so we only marked the ones
that _required_ position atomicity according to POSIX - regular files
and directories.

The case one was intentionally lazy, but now that we _do_ have
FMODE_STREAM we could and should just use it.  With the change to use
FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all
the code to set it is deleted.

Any cases where we don't want the serialization because the driver (or
subsystem) doesn't use the file position should just be updated to do
"stream_open()".  We've done that for all the obvious and common
situations, we may need a few more.  Quoting Kirill Smelkov in the
original FMODE_STREAM thread (see link below for full email):

 "And I appreciate if people could help at least somehow with "getting
  rid of mixed case entirely" (i.e. always lock f_pos_lock on
  !FMODE_STREAM), because this transition starts to diverge from my
  particular use-case too far. To me it makes sense to do that
  transition as follows:

   - convert nonseekable_open -> stream_open via stream_open.cocci;
   - audit other nonseekable_open calls and convert left users that
     truly don't depend on position to stream_open;
   - extend stream_open.cocci to analyze alloc_file_pseudo as well (this
     will cover pipes and sockets), or maybe convert pipes and sockets
     to FMODE_STREAM manually;
   - extend stream_open.cocci to analyze file_operations that use
     no_llseek or noop_llseek, but do not use nonseekable_open or
     alloc_file_pseudo. This might find files that have stream semantic
     but are opened differently;
   - extend stream_open.cocci to analyze file_operations whose
     .read/.write do not use ppos at all (independently of how file was
     opened);
   - ...
   - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if
     !FMODE_STREAM;
   - gather bug reports for deadlocked read/write and convert missed
     cases to FMODE_STREAM, probably extending stream_open.cocci along
     the road to catch similar cases

  i.e. always take f_pos_lock unless a file is explicitly marked as
  being stream, and try to find and cover all files that are streams"

We have not done the "extend stream_open.cocci to analyze
alloc_file_pseudo" as well, but the previous commit did manually handle
the case of pipes and sockets.

The other case where we can avoid locking f_pos is the "this file
descriptor only has a single user and it is us, and thus there is no
need to lock it".

The second test was correct, although a bit subtle and worth just
re-iterating here.  There are two kinds of other sources of references
to the same file descriptor: file descriptors that have been explicitly
shared across fork() or with dup(), and file tables having elevated
reference counts due to threading (or explicit file sharing with
clone()).

The first case would have incremented the file count explicitly, and in
the second case the previous __fdget() would have incremented it for us
and set the FDPUT_FPUT flag.

But in both cases the file count would be greater than one, so the
"file_count(file) > 1" test catches both situations.  Also note that if
file_count is 1, that also means that no other thread can have access to
the file table, so there also cannot be races with concurrent calls to
dup()/fork()/clone() that would increment the file count any other way.

Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru
Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0be0ee71

vfs: mark pipes and sockets as stream-like file descriptors · d8e464ec

由 Linus Torvalds 提交于 11月 17, 2019

In commit 3975b097 ("convert stream-like files -> stream_open, even
if they use noop_llseek") Kirill used a coccinelle script to change
"nonseekable_open()" to "stream_open()", which changed the trivial cases
of stream-like file descriptors to the new model with FMODE_STREAM.

However, the two big cases - sockets and pipes - don't actually have
that trivial pattern at all, and were thus never converted to
FMODE_STREAM even though it makes lots of sense to do so.

That's particularly true when looking forward to the next change:
getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
decide whether f_pos updates are needed or not.  And if they are, we'll
always do them atomically.

This came up because KCSAN (correctly) noted that the non-locked f_pos
updates are data races: they are clearly benign for the case where we
don't care, but it would be good to just not have that issue exist at
all.

Note that the reason we used FMODE_ATOMIC_POS originally is that only
doing it for the minimal required case is "safer" in that it's possible
that the f_pos locking can cause unnecessary serialization across the
whole write() call.  And in the worst case, that kind of serialization
can cause deadlock issues: think writers that need readers to empty the
state using the same file descriptor.

[ Note that the locking is per-file descriptor - because it protects
  "f_pos", which is obviously per-file descriptor - so it only affects
  cases where you literally use the same file descriptor to both read
  and write.

  So a regular pipe that has separate reading and writing file
  descriptors doesn't really have this situation even though it's the
  obvious case of "reader empties what a bit writer concurrently fills"

  But we want to make pipes as being stream-line anyway, because we
  don't want the unnecessary overhead of locking, and because a named
  pipe can be (ab-)used by reading and writing to the same file
  descriptor. ]

There are likely a lot of other cases that might want FMODE_STREAM, and
looking for ".llseek = no_llseek" users and other cases that don't have
an lseek file operation at all and making them use "stream_open()" might
be a good idea.  But pipes and sockets are likely to be the two main
cases.

Cc: Kirill Smelkov <kirr@nexedi.com>
Cc: Eic Dumazet <edumazet@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Marco Elver <elver@google.com>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Paul McKenney <paulmck@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d8e464ec

24 11月, 2019 1 次提交

cramfs: fix usage on non-MTD device · 3e5aeec0

由 Maxime Bizon 提交于 10月 19, 2019

When both CONFIG_CRAMFS_MTD and CONFIG_CRAMFS_BLOCKDEV are enabled, if
we fail to mount on MTD, we don't try on block device.

Note: this relies upon cramfs_mtd_fill_super() leaving no side
effects on fc state in case of failure; in general, failing
get_tree_...() does *not* mean "fine to try again"; e.g. parsed
options might've been consumed by fill_super callback and freed
on failure.

Fixes: 74f78fc5 ("vfs: Convert cramfs to use the new mount API")
Signed-off-by: NMaxime Bizon <mbizon@freebox.fr>
Signed-off-by: NNicolas Pitre <nico@fluxnic.net>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3e5aeec0

23 11月, 2019 3 次提交

afs: Fix large file support · b485275f

由 Marc Dionne 提交于 11月 21, 2019

By default s_maxbytes is set to MAX_NON_LFS, which limits the usable
file size to 2GB, enforced by the vfs.

Commit b9b1f8d5 ("AFS: write support fixes") added support for the
64-bit fetch and store server operations, but did not change this value.
As a result, attempts to write past the 2G mark result in EFBIG errors:

 $ dd if=/dev/zero of=foo bs=1M count=1 seek=2048
 dd: error writing 'foo': File too large

Set s_maxbytes to MAX_LFS_FILESIZE.

Fixes: b9b1f8d5 ("AFS: write support fixes")
Signed-off-by: NMarc Dionne <marc.dionne@auristor.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b485275f

afs: Fix possible assert with callbacks from yfs servers · cd340703

由 Marc Dionne 提交于 11月 21, 2019

Servers sending callback breaks to the YFS_CM_SERVICE service may
send up to YFSCBMAX (1024) fids in a single RPC.  Anything over
AFSCBMAX (50) will cause the assert in afs_break_callbacks to trigger.

Remove the assert, as the count has already been checked against
the appropriate max values in afs_deliver_cb_callback and
afs_deliver_yfs_cb_callback.

Fixes: 35dbfba3 ("afs: Implement the YFS cache manager service")
Signed-off-by: NMarc Dionne <marc.dionne@auristor.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cd340703

Revert "fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()" · 94b07b6f

由 Joseph Qi 提交于 11月 21, 2019

This reverts commit 56e94ea1.

Commit 56e94ea1 ("fs: ocfs2: fix possible null-pointer dereferences
in ocfs2_xa_prepare_entry()") introduces a regression that fail to
create directory with mount option user_xattr and acl.  Actually the
reported NULL pointer dereference case can be correctly handled by
loc->xl_ops->xlo_add_entry(), so revert it.

Link: http://lkml.kernel.org/r/1573624916-83825-1-git-send-email-joseph.qi@linux.alibaba.com
Fixes: 56e94ea1 ("fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()")
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: NThomas Voegtle <tv@lio96.de>
Acked-by: NChangwei Ge <gechangwei@live.cn>
Cc: Jia-Ju Bai <baijiaju1990@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

94b07b6f

20 11月, 2019 1 次提交

afs: Fix missing timeout reset · c74386d5

由 David Howells 提交于 11月 19, 2019

In afs_wait_for_call_to_complete(), rather than immediately aborting an
operation if a signal occurs, the code attempts to wait for it to
complete, using a schedule timeout of 2*RTT (or min 2 jiffies) and a
check that we're still receiving relevant packets from the server before
we consider aborting the call. We may even ping the server to check on
the status of the call.

However, there's a missing timeout reset in the event that we do
actually get a packet to process, such that if we then get a couple of
short stalls, we then time out when progress is actually being made.

Fix this by resetting the timeout any time we get something to process.
If it's the failure of the call then the call state will get changed and
we'll exit the loop shortly thereafter.

A symptom of this is data fetches and stores failing with EINTR when
they really shouldn't.

Fixes: bc5e3a54 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Reviewed-by: NMarc Dionne <marc.dionne@auristor.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c74386d5

19 11月, 2019 17 次提交

btrfs: drop bdev argument from submit_extent_page · fa17ed06

由 David Sterba 提交于 10月 03, 2019

After previous patches removing bdev being passed around to set it to
bio, it has become unused in submit_extent_page. So it now has "only" 13
parameters.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

fa17ed06

btrfs: remove extent_map::bdev · a019e9e1

由 David Sterba 提交于 8月 30, 2019

We can now remove the bdev from extent_map. Previous patches made sure
that bio_set_dev is correctly in all places and that we don't need to
grab it from latest_bdev or pass it around inside the extent map.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a019e9e1

btrfs: drop bio_set_dev where not needed · 1a418027

由 David Sterba 提交于 8月 30, 2019

bio_set_dev sets a bdev to a bio and is not only setting a pointer bug
also changing some state bits if there was a different bdev set before.
This is one thing that's not needed.

Another thing is that setting a bdev at bio allocation time is too early
and actually does not work with plain redundancy profiles, where each
time we submit a bio to a device, the bdev is set correctly.

In many places the bio bdev is set to latest_bdev that seems to serve as
a stub pointer "just to put something to bio". But we don't have to do
that.

Where do we know which bdev to set:

* for regular IO: submit_stripe_bio that's called by btrfs_map_bio

* repair IO: repair_io_failure, read or write from specific device

* super block write (using buffer_heads but uses raw bdev) and barriers

* scrub: this does not use all regular IO paths as it needs to reach all
  copies, verify and fixup eventually, and for that all bdev management
  is independent

* raid56: rbio_add_io_page, for the RMW write

* integrity-checker: does it's own low-level block tracking
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1a418027

btrfs: get bdev directly from fs_devices in submit_extent_page · 429aebc0

由 David Sterba 提交于 11月 18, 2019

This is preparatory patch to remove @bdev parameter from
submit_extent_page. It can't be removed completely, because the cgroups
need it for wbc when initializing the bio

wbc_init_bio
  bio_associate_blkg_from_css
    dereference bdev->bi_disk->queue

The bdev pointer is the same as latest_bdev, thus no functional change.
We can retrieve it from fs_devices that's reachable through several
dereferences. The local variable shadows the parameter, but that's only
temporary.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

429aebc0

btrfs: record all roots for rename exchange on a subvol · 3e174099

由 Josef Bacik 提交于 11月 15, 2019

Testing with the new fsstress support for subvolumes uncovered a pretty
bad problem with rename exchange on subvolumes. We're modifying two
different subvolumes, but we only start the transaction on one of them,
so the other one is not added to the dirty root list. This is caught by
btrfs_cow_block() with a warning because the root has not been updated,
however if we do not modify this root again we'll end up pointing at an
invalid root because the root item is never updated.

Fix this by making sure we add the destination root to the trans list,
the same as we do with normal renames. This fixes the corruption.

Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3e174099

Btrfs: fix block group remaining RO forever after error during device replace · 042528f8

由 Filipe Manana 提交于 11月 14, 2019

When doing a device replace, while at scrub.c:scrub_enumerate_chunks(), we
set the block group to RO mode and then wait for any ongoing writes into
extents of the block group to complete. While doing that wait we overwrite
the value of the variable 'ret' and can break out of the loop if an error
happens without turning the block group back into RW mode. So what happens
is the following:

1) btrfs_inc_block_group_ro() returns 0, meaning it set the block group
to RO mode (its ->ro field set to 1 or incremented to some value > 1);

2) Then btrfs_wait_ordered_roots() returns a value > 0;

3) Then if either joining or committing the transaction fails, we break
out of the loop wihtout calling btrfs_dec_block_group_ro(), leaving
the block group in RO mode forever.

To fix this, just remove the code that waits for ongoing writes to extents
of the block group, since it's not needed because in the initial setup
phase of a device replace operation, before starting to find all chunks
and their extents, we set the target device for replace while holding
fs_info->dev_replace->rwsem, which ensures that after releasing that
semaphore, any writes into the source device are made to the target device
as well (__btrfs_map_block() guarantees that). So while at
scrub_enumerate_chunks() we only need to worry about finding and copying
extents (from the source device to the target device) that were written
before we started the device replace operation.

Fixes: f0e9b7d6 ("Btrfs: fix race setting block group readonly during device replace")
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

042528f8

btrfs: scrub: Don't check free space before marking a block group RO · b12de528

由 Qu Wenruo 提交于 11月 15, 2019

[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:

  btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
  - output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
      --- tests/btrfs/072.out     2019-10-22 15:18:14.008965340 +0800
      +++ /xfstests-dev/results//btrfs/072.out.bad      2019-11-14 15:56:45.877152240 +0800
      @@ -1,2 +1,3 @@
       QA output created by 072
       Silence is golden
      +Scrub find errors in "-m dup -d single" test
      ...

And with the following call trace:

  BTRFS info (device dm-5): scrub: started on devid 1
  ------------[ cut here ]------------
  BTRFS: Transaction aborted (error -27)
  WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
  CPU: 0 PID: 55087 Comm: btrfs Tainted: G        W  O      5.4.0-rc1-custom+ #13
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
  Call Trace:
   __btrfs_end_transaction+0xdb/0x310 [btrfs]
   btrfs_end_transaction+0x10/0x20 [btrfs]
   btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
   scrub_enumerate_chunks+0x264/0x940 [btrfs]
   btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
   btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
   do_vfs_ioctl+0x636/0xaa0
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x43/0x50
   do_syscall_64+0x79/0xe0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  ---[ end trace 166c865cec7688e7 ]---

[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
   |- btrfs_create_pending_block_groups()
      |- btrfs_finish_chunk_alloc()
         |- btrfs_add_system_chunk()

This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.

The root cause is, we have the following bad loop of creating tons of
system chunks:

1. The only SYSTEM chunk is being scrubbed
   It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
   As btrfs_inc_block_group_ro() will check if we have enough space
   after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
   During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
   We go back to step 2, as the bg is already mark RO but still not
   cleaned up yet.

If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.

[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.

To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.

For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.

Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b12de528

btrfs: change btrfs_fs_devices::rotating to bool · 7f0432d0

由 Johannes Thumshirn 提交于 11月 13, 2019

struct btrfs_fs_devices::rotating currently is declared as an integer
variable but only used as a boolean.

Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.
Reviewed-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

7f0432d0

btrfs: change btrfs_fs_devices::seeding to bool · 0395d84f

由 Johannes Thumshirn 提交于 11月 13, 2019

struct btrfs_fs_devices::seeding currently is declared as an integer
variable but only used as a boolean.

Change the variable definition to bool and update to code touching it to
set 'true' and 'false'.
Reviewed-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0395d84f

btrfs: rename btrfs_block_group_cache · 32da5386

由 David Sterba 提交于 10月 29, 2019

The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.
Suggested-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

32da5386

btrfs: block-group: Reuse the item key from caller of read_one_block_group() · d49a2ddb

由 Qu Wenruo 提交于 11月 05, 2019

For read_one_block_group(), its only caller has already got the item key
to search next block group item.

So we can use that key directly without doing our own convertion on
stack.

Also, since that key used in btrfs_read_block_groups() is vital for
block group item search, add 'const' keyword for that parameter to
prevent read_one_block_group() to modify it.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d49a2ddb

btrfs: block-group: Refactor btrfs_read_block_groups() · ffb9e0f0

由 Qu Wenruo 提交于 10月 10, 2019

Refactor the work inside the loop of btrfs_read_block_groups() into one
separate function, read_one_block_group().

This allows read_one_block_group to be reused for later BG_TREE feature.

The refactor does the following extra fix:
- Use btrfs_fs_incompat() to replace open-coded feature check
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ffb9e0f0

D
btrfs: document extent buffer locking · d4e253bb
由 David Sterba 提交于 10月 16, 2019
```
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
d4e253bb

btrfs: access eb::blocking_writers according to ACCESS_ONCE policies · a4477988

由 David Sterba 提交于 10月 11, 2019

A nice writeup of the LKMM (Linux Kernel Memory Model) rules for access
once policies can be found here
https://lwn.net/Articles/799218/#Access-Marking%20Policies .

The locked and unlocked access to eb::blocking_writers should be
annotated accordingly, following this:

Writes:

- locked write must use ONCE, may use plain read
- unlocked write must use ONCE

Reads:

- unlocked read must use ONCE
- locked read may use plain read iff not mixed with unlocked read
- unlocked read then locked must use ONCE

There's one difference on the assembly level, where
btrfs_tree_read_lock_atomic and btrfs_try_tree_read_lock used the cached
value and did not reevaluate it after taking the lock. This could have
missed some opportunities to take the lock in case blocking writers
changed between the calls, but the window is just a few instructions
long. As this is in try-lock, the callers handle that.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a4477988

btrfs: set blocking_writers directly, no increment or decrement · 40d38f53

由 David Sterba 提交于 10月 10, 2019

The increment and decrement was inherited from previous version that
used atomics, switched in commit 06297d8c ("btrfs: switch
extent_buffer blocking_writers from atomic to int"). The only possible
values are 0 and 1 so we can set them directly.

The generated assembly (gcc 9.x) did the direct value assignment in
btrfs_set_lock_blocking_write (asm diff after change in 06297d8c):

     5d:   test   %eax,%eax
     5f:   je     62 <btrfs_set_lock_blocking_write+0x22>
     61:   retq

  -  62:   lock incl 0x44(%rdi)
  -  66:   add    $0x50,%rdi
  -  6a:   jmpq   6f <btrfs_set_lock_blocking_write+0x2f>

  +  62:   movl   $0x1,0x44(%rdi)
  +  69:   add    $0x50,%rdi
  +  6d:   jmpq   72 <btrfs_set_lock_blocking_write+0x32>

The part in btrfs_tree_unlock did a decrement because
BUG_ON(blockers > 1) is probably not a strong hint for the compiler, but
otherwise the output looks safe:

  - lock decl 0x44(%rdi)

  + sub    $0x1,%eax
  + mov    %eax,0x44(%rdi)
Signed-off-by: NDavid Sterba <dsterba@suse.com>

40d38f53

btrfs: merge blocking_writers branches in btrfs_tree_read_lock · f5c2a525

由 David Sterba 提交于 10月 10, 2019

There are two ifs that use eb::blocking_writers. As this is a variable
modified inside and outside of locks, we could minimize number of
accesses to avoid problems with getting different results at different
times.

The access here is locked so this can only race with btrfs_tree_unlock
that sets blocking_writers to 0 without lock and unsets the lock owner.

The first branch is taken only if the same thread already holds the
lock, the second if checks for blocking writers. Here we'd either unlock
and wait, or proceed. Both are valid states of the locking protocol.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f5c2a525

btrfs: drop incompat bit for raid1c34 after last block group is gone · 9c907446

由 David Sterba 提交于 10月 31, 2019

When there are no raid1c3 or raid1c4 block groups left after balance
(either convert or with other filters applied), remove the incompat bit.
This is already done for RAID56, do the same for RAID1C34.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9c907446

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功