提交 · b43b7e81eb2b188fab1d8bc334b4b725f6d2ae10 · openeuler / Kernel

10 9月, 2020 2 次提交

virtiofs: provide a helper function for virtqueue initialization · b43b7e81

由 Vivek Goyal 提交于 8月 19, 2020

This reduces code duplication and make it little easier to read code.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

b43b7e81

dax: Create a range version of dax_layout_busy_page() · 6bbdd563

由 Vivek Goyal 提交于 3月 03, 2020

virtiofs device has a range of memory which is mapped into file inodes
using dax. This memory is mapped in qemu on host and maps different
sections of real file on host. Size of this memory is limited
(determined by administrator) and depending on filesystem size, we will
soon reach a situation where all the memory is in use and we need to
reclaim some.

As part of reclaim process, we will need to make sure that there are
no active references to pages (taken by get_user_pages()) on the memory
range we are trying to reclaim. I am planning to use
dax_layout_busy_page() for this. But in current form this is per inode
and scans through all the pages of the inode.

We want to reclaim only a portion of memory (say 2MB page). So we want
to make sure that only that 2MB range of pages do not have any
references  (and don't want to unmap all the pages of inode).

Hence, create a range version of this function named
dax_layout_busy_page_range() which can be used to pass a range which
needs to be unmapped.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm@lists.01.org
Cc: Jan Kara <jack@suse.cz>
Cc: Vishal L Verma <vishal.l.verma@intel.com>
Cc: "Weiny, Ira" <ira.weiny@intel.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

6bbdd563

04 9月, 2020 1 次提交

fuse: update project homepage · c1b0c627

由 André Almeida 提交于 7月 23, 2020

As stated in https://sourceforge.net/projects/fuse/, "the FUSE project has
moved to https://github.com/libfuse/" in 22-Dec-2015. Update URLs to
reflect this.
Signed-off-by: NAndré Almeida <andrealmeid@collabora.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

c1b0c627

29 8月, 2020 1 次提交

cifs: fix check of tcon dfs in smb1 · e183785f

由 Paulo Alcantara 提交于 8月 27, 2020

For SMB1, the DFS flag should be checked against tcon->Flags rather
than tcon->share_flags.  While at it, add an is_tcon_dfs() helper to
check for DFS capability in a more generic way.
Signed-off-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: NSteve French <stfrench@microsoft.com>
Reviewed-by: NShyam Prasad N <nspmangalore@gmail.com>

e183785f

28 8月, 2020 3 次提交

io_uring: don't bounce block based -EAGAIN retry off task_work · fdee946d

由 Jens Axboe 提交于 8月 27, 2020

These events happen inline from submission, so there's no need to
bounce them through the original task. Just set them up for retry
and issue retry directly instead of going over task_work.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fdee946d

io_uring: fix IOPOLL -EAGAIN retries · eefdf30f

由 Jens Axboe 提交于 8月 27, 2020

This normally isn't hit, as polling is mostly done on NVMe with deep
queue depths. But if we do run into request starvation, we need to
ensure that retries are properly serialized.
Reported-by: NAndres Freund <andres@anarazel.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eefdf30f

afs: Remove erroneous fallthough annotation · 210e799e

由 Dan Carpenter 提交于 8月 26, 2020

The fall through annotation comes after a return statement so it's not
reachable.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

210e799e

27 8月, 2020 2 次提交

io_uring: clear req->result on IOPOLL re-issue · 56450c20

由 Jens Axboe 提交于 8月 26, 2020

Make sure we clear req->result, which was set to -EAGAIN for retry
purposes, when moving it to the reissue list. Otherwise we can end up
retrying a request more than once, which leads to weird results in
the io-wq handling (and other spots).

Cc: stable@vger.kernel.org
Reported-by: NAndres Freund <andres@anarazel.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

56450c20

io_uring: make offset == -1 consistent with preadv2/pwritev2 · 0fef9483

由 Jens Axboe 提交于 8月 26, 2020

The man page for io_uring generally claims were consistent with what
preadv2 and pwritev2 accept, but turns out there's a slight discrepancy
in how offset == -1 is handled for pipes/streams. preadv doesn't allow
it, but preadv2 does. This currently causes io_uring to return -EINVAL
if that is attempted, but we should allow that as documented.

This change makes us consistent with preadv2/pwritev2 for just passing
in a NULL ppos for streams if the offset is -1.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: NBenedikt Ames <wisp3rwind@posteo.eu>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0fef9483

26 8月, 2020 3 次提交

io_uring: ensure read requests go through -ERESTART* transformation · 00d23d51

由 Jens Axboe 提交于 8月 25, 2020

We need to call kiocb_done() for any ret < 0 to ensure that we always
get the proper -ERESTARTSYS (and friends) transformation done.

At some point this should be tied into general error handling, so we
can get rid of the various (mostly network) related commands that check
and perform this substitution.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

00d23d51

io_uring: don't use poll handler if file can't be nonblocking read/written · 9dab14b8

由 Jens Axboe 提交于 8月 25, 2020

There's no point in using the poll handler if we can't do a nonblocking
IO attempt of the operation, since we'll need to go async anyway. In
fact this is actively harmful, as reading from eg pipes won't return 0
to indicate EOF.

Cc: stable@vger.kernel.org # v5.7+
Reported-by: NBenedikt Ames <wisp3rwind@posteo.eu>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9dab14b8

io_uring: fix imbalanced sqo_mm accounting · 6b7898eb

由 Jens Axboe 提交于 8月 25, 2020

We do the initial accounting of locked_vm and pinned_vm before we have
setup ctx->sqo_mm, which means we can end up having not accounted the
memory at setup time, but still decrement it when we exit. This causes
an imbalance in the accounting.

Setup ctx->sqo_mm earlier in io_uring_create(), before we do the first
accounting of mm->{locked,pinned}_vm. This also unifies the state
grabbing for the ctx, and eliminates a failure case in
io_sq_offload_start().

Fixes: f74441e6 ("io_uring: account locked memory before potential error case")
Reported-by: NRobert M. Muncrief <rmuncrief@humanavance.com>
Reported-by: NNiklas Schnelle <schnelle@linux.ibm.com>
Tested-by: NNiklas Schnelle <schnelle@linux.ibm.com>
Tested-by: NRobert M. Muncrief <rmuncrief@humanavance.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6b7898eb

25 8月, 2020 2 次提交

io_uring: revert consumed iov_iter bytes on error · 84216315

由 Jens Axboe 提交于 8月 24, 2020

Some consumers of the iov_iter will return an error, but still have
bytes consumed in the iterator. This is an issue for -EAGAIN, since we
rely on a sane iov_iter state across retries.

Fix this by ensuring that we revert consumed bytes, if any, if the file
operations have consumed any bytes from iterator. This is similar to what
generic_file_read_iter() does, and is always safe as we have the previous
bytes count handy already.

Fixes: ff6165b2 ("io_uring: retain iov_iter state over io_read/io_write calls")
Reported-by: NDmitry Shulyak <yashulyak@gmail.com>
Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

84216315

ceph: don't allow setlease on cephfs · 496ceaf1

由 Jeff Layton 提交于 8月 20, 2020

Leases don't currently work correctly on kcephfs, as they are not broken
when caps are revoked. They could eventually be implemented similarly to
how we did them in libcephfs, but for now don't allow them.

[ idryomov: no need for simple_nosetlease() in ceph_dir_fops and
  ceph_snapdir_fops ]
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

496ceaf1

24 8月, 2020 6 次提交

ceph: fix inode number handling on arches with 32-bit ino_t · ebce3eb2

由 Jeff Layton 提交于 8月 18, 2020

Tuan and Ulrich mentioned that they were hitting a problem on s390x,
which has a 32-bit ino_t value, even though it's a 64-bit arch (for
historical reasons).

I think the current handling of inode numbers in the ceph driver is
wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's
actually not a problem. 32-bit arches can deal with 64-bit inode numbers
just fine when userland code is compiled with LFS support (the common
case these days).

What we really want to do is just use 64-bit numbers everywhere, unless
someone has mounted with the ino32 mount option. In that case, we want
to ensure that we hash the inode number down to something that will fit
in 32 bits before presenting the value to userland.

Add new helper functions that do this, and only do the conversion before
presenting these values to userland in getattr and readdir.

The inode table hashvalue is changed to just cast the inode number to
unsigned long, as low-order bits are the most likely to vary anyway.

While it's not strictly required, we do want to put something in
inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on
the size of the ino_t type.

NOTE: This is a user-visible change on 32-bit arches:

1/ inode numbers will be seen to have changed between kernel versions.
32-bit arches will see large inode numbers now instead of the hashed
ones they saw before.

2/ any really old software not built with LFS support may start failing
stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we
can do about these, but hopefully the intersection of people running
such code on ceph will be very small.

The workaround for both problems is to mount with "-o ino32".

[ idryomov: changelog tweak ]

URL: https://tracker.ceph.com/issues/46828Reported-by: NUlrich Weigand <Ulrich.Weigand@de.ibm.com>
Reported-and-Tested-by: NTuan Hoang1 <Tuan.Hoang1@ibm.com>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

ebce3eb2

gfs2: add some much needed cleanup for log flushes that fail · 462582b9

由 Bob Peterson 提交于 8月 21, 2020

When a log flush fails due to io errors, it signals the failure but does
not clean up after itself very well. This is because buffers are added to
the transaction tr_buf and tr_databuf queue, but the io error causes
gfs2_log_flush to bypass the "after_commit" functions responsible for
dequeueing the bd elements. If the bd elements are added to the ail list
before the error, function ail_drain takes care of dequeueing them.
But if they haven't gotten that far, the elements are forgotten and
make the transactions unable to be freed.

This patch introduces new function trans_drain which drains the bd
elements from the transaction so they can be freed properly.
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>

462582b9

binfmt_flat: revert "binfmt_flat: don't offset the data start" · 2217b982

由 Max Filippov 提交于 8月 08, 2020

binfmt_flat loader uses the gap between text and data to store data
segment pointers for the libraries. Even in the absence of shared
libraries it stores at least one pointer to the executable's own data
segment. Text and data can go back to back in the flat binary image and
without offsetting data segment last few instructions in the text
segment may get corrupted by the data segment pointer.

Fix it by reverting commit a2357223 ("binfmt_flat: don't offset the
data start").

Cc: stable@vger.kernel.org
Fixes: a2357223 ("binfmt_flat: don't offset the data start")
Signed-off-by: NMax Filippov <jcmvbkbc@gmail.com>
Signed-off-by: NGreg Ungerer <gerg@linux-m68k.org>

2217b982

treewide: Use fallthrough pseudo-keyword · df561f66

由 Gustavo A. R. Silva 提交于 8月 23, 2020

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

df561f66

io-wq: fix hang after cancelling pending hashed work · 204361a7

由 Pavel Begunkov 提交于 8月 23, 2020

Don't forget to update wqe->hash_tail after cancelling a pending work
item, if it was hashed.

Cc: stable@vger.kernel.org # 5.7+
Reported-by: NDmitry Shulyak <yashulyak@gmail.com>
Fixes: 86f3cd1b ("io-wq: handle hashed writes in chains")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

204361a7

io_uring: don't recurse on tsk->sighand->siglock with signalfd · fd7d6de2

由 Jens Axboe 提交于 8月 23, 2020

If an application is doing reads on signalfd, and we arm the poll handler
because there's no data available, then the wakeup can recurse on the
tasks sighand->siglock as the signal delivery from task_work_add() will
use TWA_SIGNAL and that attempts to lock it again.

We can detect the signalfd case pretty easily by comparing the poll->head
wait_queue_head_t with the target task signalfd wait queue. Just use
normal task wakeup for this case.

Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fd7d6de2

23 8月, 2020 2 次提交

A
do_epoll_ctl(): clean the failure exits up a bit · 52c47969
由 Al Viro 提交于 8月 22, 2020
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
52c47969

epoll: Keep a reference on files added to the check list · a9ed4a65

由 Marc Zyngier 提交于 8月 19, 2020

When adding a new fd to an epoll, and that this new fd is an
epoll fd itself, we recursively scan the fds attached to it
to detect cycles, and add non-epool files to a "check list"
that gets subsequently parsed.

However, this check list isn't completely safe when deletions
can happen concurrently. To sidestep the issue, make sure that
a struct file placed on the check list sees its f_count increased,
ensuring that a concurrent deletion won't result in the file
disapearing from under our feet.

Cc: stable@vger.kernel.org
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a9ed4a65

22 8月, 2020 3 次提交

afs: Fix NULL deref in afs_dynroot_depopulate() · 5e0b17b0

由 David Howells 提交于 8月 21, 2020

If an error occurs during the construction of an afs superblock, it's
possible that an error occurs after a superblock is created, but before
we've created the root dentry.  If the superblock has a dynamic root
(ie.  what's normally mounted on /afs), the afs_kill_super() will call
afs_dynroot_depopulate() to unpin any created dentries - but this will
oops if the root hasn't been created yet.

Fix this by skipping that bit of code if there is no root dentry.

This leads to an oops looking like:

	general protection fault, ...
	KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f]
	...
	RIP: 0010:afs_dynroot_depopulate+0x25f/0x529 fs/afs/dynroot.c:385
	...
	Call Trace:
	 afs_kill_super+0x13b/0x180 fs/afs/super.c:535
	 deactivate_locked_super+0x94/0x160 fs/super.c:335
	 afs_get_tree+0x1124/0x1460 fs/afs/super.c:598
	 vfs_get_tree+0x89/0x2f0 fs/super.c:1547
	 do_new_mount fs/namespace.c:2875 [inline]
	 path_mount+0x1387/0x2070 fs/namespace.c:3192
	 do_mount fs/namespace.c:3205 [inline]
	 __do_sys_mount fs/namespace.c:3413 [inline]
	 __se_sys_mount fs/namespace.c:3390 [inline]
	 __x64_sys_mount+0x27f/0x300 fs/namespace.c:3390
	 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

which is oopsing on this line:

	inode_lock(root->d_inode);

presumably because sb->s_root was NULL.

Fixes: 0da0b7fd ("afs: Display manually added cells in dynamic root mount")
Reported-by: syzbot+c1eff8205244ae7e11a6@syzkaller.appspotmail.com
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5e0b17b0

squashfs: avoid bio_alloc() failure with 1Mbyte blocks · f26044c8

由 Phillip Lougher 提交于 8月 20, 2020

This is a regression introduced by the patch "migrate from ll_rw_block
usage to BIO".

Bio_alloc() is limited to 256 pages (1 Mbyte).  This can cause a failure
when reading 1 Mbyte block filesystems.  The problem is a datablock can be
fully (or almost uncompressed), requiring 256 pages, but, because blocks
are not aligned to page boundaries, it may require 257 pages to read.

Bio_kmalloc() can handle 1024 pages, and so use this for the edge
condition.

Fixes: 93e72b3c ("squashfs: migrate from ll_rw_block usage to BIO")
Reported-by: NNicolas Prochazka <nicolas.prochazka@gmail.com>
Reported-by: NTomoatsu Shimada <shimada@walbrix.com>
Signed-off-by: NPhillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NGuenter Roeck <groeck@chromium.org>
Cc: Philippe Liard <pliard@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Adrien Schildknecht <adrien+dev@schischi.me>
Cc: Daniel Rosenberg <drosen@google.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200815035637.15319-1-phillip@squashfs.org.ukSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f26044c8

romfs: fix uninitialized memory leak in romfs_dev_read() · bcf85fce

由 Jann Horn 提交于 8月 20, 2020

romfs has a superblock field that limits the size of the filesystem; data
beyond that limit is never accessed.

romfs_dev_read() fetches a caller-supplied number of bytes from the
backing device. It returns 0 on success or an error code on failure;
therefore, its API can't represent short reads, it's all-or-nothing.

However, when romfs_dev_read() detects that the requested operation would
cross the filesystem size limit, it currently silently truncates the
requested number of bytes. This e.g. means that when the content of a
file with size 0x1000 starts one byte before the filesystem size limit,
->readpage() will only fill a single byte of the supplied page while
leaving the rest uninitialized, leaking that uninitialized memory to
userspace.

Fix it by returning an error code instead of truncating the read when the
requested read operation would go beyond the end of the filesystem.

Fixes: da4458bd ("NOMMU: Make it possible for RomFS to use MTD devices directly")
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200818013202.2246365-1-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bcf85fce

21 8月, 2020 3 次提交

btrfs: detect nocow for swap after snapshot delete · a84d5d42

由 Boris Burkov 提交于 8月 18, 2020

can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
detecting a must cow condition which is not exactly accurate, but saves
unnecessary tree traversal. The incorrect assumption is that if the
extent was created in a generation smaller than the last snapshot
generation, it must be referenced by that snapshot. That is true, except
the snapshot could have since been deleted, without affecting the last
snapshot generation.

The original patch claimed a performance win from this check, but it
also leads to a bug where you are unable to use a swapfile if you ever
snapshotted the subvolume it's in. Make the check slower and more strict
for the swapon case, without modifying the general cow checks as a
compromise. Turning swap on does not seem to be a particularly
performance sensitive operation, so incurring a possibly unnecessary
btrfs_search_slot seems worthwhile for the added usability.

Note: Until the snapshot is competely cleaned after deletion,
check_committed_refs will still cause the logic to think that cow is
necessary, so the user must until 'btrfs subvolu sync' finished before
activating the swapfile swapon.

CC: stable@vger.kernel.org # 5.4+
Suggested-by: NOmar Sandoval <osandov@osandov.com>
Signed-off-by: NBoris Burkov <boris@bur.io>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a84d5d42

btrfs: check the right error variable in btrfs_del_dir_entries_in_log · fb2fecba

由 Josef Bacik 提交于 8月 10, 2020

With my new locking code dbench is so much faster that I tripped over a
transaction abort from ENOSPC.  This turned out to be because
btrfs_del_dir_entries_in_log was checking for ret == -ENOSPC, but this
function sets err on error, and returns err.  So instead of properly
marking the inode as needing a full commit, we were returning -ENOSPC
and aborting in __btrfs_unlink_inode.  Fix this by checking the proper
variable so that we return the correct thing in the case of ENOSPC.

The ENOENT needs to be checked, because btrfs_lookup_dir_item_index()
can return -ENOENT if the dir item isn't in the tree log (which would
happen if we hadn't fsync'ed this guy).  We actually handle that case in
__btrfs_unlink_inode, so it's an expected error to get back.

Fixes: 4a500fd1 ("Btrfs: Metadata ENOSPC handling for tree log")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ add note and comment about ENOENT ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

fb2fecba

afs: Fix key ref leak in afs_put_operation() · ba8e4207

由 David Howells 提交于 8月 20, 2020

The afs_put_operation() function needs to put the reference to the key
that's authenticating the operation.

Fixes: e49c7b2f ("afs: Build an abstraction around an "operation" concept")
Reported-by: NDave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ba8e4207

20 8月, 2020 12 次提交

io_uring: kill extra iovec=NULL in import_iovec() · 867a23ea

由 Pavel Begunkov 提交于 8月 20, 2020

If io_import_iovec() returns an error, return iovec is undefined and
must not be used, so don't set it to NULL when failing.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

867a23ea

io_uring: comment on kfree(iovec) checks · f261c168

由 Pavel Begunkov 提交于 8月 20, 2020

kfree() handles NULL pointers well, but io_{read,write}() checks it
because of performance reasons. Leave a comment there for those who are
tempted to patch it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f261c168

io_uring: fix racy req->flags modification · bb175342

由 Pavel Begunkov 提交于 8月 20, 2020

Setting and clearing REQ_F_OVERFLOW in io_uring_cancel_files() and
io_cqring_overflow_flush() are racy, because they might be called
asynchronously.

REQ_F_OVERFLOW flag in only needed for files cancellation, so if it can
be guaranteed that requests _currently_ marked inflight can't be
overflown, the problem will be solved with removing the flag
altogether.

That's how the patch works, it removes inflight status of a request
in io_cqring_fill_event() whenever it should be thrown into CQ-overflow
list. That's Ok to do, because no opcode specific handling can be done
after io_cqring_fill_event(), the same assumption as with "struct
io_completion" patches.
And it already have a good place for such cleanups, which is
io_clean_op(). A nice side effect of this is removing this inflight
check from the hot path.

note on synchronisation: now __io_cqring_fill_event() may be taking two
spinlocks simultaneously, completion_lock and inflight_lock. It's fine,
because we never do that in reverse order, and CQ-overflow of inflight
requests shouldn't happen often.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bb175342

io_uring: use system_unbound_wq for ring exit work · fc666777

由 Jens Axboe 提交于 8月 19, 2020

We currently use system_wq, which is unbounded in terms of number of
workers. This means that if we're exiting tons of rings at the same
time, then we'll briefly spawn tons of event kworkers just for a very
short blocking time as the rings exit.

Use system_unbound_wq instead, which has a sane cap on the concurrency
level.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fc666777

btrfs: fix space cache memory leak after transaction abort · bbc37d6e

由 Filipe Manana 提交于 8月 14, 2020

If a transaction aborts it can cause a memory leak of the pages array of
a block group's io_ctl structure. The following steps explain how that can
happen:

1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
   and it's about to start writing out dirty extent buffers;

2) Transaction N + 1 already started and another task, task A, just called
   btrfs_commit_transaction() on it;

3) Block group B was dirtied (extents allocated from it) by transaction
   N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
   very beginning of the transaction commit, it starts writeback for the
   block group's space cache by calling btrfs_write_out_cache(), which
   allocates the pages array for the block group's io_ctl with a call to
   io_ctl_init(). Block group A is added to the io_list of transaction
   N + 1 by btrfs_start_dirty_block_groups();

4) While transaction N's commit is writing out the extent buffers, it gets
   an IO error and aborts transaction N, also setting the file system to
   RO mode;

5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
   btrfs_commit_transaction() and has set transaction N + 1 state to
   TRANS_STATE_COMMIT_START. Immediately after that it checks that the
   filesystem was turned to RO mode, due to transaction N's abort, and
   jumps to the "cleanup_transaction" label. After that we end up at
   btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
   That helper finds block group B in the transaction's io_list but it
   never releases the pages array of the block group's io_ctl, resulting in
   a memory leak.

In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
array points to pages that were already released by us at
__btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
up freeing the pages array only after waiting for the ordered extent to
complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
that. But in the transaction abort case we don't wait for the space cache's
ordered extent to complete through a call to btrfs_wait_cache_io(), so
that's why we end up with a memory leak - we wait for the ordered extent
to complete indirectly by shutting down the work queues and waiting for
any jobs in them to complete before returning from close_ctree().

We can solve the leak simply by freeing the pages array right after
releasing the pages (with the call to io_ctl_drop_pages()) at
__btrfs_write_out_cache(), since we will never use it anymore after that
and the pages array points to already released pages at that point, which
is currently not a problem since no one will use it after that, but not a
good practice anyway since it can easily lead to use-after-free issues.

So fix this by freeing the pages array right after releasing the pages at
__btrfs_write_out_cache().

This issue can often be reproduced with test case generic/475 from fstests
and kmemleak can detect it and reports it with the following trace:

unreferenced object 0xffff9bbf009fa600 (size 512):
  comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
  hex dump (first 32 bytes):
    00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff  ..|M=...@.|M=...
    80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff  ..|M=.....|M=...
  backtrace:
    [<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
    [<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
    [<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
    [<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
    [<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
    [<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
    [<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
    [<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
    [<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
    [<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
    [<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
    [<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
    [<0000000043ace2c9>] do_syscall_64+0x33/0x80
    [<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

CC: stable@vger.kernel.org # 4.9+
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

bbc37d6e

btrfs: use the correct const function attribute for btrfs_get_num_csums · 604997b4

由 David Sterba 提交于 7月 27, 2020

The build robot reports

compiler: h8300-linux-gcc (GCC) 9.3.0
   In file included from fs/btrfs/tests/extent-map-tests.c:8:
>> fs/btrfs/tests/../ctree.h:2166:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
    2166 | size_t __const btrfs_get_num_csums(void);
         |        ^~~~~~~

The function attribute for const does not follow the expected scheme and
in this case is confused with a const type qualifier.
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

604997b4

btrfs: reset compression level for lzo on remount · 282dd7d7

由 Marcos Paulo de Souza 提交于 8月 03, 2020

Currently a user can set mount "-o compress" which will set the
compression algorithm to zlib, and use the default compress level for
zlib (3):

  relatime,compress=zlib:3,space_cache

If the user remounts the fs using "-o compress=lzo", then the old
compress_level is used:

  relatime,compress=lzo:3,space_cache

But lzo does not expose any tunable compression level. The same happens
if we set any compress argument with different level, also with zstd.

Fix this by resetting the compress_level when compress=lzo is
specified.  With the fix applied, lzo is shown without compress level:

  relatime,compress=lzo,space_cache

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

282dd7d7

btrfs: handle errors from async submission · c965d640

由 Johannes Thumshirn 提交于 7月 28, 2020

Btrfs' async submit mechanism is able to handle errors in the submission
path and the meta-data async submit function correctly passes the error
code to the caller.

In btrfs_submit_bio_start() and btrfs_submit_bio_start_direct_io() we're
not handling the errors returned by btrfs_csum_one_bio() correctly though
and simply call BUG_ON(). This is unnecessary as the caller of these two
functions - run_one_async_start - correctly checks for the return values
and sets the status of the async_submit_bio. The actual bio submission
will be handled later on by run_one_async_done only if
async_submit_bio::status is 0, so the data won't be written if we
encountered an error in the checksum process.

Simply return the error from btrfs_csum_one_bio() to the async submitters,
like it's done in btree_submit_bio_start().
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c965d640

ext4: limit the length of per-inode prealloc list · 27bc446e

由 brookxu 提交于 8月 17, 2020

In the scenario of writing sparse files, the per-inode prealloc list may
be very long, resulting in high overhead for ext4_mb_use_preallocated().
To circumvent this problem, we limit the maximum length of per-inode
prealloc list to 512 and allow users to modify it.

After patching, we observed that the sys ratio of cpu has dropped, and
the system throughput has increased significantly. We created a process
to write the sparse file, and the running time of the process on the
fixed kernel was significantly reduced, as follows:

Running time on unfixed kernel：
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real    0m2.051s
user    0m0.008s
sys     0m2.026s

Running time on fixed kernel：
[root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat
real    0m0.471s
user    0m0.004s
sys     0m0.395s
Signed-off-by: NChunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

27bc446e

ext4: reorganize if statement of ext4_mb_release_context() · 66d5e027

由 brookxu 提交于 8月 17, 2020

Reorganize the if statement of ext4_mb_release_context(), make it
easier to read.
Signed-off-by: NChunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/5439ac6f-db79-ad68-76c1-a4dda9aa0cc3@gmail.comReviewed-by: NRitesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

66d5e027

ext4: add mb_debug logging when there are lost chunks · c55ee7d2

由 brookxu 提交于 8月 15, 2020

Lost chunks are when some other process raced with the current thread
to grab a particular block allocation. Add mb_debug log for
developers who wants to see how often this is happening for a
particular workload.
Signed-off-by: NChunguang Xu <brookxu@tencent.com>
Link: https://lore.kernel.org/r/0a165ac0-1912-aebd-8a0d-b42e7cd1aea1@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

c55ee7d2

ext4: Fix comment typo "the the". · 7ca4fcba

由 kyoungho koo 提交于 4月 25, 2020

I have found double typed comments "the the". So i modified it to
one "the"
Signed-off-by: Nkyoungho koo <rnrudgh@gmail.com>
Link: https://lore.kernel.org/r/20200424171620.GA11943@koo-Z370-HD3Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

7ca4fcba

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功