提交 · e8c954df234145c5765870382c2bc630a48beec9 · openeuler / Kernel

07 12月, 2020 1 次提交

io_uring: fix mis-seting personality's creds · e8c954df

由 Pavel Begunkov 提交于 12月 06, 2020

After io_identity_cow() copies an work.identity it wants to copy creds
to the new just allocated id, not the old one. Otherwise it's
akin to req->work.identity->creds = req->work.identity->creds.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e8c954df

01 12月, 2020 1 次提交

io_uring: fix recvmsg setup with compat buf-select · 2d280bc8

由 Pavel Begunkov 提交于 11月 29, 2020

__io_compat_recvmsg_copy_hdr() with REQ_F_BUFFER_SELECT reads out iov
len but never assigns it to iov/fast_iov, leaving sr->len with garbage.
Hopefully, following io_buffer_select() truncates it to the selected
buffer size, but the value is still may be under what was specified.

Cc: <stable@vger.kernel.org> # 5.7
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2d280bc8

26 11月, 2020 1 次提交

io_uring: fix files grab/cancel race · af604703

由 Pavel Begunkov 提交于 11月 25, 2020

When one task is in io_uring_cancel_files() and another is doing
io_prep_async_work() a race may happen. That's because after accounting
a request inflight in first call to io_grab_identity() it still may fail
and go to io_identity_cow(), which migh briefly keep dangling
work.identity and not only.

Grab files last, so io_prep_async_work() won't fail if it did get into
->inflight_list.

note: the bug shouldn't exist after making io_uring_cancel_files() not
poking into other tasks' requests.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

af604703

24 11月, 2020 2 次提交

io_uring: fix ITER_BVEC check · 9c3a205c

由 Pavel Begunkov 提交于 11月 23, 2020

iov_iter::type is a bitmask that also keeps direction etc., so it
shouldn't be directly compared against ITER_*. Use proper helper.

Fixes: ff6165b2 ("io_uring: retain iov_iter state over io_read/io_write calls")
Reported-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Cc: <stable@vger.kernel.org> # 5.9
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9c3a205c

io_uring: fix shift-out-of-bounds when round up cq size · eb2667b3

由 Joseph Qi 提交于 11月 24, 2020

Abaci Fuzz reported a shift-out-of-bounds BUG in io_uring_create():

[ 59.598207] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
[ 59.599665] shift exponent 64 is too large for 64-bit type 'long unsigned int'
[ 59.601230] CPU: 0 PID: 963 Comm: a.out Not tainted 5.10.0-rc4+ #3
[ 59.602502] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 59.603673] Call Trace:
[ 59.604286] dump_stack+0x107/0x163
[ 59.605237] ubsan_epilogue+0xb/0x5a
[ 59.606094] __ubsan_handle_shift_out_of_bounds.cold+0xb2/0x20e
[ 59.607335] ? lock_downgrade+0x6c0/0x6c0
[ 59.608182] ? rcu_read_lock_sched_held+0xaf/0xe0
[ 59.609166] io_uring_create.cold+0x99/0x149
[ 59.610114] io_uring_setup+0xd6/0x140
[ 59.610975] ? io_uring_create+0x2510/0x2510
[ 59.611945] ? lockdep_hardirqs_on_prepare+0x286/0x400
[ 59.613007] ? syscall_enter_from_user_mode+0x27/0x80
[ 59.614038] ? trace_hardirqs_on+0x5b/0x180
[ 59.615056] do_syscall_64+0x2d/0x40
[ 59.615940] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 59.617007] RIP: 0033:0x7f2bb8a0b239

This is caused by roundup_pow_of_two() if the input entries larger
enough, e.g. 2^32-1. For sq_entries, it will check first and we allow
at most IORING_MAX_ENTRIES, so it is okay. But for cq_entries, we do
round up first, that may overflow and truncate it to 0, which is not
the expected behavior. So check the cq size first and then do round up.

Fixes: 88ec3211 ("io_uring: round-up cq size before comparing with rounded sq size")
Reported-by: NAbaci Fuzz <abaci@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eb2667b3

23 11月, 2020 2 次提交

afs: Fix speculative status fetch going out of order wrt to modifications · a9e5c87c

由 David Howells 提交于 11月 22, 2020

When doing a lookup in a directory, the afs filesystem uses a bulk
status fetch to speculatively retrieve the statuses of up to 48 other
vnodes found in the same directory and it will then either update extant
inodes or create new ones - effectively doing 'lookup ahead'.

To avoid the possibility of deadlocking itself, however, the filesystem
doesn't lock all of those inodes; rather just the directory inode is
locked (by the VFS).

When the operation completes, afs_inode_init_from_status() or
afs_apply_status() is called, depending on whether the inode already
exists, to commit the new status.

A case exists, however, where the speculative status fetch operation may
straddle a modification operation on one of those vnodes. What can then
happen is that the speculative bulk status RPC retrieves the old status,
and whilst that is happening, the modification happens - which returns
an updated status, then the modification status is committed, then we
attempt to commit the speculative status.

This results in something like the following being seen in dmesg:

kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus

showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
say that the vnode had data version 8 when we'd already recorded version
9 due to a local modification. This was causing the cache to be
invalidated for that vnode when it shouldn't have been. If it happens
on a data file, this might lead to local changes being lost.

Fix this by ignoring speculative status updates if the data version
doesn't match the expected value.

Note that it is possible to get a DV regression if a volume gets
restored from a backup - but we should get a callback break in such a
case that should trigger a recheck anyway. It might be worth checking
the volume creation time in the volsync info and, if a change is
observed in that (as would happen on a restore), invalidate all caches
associated with the volume.

Fixes: 5cf9dd55 ("afs: Prospectively look up extra files when doing a single lookup")
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a9e5c87c

libfs: fix error cast of negative value in simple_attr_write() · 488dac0c

由 Yicong Yang 提交于 11月 21, 2020

The attr->set() receive a value of u64, but simple_strtoll() is used for
doing the conversion. It will lead to the error cast if user inputs a
negative value.

Use kstrtoull() instead of simple_strtoll() to convert a string got from
the user to an unsigned value. The former will return '-EINVAL' if it
gets a negetive value, but the latter can't handle the situation
correctly. Make 'val' unsigned long long as what kstrtoull() takes,
this will eliminate the compile warning on no 64-bit architectures.

Fixes: f7b88631 ("fs/libfs.c: fix simple_attr_write() on 32bit machines")
Signed-off-by: NYicong Yang <yangyicong@hisilicon.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/1605341356-11872-1-git-send-email-yangyicong@hisilicon.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

488dac0c

20 11月, 2020 5 次提交

ext4: fix bogus warning in ext4_update_dx_flag() · f902b216

由 Jan Kara 提交于 11月 18, 2020

The idea of the warning in ext4_update_dx_flag() is that we should warn
when we are clearing EXT4_INODE_INDEX on a filesystem with metadata
checksums enabled since after clearing the flag, checksums for internal
htree nodes will become invalid. So there's no need to warn (or actually
do anything) when EXT4_INODE_INDEX is not set.

Link: https://lore.kernel.org/r/20201118153032.17281-1-jack@suse.cz
Fixes: 48a34311 ("ext4: fix checksum errors with indexed dirs")
Reported-by: NEric Biggers <ebiggers@kernel.org>
Reviewed-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

f902b216

jbd2: fix kernel-doc markups · 2bf31d94

由 Mauro Carvalho Chehab 提交于 11月 16, 2020

Kernel-doc markup should use this format:
        identifier - description

They should not have any type before that, as otherwise
the parser won't do the right thing.

Also, some identifiers have different names between their
prototypes and the kernel-doc markup.
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/72f5c6628f5f278d67625f60893ffbc2ca28d46e.1605521731.git.mchehab+huawei@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

2bf31d94

xfs: revert "xfs: fix rmap key and record comparison functions" · eb840907

由 Darrick J. Wong 提交于 11月 19, 2020

This reverts commit 6ff646b2.

Your maintainer committed a major braino in the rmap code by adding the
attr fork, bmbt, and unwritten extent usage bits into rmap record key
comparisons.  While XFS uses the usage bits *in the rmap records* for
cross-referencing metadata in xfs_scrub and xfs_repair, it only needs
the owner and offset information to distinguish between reverse mappings
of the same physical extent into the data fork of a file at multiple
offsets.  The other bits are not important for key comparisons for index
lookups, and never have been.

Eric Sandeen reports that this causes regressions in generic/299, so
undo this patch before it does more damage.
Reported-by: NEric Sandeen <sandeen@sandeen.net>
Fixes: 6ff646b2 ("xfs: fix rmap key and record comparison functions")
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NEric Sandeen <sandeen@redhat.com>

eb840907

ext4: drop fast_commit from /proc/mounts · 704c2317

由 Theodore Ts'o 提交于 11月 19, 2020

The options in /proc/mounts must be valid mount options --- and
fast_commit is not a mount option.  Otherwise, command sequences like
this will fail:

    # mount /dev/vdc /vdc
    # mkdir -p /vdc/phoronix_test_suite /pts
    # mount --bind /vdc/phoronix_test_suite /pts
    # mount -o remount,nodioread_nolock /pts
    mount: /pts: mount point not mounted or bad option.

And in the system logs, you'll find:

    EXT4-fs (vdc): Unrecognized mount option "fast_commit" or missing value

Fixes: 995a3ed6 ("ext4: add fast_commit feature and handling for extended mount options")
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

704c2317

xfs: don't allow NOWAIT DIO across extent boundaries · 883a790a

由 Dave Chinner 提交于 11月 19, 2020

Jens has reported a situation where partial direct IOs can be issued
and completed yet still return -EAGAIN. We don't want this to report
a short IO as we want XFS to complete user DIO entirely or not at
all.

This partial IO situation can occur on a write IO that is split
across an allocated extent and a hole, and the second mapping is
returning EAGAIN because allocation would be required.

The trivial reproducer:

$ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
pwrite: Resource temporarily unavailable
$

The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
the first 4kB write:

xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
iomap_apply: dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
xfs_iomap_found: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
iomap_apply_dstmap: dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY

Here the first iomap loop has mapped the first 4kB of the file and
issued the IO, and we enter the second iomap_apply loop:

iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
xfs_ilock_nowait: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
xfs_iunlock: dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin

And we exit with -EAGAIN out because we hit the allocate case trying
to make the second 4kB block.

Then IO completes on the first 4kB and the original IO context
completes and unlocks the inode, returning -EAGAIN to userspace:

xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
xfs_iunlock: dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write

There are other vectors to the same problem when we re-enter the
mapping code if we have to make multiple mappinfs under NOWAIT
conditions. e.g. failing trylocks, COW extents being found,
allocation being required, and so on.

Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
to go ahead if the mapping we retrieve for the IO spans an entire
allocated extent. This avoids the possibility of subsequent mappings
to complete the IO from triggering NOWAIT semantics by any means as
NOWAIT IO will now only enter the mapping code once per NOWAIT IO.
Reported-and-tested-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

883a790a

19 11月, 2020 6 次提交

xfs: return corresponding errcode if xfs_initialize_perag() fail · 595189c2

由 Yu Kuai 提交于 11月 18, 2020

In xfs_initialize_perag(), if kmem_zalloc(), xfs_buf_hash_init(), or
radix_tree_preload() failed, the returned value 'error' is not set
accordingly.

Reported-as-fixing: 8b26c582 ("xfs: handle ENOMEM correctly during initialisation of perag structures")
Fixes: 9b247179 ("xfs: cache unlinked pointers in an rhashtable")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

595189c2

xfs: ensure inobt record walks always make forward progress · 27c14b5d

由 Darrick J. Wong 提交于 11月 14, 2020

The aim of the inode btree record iterator function is to call a
callback on every record in the btree.  To avoid having to tear down and
recreate the inode btree cursor around every callback, it caches a
certain number of records in a memory buffer.  After each batch of
callback invocations, we have to perform a btree lookup to find the
next record after where we left off.

However, if the keys of the inode btree are corrupt, the lookup might
put us in the wrong part of the inode btree, causing the walk function
to loop forever.  Therefore, we add extra cursor tracking to make sure
that we never go backwards neither when performing the lookup nor when
jumping to the next inobt record.  This also fixes an off by one error
where upon resume the lookup should have been for the inode /after/ the
point at which we stopped.

Found by fuzzing xfs/460 with keys[2].startino = ones causing bulkstat
and quotacheck to hang.

Fixes: a211432c ("xfs: create simplified inode walk function")
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>

27c14b5d

xfs: fix forkoff miscalculation related to XFS_LITINO(mp) · ada49d64

由 Gao Xiang 提交于 11月 14, 2020

Currently, commit e9e2eae8 dropped a (int) decoration from
XFS_LITINO(mp), and since sizeof() expression is also involved,
the result of XFS_LITINO(mp) is simply as the size_t type
(commonly unsigned long).

Considering the expression in xfs_attr_shortform_bytesfit():
  offset = (XFS_LITINO(mp) - bytes) >> 3;
let "bytes" be (int)340, and
    "XFS_LITINO(mp)" be (unsigned long)336.

on 64-bit platform, the expression is
  offset = ((unsigned long)336 - (int)340) >> 3 =
           (int)(0xfffffffffffffffcUL >> 3) = -1

but on 32-bit platform, the expression is
  offset = ((unsigned long)336 - (int)340) >> 3 =
           (int)(0xfffffffcUL >> 3) = 0x1fffffff
instead.

so offset becomes a large positive number on 32-bit platform, and
cause xfs_attr_shortform_bytesfit() returns maxforkoff rather than 0.

Therefore, one result is
  "ASSERT(new_size <= XFS_IFORK_SIZE(ip, whichfork));"

assertion failure in xfs_idata_realloc(), which was also the root
cause of the original bugreport from Dennis, see:
   https://bugzilla.redhat.com/show_bug.cgi?id=1894177

And it can also be manually triggered with the following commands:
  $ touch a;
  $ setfattr -n user.0 -v "`seq 0 80`" a;
  $ setfattr -n user.1 -v "`seq 0 80`" a

on 32-bit platform.

Fix the case in xfs_attr_shortform_bytesfit() by bailing out
"XFS_LITINO(mp) < bytes" in advance suggested by Eric and a misleading
comment together with this bugfix suggested by Darrick. It seems the
other users of XFS_LITINO(mp) are not impacted.

Fixes: e9e2eae8 ("xfs: only check the superblock version for dinode size calculation")
Cc: <stable@vger.kernel.org> # 5.7+
Reported-and-tested-by: NDennis Gilmore <dgilmore@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NGao Xiang <hsiangkao@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

ada49d64

xfs: directory scrub should check the null bestfree entries too · 6b48e5b8

由 Darrick J. Wong 提交于 11月 08, 2020

Teach the directory scrubber to check all the bestfree entries,
including the null ones.  We want to be able to detect the case where
the entry is null but there actually /is/ a directory data block.

Found by fuzzing lbests[0] = ones in xfs/391.

Fixes: df481968 ("xfs: scrub directory freespace")
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

6b48e5b8

xfs: strengthen rmap record flags checking · 498fe261

由 Darrick J. Wong 提交于 11月 08, 2020

We always know the correct state of the rmap record flags (attr, bmbt,
unwritten) so check them by direct comparison.

Fixes: d852657c ("xfs: cross-reference reverse-mapping btree")
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

498fe261

xfs: fix the minrecs logic when dealing with inode root child blocks · e95b6c3e

由 Darrick J. Wong 提交于 11月 08, 2020

The comment and logic in xchk_btree_check_minrecs for dealing with
inode-rooted btrees isn't quite correct.  While the direct children of
the inode root are allowed to have fewer records than what would
normally be allowed for a regular ondisk btree block, this is only true
if there is only one child block and the number of records don't fit in
the inode root.

Fixes: 08a3a692 ("xfs: btree scrub should check minrecs")
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

e95b6c3e

18 11月, 2020 4 次提交

gfs2: Fix regression in freeze_go_sync · 20b32912

由 Bob Peterson 提交于 11月 18, 2020

Patch 541656d3 ("gfs2: freeze should work on read-only mounts") changed
the check for glock state in function freeze_go_sync() from "gl->gl_state
== LM_ST_SHARED" to "gl->gl_req == LM_ST_EXCLUSIVE".  That's wrong and it
regressed gfs2's freeze/thaw mechanism because it caused only the freezing
node (which requests the glock in EX) to queue freeze work.

All nodes go through this go_sync code path during the freeze to drop their
SHared hold on the freeze glock, allowing the freezing node to acquire it
in EXclusive mode. But all the nodes must freeze access to the file system
locally, so they ALL must queue freeze work. The freeze_work calls
freeze_func, which makes a request to reacquire the freeze glock in SH,
effectively blocking until the thaw from the EX holder. Once thawed, the
freezing node drops its EX hold on the freeze glock, then the (blocked)
freeze_func reacquires the freeze glock in SH again (on all nodes, including
the freezer) so all nodes go back to a thawed state.

This patch changes the check back to gl_state == LM_ST_SHARED like it was
prior to 541656d3.

Fixes: 541656d3 ("gfs2: freeze should work on read-only mounts")
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>

20b32912

io_uring: order refnode recycling · e297822b

由 Pavel Begunkov 提交于 11月 18, 2020

Don't recycle a refnode until we're done with all requests of nodes
ejected before.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e297822b

io_uring: get an active ref_node from files_data · 1e5d770b

由 Pavel Begunkov 提交于 11月 18, 2020

An active ref_node always can be found in ctx->files_data, it's much
safer to get it this way instead of poking into files_data->ref_list.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1e5d770b

io_uring: don't double complete failed reissue request · c993df5a

由 Jens Axboe 提交于 11月 17, 2020

Zorro reports that an xfstest test case is failing, and it turns out that
for the reissue path we can potentially issue a double completion on the
request for the failure path. There's an issue around the retry as well,
but for now, at least just make sure that we handle the error path
correctly.

Cc: stable@vger.kernel.org
Fixes: b63534c4 ("io_uring: re-issue block requests that failed because of resources")
Reported-by: NZorro Lang <zlang@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c993df5a

15 11月, 2020 3 次提交

afs: Fix afs_write_end() when called with copied == 0 [ver #3] · 3ad216ee

由 David Howells 提交于 11月 14, 2020

When afs_write_end() is called with copied == 0, it tries to set the
dirty region, but there's no way to actually encode a 0-length region in
the encoding in page->private.

"0,0", for example, indicates a 1-byte region at offset 0.  The maths
miscalculates this and sets it incorrectly.

Fix it to just do nothing but unlock and put the page in this case.  We
don't actually need to mark the page dirty as nothing presumably
changed.

Fixes: 65dd2d60 ("afs: Alter dirty range encoding in page->private")
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3ad216ee

ocfs2: initialize ip_next_orphan · f5785283

由 Wengang Wang 提交于 11月 13, 2020

Though problem if found on a lower 4.1.12 kernel, I think upstream has
same issue.

In one node in the cluster, there is the following callback trace:

   # cat /proc/21473/stack
   __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
   ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
   ocfs2_evict_inode+0x152/0x820 [ocfs2]
   evict+0xae/0x1a0
   iput+0x1c6/0x230
   ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
   ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
   ocfs2_dir_foreach+0x29/0x30 [ocfs2]
   ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
   ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
   process_one_work+0x169/0x4a0
   worker_thread+0x5b/0x560
   kthread+0xcb/0xf0
   ret_from_fork+0x61/0x90

The above stack is not reasonable, the final iput shouldn't happen in
ocfs2_orphan_filldir() function.  Looking at the code,

  2067         /* Skip inodes which are already added to recover list, since dio may
  2068          * happen concurrently with unlink/rename */
  2069         if (OCFS2_I(iter)->ip_next_orphan) {
  2070                 iput(iter);
  2071                 return 0;
  2072         }
  2073

The logic thinks the inode is already in recover list on seeing
ip_next_orphan is non-NULL, so it skip this inode after dropping a
reference which incremented in ocfs2_iget().

While, if the inode is already in recover list, it should have another
reference and the iput() at line 2070 should not be the final iput
(dropping the last reference).  So I don't think the inode is really in
the recover list (no vmcore to confirm).

Note that ocfs2_queue_orphans(), though not shown up in the call back
trace, is holding cluster lock on the orphan directory when looking up
for unlinked inodes.  The on disk inode eviction could involve a lot of
IOs which may need long time to finish.  That means this node could hold
the cluster lock for very long time, that can lead to the lock requests
(from other nodes) to the orhpan directory hang for long time.

Looking at more on ip_next_orphan, I found it's not initialized when
allocating a new ocfs2_inode_info structure.

This causes te reflink operations from some nodes hang for very long
time waiting for the cluster lock on the orphan directory.

Fix: initialize ip_next_orphan as NULL.
Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f5785283

io_uring: handle -EOPNOTSUPP on path resolution · 944d1444

由 Jens Axboe 提交于 11月 13, 2020

Any attempt to do path resolution on /proc/self from an async worker will
yield -EOPNOTSUPP. We can safely do that resolution from the task itself,
and without blocking, so retry it from there.

Ideally io_uring would know this upfront and not have to go through the
worker thread to find out, but that doesn't currently seem feasible.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

944d1444

14 11月, 2020 1 次提交

proc: don't allow async path resolution of /proc/self components · 8d4c3e76

由 Jens Axboe 提交于 11月 13, 2020

If this is attempted by a kthread, then return -EOPNOTSUPP as we don't
currently support that. Once we can get task_pid_ptr() doing the right
thing, then this can go away again.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8d4c3e76

13 11月, 2020 2 次提交

gfs2: Fix case in which ail writes are done to jdata holes · 4e79e3f0

由 Bob Peterson 提交于 11月 12, 2020

Patch b2a846db ("gfs2: Ignore journal log writes for jdata holes")
tried (unsuccessfully) to fix a case in which writes were done to jdata
blocks, the blocks are sent to the ail list, then a punch_hole or truncate
operation caused the blocks to be freed. In other words, the ail items
are for jdata holes. Before b2a846db, the jdata hole caused function
gfs2_block_map to return -EIO, which was eventually interpreted as an
IO error to the journal, and then withdraw.

This patch changes function gfs2_get_block_noalloc, which is only used
for jdata writes, so it returns -ENODATA rather than -EIO, and when
-ENODATA is returned to gfs2_ail1_start_one, the error is ignored.
We can safely ignore it because gfs2_ail1_start_one is only called
when the jdata pages have already been written and truncated, so the
ail1 content no longer applies.
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>

4e79e3f0

Revert "gfs2: Ignore journal log writes for jdata holes" · d3039c06

由 Bob Peterson 提交于 11月 11, 2020

This reverts commit b2a846db.

That commit changed the behavior of function gfs2_block_map to return
-ENODATA in cases where a hole (IOMAP_HOLE) is encountered and create is
false.  While that fixed the intended problem for jdata, it also broke
other callers of gfs2_block_map such as some jdata block reads.  Before
the patch, an encountered hole would be skipped and the buffer seen as
unmapped by the caller.  The patch changed the behavior to return
-ENODATA, which is interpreted as an error by the caller.

The -ENODATA return code should be restricted to the specific case where
jdata holes are encountered during ail1 writes.  That will be done in a
later patch.
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>

d3039c06

12 11月, 2020 10 次提交

NFS: Remove unnecessary inode lock in nfs_fsync_dir() · 11decaf8

由 Trond Myklebust 提交于 10月 30, 2020

nfs_inc_stats() is already thread-safe, and there are no other reasons
to hold the inode lock here.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

11decaf8

NFS: Remove unnecessary inode locking in nfs_llseek_dir() · 83f2c45e

由 Trond Myklebust 提交于 10月 30, 2020

Remove the contentious inode lock, and instead provide thread safety
using the file->f_lock spinlock.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

83f2c45e

NFS: Fix listxattr receive buffer size · 6c2190b3

由 Chuck Lever 提交于 10月 31, 2020

Certain NFSv4.2/RDMA tests fail with v5.9-rc1.

rpcrdma_convert_kvec() runs off the end of the rl_segments array
because rq_rcv_buf.tail[0].iov_len holds a very large positive
value. The resultant kernel memory corruption is enough to crash
the client system.

Callers of rpc_prepare_reply_pages() must reserve an extra XDR_UNIT
in the maximum decode size for a possible XDR pad of the contents
of the xdr_buf's pages. That guarantees the allocated receive buffer
will be large enough to accommodate the usual contents plus that XDR
pad word.

encode_op_hdr() cannot add that extra word. If it does,
xdr_inline_pages() underruns the length of the tail iovec.

Fixes: 3e1f0212 ("NFSv4.2: add client side XDR handling for extended attributes")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6c2190b3

NFSv4.2: fix failure to unregister shrinker · 70438afb

由 J. Bruce Fields 提交于 10月 21, 2020

We forgot to unregister the nfs4_xattr_large_entry_shrinker.

That leaves the global list of shrinkers corrupted after unload of the
nfs module, after which possibly unrelated code that calls
register_shrinker() or unregister_shrinker() gets a BUG() with
"supervisor write access in kernel mode".

And similarly for the nfs4_xattr_large_entry_lru.
Reported-by: NKris Karas <bugs-a17@moonlit-rail.com>
Tested-By: NKris Karas <bugs-a17@moonlit-rail.com>
Fixes: 95ad37f9 "NFSv4.2: add client side xattr caching."
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

70438afb

gfs2: fix possible reference leak in gfs2_check_blk_type · bc923818

由 Zhang Qilong 提交于 11月 08, 2020

In the fail path of gfs2_check_blk_type, forgetting to call
gfs2_glock_dq_uninit will result in rgd_gh reference leak.
Signed-off-by: NZhang Qilong <zhangqilong3@huawei.com>
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>

bc923818

fscrypt: fix inline encryption not used on new files · d19d8d34

由 Eric Biggers 提交于 11月 10, 2020

The new helper function fscrypt_prepare_new_inode() runs before
S_ENCRYPTED has been set on the new inode. This accidentally made
fscrypt_select_encryption_impl() never enable inline encryption on newly
created files, due to its use of fscrypt_needs_contents_encryption()
which only returns true when S_ENCRYPTED is set.

Fix this by using S_ISREG() directly instead of
fscrypt_needs_contents_encryption(), analogous to what
select_encryption_mode() does.

I didn't notice this earlier because by design, the user-visible
behavior is the same (other than performance, potentially) regardless of
whether inline encryption is used or not.

Fixes: a992b20c ("fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()")
Reviewed-by: NSatya Tangirala <satyat@google.com>
Link: https://lore.kernel.org/r/20201111015224.303073-1-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>

d19d8d34

Revert "ext4: fix superblock checksum calculation race" · d196e229

由 Theodore Ts'o 提交于 11月 11, 2020

This reverts commit acaa5326 which can
result in a ext4_superblock_csum_set() trying to sleep while a
spinlock is being held.

For more discussion of this issue, please see:

https://lore.kernel.org/r/000000000000f50cb705b313ed70@google.com

Reported-by: syzbot+7a4ba6a239b91a126c28@syzkaller.appspotmail.com
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

d196e229

ext4: handle dax mount option collision · a72b38ee

由 Harshad Shirwadkar 提交于 11月 11, 2020

Mount options dax=inode and dax=never collided with fast_commit and
journal checksum. Redefine the mount flags to remove the collision.
Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
Fixes: 9cb20f94 ("fs/ext4: Make DAX mount option a tri-state")
Signed-off-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201111183209.447175-1-harshads@google.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

a72b38ee

io_uring: round-up cq size before comparing with rounded sq size · 88ec3211

由 Jens Axboe 提交于 11月 11, 2020

If an application specifies IORING_SETUP_CQSIZE to set the CQ ring size
to a specific size, we ensure that the CQ size is at least that of the
SQ ring size. But in doing so, we compare the already rounded up to power
of two SQ size to the as-of yet unrounded CQ size. This means that if an
application passes in non power of two sizes, we can return -EINVAL when
the final value would've been fine. As an example, an application passing
in 100/100 for sq/cq size should end up with 128 for both. But since we
round the SQ size first, we compare the CQ size of 100 to 128, and return
-EINVAL as that is too small.

Cc: stable@vger.kernel.org
Fixes: 33a107f0 ("io_uring: allow application controlled CQ ring size")
Reported-by: NDan Melnic <dmm@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

88ec3211

xfs: fix a missing unlock on error in xfs_fs_map_blocks · 2bd3fa79

由 Christoph Hellwig 提交于 11月 11, 2020

We also need to drop the iolock when invalidate_inode_pages2 fails, not
only on all other error or successful cases.

Fixes: 52785112 ("xfs: implement pNFS export operations")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

2bd3fa79

11 11月, 2020 2 次提交

vfs: move __sb_{start,end}_write* to fs.h · 9b852342

由 Darrick J. Wong 提交于 11月 10, 2020

Now that we've straightened out the callers, move these three functions
to fs.h since they're fairly trivial.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>

9b852342

vfs: separate __sb_start_write into blocking and non-blocking helpers · 8a3c84b6

由 Darrick J. Wong 提交于 11月 10, 2020

Break this function into two helpers so that it's obvious that the
trylock versions return a value that must be checked, and the blocking
versions don't require that.  While we're at it, clean up the return
type mismatch.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

8a3c84b6

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功