提交 · f5d92749191402c50e32ac83dd9da3b910f5680f · openeuler / Kernel

23 1月, 2021 5 次提交

xfs: Check for extent overflow when adding dir entries · f5d92749

由 Chandan Babu R 提交于 1月 22, 2021

Directory entry addition can cause the following,
1. Data block can be added/removed.
   A new extent can cause extent count to increase by 1.
2. Free disk block can be added/removed.
   Same behaviour as described above for Data block.
3. Dabtree blocks.
   XFS_DA_NODE_MAXDEPTH blocks can be added. Each of these
   can be new extents. Hence extent count can increase by
   XFS_DA_NODE_MAXDEPTH.
Signed-off-by: NChandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

f5d92749

xfs: Check for extent overflow when punching a hole · 85ef08b5

由 Chandan Babu R 提交于 1月 22, 2021

The extent mapping the file offset at which a hole has to be
inserted will be split into two extents causing extent count to
increase by 1.
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
Signed-off-by: NChandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

85ef08b5

xfs: Check for extent overflow when trivally adding a new extent · 727e1acd

由 Chandan Babu R 提交于 1月 22, 2021

When adding a new data extent (without modifying an inode's existing
extents) the extent count increases only by 1. This commit checks for
extent count overflow in such cases.
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
Signed-off-by: NChandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

727e1acd

xfs: Add helper for checking per-inode extent count overflow · b9b7e1dc

由 Chandan Babu R 提交于 1月 22, 2021

XFS does not check for possible overflow of per-inode extent counter
fields when adding extents to either data or attr fork.

For e.g.
1. Insert 5 million xattrs (each having a value size of 255 bytes) and
   then delete 50% of them in an alternating manner.

2. On a 4k block sized XFS filesystem instance, the above causes 98511
   extents to be created in the attr fork of the inode.

   xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131

3. The incore inode fork extent counter is a signed 32-bit
   quantity. However the on-disk extent counter is an unsigned 16-bit
   quantity and hence cannot hold 98511 extents.

4. The following incorrect value is stored in the attr extent counter,
   # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
   core.naextents = -32561

This commit adds a new helper function (i.e.
xfs_iext_count_may_overflow()) to check for overflow of the per-inode
data and xattr extent counters. Future patches will use this function to
make sure that an FS operation won't cause the extent counter to
overflow.
Suggested-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NChandan Babu R <chandanrlinux@gmail.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

b9b7e1dc

xfs: fix an ABBA deadlock in xfs_rename · 6da1b4b1

由 Darrick J. Wong 提交于 1月 22, 2021

When overlayfs is running on top of xfs and the user unlinks a file in
the overlay, overlayfs will create a whiteout inode and ask xfs to
"rename" the whiteout file atop the one being unlinked. If the file
being unlinked loses its one nlink, we then have to put the inode on the
unlinked list.

This requires us to grab the AGI buffer of the whiteout inode to take it
off the unlinked list (which is where whiteouts are created) and to grab
the AGI buffer of the file being deleted. If the whiteout was created
in a higher numbered AG than the file being deleted, we'll lock the AGIs
in the wrong order and deadlock.

Therefore, grab all the AGI locks we think we'll need ahead of time, and
in order of increasing AG number per the locking rules.
Reported-by: Nwenli xie <wlxie7296@gmail.com>
Fixes: 93597ae8 ("xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()")
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>

6da1b4b1

17 1月, 2021 2 次提交

mm: don't play games with pinned pages in clear_page_refs · 9348b73c

由 Linus Torvalds 提交于 1月 09, 2021

Turning a pinned page read-only breaks the pinning after COW. Don't do it.

The whole "track page soft dirty" state doesn't work with pinned pages
anyway, since the page might be dirtied by the pinning entity without
ever being noticed in the page tables.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9348b73c

mm: fix clear_refs_write locking · 29a951df

由 Linus Torvalds 提交于 1月 08, 2021

Turning page table entries read-only requires the mmap_sem held for
writing.

So stop doing the odd games with turning things from read locks to write
locks and back.  Just get the write lock.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

29a951df

16 1月, 2021 6 次提交

io_uring: ensure finish_wait() is always called in __io_uring_task_cancel() · a8d13dbc

由 Jens Axboe 提交于 1月 15, 2021

If we enter with requests pending and performm cancelations, we'll have
a different inflight count before and after calling prepare_to_wait().
This causes the loop to restart. If we actually ended up canceling
everything, or everything completed in-between, then we'll break out
of the loop without calling finish_wait() on the waitqueue. This can
trigger a warning on exit_signals(), as we leave the task state in
TASK_UNINTERRUPTIBLE.

Put a finish_wait() after the loop to catch that case.

Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a8d13dbc

ext4: remove expensive flush on fast commit · e9f53353

由 Daejun Park 提交于 1月 06, 2021

In the fast commit, it adds REQ_FUA and REQ_PREFLUSH on each fast
commit block when barrier is enabled. However, in recovery phase,
ext4 compares CRC value in the tail. So it is sufficient to add
REQ_FUA and REQ_PREFLUSH on the block that has tail.
Signed-off-by: NDaejun Park <daejun7.park@samsung.com>
Reviewed-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20210106013242epcms2p5b6b4ed8ca86f29456fdf56aa580e74b4@epcms2p5Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

e9f53353

ext4: fix bug for rename with RENAME_WHITEOUT · 6b4b8e6b

由 yangerkun 提交于 1月 05, 2021

We got a "deleted inode referenced" warning cross our fsstress test. The
bug can be reproduced easily with following steps:

  cd /dev/shm
  mkdir test/
  fallocate -l 128M img
  mkfs.ext4 -b 1024 img
  mount img test/
  dd if=/dev/zero of=test/foo bs=1M count=128
  mkdir test/dir/ && cd test/dir/
  for ((i=0;i<1000;i++)); do touch file$i; done # consume all block
  cd ~ && renameat2(AT_FDCWD, /dev/shm/test/dir/file1, AT_FDCWD,
    /dev/shm/test/dir/dst_file, RENAME_WHITEOUT) # ext4_add_entry in
    ext4_rename will return ENOSPC!!
  cd /dev/shm/ && umount test/ && mount img test/ && ls -li test/dir/file1
  We will get the output:
  "ls: cannot access 'test/dir/file1': Structure needs cleaning"
  and the dmesg show:
  "EXT4-fs error (device loop0): ext4_lookup:1626: inode #2049: comm ls:
  deleted inode referenced: 139"

ext4_rename will create a special inode for whiteout and use this 'ino'
to replace the source file's dir entry 'ino'. Once error happens
latter(the error above was the ENOSPC return from ext4_add_entry in
ext4_rename since all space has been consumed), the cleanup do drop the
nlink for whiteout, but forget to restore 'ino' with source file. This
will trigger the bug describle as above.
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Fixes: cd808dec ("ext4: support RENAME_WHITEOUT")
Link: https://lore.kernel.org/r/20210105062857.3566-1-yangerkun@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

6b4b8e6b

ext4: fix wrong list_splice in ext4_fc_cleanup · 31e203e0

由 Daejun Park 提交于 12月 30, 2020

After full/fast commit, entries in staging queue are promoted to main
queue. In ext4_fs_cleanup function, it splice to staging queue to
staging queue.

Fixes: aa75f4d3 ("ext4: main fast-commit commit path")
Signed-off-by: NDaejun Park <daejun7.park@samsung.com>
Reviewed-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201230094851epcms2p6eeead8cc984379b37b2efd21af90fd1a@epcms2p6Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

31e203e0

ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR · 23dd561a

由 Yi Li 提交于 12月 30, 2020

1: ext4_iget/ext4_find_extent never returns NULL, use IS_ERR
instead of IS_ERR_OR_NULL to fix this.

2: ext4_fc_replay_inode should set the inode to NULL when IS_ERR.
and go to call iput properly.

Fixes: 8016e29f ("ext4: fast commit recovery path")
Signed-off-by: NYi Li <yili@winhong.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201230033827.3996064-1-yili@winhong.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org

23dd561a

io_uring: flush timeouts that should already have expired · f010505b

由 Marcelo Diop-Gonzalez 提交于 1月 15, 2021

Right now io_flush_timeouts() checks if the current number of events
is equal to ->timeout.target_seq, but this will miss some timeouts if
there have been more than 1 event added since the last time they were
flushed (possible in io_submit_flush_completions(), for example). Fix
it by recording the last sequence at which timeouts were flushed so
that the number of events seen can be compared to the number of events
needed without overflow.
Signed-off-by: NMarcelo Diop-Gonzalez <marcelo827@gmail.com>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f010505b

14 1月, 2021 5 次提交

cifs: style: replace one-element array with flexible-array · e54fd071

由 YANG LI 提交于 12月 30, 2020

There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use "flexible array members"[1] for these
cases. The older style of one-element or zero-length arrays should
no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9/process/
    deprecated.html#zero-length-and-one-element-arrays
Signed-off-by: NYANG LI <abaci-bugfix@linux.alibaba.com>
Reported-by: NAbaci <abaci@linux.alibaba.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>

e54fd071

cifs: connect: style: Simplify bool comparison · ed6b1920

由 YANG LI 提交于 1月 11, 2021

Fix the following coccicheck warning:
./fs/cifs/connect.c:3740:6-21: WARNING: Comparison of 0/1 to bool
variable
Signed-off-by: NYANG LI <abaci-bugfix@linux.alibaba.com>
Reported-by: Abaci Robot<abaci@linux.alibaba.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>

ed6b1920

fs: cifs: remove unneeded variable in smb3_fs_context_dup · c13e7af0

由 Menglong Dong 提交于 1月 12, 2021

'rc' in smb3_fs_context_dup is not used and can be removed.
Signed-off-by: NMenglong Dong <dong.menglong@zte.com.cn>
Reviewed-by: NAurelien Aptel <aaptel@suse.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>

c13e7af0

cifs: fix interrupted close commands · 2659d3bf

由 Paulo Alcantara 提交于 1月 13, 2021

Retry close command if it gets interrupted to not leak open handles on
the server.
Signed-off-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
Reported-by: NDuncan Findlay <duncf@duncf.ca>
Suggested-by: NPavel Shilovsky <pshilov@microsoft.com>
Fixes: 6988a619 ("cifs: allow syscalls to be restarted in __smb_send_rqst()")
Cc: stable@vger.kernel.org
Reviewd-by: NPavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>

2659d3bf

cifs: check pointer before freeing · 77b6ec01

由 Tom Rix 提交于 1月 05, 2021

clang static analysis reports this problem

dfs_cache.c:591:2: warning: Argument to kfree() is a constant address
  (18446744073709551614), which is not memory allocated by malloc()
        kfree(vi);
        ^~~~~~~~~

In dfs_cache_del_vol() the volume info pointer 'vi' being freed
is the return of a call to find_vol().  The large constant address
is find_vol() returning an error.

Add an error check to dfs_cache_del_vol() similar to the one done
in dfs_cache_update_vol().

Fixes: 54be1f6c ("cifs: Add DFS cache routines")
Signed-off-by: NTom Rix <trix@redhat.com>
Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
CC: <stable@vger.kernel.org> # v5.0+
Signed-off-by: NSteve French <stfrench@microsoft.com>

77b6ec01

13 1月, 2021 2 次提交

io_uring: do sqo disable on install_fd error · 06585c49

由 Pavel Begunkov 提交于 1月 13, 2021

WARNING: CPU: 0 PID: 8494 at fs/io_uring.c:8717
	io_ring_ctx_wait_and_kill+0x4f2/0x600 fs/io_uring.c:8717
Call Trace:
 io_uring_release+0x3e/0x50 fs/io_uring.c:8759
 __fput+0x283/0x920 fs/file_table.c:280
 task_work_run+0xdd/0x190 kernel/task_work.c:140
 tracehook_notify_resume include/linux/tracehook.h:189 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:174 [inline]
 exit_to_user_mode_prepare+0x249/0x250 kernel/entry/common.c:201
 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
 syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

failed io_uring_install_fd() is a special case, we don't do
io_ring_ctx_wait_and_kill() directly but defer it to fput, though still
need to io_disable_sqo_submit() before.

note: it doesn't fix any real problem, just a warning. That's because
sqring won't be available to the userspace in this case and so SQPOLL
won't submit anything.

Reported-by: syzbot+9c9c35374c0ecac06516@syzkaller.appspotmail.com
Fixes: d9d05217 ("io_uring: stop SQPOLL submit on creator's death")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

06585c49

io_uring: fix null-deref in io_disable_sqo_submit · b4411616

由 Pavel Begunkov 提交于 1月 13, 2021

general protection fault, probably for non-canonical address
	0xdffffc0000000022: 0000 [#1] KASAN: null-ptr-deref
	in range [0x0000000000000110-0x0000000000000117]
RIP: 0010:io_ring_set_wakeup_flag fs/io_uring.c:6929 [inline]
RIP: 0010:io_disable_sqo_submit+0xdb/0x130 fs/io_uring.c:8891
Call Trace:
 io_uring_create fs/io_uring.c:9711 [inline]
 io_uring_setup+0x12b1/0x38e0 fs/io_uring.c:9739
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

io_disable_sqo_submit() might be called before user rings were
allocated, don't do io_ring_set_wakeup_flag() in those cases.

Reported-by: syzbot+ab412638aeb652ded540@syzkaller.appspotmail.com
Fixes: d9d05217 ("io_uring: stop SQPOLL submit on creator's death")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b4411616

11 1月, 2021 12 次提交

io_uring: don't take files/mm for a dead task · 621fadc2

由 Pavel Begunkov 提交于 1月 11, 2021

In rare cases a task may be exiting while io_ring_exit_work() trying to
cancel/wait its requests. It's ok for __io_sq_thread_acquire_mm()
because of SQPOLL check, but is not for __io_sq_thread_acquire_files().
Play safe and fail for both of them.

Cc: stable@vger.kernel.org # 5.5+
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

621fadc2

io_uring: drop mm and files after task_work_run · d434ab6d

由 Pavel Begunkov 提交于 1月 11, 2021

__io_req_task_submit() run by task_work can set mm and files, but
io_sq_thread() in some cases, and because __io_sq_thread_acquire_mm()
and __io_sq_thread_acquire_files() do a simple current->mm/files check
it may end up submitting IO with mm/files of another task.

We also need to drop it after in the end to drop potentially grabbed
references to them.

Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d434ab6d

NFS: nfs_igrab_and_active must first reference the superblock · 896567ee

由 Trond Myklebust 提交于 1月 10, 2021

Before referencing the inode, we must ensure that the superblock can be
referenced. Otherwise, we can end up with iput() calling superblock
operations that are no longer valid or accessible.

Fixes: ea7c38fe ("NFSv4: Ensure we reference the inode for return-on-close in delegreturn")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

896567ee

NFS: nfs_delegation_find_inode_server must first reference the superblock · 113aac6d

由 Trond Myklebust 提交于 1月 10, 2021

Before referencing the inode, we must ensure that the superblock can be
referenced. Otherwise, we can end up with iput() calling superblock
operations that are no longer valid or accessible.

Fixes: e39d8a18 ("NFSv4: Fix an Oops during delegation callbacks")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

113aac6d

NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter · cb2856c5

由 Trond Myklebust 提交于 1月 06, 2021

If we exit _lgopen_prepare_attached() without setting a layout, we will
currently leak the plh_outstanding counter.

Fixes: 411ae722 ("pNFS: Wait for stale layoutget calls to complete in pnfs_update_layout()")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

cb2856c5

NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit() · 46c9ea1d

由 Trond Myklebust 提交于 1月 06, 2021

We must ensure that we pass a layout segment to nfs_retry_commit() when
we're cleaning up after pnfs_bucket_alloc_ds_commits(). Otherwise,
requests that should be committed to the DS will get committed to the
MDS.
Do so by ensuring that pnfs_bucket_get_committing() always tries to
return a layout segment when it returns a non-empty page list.

Fixes: c84bea59 ("NFS/pNFS: Simplify bucket layout segment reference counting")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

46c9ea1d

NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request · 1757655d

由 Trond Myklebust 提交于 1月 06, 2021

In pnfs_generic_clear_request_commit(), we try calling
pnfs_free_bucket_lseg() before we remove the request from the DS bucket.
That will always fail, since the point is to test for whether or not
that bucket is empty.

Fixes: c84bea59 ("NFS/pNFS: Simplify bucket layout segment reference counting")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

1757655d

pNFS: Stricter ordering of layoutget and layoutreturn · 2c8d5fc3

由 Trond Myklebust 提交于 1月 05, 2021

If a layout return is in progress, we should wait for it to complete,
in case the layout segment we are picking up gets returned too.

Fixes: 30cb3ee2 ("pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state seqid")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

2c8d5fc3

pNFS: Clean up pnfs_layoutreturn_free_lsegs() · c18d1e17

由 Trond Myklebust 提交于 1月 04, 2021

Remove the check for whether or not the stateid is NULL, and fix up the
callers.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

c18d1e17

pNFS: We want return-on-close to complete when evicting the inode · 078000d0

由 Trond Myklebust 提交于 1月 04, 2021

If the inode is being evicted, it should be safe to run return-on-close,
so we should do it to ensure we don't inadvertently leak layout segments.

Fixes: 1c5bd76d ("pNFS: Enable layoutreturn operation for return-on-close")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

078000d0

pNFS: Mark layout for return if return-on-close was not sent · 67bbceed

由 Trond Myklebust 提交于 1月 04, 2021

If the layout return-on-close failed because the layoutreturn was never
sent, then we should mark the layout for return again.

Fixes: 9c47b18c ("pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

67bbceed

NFS: Adjust fs_context error logging · c98e9daa

由 Scott Mayhew 提交于 1月 05, 2021

Several existing dprink()/dfprintk() calls were converted to use the new
mount API logging macros by commit ce8866f0 ("NFS: Attach
supplementary error information to fs_context"). If the fs_context was
not created using fsopen() then it will not have had a log buffer
allocated for it, and the new mount API logging macros will wind up
calling printk().

This can result in syslog messages being logged where previously there
were none... most notably "NFS4: Couldn't follow remote path", which can
happen if the client is auto-negotiating a protocol version with an NFS
server that doesn't support the higher v4.x versions.

Convert the nfs_errorf(), nfs_invalf(), and nfs_warnf() macros to check
for the existence of the fs_context's log buffer and call dprintk() if
it doesn't exist. Add nfs_ferrorf(), nfs_finvalf(), and nfs_warnf(),
which do the same thing but take an NFS debug flag as an argument and
call dfprintk(). Finally, modify the "NFS4: Couldn't follow remote
path" message to use nfs_ferrorf().

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207385Signed-off-by: NScott Mayhew <smayhew@redhat.com>
Reviewed-by: NBenjamin Coddington <bcodding@redhat.com>
Fixes: ce8866f0 ("NFS: Attach supplementary error information to fs_context.")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

c98e9daa

10 1月, 2021 4 次提交

io_uring: stop SQPOLL submit on creator's death · d9d05217

由 Pavel Begunkov 提交于 1月 08, 2021

When the creator of SQPOLL io_uring dies (i.e. sqo_task), we don't want
its internals like ->files and ->mm to be poked by the SQPOLL task, it
have never been nice and recently got racy. That can happen when the
owner undergoes destruction and SQPOLL tasks tries to submit new
requests in parallel, and so calls io_sq_thread_acquire*().

That patch halts SQPOLL submissions when sqo_task dies by introducing
sqo_dead flag. Once set, the SQPOLL task must not do any submission,
which is synchronised by uring_lock as well as the new flag.

The tricky part is to make sure that disabling always happens, that
means either the ring is discovered by creator's do_exit() -> cancel,
or if the final close() happens before it's done by the creator. The
last is guaranteed by the fact that for SQPOLL the creator task and only
it holds exactly one file note, so either it pins up to do_exit() or
removed by the creator on the final put in flush. (see comments in
uring_flush() around file->f_count == 2).

One more place that can trigger io_sq_thread_acquire_*() is
__io_req_task_submit(). Shoot off requests on sqo_dead there, even
though actually we don't need to. That's because cancellation of
sqo_task should wait for the request before going any further.

note 1: io_disable_sqo_submit() does io_ring_set_wakeup_flag() so the
caller would enter the ring to get an error, but it still doesn't
guarantee that the flag won't be cleared.

note 2: if final __userspace__ close happens not from the creator
task, the file note will pin the ring until the task dies.

Fixed: b1b6b5a3 ("kernel/io_uring: cancel io_uring before task works")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d9d05217

io_uring: add warn_once for io_uring_flush() · 6b5733eb

由 Pavel Begunkov 提交于 1月 08, 2021

files_cancel() should cancel all relevant requests and drop file notes,
so we should never have file notes after that, including on-exit fput
and flush. Add a WARN_ONCE to be sure.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6b5733eb

io_uring: inline io_uring_attempt_task_drop() · 4f793dc4

由 Pavel Begunkov 提交于 1月 08, 2021

A simple preparation change inlining io_uring_attempt_task_drop() into
io_uring_flush().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4f793dc4

io_uring: io_rw_reissue lockdep annotations · 55e6ac1e

由 Pavel Begunkov 提交于 1月 08, 2021

We expect io_rw_reissue() to take place only during submission with
uring_lock held. Add a lockdep annotation to check that invariant.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

55e6ac1e

09 1月, 2021 1 次提交

poll: fix performance regression due to out-of-line __put_user() · ef0ba055

由 Linus Torvalds 提交于 1月 07, 2021

The kernel test robot reported a -5.8% performance regression on the
"poll2" test of will-it-scale, and bisected it to commit d55564cf
("x86: Make __put_user() generate an out-of-line call").

I didn't expect an out-of-line __put_user() to matter, because no normal
core code should use that non-checking legacy version of user access any
more.  But I had overlooked the very odd poll() usage, which does a
__put_user() to update the 'revents' values of the poll array.

Now, Al Viro correctly points out that instead of updating just the
'revents' field, it would be much simpler to just copy the _whole_
pollfd entry, and then we could just use "copy_to_user()" on the whole
array of entries, the same way we use "copy_from_user()" a few lines
earlier to get the original values.

But that is not what we've traditionally done, and I worry that threaded
applications might be concurrently modifying the other fields of the
pollfd array.  So while Al's suggestion is simpler - and perhaps worth
trying in the future - this instead keeps the "just update revents"
model.

To fix the performance regression, use the modern "unsafe_put_user()"
instead of __put_user(), with the proper "user_write_access_begin()"
guarding in place. This improves code generation enormously.

Link: https://lore.kernel.org/lkml/20210107134723.GA28532@xsang-OptiPlex-9020/Reported-by: Nkernel test robot <oliver.sang@intel.com>
Tested-by: NOliver Sang <oliver.sang@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Laight <David.Laight@aculab.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ef0ba055

08 1月, 2021 3 次提交

btrfs: shrink delalloc pages instead of full inodes · e076ab2a

由 Josef Bacik 提交于 1月 07, 2021

Commit 38d715f4 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
some infrastructure we have in place to flush inodes that we use for
device replace and snapshot.  However this introduced a pretty serious
performance regression.  To reproduce the user untarred the source
tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
see it take anywhere from 5 to 20 times as long to untar in 5.10
compared to 5.9. This was observed on fast devices (SSD and better) and
not on HDD.

The root cause is because before we would generally use the normal
writeback path to reclaim delalloc space, and for this we would provide
it with the number of pages we wanted to flush.  The referenced commit
changed this to flush that many inodes, which drastically increased the
amount of space we were flushing in certain cases, which severely
affected performance.

We cannot revert this patch unfortunately because of 3d45f221
("btrfs: fix deadlock when cloning inline extent and low on free
metadata space") which requires the ability to skip flushing inodes that
are being cloned in certain scenarios, which means we need to keep using
our flushing infrastructure or risk re-introducing the deadlock.

Instead to fix this problem we can go back to providing
btrfs_start_delalloc_roots with a number of pages to flush, and then set
up a writeback_control and utilize sync_inode() to handle the flushing
for us.  This gives us the same behavior we had prior to the fix, while
still allowing us to avoid the deadlock that was fixed by Filipe.  I
redid the users original test and got the following results on one of
our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)

  5.9		0m54.258s
  5.10		1m26.212s
  5.10+patch	0m38.800s

5.10+patch is significantly faster than plain 5.9 because of my patch
series "Change data reservations to use the ticketing infra" which
contained the patch that introduced the regression, but generally
improved the overall ENOSPC flushing mechanisms.

Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
the results:

  5.10.5            4m00s
  5.10.5+patch      1m08s
  5.11-rc2	    5m14s
  5.11-rc2+patch    1m30s
Reported-by: NRené Rebe <rene@exactcode.de>
Fixes: 38d715f4 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
CC: stable@vger.kernel.org # 5.10
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Tested-by: NDavid Sterba <dsterba@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ add my test results ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e076ab2a

block: pre-initialize struct block_device in bdev_alloc_inode · 2d2f6f1b

由 Christoph Hellwig 提交于 1月 07, 2021

bdev_evict_inode and bdev_free_inode are also called for the root inode
of bdevfs, for which bdev_alloc is never called.  Move the zeroing o
f struct block_device and the initialization of the bd_bdi field into
bdev_alloc_inode to make sure they are initialized for the root inode
as well.

Fixes: e6cb5382 ("block: initialize struct block_device in bdev_alloc")
Reported-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2d2f6f1b

fs: Fix freeze_bdev()/thaw_bdev() accounting of bd_fsfreeze_sb · 04a6a536

由 Satya Tangirala 提交于 12月 24, 2020

freeze/thaw_bdev() currently use bdev->bd_fsfreeze_count to infer
whether or not bdev->bd_fsfreeze_sb is valid (it's valid iff
bd_fsfreeze_count is non-zero). thaw_bdev() doesn't nullify
bd_fsfreeze_sb.

But this means a freeze_bdev() call followed by a thaw_bdev() call can
leave bd_fsfreeze_sb with a non-null value, while bd_fsfreeze_count is
zero. If freeze_bdev() is called again, and this time
get_active_super() returns NULL (e.g. because the FS is unmounted),
we'll end up with bd_fsfreeze_count > 0, but bd_fsfreeze_sb is
*untouched* - it stays the same (now garbage) value. A subsequent
thaw_bdev() will decide that the bd_fsfreeze_sb value is legitimate
(since bd_fsfreeze_count > 0), and attempt to use it.

Fix this by always setting bd_fsfreeze_sb to NULL when
bd_fsfreeze_count is successfully decremented to 0 in thaw_sb().
Alternatively, we could set bd_fsfreeze_sb to whatever
get_active_super() returns in freeze_bdev() whenever bd_fsfreeze_count
is successfully incremented to 1 from 0 (which can be achieved cleanly
by moving the line currently setting bd_fsfreeze_sb to immediately
after the "sync:" label, but it might be a little too subtle/easily
overlooked in future).

This fixes the currently panicking xfstests generic/085.

Fixes: 040f04bd ("fs: simplify freeze_bdev/thaw_bdev")
Signed-off-by: NSatya Tangirala <satyat@google.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

04a6a536

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功