提交 · 6a4d07cde5778174a35ffc445c1d1388479563ee · openeuler / Kernel

16 5月, 2020 5 次提交

io_uring: file registration list and lock optimization · 6a4d07cd

由 Jens Axboe 提交于 5月 15, 2020

There's no point in using list_del_init() on entries that are going
away, and the associated lock is always used in process context so
let's not use the IRQ disabling+saving variant of the spinlock.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6a4d07cd

io_uring: add IORING_CQ_EVENTFD_DISABLED to the CQ ring flags · 7e55a19c

由 Stefano Garzarella 提交于 5月 15, 2020

This new flag should be set/clear from the application to
disable/enable eventfd notifications when a request is completed
and queued to the CQ ring.

Before this patch, notifications were always sent if an eventfd is
registered, so IORING_CQ_EVENTFD_DISABLED is not set during the
initialization.

It will be up to the application to set the flag after initialization
if no notifications are required at the beginning.
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7e55a19c

io_uring: add 'cq_flags' field for the CQ ring · 0d9b5b3a

由 Stefano Garzarella 提交于 5月 15, 2020

This patch adds the new 'cq_flags' field that should be written by
the application and read by the kernel.

This new field is available to the userspace application through
'cq_off.flags'.
We are using 4-bytes previously reserved and set to zero. This means
that if the application finds this field to zero, then the new
functionality is not supported.

In the next patch we will introduce the first flag available.
Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0d9b5b3a

io_uring: allow POLL_ADD with double poll_wait() users · 18bceab1

由 Jens Axboe 提交于 5月 15, 2020

Some file descriptors use separate waitqueues for their f_ops->poll()
handler, most commonly one for read and one for write. The io_uring
poll implementation doesn't work with that, as the 2nd poll_wait()
call will cause the io_uring poll request to -EINVAL.

This affects (at least) tty devices and /dev/random as well. This is a
big problem for event loops where some file descriptors work, and others
don't.

With this fix, io_uring handles multiple waitqueues.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

18bceab1

io_uring: batch reap of dead file registrations · 4a38aed2

由 Jens Axboe 提交于 5月 14, 2020

We currently embed and queue a work item per fixed_file_ref_node that
we update, but if the workload does a lot of these, then the associated
kworker-events overhead can become quite noticeable.

Since we rarely need to wait on these, batch them at 1 second intervals
instead. If we do need to wait for them, we just flush the pending
delayed work.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4a38aed2

15 5月, 2020 1 次提交

io_uring: name sq thread and ref completions · 0f158b4c

由 Jens Axboe 提交于 5月 14, 2020

We used to have three completions, now we just have two. With the two,
let's not allocate them dynamically, just embed then in the ctx and
name them appropriately.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0f158b4c

11 5月, 2020 1 次提交

io_uring: remove duplicate semicolon at the end of line · 84695089

由 Xiaoming Ni 提交于 5月 11, 2020

Remove duplicate semicolon at the end of line in io_file_from_index()
Signed-off-by: NXiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

84695089

09 5月, 2020 2 次提交

io_uring: remove obsolete 'state' parameter · 7d01bd74

由 Xiaoguang Wang 提交于 5月 08, 2020

The "struct io_submit_state *state" parameter is not used, remove it.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7d01bd74

io_uring: remove 'fd is io_uring' from close path · 904fbcb1

由 Jens Axboe 提交于 5月 08, 2020

The attempt protecting us from closing the ring itself wasn't really
complete, and we actually don't need it. The referencing of requests
themselve, and the references they hold on the ring, ensures that the
life time of the ring is sane. With the check removed, we can also
remove the need to have the close operation fget() the file.
Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

904fbcb1

08 5月, 2020 1 次提交

io_uring: don't use 'fd' for openat/openat2/statx · 63ff8223

由 Jens Axboe 提交于 5月 07, 2020

We currently make some guesses as when to open this fd, but in reality
we have no business (or need) to do so at all. In fact, it makes certain
things fail, like O_PATH.

Remove the fd lookup from these opcodes, we're just passing the 'fd' to
generic helpers anyway. With that, we can also remove the special casing
of fd values in io_req_needs_file(), and the 'fd_non_neg' check that
we have. And we can ensure that we only read sqe->fd once.

This fixes O_PATH usage with openat/openat2, and ditto statx path side
oddities.

Cc: stable@vger.kernel.org: # v5.6
Reported-by: NMax Kellermann <mk@cm4all.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

63ff8223

07 5月, 2020 1 次提交

splice: move f_mode checks to do_{splice,tee}() · 90da2e3f

由 Pavel Begunkov 提交于 5月 04, 2020

do_splice() is used by io_uring, as will be do_tee(). Move f_mode
checks from sys_{splice,tee}() to do_{splice,tee}(), so they're
enforced for io_uring as well.

Fixes: 7d67af2c ("io_uring: add splice(2) support")
Reported-by: NJann Horn <jannh@google.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

90da2e3f

06 5月, 2020 1 次提交

io_uring: handle -EFAULT properly in io_uring_setup() · 7f13657d

由 Xiaoguang Wang 提交于 5月 05, 2020

If copy_to_user() in io_uring_setup() failed, we'll leak many kernel
resources, which will be recycled until process terminates. This bug
can be reproduced by using mprotect to set params to PROT_READ. To fix
this issue, refactor io_uring_create() a bit to add a new 'struct
io_uring_params __user *params' parameter and move the copy_to_user()
in io_uring_setup() to io_uring_setup(), if copy_to_user() failed,
we can free kernel resource properly.
Suggested-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7f13657d

04 5月, 2020 1 次提交

io_uring: fix mismatched finish_wait() calls in io_uring_cancel_files() · d8f1b971

由 Xiaoguang Wang 提交于 4月 26, 2020

The prepare_to_wait() and finish_wait() calls in io_uring_cancel_files()
are mismatched. Currently I don't see any issues related this bug, just
find it by learning codes.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d8f1b971

01 5月, 2020 8 次提交

io_uring: punt splice async because of inode mutex · 2fb3e822

由 Pavel Begunkov 提交于 5月 01, 2020

Nonblocking do_splice() still may wait for some time on an inode mutex.
Let's play safe and always punt it async.
Reported-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2fb3e822

io_uring: check non-sync defer_list carefully · 4ee36314

由 Pavel Begunkov 提交于 5月 01, 2020

io_req_defer() do double-checked locking. Use proper helpers for that,
i.e. list_empty_careful().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4ee36314

io_uring: fix extra put in sync_file_range() · 7759a0bf

由 Pavel Begunkov 提交于 5月 01, 2020

[   40.179474] refcount_t: underflow; use-after-free.
[   40.179499] WARNING: CPU: 6 PID: 1848 at lib/refcount.c:28 refcount_warn_saturate+0xae/0xf0
...
[   40.179612] RIP: 0010:refcount_warn_saturate+0xae/0xf0
[   40.179617] Code: 28 44 0a 01 01 e8 d7 01 c2 ff 0f 0b 5d c3 80 3d 15 44 0a 01 00 75 91 48 c7 c7 b8 f5 75 be c6 05 05 44 0a 01 01 e8 b7 01 c2 ff <0f> 0b 5d c3 80 3d f3 43 0a 01 00 0f 85 6d ff ff ff 48 c7 c7 10 f6
[   40.179619] RSP: 0018:ffffb252423ebe18 EFLAGS: 00010286
[   40.179623] RAX: 0000000000000000 RBX: ffff98d65e929400 RCX: 0000000000000000
[   40.179625] RDX: 0000000000000001 RSI: 0000000000000086 RDI: 00000000ffffffff
[   40.179627] RBP: ffffb252423ebe18 R08: 0000000000000001 R09: 000000000000055d
[   40.179629] R10: 0000000000000c8c R11: 0000000000000001 R12: 0000000000000000
[   40.179631] R13: ffff98d68c434400 R14: ffff98d6a9cbaa20 R15: ffff98d6a609ccb8
[   40.179634] FS:  0000000000000000(0000) GS:ffff98d6af580000(0000) knlGS:0000000000000000
[   40.179636] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.179638] CR2: 00000000033e3194 CR3: 000000006480a003 CR4: 00000000003606e0
[   40.179641] Call Trace:
[   40.179652]  io_put_req+0x36/0x40
[   40.179657]  io_free_work+0x15/0x20
[   40.179661]  io_worker_handle_work+0x2f5/0x480
[   40.179667]  io_wqe_worker+0x2a9/0x360
[   40.179674]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[   40.179681]  kthread+0x12c/0x170
[   40.179685]  ? io_worker_handle_work+0x480/0x480
[   40.179690]  ? kthread_park+0x90/0x90
[   40.179695]  ret_from_fork+0x35/0x40
[   40.179702] ---[ end trace 85027405f00110aa ]---

Opcode handler must never put submission ref, but that's what
io_sync_file_range_finish() do. use io_steal_work() there.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7759a0bf

io_uring: use cond_resched() in io_ring_ctx_wait_and_kill() · 3fd44c86

由 Xiaoguang Wang 提交于 5月 01, 2020

While working on to make io_uring sqpoll mode support syscalls that need
struct files_struct, I got cpu soft lockup in io_ring_ctx_wait_and_kill(),

    while (ctx->sqo_thread && !wq_has_sleeper(&ctx->sqo_wait))
        cpu_relax();

above loop never has an chance to exit, it's because preempt isn't enabled
in the kernel, and the context calling io_ring_ctx_wait_and_kill() and
io_sq_thread() run in the same cpu, if io_sq_thread calls a cond_resched()
yield cpu and another context enters above loop, then io_sq_thread() will
always in runqueue and never exit.

Use cond_resched() can fix this issue.

 Reported-by: syzbot+66243bb7126c410cefe6@syzkaller.appspotmail.com
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3fd44c86

io_uring: use proper references for fallback_req locking · dd461af6

由 Bijan Mottahedeh 提交于 4月 29, 2020

Use ctx->fallback_req address for test_and_set_bit_lock() and
clear_bit_unlock().
Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dd461af6

io_uring: only force async punt if poll based retry can't handle it · 490e8967

由 Jens Axboe 提交于 4月 28, 2020

We do blocking retry from our poll handler, if the file supports polled
notifications. Only mark the request as needing an async worker if we
can't poll for it.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

490e8967

io_uring: enable poll retry for any file with ->read_iter / ->write_iter · af197f50

由 Jens Axboe 提交于 4月 28, 2020

We can have files like eventfd where it's perfectly fine to do poll
based retry on them, right now io_file_supports_async() doesn't take
that into account.

Pass in data direction and check the f_op instead of just always needing
an async worker.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

af197f50

NFS: Fix a race in __nfs_list_for_each_server() · 9c07b75b

由 Trond Myklebust 提交于 4月 30, 2020

The struct nfs_server gets put on the cl_superblocks list before
the server->super field has been initialised, in which case the
call to nfs_sb_active() will Oops. Add a check to ensure that
we skip such a list entry.

Fixes: 3c9e502b ("NFS: Add a helper nfs_client_for_each_server()")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

9c07b75b

30 4月, 2020 2 次提交

fibmap: Warn and return an error in case of block > INT_MAX · b75dfde1

由 Ritesh Harjani 提交于 4月 30, 2020

We better warn the fibmap user and not return a truncated and therefore
an incorrect block map address if the bmap() returned block address
is greater than INT_MAX (since user supplied integer pointer).

It's better to pr_warn() all user of ioctl_fibmap() and return a proper
error code rather than silently letting a FS corruption happen if the
user tries to fiddle around with the returned block map address.

We fix this by returning an error code of -ERANGE and returning 0 as the
block mapping address in case if it is > INT_MAX.

Now iomap_bmap() could be called from either of these two paths.
Either when a user is calling an ioctl_fibmap() interface to get
the block mapping address or by some filesystem via use of bmap()
internal kernel API.
bmap() kernel API is well equipped with handling of u64 addresses.

WARN condition in iomap_bmap_actor() was mainly added to warn all
the fibmap users. But now that we have directly added this warning
for all fibmap users and also made sure to return 0 as block map address
in case if addr > INT_MAX.
So we can now remove this logic from iomap_bmap_actor().
Signed-off-by: NRitesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

b75dfde1

btrfs: fix gcc-4.8 build warning for struct initializer · 9c6c723f

由 Arnd Bergmann 提交于 4月 29, 2020

Some older compilers like gcc-4.8 warn about mismatched curly braces in
a initializer:

fs/btrfs/backref.c: In function 'is_shared_data_backref':
fs/btrfs/backref.c:394:9: error: missing braces around
initializer [-Werror=missing-braces]
  struct prelim_ref target = {0};
         ^
fs/btrfs/backref.c:394:9: error: (near initialization for
'target.rbnode') [-Werror=missing-braces]

Use the GNU empty initializer extension to avoid this.

Fixes: ed58f2e6 ("btrfs: backref, don't add refs from shared block when resolving normal backref")
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9c6c723f

29 4月, 2020 2 次提交

Fix use after free in get_tree_bdev() · dd7bc815

由 David Howells 提交于 4月 28, 2020

Commit 6fcf0c72, a fix to get_tree_bdev() put a missing blkdev_put() in
the wrong place, before a warnf() that displays the bdev under
consideration rather after it.

This results in a silent lockup in printk("%pg") called via warnf() from
get_tree_bdev() under some circumstances when there's a race with the
blockdev being frozen.  This can be caused by xfstests/tests/generic/085 in
combination with Lukas Czerner's ext4 mount API conversion patchset.  It
looks like it ought to occur with other users of get_tree_bdev() such as
XFS, but apparently doesn't.

Fix this by switching the order of the lines.

Fixes: 6fcf0c72 ("vfs: add missing blkdev_put() in get_tree_bdev()")
Reported-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
cc: Ian Kent <raven@themaw.net>
cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dd7bc815

NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION · dff58530

由 Olga Kornievskaia 提交于 4月 24, 2020

Currently, if the client sends BIND_CONN_TO_SESSION with
NFS4_CDFC4_FORE_OR_BOTH but only gets NFS4_CDFS4_FORE back it ignores
that it wasn't able to enable a backchannel.

To make sure, the client sends BIND_CONN_TO_SESSION as the first
operation on the connections (ie., no other session compounds haven't
been sent before), and if the client's request to bind the backchannel
is not satisfied, then reset the connection and retry.

Cc: stable@vger.kernel.org
Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

dff58530

28 4月, 2020 1 次提交

io_uring: statx must grab the file table for valid fd · 5b0bbee4

由 Jens Axboe 提交于 4月 27, 2020

Clay reports that OP_STATX fails for a test case with a valid fd
and empty path:

 -- Test 0: statx:fd 3: SUCCEED, file mode 100755
 -- Test 1: statx:path ./uring_statx: SUCCEED, file mode 100755
 -- Test 2: io_uring_statx:fd 3: FAIL, errno 9: Bad file descriptor
 -- Test 3: io_uring_statx:path ./uring_statx: SUCCEED, file mode 100755

This is due to statx not grabbing the process file table, hence we can't
lookup the fd in async context. If the fd is valid, ensure that we grab
the file table so we can grab the file from async context.

Cc: stable@vger.kernel.org # v5.6
Reported-by: NClay Harris <bugs@claycon.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5b0bbee4

27 4月, 2020 3 次提交

btrfs: transaction: Avoid deadlock due to bad initialization timing of fs_info::journal_info · fcc99734

由 Qu Wenruo 提交于 4月 27, 2020

[BUG]
One run of btrfs/063 triggered the following lockdep warning:
  ============================================
  WARNING: possible recursive locking detected
  5.6.0-rc7-custom+ #48 Not tainted
  --------------------------------------------
  kworker/u24:0/7 is trying to acquire lock:
  ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]

  but task is already holding lock:
  ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(sb_internal#2);
    lock(sb_internal#2);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  4 locks held by kworker/u24:0/7:
   #0: ffff88817b495948 ((wq_completion)btrfs-endio-write){+.+.}, at: process_one_work+0x557/0xb80
   #1: ffff888189ea7db8 ((work_completion)(&work->normal_work)){+.+.}, at: process_one_work+0x557/0xb80
   #2: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
   #3: ffff888174ca4da8 (&fs_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x83/0xd0 [btrfs]

  stack backtrace:
  CPU: 0 PID: 7 Comm: kworker/u24:0 Not tainted 5.6.0-rc7-custom+ #48
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
  Call Trace:
   dump_stack+0xc2/0x11a
   __lock_acquire.cold+0xce/0x214
   lock_acquire+0xe6/0x210
   __sb_start_write+0x14e/0x290
   start_transaction+0x66c/0x890 [btrfs]
   btrfs_join_transaction+0x1d/0x20 [btrfs]
   find_free_extent+0x1504/0x1a50 [btrfs]
   btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
   btrfs_alloc_tree_block+0x1ac/0x570 [btrfs]
   btrfs_copy_root+0x213/0x580 [btrfs]
   create_reloc_root+0x3bd/0x470 [btrfs]
   btrfs_init_reloc_root+0x2d2/0x310 [btrfs]
   record_root_in_trans+0x191/0x1d0 [btrfs]
   btrfs_record_root_in_trans+0x90/0xd0 [btrfs]
   start_transaction+0x16e/0x890 [btrfs]
   btrfs_join_transaction+0x1d/0x20 [btrfs]
   btrfs_finish_ordered_io+0x55d/0xcd0 [btrfs]
   finish_ordered_fn+0x15/0x20 [btrfs]
   btrfs_work_helper+0x116/0x9a0 [btrfs]
   process_one_work+0x632/0xb80
   worker_thread+0x80/0x690
   kthread+0x1a3/0x1f0
   ret_from_fork+0x27/0x50

It's pretty hard to reproduce, only one hit so far.

[CAUSE]
This is because we're calling btrfs_join_transaction() without re-using
the current running one:

btrfs_finish_ordered_io()
|- btrfs_join_transaction()		<<< Call #1
   |- btrfs_record_root_in_trans()
      |- btrfs_reserve_extent()
	 |- btrfs_join_transaction()	<<< Call #2

Normally such btrfs_join_transaction() call should re-use the existing
one, without trying to re-start a transaction.

But the problem is, in btrfs_join_transaction() call #1, we call
btrfs_record_root_in_trans() before initializing current::journal_info.

And in btrfs_join_transaction() call #2, we're relying on
current::journal_info to avoid such deadlock.

[FIX]
Call btrfs_record_root_in_trans() after we have initialized
current::journal_info.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

fcc99734

btrfs: fix partial loss of prealloc extent past i_size after fsync · f135cea3

由 Filipe Manana 提交于 4月 23, 2020

When we have an inode with a prealloc extent that starts at an offset
lower than the i_size and there is another prealloc extent that starts at
an offset beyond i_size, we can end up losing part of the first prealloc
extent (the part that starts at i_size) and have an implicit hole if we
fsync the file and then have a power failure.

Consider the following example with comments explaining how and why it
happens.

$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt

# Create our test file with 2 consecutive prealloc extents, each with a
# size of 128Kb, and covering the range from 0 to 256Kb, with a file
# size of 0.
$ xfs_io -f -c "falloc -k 0 128K" /mnt/foo
$ xfs_io -c "falloc -k 128K 128K" /mnt/foo

# Fsync the file to record both extents in the log tree.
$ xfs_io -c "fsync" /mnt/foo

# Now do a redudant extent allocation for the range from 0 to 64Kb.
# This will merely increase the file size from 0 to 64Kb. Instead we
# could also do a truncate to set the file size to 64Kb.
$ xfs_io -c "falloc 0 64K" /mnt/foo

# Fsync the file, so we update the inode item in the log tree with the
# new file size (64Kb). This also ends up setting the number of bytes
# for the first prealloc extent to 64Kb. This is done by the truncation
# at btrfs_log_prealloc_extents().
# This means that if a power failure happens after this, a write into
# the file range 64Kb to 128Kb will not use the prealloc extent and
# will result in allocation of a new extent.
$ xfs_io -c "fsync" /mnt/foo

# Now set the file size to 256K with a truncate and then fsync the file.
# Since no changes happened to the extents, the fsync only updates the
# i_size in the inode item at the log tree. This results in an implicit
# hole for the file range from 64Kb to 128Kb, something which fsck will
# complain when not using the NO_HOLES feature if we replay the log
# after a power failure.
$ xfs_io -c "truncate 256K" -c "fsync" /mnt/foo

So instead of always truncating the log to the inode's current i_size at
btrfs_log_prealloc_extents(), check first if there's a prealloc extent
that starts at an offset lower than the i_size and with a length that
crosses the i_size - if there is one, just make sure we truncate to a
size that corresponds to the end offset of that prealloc extent, so
that we don't lose the part of that extent that starts at i_size if a
power failure happens.

A test case for fstests follows soon.

Fixes: 31d11b83 ("Btrfs: fix duplicate extents after fsync of file with prealloc extents")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f135cea3

propagate_one(): mnt_set_mountpoint() needs mount_lock · b0d3869c

由 Al Viro 提交于 4月 27, 2020

... to protect the modification of mp->m_count done by it.  Most of
the places that modify that thing also have namespace_lock held,
but not all of them can do so, so we really need mount_lock here.
Kudos to Piotr Krysiuk <piotras@gmail.com>, who'd spotted a related
bug in pivot_root(2) (fixed unnoticed in 5.3); search for other
similar turds has caught out this one.

Cc: stable@kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b0d3869c

25 4月, 2020 2 次提交

NFSv4: Remove unreachable error condition due to rpc_run_task() · 6e47666e

由 Xiyu Yang 提交于 4月 25, 2020

nfs4_proc_layoutget() invokes rpc_run_task(), which return the value to
"task". Since rpc_run_task() is impossible to return an ERR pointer,
there is no need to add the IS_ERR() condition on "task" here. So we
need to remove it.
Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

6e47666e

proc: Put thread_pid in release_task not proc_flush_pid · 6ade99ec

由 Eric W. Biederman 提交于 4月 24, 2020

Oleg pointed out that in the unlikely event the kernel is compiled
with CONFIG_PROC_FS unset that release_task will now leak the pid.

Move the put_pid out of proc_flush_pid into release_task to fix this
and to guarantee I don't make that mistake again.

When possible it makes sense to keep get and put in the same function
so it can easily been seen how they pair up.

Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc")
Reported-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

6ade99ec

24 4月, 2020 4 次提交

afs: Make record checking use TASK_UNINTERRUPTIBLE when appropriate · c4bfda16

由 David Howells 提交于 4月 16, 2020

When an operation is meant to be done uninterruptibly (such as
FS.StoreData), we should not be allowing volume and server record checking
to be interrupted.

Fixes: d2ddc776 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: NDavid Howells <dhowells@redhat.com>

c4bfda16

afs: Fix to actually set AFS_SERVER_FL_HAVE_EPOCH · 69cf3978

由 David Howells 提交于 4月 24, 2020

AFS keeps track of the epoch value from the rxrpc protocol to note (a) when
a fileserver appears to have restarted and (b) when different endpoints of
a fileserver do not appear to be associated with the same fileserver
(ie. all probes back from a fileserver from all of its interfaces should
carry the same epoch).

However, the AFS_SERVER_FL_HAVE_EPOCH flag that indicates that we've
received the server's epoch is never set, though it is used.

Fix this to set the flag when we first receive an epoch value from a probe
sent to the filesystem client from the fileserver.

Fixes: 3bf0fb6f ("afs: Probe multiple fileservers simultaneously")
Signed-off-by: NDavid Howells <dhowells@redhat.com>

69cf3978

afs: Remove some unused bits · be59167c

由 David Howells 提交于 4月 24, 2020

Remove three bits:

 (1) afs_server::no_epoch is neither set nor used.

 (2) afs_server::have_result is set and a wakeup is applied to it, but
     nothing looks at it or waits on it.

 (3) afs_vl_dump_edestaddrreq() prints afs_addr_list::probed, but nothing
     sets it for VL servers.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

be59167c

dlmfs_file_write(): fix the bogosity in handling non-zero *ppos · 3815f1be

由 Al Viro 提交于 4月 23, 2020

'count' is how much you want written, not the final position.
Moreover, it can legitimately be less than the current position...

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3815f1be

23 4月, 2020 5 次提交

btrfs: fix transaction leak in btrfs_recover_relocation · 1402d17d

由 Xiyu Yang 提交于 4月 20, 2020

btrfs_recover_relocation() invokes btrfs_join_transaction(), which joins
a btrfs_trans_handle object into transactions and returns a reference of
it with increased refcount to "trans".

When btrfs_recover_relocation() returns, "trans" becomes invalid, so the
refcount should be decreased to keep refcount balanced.

The reference counting issue happens in one exception handling path of
btrfs_recover_relocation(). When read_fs_root() failed, the refcnt
increased by btrfs_join_transaction() is not decreased, causing a refcnt
leak.

Fix this issue by calling btrfs_end_transaction() on this error path
when read_fs_root() failed.

Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1402d17d

btrfs: fix block group leak when removing fails · f6033c5e

由 Xiyu Yang 提交于 4月 21, 2020

btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which
returns a local reference of the block group that contains the given
bytenr to "block_group" with increased refcount.

When btrfs_remove_block_group() returns, "block_group" becomes invalid,
so the refcount should be decreased to keep refcount balanced.

The reference counting issue happens in several exception handling paths
of btrfs_remove_block_group(). When those error scenarios occur such as
btrfs_alloc_path() returns NULL, the function forgets to decrease its
refcnt increased by btrfs_lookup_block_group() and will cause a refcnt
leak.

Fix this issue by jumping to "out_put_group" label and calling
btrfs_put_block_group() when those error scenarios occur.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f6033c5e

btrfs: drop logs when we've aborted a transaction · ef67963d

由 Josef Bacik 提交于 3月 24, 2020

Dave reported a problem where we were panicing with generic/475 with
misc-5.7.  This is because we were doing IO after we had stopped all of
the worker threads, because we do the log tree cleanup on roots at drop
time.  Cleaning up the log tree will always need to do reads if we
happened to have evicted the blocks from memory.

Because of this simply add a helper to btrfs_cleanup_transaction() that
will go through and drop all of the log roots.  This gets run before we
do the close_ctree() work, and thus we are allowed to do any reads that
we would need.  I ran this through many iterations of generic/475 with
constrained memory and I did not see the issue.

  general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
  CPU: 2 PID: 12359 Comm: umount Tainted: G        W 5.6.0-rc7-btrfs-next-58 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_queue_work+0x33/0x1c0 [btrfs]
  RSP: 0018:ffff9cfb015937d8 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff8eb5e339ed80 RCX: 0000000000000000
  RDX: 0000000000000001 RSI: ffff8eb5eb33b770 RDI: ffff8eb5e37a0460
  RBP: ffff8eb5eb33b770 R08: 000000000000020c R09: ffffffff9fc09ac0
  R10: 0000000000000007 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6b
  R13: ffff9cfb00229040 R14: 0000000000000008 R15: ffff8eb5d3868000
  FS:  00007f167ea022c0(0000) GS:ffff8eb5fae00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f167e5e0cb1 CR3: 0000000138c18004 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   btrfs_end_bio+0x81/0x130 [btrfs]
   __split_and_process_bio+0xaf/0x4e0 [dm_mod]
   ? percpu_counter_add_batch+0xa3/0x120
   dm_process_bio+0x98/0x290 [dm_mod]
   ? generic_make_request+0xfb/0x410
   dm_make_request+0x4d/0x120 [dm_mod]
   ? generic_make_request+0xfb/0x410
   generic_make_request+0x12a/0x410
   ? submit_bio+0x38/0x160
   submit_bio+0x38/0x160
   ? percpu_counter_add_batch+0xa3/0x120
   btrfs_map_bio+0x289/0x570 [btrfs]
   ? kmem_cache_alloc+0x24d/0x300
   btree_submit_bio_hook+0x79/0xc0 [btrfs]
   submit_one_bio+0x31/0x50 [btrfs]
   read_extent_buffer_pages+0x2fe/0x450 [btrfs]
   btree_read_extent_buffer_pages+0x7e/0x170 [btrfs]
   walk_down_log_tree+0x343/0x690 [btrfs]
   ? walk_log_tree+0x3d/0x380 [btrfs]
   walk_log_tree+0xf7/0x380 [btrfs]
   ? plist_requeue+0xf0/0xf0
   ? delete_node+0x4b/0x230
   free_log_tree+0x4c/0x130 [btrfs]
   ? wait_log_commit+0x140/0x140 [btrfs]
   btrfs_free_log+0x17/0x30 [btrfs]
   btrfs_drop_and_free_fs_root+0xb0/0xd0 [btrfs]
   btrfs_free_fs_roots+0x10c/0x190 [btrfs]
   ? do_raw_spin_unlock+0x49/0xc0
   ? _raw_spin_unlock+0x29/0x40
   ? release_extent_buffer+0x121/0x170 [btrfs]
   close_ctree+0x289/0x2e6 [btrfs]
   generic_shutdown_super+0x6c/0x110
   kill_anon_super+0xe/0x30
   btrfs_kill_super+0x12/0x20 [btrfs]
   deactivate_locked_super+0x3a/0x70
Reported-by: NDavid Sterba <dsterba@suse.com>
Fixes: 8c38938c ("btrfs: move the root freeing stuff into btrfs_put_root")
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ef67963d

btrfs: fix memory leak of transaction when deleting unused block group · 5150bf19

由 Filipe Manana 提交于 4月 17, 2020

When cleaning pinned extents right before deleting an unused block group,
we check if there's still a previous transaction running and if so we
increment its reference count before using it for cleaning pinned ranges
in its pinned extents iotree. However we ended up never decrementing the
reference count after using the transaction, resulting in a memory leak.

Fix it by decrementing the reference count.

Fixes: fe119a6e ("btrfs: switch to per-transaction pinned extents")
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5150bf19

cifs: fix uninitialised lease_key in open_shroot() · 0fe0781f

由 Paulo Alcantara 提交于 4月 20, 2020

SMB2_open_init() expects a pre-initialised lease_key when opening a
file with a lease, so set pfid->lease_key prior to calling it in
open_shroot().

This issue was observed when performing some DFS failover tests and
the lease key was never randomly generated.
Signed-off-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: NSteve French <stfrench@microsoft.com>
Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: NAurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>

0fe0781f

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功