提交 · 67d5e289a193c643a70ceda437c625e2bc876dbc · openeuler / Kernel

23 8月, 2021 5 次提交

btrfs: remove max argument from generic_bin_search · 67d5e289

由 Marcos Paulo de Souza 提交于 7月 06, 2021

Both callers use btrfs_header_nritems to feed the max argument.  Remove
the argument and let generic_bin_search call it itself.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

67d5e289

btrfs: make btrfs_finish_chunk_alloc private to block-group.c · 2eadb9e7

由 Nikolay Borisov 提交于 7月 05, 2021

One of the final things that must be done to add a new chunk is
inserting its device extent items in the device tree. They describe
the portion of allocated device physical space during phase 1 of
chunk allocation. This is currently done in btrfs_finish_chunk_alloc
whose name isn't very informative. What's more, this function is only
used in block-group.c but is defined as public. There isn't anything
special about it that would warrant it being defined in volumes.c.

Just move btrfs_finish_chunk_alloc and alloc_chunk_dev_extent to
block-group.c, make the former static and rename both functions to
insert_dev_extents and insert_dev_extent respectively.
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2eadb9e7

btrfs: check-integrity: drop unnecessary function prototypes · 4a9531cf

由 Anand Jain 提交于 7月 05, 2021

The function prototypes below aren't necessary as the functions are
first defined before called. Remove them.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4a9531cf

btrfs: add special case to setget helpers for 64k pages · b3b7e1d0

由 David Sterba 提交于 6月 23, 2021

On 64K pages the size of the extent_buffer::pages array is 1 and
compilation with -Warray-bounds warns due to

  kaddr = page_address(eb->pages[idx + 1]);

when reading byte range crossing page boundary.

This does never actually overflow the array because on 64K because all
the data fit in one page and bounds are checked by check_setget_bounds.

To fix the reported overflows and warnings add a compile-time condition
that will allow compiler to eliminate the dead code that reads from the
idx + 1 page.

Link: https://lore.kernel.org/lkml/20210623083901.1d49d19d@canb.auug.org.au/
CC: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b3b7e1d0

btrfs: zoned: remove max_zone_append_size logic · 5a80d1c6

由 Johannes Thumshirn 提交于 6月 29, 2021

There used to be a patch in the original series for zoned support which
limited the extent size to max_zone_append_size, but this patch has been
dropped somewhere around v9.

We've decided to go the opposite direction, instead of limiting extents
in the first place we split them before submission to comply with the
device's limits.

Remove the related code, btrfs_fs_info::max_zone_append_size and
btrfs_zoned_device_info::max_zone_append_size.

This also removes the workaround for dm-crypt introduced in
1d68128c ("btrfs: zoned: fail mount if the device does not support
zone append") because the fix has been merged as f34ee1dc ("dm
crypt: Fix zoned block device support").
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5a80d1c6

21 8月, 2021 2 次提交

fs: warn about impending deprecation of mandatory locks · fdd92b64

由 Jeff Layton 提交于 8月 20, 2021

We've had CONFIG_MANDATORY_FILE_LOCKING since 2015 and a lot of distros
have disabled it. Warn the stragglers that still use "-o mand" that
we'll be dropping support for that mount option.

Cc: stable@vger.kernel.org
Signed-off-by: NJeff Layton <jlayton@kernel.org>

fdd92b64

io_uring: fix xa_alloc_cycle() error return value check · a30f895a

由 Jens Axboe 提交于 8月 20, 2021

We currently check for ret != 0 to indicate error, but '1' is a valid
return and just indicates that the allocation succeeded with a wrap.
Correct the check to be for < 0, like it was before the xarray
conversion.

Cc: stable@vger.kernel.org
Fixes: 61cf9370 ("io_uring: Convert personality_idr to XArray")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a30f895a

19 8月, 2021 1 次提交

pipe: avoid unnecessary EPOLLET wakeups under normal loads · 3b844826

由 Linus Torvalds 提交于 8月 05, 2021

I had forgotten just how sensitive hackbench is to extra pipe wakeups,
and commit 3a34b13a ("pipe: make pipe writes always wake up
readers") ended up causing a quite noticeable regression on larger
machines.

Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's
not clear that this matters in real life all that much, but as Mel
points out, it's used often enough when comparing kernels and so the
performance regression shows up like a sore thumb.

It's easy enough to fix at least for the common cases where pipes are
used purely for data transfer, and you never have any exciting poll
usage at all.  So set a special 'poll_usage' flag when there is polling
activity, and make the ugly "EPOLLET has crazy legacy expectations"
semantics explicit to only that case.

I would love to limit it to just the broken EPOLLET case, but the pipe
code can't see the difference between epoll and regular select/poll, so
any non-read/write waiting will trigger the extra wakeup behavior.  That
is sufficient for at least the hackbench case.

Apart from making the odd extra wakeup cases more explicitly about
EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe
write, not at the first write chunk.  That is actually much saner
semantics (as much as you can call any of the legacy edge-triggered
expectations for EPOLLET "sane") since it means that you know the wakeup
will happen once the write is done, rather than possibly in the middle
of one.

[ For stable people: I'm putting a "Fixes" tag on this, but I leave it
  up to you to decide whether you actually want to backport it or not.
  It likely has no impact outside of synthetic benchmarks  - Linus ]

Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/
Fixes: 3a34b13a ("pipe: make pipe writes always wake up readers")
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Tested-by: NSandeep Patil <sspatil@android.com>
Tested-by: NMel Gorman <mgorman@techsingularity.net>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3b844826

18 8月, 2021 1 次提交

io_uring: pin ctx on fallback execution · 9cb0073b

由 Pavel Begunkov 提交于 8月 17, 2021

Pin ring in io_fallback_req_func() by briefly elevating ctx->refs in
case any task_work handler touches ctx after releasing a request.

Fixes: 9011bf9a ("io_uring: fix stuck fallback reqs")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/833a494713d235ec144284a9bbfe418df4f6b61c.1629235576.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

9cb0073b

16 8月, 2021 1 次提交

btrfs: prevent rename2 from exchanging a subvol with a directory from different parents · 3f79f6f6

由 NeilBrown 提交于 8月 06, 2021

Cross-rename lacks a check when that would prevent exchanging a
directory and subvolume from different parent subvolume. This causes
data inconsistencies and is caught before commit by tree-checker,
turning the filesystem to read-only.

Calling the renameat2 with RENAME_EXCHANGE flags like

  renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1))

on two paths:

  namesrc = dir1/subvol1/dir2
 namedest = subvol2/subvol3

will cause key order problem with following write time tree-checker
report:

  [1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258
  [1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5
  [1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561
  [1194842.338772]        item 0 key (256 1 0) itemoff 16123 itemsize 160
  [1194842.338793]                inode generation 3 size 16 mode 40755
  [1194842.338801]        item 1 key (256 12 256) itemoff 16111 itemsize 12
  [1194842.338809]        item 2 key (256 84 2248503653) itemoff 16077 itemsize 34
  [1194842.338817]                dir oid 258 type 2
  [1194842.338823]        item 3 key (256 84 2363071922) itemoff 16043 itemsize 34
  [1194842.338830]                dir oid 257 type 2
  [1194842.338836]        item 4 key (256 96 2) itemoff 16009 itemsize 34
  [1194842.338843]        item 5 key (256 96 3) itemoff 15975 itemsize 34
  [1194842.338852]        item 6 key (257 1 0) itemoff 15815 itemsize 160
  [1194842.338863]                inode generation 6 size 8 mode 40755
  [1194842.338869]        item 7 key (257 12 256) itemoff 15801 itemsize 14
  [1194842.338876]        item 8 key (257 84 2505409169) itemoff 15767 itemsize 34
  [1194842.338883]                dir oid 256 type 2
  [1194842.338888]        item 9 key (257 96 2) itemoff 15733 itemsize 34
  [1194842.338895]        item 10 key (258 12 256) itemoff 15719 itemsize 14
  [1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected
  [1194842.339245] ------------[ cut here ]------------
  [1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs]
  [1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793
  [1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008
  [1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
  [1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
  [1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282
  [1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978
  [1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970
  [1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003
  [1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
  [1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000
  [1194842.512092] FS:  0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000
  [1194842.512095] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0
  [1194842.512103] Call Trace:
  [1194842.512128]  ? run_one_async_free+0x10/0x10 [btrfs]
  [1194842.631729]  btree_csum_one_bio+0x1ac/0x1d0 [btrfs]
  [1194842.631837]  run_one_async_start+0x18/0x30 [btrfs]
  [1194842.631938]  btrfs_work_helper+0xd5/0x1d0 [btrfs]
  [1194842.647482]  process_one_work+0x262/0x5e0
  [1194842.647520]  worker_thread+0x4c/0x320
  [1194842.655935]  ? process_one_work+0x5e0/0x5e0
  [1194842.655946]  kthread+0x135/0x160
  [1194842.655953]  ? set_kthread_struct+0x40/0x40
  [1194842.655965]  ret_from_fork+0x1f/0x30
  [1194842.672465] irq event stamp: 1729
  [1194842.672469] hardirqs last  enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0
  [1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0
  [1194842.672482] softirqs last  enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a
  [1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0

The corrupted data will not be written, and filesystem can be unmounted
and mounted again (all changes since the last commit will be lost).

Add the missing check for new_ino so that all non-subvolumes must reside
under the same parent subvolume. There's an exception allowing to
exchange two subvolumes from any parents as the directory representing a
subvolume is only a logical link and does not have any other structures
related to the parent subvolume, unlike files, directories etc, that
are always in the inode namespace of the parent subvolume.

Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
CC: stable@vger.kernel.org # 4.7+
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3f79f6f6

15 8月, 2021 1 次提交

io_uring: only assign io_uring_enter() SQPOLL error in actual error case · 21f96522

由 Jens Axboe 提交于 8月 14, 2021

If an SQPOLL based ring is newly created and an application issues an
io_uring_enter(2) system call on it, then we can return a spurious
-EOWNERDEAD error. This happens because there's nothing to submit, and
if the caller doesn't specify any other action, the initial error
assignment of -EOWNERDEAD never gets overwritten. This causes us to
return it directly, even if it isn't valid.

Move the error assignment into the actual failure case instead.

Cc: stable@vger.kernel.org
Fixes: d9d05217 ("io_uring: stop SQPOLL submit on creator's death")
Reported-by: Sherlock Holo sherlockya@gmail.com
Link: https://github.com/axboe/liburing/issues/413Signed-off-by: NJens Axboe <axboe@kernel.dk>

21f96522

13 8月, 2021 2 次提交

cifs: Call close synchronously during unlink/rename/lease break. · 9e992755

由 Rohith Surabattula 提交于 8月 09, 2021

During unlink/rename/lease break, deferred work for close is
scheduled immediately but in an asynchronous manner which might
lead to race with actual(unlink/rename) commands.

This change will schedule close synchronously which will avoid
the race conditions with other commands.
Signed-off-by: NRohith Surabattula <rohiths@microsoft.com>
Reviewed-by: NShyam Prasad N <sprasad@microsoft.com>
Cc: stable@vger.kernel.org # 5.13
Signed-off-by: NSteve French <stfrench@microsoft.com>

9e992755

cifs: Handle race conditions during rename · 41535701

由 Rohith Surabattula 提交于 7月 29, 2021

When rename is executed on directory which has files for which
close is deferred, then rename will fail with EACCES.

This patch will try to close all deferred files when EACCES is received
and retry rename on a directory.
Signed-off-by: NRohith Surabattula <rohiths@microsoft.com>
Cc: stable@vger.kernel.org # 5.13
Reviewed-by: NShyam Prasad N <sprasad@microsoft.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>

41535701

10 8月, 2021 11 次提交

cifs: use the correct max-length for dentry_path_raw() · 981567bd

由 Ronnie Sahlberg 提交于 8月 10, 2021

RHBZ: 1972502

PATH_MAX is 4096 but PAGE_SIZE can be >4096 on some architectures
such as ppc and would thus write beyond the end of the actual object.

Cc: <stable@vger.kernel.org>
Reported-by: NXiaoli Feng <xifeng@redhat.com>
Suggested-by: NBrian foster <bfoster@redhat.com>
Reviewed-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>

981567bd

ovl: prevent private clone if bind mount is not allowed · 427215d8

由 Miklos Szeredi 提交于 8月 09, 2021

Add the following checks from __do_loopback() to clone_private_mount() as
well:

 - verify that the mount is in the current namespace

 - verify that there are no locked children
Reported-by: NAlois Wohlschlager <alois1@gmx-topmail.de>
Fixes: c771d683 ("vfs: introduce clone_private_mount()")
Cc: <stable@vger.kernel.org> # v3.18
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

427215d8

ovl: fix uninitialized pointer read in ovl_lookup_real_one() · 580c6104

由 Miklos Szeredi 提交于 8月 06, 2021

One error path can result in release_dentry_name_snapshot() being called
before "name" was initialized by take_dentry_name_snapshot().

Fix by moving the release_dentry_name_snapshot() to immediately after the
only use.
Reported-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

580c6104

ovl: fix deadlock in splice write · 9b91b6b0

由 Miklos Szeredi 提交于 7月 28, 2021

There's possibility of an ABBA deadlock in case of a splice write to an
overlayfs file and a concurrent splice write to a corresponding real file.

The call chain for splice to an overlay file:

 -> do_splice                     [takes sb_writers on overlay file]
   -> do_splice_from
     -> iter_file_splice_write    [takes pipe->mutex]
       -> vfs_iter_write
         ...
         -> ovl_write_iter        [takes sb_writers on real file]

And the call chain for splice to a real file:

 -> do_splice                     [takes sb_writers on real file]
   -> do_splice_from
     -> iter_file_splice_write    [takes pipe->mutex]

Syzbot successfully bisected this to commit 82a763e6 ("ovl: simplify
file splice").

Fix by reverting the write part of the above commit and by adding missing
bits from ovl_write_iter() into ovl_splice_write().

Fixes: 82a763e6 ("ovl: simplify file splice")
Reported-and-tested-by: syzbot+579885d1a9a833336209@syzkaller.appspotmail.com
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

9b91b6b0

ovl: skip stale entries in merge dir cache iteration · 9011c279

由 Amir Goldstein 提交于 4月 26, 2021

On the first getdents call, ovl_iterate() populates the readdir cache
with a list of entries, but for upper entries with origin lower inode,
p->ino remains zero.

Following getdents calls traverse the readdir cache list and call
ovl_cache_update_ino() for entries with zero p->ino to lookup the entry
in the overlay and return d_ino that is consistent with st_ino.

If the upper file was unlinked between the first getdents call and the
getdents call that lists the file entry, ovl_cache_update_ino() will not
find the entry and fall back to setting d_ino to the upper real st_ino,
which is inconsistent with how this object was presented to users.

Instead of listing a stale entry with inconsistent d_ino, simply skip
the stale entry, which is better for users.

xfstest overlay/077 is failing without this patch.
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Link: https://lore.kernel.org/fstests/CAOQ4uxgR_cLnC_vdU5=seP3fwqVkuZM_-WfD6maFTMbMYq=a9w@mail.gmail.com/Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

9011c279

io_uring: fix ctx-exit io_rsrc_put_work() deadlock · 43597aac

由 Pavel Begunkov 提交于 8月 10, 2021

__io_rsrc_put_work() might need ->uring_lock, so nobody should wait for
rsrc nodes holding the mutex. However, that's exactly what
io_ring_ctx_free() does with io_wait_rsrc_data().

Split it into rsrc wait + dealloc, and move the first one out of the
lock.

Cc: stable@vger.kernel.org
Fixes: b60c8dce ("io_uring: preparation for rsrc tagging")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0130c5c2693468173ec1afab714e0885d2c9c363.1628559783.git.asml.silence@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

43597aac

io_uring: drop ctx->uring_lock before flushing work item · c018db4a

由 Jens Axboe 提交于 8月 09, 2021

Ammar reports that he's seeing a lockdep splat on running test/rsrc_tags
from the regression suite:

======================================================
WARNING: possible circular locking dependency detected
5.14.0-rc3-bluetea-test-00249-gc7d10223 #5 Tainted: G           OE
------------------------------------------------------
kworker/2:4/2684 is trying to acquire lock:
ffff88814bb1c0a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_rsrc_put_work+0x13d/0x1a0

but task is already holding lock:
ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}:
       __flush_work+0x31b/0x490
       io_rsrc_ref_quiesce.part.0.constprop.0+0x35/0xb0
       __do_sys_io_uring_register+0x45b/0x1060
       do_syscall_64+0x35/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xae

-> #0 (&ctx->uring_lock){+.+.}-{3:3}:
       __lock_acquire+0x119a/0x1e10
       lock_acquire+0xc8/0x2f0
       __mutex_lock+0x86/0x740
       io_rsrc_put_work+0x13d/0x1a0
       process_one_work+0x236/0x530
       worker_thread+0x52/0x3b0
       kthread+0x135/0x160
       ret_from_fork+0x1f/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock((work_completion)(&(&ctx->rsrc_put_work)->work));
                               lock(&ctx->uring_lock);
                               lock((work_completion)(&(&ctx->rsrc_put_work)->work));
  lock(&ctx->uring_lock);

 *** DEADLOCK ***

2 locks held by kworker/2:4/2684:
 #0: ffff88810004d938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530
 #1: ffffc90001c6be70 ((work_completion)(&(&ctx->rsrc_put_work)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x530

stack backtrace:
CPU: 2 PID: 2684 Comm: kworker/2:4 Tainted: G           OE     5.14.0-rc3-bluetea-test-00249-gc7d10223 #5
Hardware name: Acer Aspire ES1-421/OLVIA_BE, BIOS V1.05 07/02/2015
Workqueue: events io_rsrc_put_work
Call Trace:
 dump_stack_lvl+0x6a/0x9a
 check_noncircular+0xfe/0x110
 __lock_acquire+0x119a/0x1e10
 lock_acquire+0xc8/0x2f0
 ? io_rsrc_put_work+0x13d/0x1a0
 __mutex_lock+0x86/0x740
 ? io_rsrc_put_work+0x13d/0x1a0
 ? io_rsrc_put_work+0x13d/0x1a0
 ? io_rsrc_put_work+0x13d/0x1a0
 ? process_one_work+0x1ce/0x530
 io_rsrc_put_work+0x13d/0x1a0
 process_one_work+0x236/0x530
 worker_thread+0x52/0x3b0
 ? process_one_work+0x530/0x530
 kthread+0x135/0x160
 ? set_kthread_struct+0x40/0x40
 ret_from_fork+0x1f/0x30

which is due to holding the ctx->uring_lock when flushing existing
pending work, while the pending work flushing may need to grab the uring
lock if we're using IOPOLL.

Fix this by dropping the uring_lock a bit earlier as part of the flush.

Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/404Tested-by: NAmmar Faizi <ammarfaizi2@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c018db4a

io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker() · 47cae0c7

由 Hao Xu 提交于 8月 08, 2021

There may be cases like:
        A                                 B
spin_lock(wqe->lock)
nr_workers is 0
nr_workers++
spin_unlock(wqe->lock)
                                     spin_lock(wqe->lock)
                                     nr_wokers is 1
                                     nr_workers++
                                     spin_unlock(wqe->lock)
create_io_worker()
  acct->worker is 1
                                     create_io_worker()
                                       acct->worker is 1

There should be one worker marked IO_WORKER_F_FIXED, but no one is.
Fix this by introduce a new agrument for create_io_worker() to indicate
if it is the first worker.

Fixes: 3d4e4fac ("io-wq: fix no lock protection of acct->nr_worker")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210808135434.68667-3-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

47cae0c7

io-wq: fix bug of creating io-wokers unconditionally · 49e7f0c7

由 Hao Xu 提交于 8月 08, 2021

The former patch to add check between nr_workers and max_workers has a
bug, which will cause unconditionally creating io-workers. That's
because the result of the check doesn't affect the call of
create_io_worker(), fix it by bringing in a boolean value for it.

Fixes: 21698274 ("io-wq: fix lack of acct->nr_workers < acct->max_workers judgement")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20210808135434.68667-2-haoxu@linux.alibaba.com
[axboe: drop hunk that isn't strictly needed]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

49e7f0c7

io_uring: rsrc ref lock needs to be IRQ safe · 4956b9ea

由 Jens Axboe 提交于 8月 09, 2021

Nadav reports running into the below splat on re-enabling softirqs:

WARNING: CPU: 2 PID: 1777 at kernel/softirq.c:364 __local_bh_enable_ip+0xaa/0xe0
Modules linked in:
CPU: 2 PID: 1777 Comm: umem Not tainted 5.13.1+ #161
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
RIP: 0010:__local_bh_enable_ip+0xaa/0xe0
Code: a9 00 ff ff 00 74 38 65 ff 0d a2 21 8c 7a e8 ed 1a 20 00 fb 66 0f 1f 44 00 00 5b 41 5c 5d c3 65 8b 05 e6 2d 8c 7a 85 c0 75 9a <0f> 0b eb 96 e8 2d 1f 20 00 eb a5 4c 89 e7 e8 73 4f 0c 00 eb ae 65
RSP: 0018:ffff88812e58fcc8 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000201 RCX: dffffc0000000000
RDX: 0000000000000007 RSI: 0000000000000201 RDI: ffffffff8898c5ac
RBP: ffff88812e58fcd8 R08: ffffffff8575dbbf R09: ffffed1028ef14f9
R10: ffff88814778a7c3 R11: ffffed1028ef14f8 R12: ffffffff85c9e9ae
R13: ffff88814778a000 R14: ffff88814778a7b0 R15: ffff8881086db890
FS:  00007fbcfee17700(0000) GS:ffff8881e0300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c0402a5008 CR3: 000000011c1ac003 CR4: 00000000003706e0
Call Trace:
 _raw_spin_unlock_bh+0x31/0x40
 io_rsrc_node_ref_zero+0x13e/0x190
 io_dismantle_req+0x215/0x220
 io_req_complete_post+0x1b8/0x720
 __io_complete_rw.isra.0+0x16b/0x1f0
 io_complete_rw+0x10/0x20

where it's clear we end up calling the percpu count release directly
from the completion path, as it's in atomic mode and we drop the last
ref. For file/block IO, this can be from IRQ context already, and the
softirq locking for rsrc isn't enough.

Just make the lock fully IRQ safe, and ensure we correctly safe state
from the release path as we don't know the full context there.
Reported-by: NNadav Amit <nadav.amit@gmail.com>
Tested-by: NNadav Amit <nadav.amit@gmail.com>
Link: https://lore.kernel.org/io-uring/C187C836-E78B-4A31-B24C-D16919ACA093@gmail.com/Signed-off-by: NJens Axboe <axboe@kernel.dk>

4956b9ea

ucounts: add missing data type changes · f153c224

由 Sven Schnelle 提交于 7月 30, 2021

commit f9c82a4e ("Increase size of ucounts to atomic_long_t")
changed the data type of ucounts/ucounts_max to long, but missed to
adjust a few other places. This is noticeable on big endian platforms
from user space because the /proc/sys/user/max_*_names files all
contain 0.

v4 - Made the min and max constants long so the sysctl values
     are actually settable on little endian machines.
     -- EWB

Fixes: f9c82a4e ("Increase size of ucounts to atomic_long_t")
Signed-off-by: NSven Schnelle <svens@linux.ibm.com>
Tested-by: NNathan Chancellor <nathan@kernel.org>
Tested-by: NLinux Kernel Functional Testing <lkft@linaro.org>
Acked-by: NAlexey Gladkov <legion@kernel.org>
v1: https://lkml.kernel.org/r/20210721115800.910778-1-svens@linux.ibm.com
v2: https://lkml.kernel.org/r/20210721125233.1041429-1-svens@linux.ibm.com
v3: https://lkml.kernel.org/r/20210730062854.3601635-1-svens@linux.ibm.com
Link: https://lkml.kernel.org/r/8735rijqlv.fsf_-_@disp2133Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

f153c224

09 8月, 2021 3 次提交

configfs: restore the kernel v5.13 text attribute write behavior · 769f5267

由 Bart Van Assche 提交于 8月 04, 2021

Instead of appending new text attribute data at the offset specified by the
write() system call, only pass the newly written data to the .store()
callback.
Reported-by: NBodo Stroesser <bostroesser@gmail.com>
Tested-by: NBodo Stroesser <bostroesser@gmail.com>
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

769f5267

io_uring: Use WRITE_ONCE() when writing to sq_flags · 20c0b380

由 Nadav Amit 提交于 8月 07, 2021

The compiler should be forbidden from any strange optimization for async
writes to user visible data-structures. Without proper protection, the
compiler can cause write-tearing or invent writes that would confuse the
userspace.

However, there are writes to sq_flags which are not protected by
WRITE_ONCE(). Use WRITE_ONCE() for these writes.

This is purely a theoretical issue. Presumably, any compiler is very
unlikely to do such optimizations.

Fixes: 75b28aff ("io_uring: allocate the two rings together")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NNadav Amit <namit@vmware.com>
Link: https://lore.kernel.org/r/20210808001342.964634-3-namit@vmware.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

20c0b380

io_uring: clear TIF_NOTIFY_SIGNAL when running task work · ef98eb04

由 Nadav Amit 提交于 8月 07, 2021

When using SQPOLL, the submission queue polling thread calls
task_work_run() to run queued work. However, when work is added with
TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
set afterwards and is never cleared.

Consequently, when the submission queue polling thread checks whether
signal_pending(), it may always find a pending signal, if
task_work_add() was ever called before.

The impact of this bug might be different on different kernel versions.
It appears that on 5.14 it would only cause unnecessary calculation and
prevent the polling thread from sleeping. On 5.13, where the bug was
found, it stops the polling thread from finding newly submitted work.

Instead of task_work_run(), use tracehook_notify_signal() that clears
TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to
current->task_works to avoid a race in which task_works is cleared but
the TIF_NOTIFY_SIGNAL is set.

Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NNadav Amit <namit@vmware.com>
Link: https://lore.kernel.org/r/20210808001342.964634-2-namit@vmware.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

ef98eb04

07 8月, 2021 1 次提交

ext4: fix potential htree corruption when growing large_dir directories · 877ba3f7

由 Theodore Ts'o 提交于 8月 04, 2021

Commit b5776e75 ("ext4: fix potential htree index checksum
corruption) removed a required restart when multiple levels of index
nodes need to be split.  Fix this to avoid directory htree corruptions
when using the large_dir feature.

Cc: stable@kernel.org # v5.11
Cc: Благодаренко Артём <artem.blagodarenko@gmail.com>
Fixes: b5776e75 ("ext4: fix potential htree index checksum corruption)
Reported-by: NDenis <denis@voxelsoft.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

877ba3f7

06 8月, 2021 4 次提交

io-wq: fix lack of acct->nr_workers < acct->max_workers judgement · 21698274

由 Hao Xu 提交于 8月 05, 2021

There should be this judgement before we create an io-worker

Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

21698274

io-wq: fix no lock protection of acct->nr_worker · 3d4e4fac

由 Hao Xu 提交于 8月 05, 2021

There is an acct->nr_worker visit without lock protection. Think about
the case: two callers call io_wqe_wake_worker(), one is the original
context and the other one is an io-worker(by calling
io_wqe_enqueue(wqe, linked)), on two cpus paralelly, this may cause
nr_worker to be larger than max_worker.
Let's fix it by adding lock for it, and let's do nr_workers++ before
create_io_worker. There may be a edge cause that the first caller fails
to create an io-worker, but the second caller doesn't know it and then
quit creating io-worker as well:

say nr_worker = max_worker - 1
        cpu 0                        cpu 1
   io_wqe_wake_worker()          io_wqe_wake_worker()
      nr_worker < max_worker
      nr_worker++
      create_io_worker()         nr_worker == max_worker
         failed                  return
      return

But the chance of this case is very slim.

Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
[axboe: fix unconditional create_io_worker() call]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3d4e4fac

cifs: create sd context must be a multiple of 8 · 7d3fc017

由 Shyam Prasad N 提交于 8月 04, 2021

We used to follow the rule earlier that the create SD context
always be a multiple of 8. However, with the change:
cifs: refactor create_sd_buf() and and avoid corrupting the buffer
...we recompute the length, and we failed that rule.
Fixing that with this change.

Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: NShyam Prasad N <sprasad@microsoft.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>

7d3fc017

pipe: increase minimum default pipe size to 2 pages · 46c4c9d1

由 Alex Xu (Hello71) 提交于 8月 05, 2021

This program always prints 4096 and hangs before the patch, and always
prints 8192 and exits successfully after:

  int main()
  {
      int pipefd[2];
      for (int i = 0; i < 1025; i++)
          if (pipe(pipefd) == -1)
              return 1;
      size_t bufsz = fcntl(pipefd[1], F_GETPIPE_SZ);
      printf("%zd\n", bufsz);
      char *buf = calloc(bufsz, 1);
      write(pipefd[1], buf, bufsz);
      read(pipefd[0], buf, bufsz-1);
      write(pipefd[1], buf, 1);
  }

Note that you may need to increase your RLIMIT_NOFILE before running the
program.

Fixes: 759c0114 ("pipe: limit the per-user amount of pages allocated in pipes")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/lkml/1628086770.5rn8p04n6j.none@localhost/
Link: https://lore.kernel.org/lkml/1628127094.lxxn016tj7.none@localhost/Signed-off-by: NAlex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

46c4c9d1

05 8月, 2021 3 次提交

io-wq: fix race between worker exiting and activating free worker · 83d6c393

由 Jens Axboe 提交于 8月 03, 2021

Nadav correctly reports that we have a race between a worker exiting,
and new work being queued. This can lead to work being queued behind
an existing worker that could be sleeping on an event before it can
run to completion, and hence introducing potential big latency gaps
if we hit this race condition:

cpu0					cpu1
----					----
					io_wqe_worker()
					schedule_timeout()
					 // timed out
io_wqe_enqueue()
io_wqe_wake_worker()
// work_flags & IO_WQ_WORK_CONCURRENT
io_wqe_activate_free_worker()
					 io_worker_exit()

Fix this by having the exiting worker go through the normal decrement
of a running worker, which will spawn a new one if needed.

The free worker activation is modified to only return success if we
were able to find a sleeping worker - if not, we keep looking through
the list. If we fail, we create a new worker as per usual.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/io-uring/BFF746C0-FEDE-4646-A253-3021C57C26C9@gmail.com/Reported-by: NNadav Amit <nadav.amit@gmail.com>
Tested-by: NNadav Amit <nadav.amit@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

83d6c393

ceph: take snap_empty_lock atomically with snaprealm refcount change · 8434ffe7

由 Jeff Layton 提交于 8月 03, 2021

There is a race in ceph_put_snap_realm. The change to the nref and the
spinlock acquisition are not done atomically, so you could decrement
nref, and before you take the spinlock, the nref is incremented again.
At that point, you end up putting it on the empty list when it
shouldn't be there. Eventually __cleanup_empty_realms runs and frees
it when it's still in-use.

Fix this by protecting the 1->0 transition with atomic_dec_and_lock,
and just drop the spinlock if we can get the rwsem.

Because these objects can also undergo a 0->1 refcount transition, we
must protect that change as well with the spinlock. Increment locklessly
unless the value is at 0, in which case we take the spinlock, increment
and then take it off the empty list if it did the 0->1 transition.

With these changes, I'm removing the dout() messages from these
functions, as well as in __put_snap_realm. They've always been racy, and
it's better to not print values that may be misleading.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46419Reported-by: NMark Nelson <mnelson@redhat.com>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: NLuis Henriques <lhenriques@suse.de>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

8434ffe7

ceph: reduce contention in ceph_check_delayed_caps() · bf2ba432

由 Luis Henriques 提交于 7月 06, 2021

Function ceph_check_delayed_caps() is called from the mdsc->delayed_work
workqueue and it can be kept looping for quite some time if caps keep
being added back to the mdsc->cap_delay_list.  This may result in the
watchdog tainting the kernel with the softlockup flag.

This patch breaks this loop if the caps have been recently (i.e. during
the loop execution).  Any new caps added to the list will be handled in
the next run.

Also, allow schedule_delayed() callers to explicitly set the delay value
instead of defaulting to 5s, so we can ensure that it runs soon
afterward if it looks like there is more work.

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/46284Signed-off-by: NLuis Henriques <lhenriques@suse.de>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

bf2ba432

31 7月, 2021 3 次提交

pipe: make pipe writes always wake up readers · 3a34b13a

由 Linus Torvalds 提交于 7月 30, 2021

Since commit 1b6b26ae ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.

In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already. Doing
extraneous wakeups will only cause potential thundering herd problems.

However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it". Even if there was no edge in sight.

Quoting Sandeep Patil:
"The commit 1b6b26ae ('pipe: fix and clarify pipe write wakeup
logic') changed pipe write logic to wakeup readers only if the pipe
was empty at the time of write. However, there are libraries that
relied upon the older behavior for notification scheme similar to
what's described in [1]

One such library 'realm-core'[2] is used by numerous Android
applications. The library uses a similar notification mechanism as GNU
Make but it never drains the pipe until it is full. When Android moved
to v5.10 kernel, all applications using this library stopped working.

The library has since been fixed[3] but it will be a while before all
applications incorporate the updated library"

Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong. Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.

So add the extraneous wakeup, to approximate the old behavior.

[ I say "approximate", because the exact old behavior was to do a wakeup
not for each write(), but for each pipe buffer chunk that was filled
in. The behavior introduced by this change is not that - this is just
"every write will cause a wakeup, whether necessary or not", which
seems to be sufficient for the broken library use. ]

It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.

See commit f467a6a6 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".

Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
Link: https://github.com/realm/realm-core [2]
Link: https://github.com/realm/realm-core/issues/4666 [3]
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/Reported-by: NSandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3a34b13a

ocfs2: issue zeroout to EOF blocks · 9449ad33

由 Junxiao Bi 提交于 7月 29, 2021

For punch holes in EOF blocks, fallocate used buffer write to zero the
EOF blocks in last cluster.  But since ->writepage will ignore EOF
pages, those zeros will not be flushed.

This "looks" ok as commit 6bba4471 ("ocfs2: fix data corruption by
fallocate") will zero the EOF blocks when extend the file size, but it
isn't.  The problem happened on those EOF pages, before writeback, those
pages had DIRTY flag set and all buffer_head in them also had DIRTY flag
set, when writeback run by write_cache_pages(), DIRTY flag on the page
was cleared, but DIRTY flag on the buffer_head not.

When next write happened to those EOF pages, since buffer_head already
had DIRTY flag set, it would not mark page DIRTY again.  That made
writeback ignore them forever.  That will cause data corruption.  Even
directio write can't work because it will fail when trying to drop pages
caches before direct io, as it found the buffer_head for those pages
still had DIRTY flag set, then it will fall back to buffer io mode.

To make a summary of the issue, as writeback ingores EOF pages, once any
EOF page is generated, any write to it will only go to the page cache,
it will never be flushed to disk even file size extends and that page is
not EOF page any more.  The fix is to avoid zero EOF blocks with buffer
write.

The following code snippet from qemu-img could trigger the corruption.

  656   open("6b3711ae-3306-4bdd-823c-cf1c0060a095.conv.2", O_RDWR|O_DIRECT|O_CLOEXEC) = 11
  ...
  660   fallocate(11, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 2275868672, 327680 <unfinished ...>
  660   fallocate(11, 0, 2275868672, 327680) = 0
  658   pwrite64(11, "

Link: https://lkml.kernel.org/r/20210722054923.24389-2-junxiao.bi@oracle.comSigned-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9449ad33

ocfs2: fix zero out valid data · f267aeb6

由 Junxiao Bi 提交于 7月 29, 2021

If append-dio feature is enabled, direct-io write and fallocate could
run in parallel to extend file size, fallocate used "orig_isize" to
record i_size before taking "ip_alloc_sem", when
ocfs2_zeroout_partial_cluster() zeroout EOF blocks, i_size maybe already
extended by ocfs2_dio_end_io_write(), that will cause valid data zeroed
out.

Link: https://lkml.kernel.org/r/20210722054923.24389-1-junxiao.bi@oracle.com
Fixes: 6bba4471 ("ocfs2: fix data corruption by fallocate")
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f267aeb6

30 7月, 2021 2 次提交

xfs: prevent spoofing of rtbitmap blocks when recovering buffers · 81a448d7

由 Darrick J. Wong 提交于 7月 26, 2021

While reviewing the buffer item recovery code, the thought occurred to
me: in V5 filesystems we use log sequence number (LSN) tracking to avoid
replaying older metadata updates against newer log items. However, we
use the magic number of the ondisk buffer to find the LSN of the ondisk
metadata, which means that if an attacker can control the layout of the
realtime device precisely enough that the start of an rt bitmap block
matches the magic and UUID of some other kind of block, they can control
the purported LSN of that spoofed block and thereby break log replay.

Since realtime bitmap and summary blocks don't have headers at all, we
have no way to tell if a block really should be replayed. The best we
can do is replay unconditionally and hope for the best.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>

81a448d7

xfs: limit iclog tail updates · 9d110014

由 Dave Chinner 提交于 7月 28, 2021

From the department of "generic/482 keeps on giving", we bring you
another tail update race condition:

iclog:
	S1			C1
	+-----------------------+-----------------------+
				 S2			EOIC

Two checkpoints in a single iclog. One is complete, the other just
contains the start record and overruns into a new iclog.

Timeline:

Before S1:	Cache flush, log tail = X
At S1:		Metadata stable, write start record and checkpoint
At C1:		Write commit record, set NEED_FUA
		Single iclog checkpoint, so no need for NEED_FLUSH
		Log tail still = X, so no need for NEED_FLUSH

After C1,
Before S2:	Cache flush, log tail = X
At S2:		Metadata stable, write start record and checkpoint
After S2:	Log tail moves to X+1
At EOIC:	End of iclog, more journal data to write
		Releases iclog
		Not a commit iclog, so no need for NEED_FLUSH
		Writes log tail X+1 into iclog.

At this point, the iclog has tail X+1 and NEED_FUA set. There has
been no cache flush for the metadata between X and X+1, and the
iclog writes the new tail permanently to the log. THis is sufficient
to violate on disk metadata/journal ordering.

We have two options here. The first is to detect this case in some
manner and ensure that the partial checkpoint write sets NEED_FLUSH
when the iclog is already marked NEED_FUA and the log tail changes.
This seems somewhat fragile and quite complex to get right, and it
doesn't actually make it obvious what underlying problem it is
actually addressing from reading the code.

The second option seems much cleaner to me, because it is derived
directly from the requirements of the C1 commit record in the iclog.
That is, when we write this commit record to the iclog, we've
guaranteed that the metadata/data ordering is correct for tail
update purposes. Hence if we only write the log tail into the iclog
for the *first* commit record rather than the log tail at the last
release, we guarantee that the log tail does not move past where the
the first commit record in the log expects it to be.

IOWs, taking the first option means that replay of C1 becomes
dependent on future operations doing the right thing, not just the
C1 checkpoint itself doing the right thing. This makes log recovery
almost impossible to reason about because now we have to take into
account what might or might not have happened in the future when
looking at checkpoints in the log rather than just having to
reconstruct the past...
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

9d110014

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功