提交 · 29a8bfe52d1c38bde482971250af0ba9637ddaf2 · openanolis / cloud-kernel

01 6月, 2018 28 次提交

pNFS: Refactor nfs4_layoutget_release() · 29a8bfe5

由 Trond Myklebust 提交于 5月 30, 2018

Move the actual freeing of the struct nfs4_layoutget into fs/nfs/pnfs.c
where it can be reused by the layoutget on open code.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

29a8bfe5

pnfs: Add LAYOUTGET to OPEN of a new file · 2409a976

由 Fred Isaman 提交于 10月 06, 2016

This triggers when have no pre-existing inode to attach to.
The preexisting case is saved for later.
Signed-off-by: NFred Isaman <fred.isaman@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

2409a976

pnfs: Change pnfs_alloc_init_layoutget_args call signature · 5e36e2a9

由 Fred Isaman 提交于 10月 06, 2016

Don't send in a layout, instead use the (possibly NULL) inode.

This is needed for LAYOUTGET attached to an OPEN where the inode is not
yet set.
Signed-off-by: NFred Isaman <fred.isaman@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

5e36e2a9

pnfs: Move nfs4_opendata into nfs4_fs.h · 1b146fcf

由 Fred Isaman 提交于 9月 21, 2016

It will be needed now by the pnfs code.
Signed-off-by: NFred Isaman <fred.isaman@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

1b146fcf

F
pnfs: Add conditional encode/decode of LAYOUTGET within OPEN compound · 56f487f8
由 Fred Isaman 提交于 9月 21, 2016
```
Signed-off-by: NFred Isaman <fred.isaman@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
```
56f487f8

pnfs: move allocations out of nfs4_proc_layoutget · dacb452d

由 Fred Isaman 提交于 9月 19, 2016

They work better in the new alloc_init function.
Signed-off-by: NFred Isaman <fred.isaman@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

dacb452d

pnfs: refactor send_layoutget · 587f03de

由 Fred Isaman 提交于 9月 21, 2016

Pull out the alloc/init part for eventual reuse by OPEN.
Signed-off-by: NFred Isaman <fred.isaman@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

587f03de

pnfs: Add layout driver flag PNFS_LAYOUTGET_ON_OPEN · f86c3ac5

由 Fred Isaman 提交于 9月 20, 2016

Driver can set flag to allow LAYOUTGET to be sent with OPEN.
Signed-off-by: NFred Isaman <fred.isaman@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

f86c3ac5

NFS4: move ctx into nfs4_run_open_task · 3b65a30d

由 Fred Isaman 提交于 9月 19, 2016

Preparing to add conditional LAYOUTGET to OPEN rpc, the LAYOUTGET
will need the ctx info.
Signed-off-by: NFred Isaman <fred.isaman@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

3b65a30d

pnfs: Store return value of decode_layoutget for later processing · 808ba32a

由 Fred Isaman 提交于 10月 04, 2016

This will be needed to seperate return value of OPEN and LAYOUTGET
when they are combined into a single RPC.
Signed-off-by: NFred Isaman <fred.isaman@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

808ba32a

pnfs: Remove redundant assignment from nfs4_proc_layoutget(). · 34ec9aac

由 Fred Isaman 提交于 9月 20, 2016

nfs_init_sequence() will clear this for us.
Signed-off-by: NFred Isaman <fred.isaman@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

34ec9aac

NFSv4: Don't add a new lock on an interrupted wait for LOCK · a3cf9bca

由 Benjamin Coddington 提交于 5月 03, 2018

If the wait for a LOCK operation is interrupted, and then the file is
closed, the locks cleanup code will assume that no new locks will be added
to the inode after it has completed.  We already have a mechanism to detect
if there was signal, so let's use that to avoid recreating the local lock
once the RPC completes.  Also skip re-sending the LOCK operation for the
various error cases if we were signaled.
Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
[Trond: Fix inverted test of locks_lock_inode_wait()]
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

a3cf9bca

NFSv4: Always clear the pNFS layout when handling ESTALE · cf61eb26

由 Trond Myklebust 提交于 5月 29, 2018

If we get an ESTALE error in response to an RPC call operating on the
file on the MDS, we should immediately cancel the layout for that file.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

cf61eb26

NFSv4: Fix possible 1-byte stack overflow in nfs_idmap_read_and_verify_message · d6889480

由 Dave Wysochanski 提交于 5月 29, 2018

In nfs_idmap_read_and_verify_message there is an incorrect sprintf '%d'
that converts the __u32 'im_id' from struct idmap_msg to 'id_str', which
is a stack char array variable of length NFS_UINT_MAXLEN == 11.
If a uid or gid value is > 2147483647 = 0x7fffffff, the conversion
overflows into a negative value, for example:
crash> p (unsigned) (0x80000000)
$1 = 2147483648
crash> p (signed) (0x80000000)
$2 = -2147483648
The '-' sign is written to the buffer and this causes a 1 byte overflow
when the NULL byte is written, which corrupts kernel stack memory.  If
CONFIG_CC_STACKPROTECTOR_STRONG is set we see a stack-protector panic:

[11558053.616565] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: ffffffffa05b8a8c
[11558053.639063] CPU: 6 PID: 9423 Comm: rpc.idmapd Tainted: G        W      ------------ T 3.10.0-514.el7.x86_64 #1
[11558053.641990] Hardware name: Red Hat OpenStack Compute, BIOS 1.10.2-3.el7_4.1 04/01/2014
[11558053.644462]  ffffffff818c7bc0 00000000b1f3aec1 ffff880de0f9bd48 ffffffff81685eac
[11558053.646430]  ffff880de0f9bdc8 ffffffff8167f2b3 ffffffff00000010 ffff880de0f9bdd8
[11558053.648313]  ffff880de0f9bd78 00000000b1f3aec1 ffffffff811dcb03 ffffffffa05b8a8c
[11558053.650107] Call Trace:
[11558053.651347]  [<ffffffff81685eac>] dump_stack+0x19/0x1b
[11558053.653013]  [<ffffffff8167f2b3>] panic+0xe3/0x1f2
[11558053.666240]  [<ffffffff811dcb03>] ? kfree+0x103/0x140
[11558053.682589]  [<ffffffffa05b8a8c>] ? idmap_pipe_downcall+0x1cc/0x1e0 [nfsv4]
[11558053.689710]  [<ffffffff810855db>] __stack_chk_fail+0x1b/0x30
[11558053.691619]  [<ffffffffa05b8a8c>] idmap_pipe_downcall+0x1cc/0x1e0 [nfsv4]
[11558053.693867]  [<ffffffffa00209d6>] rpc_pipe_write+0x56/0x70 [sunrpc]
[11558053.695763]  [<ffffffff811fe12d>] vfs_write+0xbd/0x1e0
[11558053.702236]  [<ffffffff810acccc>] ? task_work_run+0xac/0xe0
[11558053.704215]  [<ffffffff811fec4f>] SyS_write+0x7f/0xe0
[11558053.709674]  [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b

Fix this by calling the internally defined nfs_map_numeric_to_string()
function which properly uses '%u' to convert this __u32.  For consistency,
also replace the one other place where snprintf is called.
Signed-off-by: NDave Wysochanski <dwysocha@redhat.com>
Reported-by: NStephen Johnston <sjohnsto@redhat.com>
Fixes: cf4ab538 ("NFSv4: Fix the string length returned by the idmapper")
Cc: stable@vger.kernel.org # v3.4+
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

d6889480

NFS: Fix up nfs_post_op_update_inode() to force ctime updates · d554168f

由 Trond Myklebust 提交于 5月 29, 2018

We do not want to ignore ctime updates that originate from functions
such as link().
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

d554168f

T
NFS: Ensure we revalidate the inode correctly after setacl · 472f761e
由 Trond Myklebust 提交于 4月 08, 2018
```
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
```
472f761e

NFS: Ensure we revalidate the inode correctly after remove or rename · 59a707b0

由 Trond Myklebust 提交于 4月 08, 2018

We may need to revalidate the change attribute, ctime and the nlinks count.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

59a707b0

NFS: Set the force revalidate flag if the inode is not completely initialised · 821a868a

由 Trond Myklebust 提交于 3月 27, 2018

Ensure that a delegation doesn't cause us to skip initialising the inode
if it was incomplete when we exited nfs_fhget()
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

821a868a

NFS: Fix up sillyrename() · 3cb3fd6d

由 Trond Myklebust 提交于 4月 09, 2018

Ensure that we register the fact that the inode ctime has changed.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

3cb3fd6d

NFSv4: Fix sillyrename to return the delegation when appropriate · ed7e9ad0

由 Trond Myklebust 提交于 5月 30, 2018

Ensure that we pass down the inode of the file being deleted so
that we can return any delegation being held.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

ed7e9ad0

NFSv4: Only pass the delegation to setattr if we're sending a truncate · 991eedb1

由 Trond Myklebust 提交于 4月 09, 2018

Even then it isn't really necessary. The reason why we may not want to
pass in a stateid in other cases is that we cannot use the delegation
credential.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

991eedb1

NFS: Merge nfs41_free_stateid() with _nfs41_free_stateid() · 2f261020

由 Anna Schumaker 提交于 5月 15, 2018

Having these exist as two functions doesn't seem to add anything useful,
and I think merging them together makes this easier to follow.
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

2f261020

NFS: Pass "privileged" value to nfs4_init_sequence() · fba83f34

由 Anna Schumaker 提交于 5月 04, 2018

We currently have a separate function just to set this, but I think it
makes more sense to set it at the same time as the other values in
nfs4_init_sequence()
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

fba83f34

NFS: Move call to nfs4_state_protect() to nfs4_commit_setup() · e9ae1ee2

由 Anna Schumaker 提交于 5月 04, 2018

Rather than doing this in the generic NFS client code.  Let's put this
with the other v4 stuff so it's all in one place.
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

e9ae1ee2

NFS: Move call to nfs4_state_protect_write() to nfs4_write_setup() · fb91fb0e

由 Anna Schumaker 提交于 5月 04, 2018

This doesn't really need to be in the generic NFS client code, and I
think it makes more sense to keep the v4 code in one place.
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

fb91fb0e

NFS: Avoid quadratic search when freeing delegations. · e04bbf6b

由 NeilBrown 提交于 4月 30, 2018

There are three places that walk all delegation for an nfs_client and
restart whenever they find something interesting - potentially
resulting in a quadratic search: If there are 10,000 uninteresting
delegations followed by 10,000 interesting one, then the code
skips over 100,000,000 delegations, which can take a noticeable amount
of time.

Of these nfs_delegation_reap_unclaimed() and
nfs_reap_expired_delegations() are only called during unusual events:
a server reboots or reports expired delegations, probably due to a
network partition. Optimizing these is not particularly important.

The third, nfs_client_return_marked_delegations(), is called
periodically via nfs_expire_unreferenced_delegations(). It could
cause periodic problems on a busy server.

New delegations are added to the end of the list, so if there are
10,000 open files with delegations, and 10,000 more recently opened files
that received delegations but are now closed, then
nfs_client_return_marked_delegations() can take seconds to skip over
the 10,000 open files 10,000 times. That is a waste of time.

The avoid this waste a place-holder (an inode) is kept when locks are
dropped, so that the place can usually be found again after taking
rcu_readlock(). This place holder ensure that we find the right
starting point in the list of nfs_servers, and makes is probable that
we find the right starting point in the list of delegations.
We might need to occasionally restart at the head of that list.

It might be possible that the place_holder inode could lose its
delegation separately, and then get a new one using the same (freed
and then reallocated) 'struct nfs_delegation'. Were this to happen,
the new delegation would be at the end of the list and we would miss
returning some other delegations. This would have the effect of
unnecessarily delaying the return of some unused delegations until the
next time this function is called - typically 90 seconds later. As
this is not a correctness issue and is vanishingly unlikely to happen,
it does not seem worth addressing.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

e04bbf6b

NFS: use cond_resched() when restarting walk of delegation list. · 3ca951b6

由 NeilBrown 提交于 4月 30, 2018

In three places we walk the list of delegations for an nfs_client
until an interesting one is found, then we act of that delegation
and restart the walk.

New delegations are added to the end of a list and the interesting
delegations are usually old, so in many case we won't repeat
a long walk over and over again, but it is possible - particularly if
the first server in the list has a large number of uninteresting
delegations.

In each cache the work done on interesting delegations will often
complete without sleeping, so this could loop many times without
giving up the CPU.

So add a cond_resched() at an appropriate point to avoid hogging the
CPU for too long.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

3ca951b6

NFS: slight optimization for walking list for delegations · f3893491

由 NeilBrown 提交于 5月 31, 2018

There are 3 places where we walk the list of delegations
for an nfs_client.
In each case there are two nested loops, one for nfs_servers
and one for nfs_delegations.

When we find an interesting delegation we try to get an active
reference to the server.  If that fails, it is pointless to
continue to look at the other delegation for the server as
we will never be able to get an active reference.
So instead of continuing in the inner loop, break out
and continue in the outer loop.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

f3893491

29 5月, 2018 3 次提交

NFS: Optimise away lookups for rename targets · 9f6d44d4

由 Trond Myklebust 提交于 5月 10, 2018

We can optimise away any lookup for a rename target, unless we're
being asked to revalidate a dentry that might be in use.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

9f6d44d4

NFS: If the VFS sets LOOKUP_REVAL then force a lookup of the dentry · 73dd684a

由 Trond Myklebust 提交于 5月 10, 2018

If nfs_lookup_revalidate() is called with LOOKUP_REVAL because a
previous path lookup failed, then we ought to force a full lookup
of the component name.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

73dd684a

NFS: Optimise away the close-to-open GETATTR when we have NFSv4 OPEN · 47921921

由 Trond Myklebust 提交于 5月 10, 2018

NFSv4 should not need to perform an extra close-to-open GETATTR as part
of the process of looking up a regular file, since the OPEN call will
do that for us.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

47921921

26 5月, 2018 2 次提交

proc: fix smaps and meminfo alignment · 6c04ab0e

由 Hugh Dickins 提交于 5月 25, 2018

The 4.17-rc /proc/meminfo and /proc/<pid>/smaps look ugly: single-digit
numbers (commonly 0) are misaligned.

Remove seq_put_decimal_ull_width()'s leftover optimization for single
digits: it's wrong now that num_to_str() takes care of the width.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805241554210.1326@eggly.anvils
Fixes: d1be35cb ("proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps")
Signed-off-by: NHugh Dickins <hughd@google.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6c04ab0e

ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio" · 3373de20

由 Changwei Ge 提交于 5月 25, 2018

This reverts commit ba16ddfb ("ocfs2/o2hb: check len for
bio_add_page() to avoid getting incorrect bio").

In my testing, this patch introduces a problem that mkfs can't have
slots more than 16 with 4k block size.

And the original logic is safe actually with the situation it mentions
so revert this commit.

Attach test log:
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio failed, page ffffea0002d7ed40, len 0, vec_len 4096, vec_start 0,bi_sector 8192
  (mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5
  (mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5
  (mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5

Link: http://lkml.kernel.org/r/SIXPR06MB0461721F398A5A92FC68C39ED5920@SIXPR06MB0461.apcprd06.prod.outlook.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3373de20

24 5月, 2018 1 次提交

Btrfs: fix error handling in btrfs_truncate() · d5014738

由 Omar Sandoval 提交于 5月 22, 2018

Jun Wu at Facebook reported that an internal service was seeing a return
value of 1 from ftruncate() on Btrfs in some cases. This is coming from
the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().

btrfs_truncate() uses two variables for error handling, ret and err.
When btrfs_truncate_inode_items() returns non-zero, we set err to the
return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
only set err if ret is an error (i.e., negative).

To reproduce the issue: mount a filesystem with -o compress-force=zstd
and the following program will encounter return value of 1 from
ftruncate:

int main(void) {
        char buf[256] = { 0 };
        int ret;
        int fd;

        fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
        if (fd == -1) {
                perror("open");
                return EXIT_FAILURE;
        }

        if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
                perror("write");
                close(fd);
                return EXIT_FAILURE;
        }

        if (fsync(fd) == -1) {
                perror("fsync");
                close(fd);
                return EXIT_FAILURE;
        }

        ret = ftruncate(fd, 128);
        if (ret) {
                printf("ftruncate() returned %d\n", ret);
                close(fd);
                return EXIT_FAILURE;
        }

        close(fd);
        return EXIT_SUCCESS;
}

Fixes: ddfae63c ("btrfs: move btrfs_truncate_block out of trans handle")
CC: stable@vger.kernel.org # 4.15+
Reported-by: NJun Wu <quark@fb.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d5014738

22 5月, 2018 6 次提交

aio: fix io_destroy(2) vs. lookup_ioctx() race · baf10564

由 Al Viro 提交于 5月 20, 2018

kill_ioctx() used to have an explicit RCU delay between removing the
reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
At some point that delay had been removed, on the theory that
percpu_ref_kill() itself contained an RCU delay. Unfortunately, that was
the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
by lookup_ioctx(). As the result, we could get ctx freed right under
lookup_ioctx(). Tejun has fixed that in a6d7cff4 ("fs/aio: Add explicit
RCU grace period when freeing kioctx"); however, that fix is not enough.

Suppose io_destroy() from one thread races with e.g. io_setup() from another;
CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
has picked it (under rcu_read_lock()). Then CPU1 proceeds to drop the
refcount, getting it to 0 and triggering a call of free_ioctx_users(),
which proceeds to drop the secondary refcount and once that reaches zero
calls free_ioctx_reqs(). That does
INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
queue_rcu_work(system_wq, &ctx->free_rwork);
and schedules freeing the whole thing after RCU delay.

In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
refcount from 0 to 1 and returned the reference to io_setup().

Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
freed until after percpu_ref_get(). Sure, we'd increment the counter before
ctx can be freed. Now we are out of rcu_read_lock() and there's nothing to
stop freeing of the whole thing. Unfortunately, CPU2 assumes that since it
has grabbed the reference, ctx is *NOT* going away until it gets around to
dropping that reference.

The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
It's not costlier than what we currently do in normal case, it's safe to
call since freeing *is* delayed and it closes the race window - either
lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
the object in question at all.

Cc: stable@kernel.org
Fixes: a6d7cff4 "fs/aio: Add explicit RCU grace period when freeing kioctx"
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

baf10564

ext2: fix a block leak · 5aa1437d

由 Al Viro 提交于 5月 17, 2018

open file, unlink it, then use ioctl(2) to make it immutable or
append only.  Now close it and watch the blocks *not* freed...

Immutable/append-only checks belong in ->setattr().
Note: the bug is old and backport to anything prior to 737f2e93
("ext2: convert to use the new truncate convention") will need
these checks lifted into ext2_setattr().

Cc: stable@kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5aa1437d

nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed · 3819bb0d

由 Al Viro 提交于 5月 11, 2018

That can (and does, on some filesystems) happen - ->mkdir() (and thus
vfs_mkdir()) can legitimately leave its argument negative and just
unhash it, counting upon the lookup to pick the object we'd created
next time we try to look at that name.

Some vfs_mkdir() callers forget about that possibility...
Acked-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3819bb0d

cachefiles: vfs_mkdir() might succeed leaving dentry negative unhashed · 9c3e9025

由 Al Viro 提交于 5月 10, 2018

That can (and does, on some filesystems) happen - ->mkdir() (and thus
vfs_mkdir()) can legitimately leave its argument negative and just
unhash it, counting upon the lookup to pick the object we'd created
next time we try to look at that name.

Some vfs_mkdir() callers forget about that possibility...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9c3e9025

unfuck sysfs_mount() · 7b745a4e

由 Al Viro 提交于 5月 14, 2018

new_sb is left uninitialized in case of early failures in kernfs_mount_ns(),
and while IS_ERR(root) is true in all such cases, using IS_ERR(root) || !new_sb
is not a solution - IS_ERR(root) is true in some cases when new_sb is true.

Make sure new_sb is initialized (and matches the reality) in all cases and
fix the condition for dropping kobj reference - we want it done precisely
in those situations where the reference has not been transferred into a new
super_block instance.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7b745a4e

kernfs: deal with kernfs_fill_super() failures · 82382ace

由 Al Viro 提交于 4月 03, 2018

make sure that info->node is initialized early, so that kernfs_kill_sb()
can list_del() it safely.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

82382ace

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功