提交 · 6fc4749d25738e1a5e5b02d04a0a60bbae516652 · openeuler / Kernel

29 5月, 2018 29 次提交

btrfs: make success path out of btrfs_init_dev_replace_tgtdev more clear · 6fc4749d

由 David Sterba 提交于 3月 20, 2018

This is a preparatory cleanup that will make clear that the only
successful way out of btrfs_init_dev_replace_tgtdev will also set the
device_out to a valid pointer. With this guarantee, the callers can be
simplified.
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6fc4749d

btrfs: squeeze btrfs_dev_replace_continue_on_mount to its caller · 00251a52

由 David Sterba 提交于 3月 20, 2018

The function is called once and is fairly small, we can merge it with
the caller.
Reviewed-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

00251a52

btrfs: cleanup btrfs_rm_device() promote fs_devices pointer · b5185197

由 Anand Jain 提交于 4月 12, 2018

This function uses fs_info::fs_devices number of time, however we
declare and use it only at the end, instead do it in the beginning of
the function and use it.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b5185197

btrfs: cleanup find_device() drop list_head pointer · 636d2c9d

由 Anand Jain 提交于 4月 12, 2018

find_device() declares struct list_head *head pointer and used only once,
instead just use it directly.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

636d2c9d

btrfs: rename __btrfs_open_devices to open_fs_devices · 897fb573

由 Anand Jain 提交于 4月 12, 2018

__btrfs_open_devices() is un-exported drop __ prefix and rename it to
open_fs_devices().
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

897fb573

btrfs: rename __btrfs_close_devices to close_fs_devices · 0226e0eb

由 Anand Jain 提交于 4月 12, 2018

__btrfs_close_devices() is un-exported, drop the __ prefix and rename it
to close_fs_devices().
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0226e0eb

btrfs: cleanup __btrfs_open_devices() drop head pointer · f117e290

由 Anand Jain 提交于 4月 12, 2018

__btrfs_open_devices() declares struct list_head *head, however head is
used only once, instead use btrfs_fs_devices::devices directly.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f117e290

btrfs: rename struct btrfs_fs_devices::list · c4babc5e

由 Anand Jain 提交于 4月 12, 2018

btrfs_fs_devices::list is the list of BTRFS fsid in the kernel, a generic
name 'list' makes it's search very difficult, rename it to fs_list.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c4babc5e

btrfs: Drop fs_info parameter from btrfs_merge_delayed_refs · be97f133

由 Nikolay Borisov 提交于 4月 19, 2018

It's provided by the transaction handle.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

be97f133

btrfs: Drop fs_info parameter from add_delayed_data_ref · f033798d

由 Nikolay Borisov 提交于 4月 19, 2018

It's provided by the transaction handle.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f033798d

btrfs: Drop add_delayed_ref_head fs_info parameter · 1acda0c2

由 Nikolay Borisov 提交于 4月 19, 2018

It's provided by the transaction handle.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1acda0c2

btrfs: Remove btrfs_wait_and_free_delalloc_work · 40012f96

由 Nikolay Borisov 提交于 4月 19, 2018

This function is called from only 1 place and is effectively a wrapper
over wait_completion/kfree. It doesn't really bring any value having
those two calls in a separate function. Just open code it and remove it.
No functional changes.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

40012f96

btrfs: Remove tree argument from extent_writepages · 8ae225a8

由 Nikolay Borisov 提交于 4月 19, 2018

It can be directly referenced from the passed address_space so do that.
No functional changes.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8ae225a8

btrfs: Use list_empty instead of list_empty_careful · 81f1d390

由 Nikolay Borisov 提交于 4月 19, 2018

list_empty_careful usually is a signal of something tricky going on. Its
usage in btrfs is actually not needed since both lists it's used on are
local to a function and cannot be modified concurrently. So switch to
plain list_empty. No functional changes.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

81f1d390

btrfs: Remove redundant tree argument from extent_readpages · 2a3ff0ad

由 Nikolay Borisov 提交于 4月 19, 2018

This function is called only from btrfs_readpage and is already passed
the mapping. Simplify its signature by moving the code obtaining
reference to the extent tree in the function. No functional changes.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2a3ff0ad

btrfs: Remove map argument from try_release_extent_state · 29c68b2d

由 Nikolay Borisov 提交于 4月 19, 2018

It's not used in the function so just remove it. No functional changes.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

29c68b2d

btrfs: Sink extent_tree arguments in try_release_extent_mapping · 477a30ba

由 Nikolay Borisov 提交于 4月 19, 2018

This function already gets the page from which the two extent trees
are referenced. Simplify its signature by moving the code getting the
trees inside the function. No functional changes.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

477a30ba

btrfs: Allow rmdir(2) to delete an empty subvolume · a79a464d

由 Misono Tomohiro 提交于 4月 18, 2018

Change the behavior of rmdir(2) and allow it to delete an empty
subvolume by using btrfs_delete_subvolume() which is used by
btrfs_ioctl_snap_destroy().

This is a change in behaviour and has been requested by users. Deleting
the subvolume by ioctl requires root permissions while the rmdir way
does works with standard tools and syscalls for all users that can
access the subvolume.

The main usecase is to allow 'rm -rf /path/with/subvols' to simply work.
We were not able to find any nasty usability surprises, the intention is
to do the destructive rm. Without allowing rmdir, this would have to be
followed by the ioctl subvolume deletion, which is more of an annoyance.

Implementation details:

The required lock for @dir and inode of @dentry is already acquired in
vfs layer.

We need some check before deleting a subvolume. Permission check is done
in vfs layer, emptiness check is in btrfs_rmdir() and additional check
(i.e. neither the subvolume is a default subvolume nor send is in progress)
is in btrfs_delete_subvolume().

Note that in btrfs_ioctl_snap_destroy(), d_delete() is called after
btrfs_delete_subvolume(). For rmdir(2), d_delete() is called in vfs
layer later.
Tested-by: NGoffredo Baroncelli <kreijack@inwind.it>
Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ enhance changelog ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a79a464d

btrfs: Factor out the main deletion process from btrfs_ioctl_snap_destroy() · f60a2364

由 Misono Tomohiro 提交于 4月 18, 2018

Factor out the second half of btrfs_ioctl_snap_destroy() as
btrfs_delete_subvolume(), which performs some subvolume specific checks
before deletion:

1. send is not in progress
2. the subvolume is not the default subvolume
3. the subvolume does not contain other subvolumes

and actual deletion process. btrfs_delete_subvolume() requires
inode_lock for both @dir and inode of @dentry. The remaining part of
btrfs_ioctl_snap_destroy() is mainly permission checks.

Note that call of d_delete() is not included in btrfs_delete_subvolume()
as this function will also be used by btrfs_rmdir() to delete an empty
subvolume and in that case d_delete() is called in VFS layer.

As a result, btrfs_unlink_subvol() and may_destroy_subvol()
become static functions. No functional changes.
Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f60a2364

btrfs: Move may_destroy_subvol() from ioctl.c to inode.c · ec42f167

由 Misono Tomohiro 提交于 4月 18, 2018

This is a preparation work to refactor btrfs_ioctl_snap_destroy()
and to allow rmdir(2) to delete an empty subvolume.
Signed-off-by: NTomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ minor update of the function comment ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ec42f167

btrfs: remove unused le_test_bit() · 3b079a91

由 Howard McLauchlan 提交于 4月 18, 2018

With commit b18253ec57c0 ("btrfs: optimize free space tree bitmap
conversion"), there are no more callers to le_test_bit(). This patch
removes le_test_bit().
Signed-off-by: NHoward McLauchlan <hmclauchlan@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3b079a91

btrfs: optimize free space tree bitmap conversion · a565971f

由 Howard McLauchlan 提交于 4月 18, 2018

Presently, convert_free_space_to_extents() does a linear scan of the
bitmap. We can speed this up with find_next_{bit,zero_bit}_le().

This patch replaces the linear scan with find_next_{bit,zero_bit}_le().
Testing shows a 20-33% decrease in execution time for
convert_free_space_to_extents().

Since we change bitmap to be unsigned long, we have to do some casting
for the bitmap cursor. In le_bitmap_set() it makes sense to use u8, as
we are doing bit operations. Everywhere else, we're just using it for
pointer arithmetic and not directly accessing it, so char seems more
appropriate.
Suggested-by: NOmar Sandoval <osandov@osandov.com>
Signed-off-by: NHoward McLauchlan <hmclauchlan@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a565971f

btrfs: clean up le_bitmap_{set, clear}() · 6faa8f47

由 Howard McLauchlan 提交于 4月 18, 2018

le_bitmap_set() is only used by free-space-tree, so move it there and
make it static. le_bitmap_clear() is not used, so remove it.
Signed-off-by: NHoward McLauchlan <hmclauchlan@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6faa8f47

btrfs: use fs_info for btrfs_handle_em_exist tracepoint · f46b24c9

由 David Sterba 提交于 4月 03, 2018

We really want to know to which filesystem the extent map events belong,
but as it cannot be reached from the extent_map pointers, we need to
pass it down the callchain.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f46b24c9

btrfs: tests: pass fs_info to extent_map tests · 0e08eb9b

由 David Sterba 提交于 4月 03, 2018

Preparatory work to pass fs_info to btrfs_add_extent_mapping so we can
get a better tracepoint message. Extent maps do not need fs_info for
anything so we only add a dummy one without any other initialization.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0e08eb9b

btrfs: Consolidate error checking for btrfs_alloc_chunk · 57f1642e

由 Nikolay Borisov 提交于 4月 11, 2018

The second if is really a subcase of ret being less than 0. So
introduce a generic if (ret < 0) check, and inside have another if
which explicitly handles the -ENOSPC and any other errors. No
functional changes.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

57f1642e

btrfs: Fix lock release order · 1e7a1421

由 Nikolay Borisov 提交于 4月 11, 2018

Locks should generally be released in the oppposite order they are
acquired. Generally lock acquisiton ordering is used to ensure
deadlocks don't happen. However, as becomes more complicated it's
best to also maintain proper unlock order so as to avoid possible dead
locks. This was found by code inspection and doesn't necessarily lead
to a deadlock scenario.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1e7a1421

btrfs: Use while loop instead of labels in __endio_write_update_ordered · b25f0d00

由 Nikolay Borisov 提交于 4月 11, 2018

Currently __endio_write_update_ordered uses labels to implement
what is essentially a simple while loop. This makes the code more
cumbersome to follow than it actually has to be. No functional
changes. No xfstest regressions were found during testing.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b25f0d00

btrfs: add comment about BTRFS_FS_EXCL_OP · 89595e80

由 Anand Jain 提交于 4月 18, 2018

Adds comments about BTRFS_FS_EXCL_OP to existing comments
about the device locks.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ minor updates ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

89595e80

28 5月, 2018 3 次提交

btrfs: Drop delayed_refs argument from btrfs_check_delayed_seq · 41d0bd3b

由 Nikolay Borisov 提交于 4月 04, 2018

It's used to print its pointer in a debug statement but doesn't really
bring any useful information to the error message.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

41d0bd3b

btrfs: rename btrfs_get_block_group_info and make it static · c065f5b1

由 Su Yue 提交于 4月 02, 2018

The function btrfs_get_block_group_info() was introduced by the
commit 5af3e8cc ("Btrfs: make filesystem read-only when submitting
 barrier fails") which used it in disk-io.c.

However, the function is only called in ioctl.c now.
Its parameter type btrfs_ioctl_space_info* is only for ioctl.

So, make it static and rename it to be original name
get_block_group_info.

No functional change.
Signed-off-by: NSu Yue <suy.fnst@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c065f5b1

btrfs: Replace owner argument in add_pinned_bytes with a boolean · 29d2b84c

由 Nikolay Borisov 提交于 3月 30, 2018

add_pinned_bytes really cares whether the bytes being pinned are either
data or metadata. To that effect it checks whether the 'owner' argument
is less than BTRFS_FIRST_FREE_OBJECTID (256). This works because
owner can really have 2 types of values:

 a) For metadata extents it holds the level at which the parent is in
    the btree. This amounts to owner having the values 0-7

 b) In case of modifying data extents, owner is the inode number
    to which those extents belongs.

Let's make this more explicit byt converting the owner parameter to a
boolean value and either pass it directly when we know the type of
extents we are working with (i.e. in btrfs_free_tree_block). In cases
when the parent function can be called on both metadata/data extents
perform the check in the caller. This hopefully makes the interface
of add_pinned_bytes more intuitive.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

29d2b84c

26 5月, 2018 2 次提交

proc: fix smaps and meminfo alignment · 6c04ab0e

由 Hugh Dickins 提交于 5月 25, 2018

The 4.17-rc /proc/meminfo and /proc/<pid>/smaps look ugly: single-digit
numbers (commonly 0) are misaligned.

Remove seq_put_decimal_ull_width()'s leftover optimization for single
digits: it's wrong now that num_to_str() takes care of the width.

Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805241554210.1326@eggly.anvils
Fixes: d1be35cb ("proc: add seq_put_decimal_ull_width to speed up /proc/pid/smaps")
Signed-off-by: NHugh Dickins <hughd@google.com>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6c04ab0e

ocfs2: revert "ocfs2/o2hb: check len for bio_add_page() to avoid getting incorrect bio" · 3373de20

由 Changwei Ge 提交于 5月 25, 2018

This reverts commit ba16ddfb ("ocfs2/o2hb: check len for
bio_add_page() to avoid getting incorrect bio").

In my testing, this patch introduces a problem that mkfs can't have
slots more than 16 with 4k block size.

And the original logic is safe actually with the situation it mentions
so revert this commit.

Attach test log:
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 0, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 1, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 2, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 3, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 4, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 5, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 6, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 7, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 8, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 9, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 10, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 11, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 12, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 13, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 14, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 15, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:463 page 16, vec_len = 4096, vec_start = 0
  (mkfs.ocfs2,27479,2):o2hb_setup_one_bio:471 ERROR: Adding page[16] to bio failed, page ffffea0002d7ed40, len 0, vec_len 4096, vec_start 0,bi_sector 8192
  (mkfs.ocfs2,27479,2):o2hb_read_slots:500 ERROR: status = -5
  (mkfs.ocfs2,27479,2):o2hb_populate_slot_data:1911 ERROR: status = -5
  (mkfs.ocfs2,27479,2):o2hb_region_dev_write:2012 ERROR: status = -5

Link: http://lkml.kernel.org/r/SIXPR06MB0461721F398A5A92FC68C39ED5920@SIXPR06MB0461.apcprd06.prod.outlook.comSigned-off-by: NChangwei Ge <ge.changwei@h3c.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3373de20

24 5月, 2018 1 次提交

Btrfs: fix error handling in btrfs_truncate() · d5014738

由 Omar Sandoval 提交于 5月 22, 2018

Jun Wu at Facebook reported that an internal service was seeing a return
value of 1 from ftruncate() on Btrfs in some cases. This is coming from
the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().

btrfs_truncate() uses two variables for error handling, ret and err.
When btrfs_truncate_inode_items() returns non-zero, we set err to the
return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
only set err if ret is an error (i.e., negative).

To reproduce the issue: mount a filesystem with -o compress-force=zstd
and the following program will encounter return value of 1 from
ftruncate:

int main(void) {
        char buf[256] = { 0 };
        int ret;
        int fd;

        fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
        if (fd == -1) {
                perror("open");
                return EXIT_FAILURE;
        }

        if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
                perror("write");
                close(fd);
                return EXIT_FAILURE;
        }

        if (fsync(fd) == -1) {
                perror("fsync");
                close(fd);
                return EXIT_FAILURE;
        }

        ret = ftruncate(fd, 128);
        if (ret) {
                printf("ftruncate() returned %d\n", ret);
                close(fd);
                return EXIT_FAILURE;
        }

        close(fd);
        return EXIT_SUCCESS;
}

Fixes: ddfae63c ("btrfs: move btrfs_truncate_block out of trans handle")
CC: stable@vger.kernel.org # 4.15+
Reported-by: NJun Wu <quark@fb.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d5014738

22 5月, 2018 5 次提交

aio: fix io_destroy(2) vs. lookup_ioctx() race · baf10564

由 Al Viro 提交于 5月 20, 2018

kill_ioctx() used to have an explicit RCU delay between removing the
reference from ->ioctx_table and percpu_ref_kill() dropping the refcount.
At some point that delay had been removed, on the theory that
percpu_ref_kill() itself contained an RCU delay. Unfortunately, that was
the wrong kind of RCU delay and it didn't care about rcu_read_lock() used
by lookup_ioctx(). As the result, we could get ctx freed right under
lookup_ioctx(). Tejun has fixed that in a6d7cff4 ("fs/aio: Add explicit
RCU grace period when freeing kioctx"); however, that fix is not enough.

Suppose io_destroy() from one thread races with e.g. io_setup() from another;
CPU1 removes the reference from current->mm->ioctx_table[...] just as CPU2
has picked it (under rcu_read_lock()). Then CPU1 proceeds to drop the
refcount, getting it to 0 and triggering a call of free_ioctx_users(),
which proceeds to drop the secondary refcount and once that reaches zero
calls free_ioctx_reqs(). That does
INIT_RCU_WORK(&ctx->free_rwork, free_ioctx);
queue_rcu_work(system_wq, &ctx->free_rwork);
and schedules freeing the whole thing after RCU delay.

In the meanwhile CPU2 has gotten around to percpu_ref_get(), bumping the
refcount from 0 to 1 and returned the reference to io_setup().

Tejun's fix (that queue_rcu_work() in there) guarantees that ctx won't get
freed until after percpu_ref_get(). Sure, we'd increment the counter before
ctx can be freed. Now we are out of rcu_read_lock() and there's nothing to
stop freeing of the whole thing. Unfortunately, CPU2 assumes that since it
has grabbed the reference, ctx is *NOT* going away until it gets around to
dropping that reference.

The fix is obvious - use percpu_ref_tryget_live() and treat failure as miss.
It's not costlier than what we currently do in normal case, it's safe to
call since freeing *is* delayed and it closes the race window - either
lookup_ioctx() comes before percpu_ref_kill() (in which case ctx->users
won't reach 0 until the caller of lookup_ioctx() drops it) or lookup_ioctx()
fails, ctx->users is unaffected and caller of lookup_ioctx() doesn't see
the object in question at all.

Cc: stable@kernel.org
Fixes: a6d7cff4 "fs/aio: Add explicit RCU grace period when freeing kioctx"
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

baf10564

ext2: fix a block leak · 5aa1437d

由 Al Viro 提交于 5月 17, 2018

open file, unlink it, then use ioctl(2) to make it immutable or
append only.  Now close it and watch the blocks *not* freed...

Immutable/append-only checks belong in ->setattr().
Note: the bug is old and backport to anything prior to 737f2e93
("ext2: convert to use the new truncate convention") will need
these checks lifted into ext2_setattr().

Cc: stable@kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5aa1437d

nfsd: vfs_mkdir() might succeed leaving dentry negative unhashed · 3819bb0d

由 Al Viro 提交于 5月 11, 2018

That can (and does, on some filesystems) happen - ->mkdir() (and thus
vfs_mkdir()) can legitimately leave its argument negative and just
unhash it, counting upon the lookup to pick the object we'd created
next time we try to look at that name.

Some vfs_mkdir() callers forget about that possibility...
Acked-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3819bb0d

cachefiles: vfs_mkdir() might succeed leaving dentry negative unhashed · 9c3e9025

由 Al Viro 提交于 5月 10, 2018

That can (and does, on some filesystems) happen - ->mkdir() (and thus
vfs_mkdir()) can legitimately leave its argument negative and just
unhash it, counting upon the lookup to pick the object we'd created
next time we try to look at that name.

Some vfs_mkdir() callers forget about that possibility...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9c3e9025

unfuck sysfs_mount() · 7b745a4e

由 Al Viro 提交于 5月 14, 2018

new_sb is left uninitialized in case of early failures in kernfs_mount_ns(),
and while IS_ERR(root) is true in all such cases, using IS_ERR(root) || !new_sb
is not a solution - IS_ERR(root) is true in some cases when new_sb is true.

Make sure new_sb is initialized (and matches the reality) in all cases and
fix the condition for dropping kobj reference - we want it done precisely
in those situations where the reference has not been transferred into a new
super_block instance.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7b745a4e

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功