提交 · ed46ff3d423780fa5173b38a844bf0fdb210a2a7 · openeuler / Kernel

17 12月, 2018 22 次提交

由 Omar Sandoval 提交于 11月 03, 2016

Btrfs has not allowed swap files since commit 35054394 ("Btrfs: stop
providing a bmap operation to avoid swapfile corruptions"). However, now
that the proper restrictions are in place, Btrfs can support swap files
through the swap file a_ops, similar to iomap in commit 67482129
("iomap: add a swapfile activation function").

For Btrfs, activation needs to make sure that the file can be used as a
swap file, which currently means that it must be fully allocated as
NOCOW with no compression on one device. It must also do the proper
tracking so that ioctls will not interfere with the swap file.
Deactivation clears this tracking.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ed46ff3d

Btrfs: rename and export get_chunk_map · 60ca842e

由 Omar Sandoval 提交于 5月 16, 2018

The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
make it non-static.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

60ca842e

Btrfs: prevent ioctls from interfering with a swap file · eede2bf3

由 Omar Sandoval 提交于 11月 03, 2016

A later patch will implement swap file support for Btrfs, but before we
do that, we need to make sure that the various Btrfs ioctls cannot
change a swap file.

When a swap file is active, we must make sure that the extents of the
file are not moved and that they don't become shared. That means that
the following are not safe:

- chattr +c (enable compression)
- reflink
- dedupe
- snapshot
- defrag

Don't allow those to happen on an active swap file.

Additionally, balance, resize, device remove, and device replace are
also unsafe if they affect an active swapfile. Add a red-black tree of
block groups and devices which contain an active swapfile. Relocation
checks each block group against this tree and skips it or errors out for
balance or resize, respectively. Device remove and device replace check
the tree for the device they will operate on.

Note that we don't have to worry about chattr -C (disable nocow), which
we ignore for non-empty files, because an active swapfile must be
non-empty and can't be truncated. We also don't have to worry about
autodefrag because it's only done on COW files. Truncate and fallocate
are already taken care of by the generic code. Device add doesn't do
relocation so it's not an issue, either.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

eede2bf3

btrfs: Remove extent_io_ops::split_extent_hook callback · abbb55f4

由 Nikolay Borisov 提交于 11月 01, 2018

This is the counterpart to merge_extent_hook, similarly, it's used only
for data/freespace inodes so let's remove it, rename it and call it
directly where necessary. No functional changes.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

abbb55f4

btrfs: Remove extent_io_ops::merge_extent_hook callback · 5c848198

由 Nikolay Borisov 提交于 11月 01, 2018

This callback is used only for data and free space inodes. Such inodes
are guaranteed to have their extent_io_tree::private_data set to the
inode struct. Exploit this fact to directly call the function. Also give
it a more descriptive name. No functional changes.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5c848198

btrfs: Remove extent_io_ops::clear_bit_hook callback · a36bb5f9

由 Nikolay Borisov 提交于 11月 01, 2018

This is the counterpart to ex-set_bit_hook (now btrfs_set_delalloc_extent),
similar to what was done before remove clear_bit_hook and rename the
function. No functional changes.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a36bb5f9

btrfs: Remove extent_io_ops::set_bit_hook extent_io callback · e06a1fc9

由 Nikolay Borisov 提交于 11月 01, 2018

This callback is used to properly account delalloc extents for data
inodes (ordinary file inodes and freespace v1 inodes). Those can be
easily identified since they have their extent_io trees ->private_data
member point to the inode. Let's exploit this fact to remove the
needless indirection through extent_io_hooks and directly call the
function. Also give the function a name which reflects its purpose -
btrfs_set_delalloc_extent.

This patch also modified test_find_delalloc so that the extent_io_tree
used for testing doesn't have its ->private_data set which would have
caused a crash in btrfs_set_delalloc_extent due to the btrfs_inode->root
member not being initialised. The old version of the code also didn't
call set_bit_hook since the extent_io ops weren't set for the inode.  No
functional changes.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e06a1fc9

btrfs: Remove extent_io_ops::check_extent_io_range callback · 65a680f6

由 Nikolay Borisov 提交于 11月 01, 2018

This callback was only used in debug builds by btrfs_leak_debug_check.
A better approach is to move its implementation in
btrfs_leak_debug_check and ensure the latter is only executed for extent
tree which have ->private_data set i.e. relate to a data node and not
the btree one. No functional changes.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

65a680f6

btrfs: Remove extent_io_ops::writepage_end_io_hook · 7087a9d8

由 Nikolay Borisov 提交于 11月 01, 2018

This callback is ony ever called for data page writeout so there is no
need to actually abstract it via extent_io_ops. Lets just export it,
remove the definition of the callback and call it directly in the
functions that invoke the callback. Also rename the function to
btrfs_writepage_endio_finish_ordered since what it really does is
account finished io in the ordered extent data structures.  No
functional changes.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

7087a9d8

btrfs: Remove extent_io_ops::writepage_start_hook · d75855b4

由 Nikolay Borisov 提交于 11月 01, 2018

This hook is called only from __extent_writepage_io which is already
called only from the data page writeout path. So there is no need to
make an indirect call via extent_io_ops. This patch just removes the
callback definition, exports the callback function and calls it directly
at the only call site. Also give the function a more descriptive name.
No functional changes.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d75855b4

btrfs: Remove extent_io_ops::fill_delalloc · 5eaad97a

由 Nikolay Borisov 提交于 11月 01, 2018

This callback is called only from writepage_delalloc which in turn is
guaranteed to be called from the data page writeout path. In the end
there is no reason to have the call to this function to be indrected via
the extent_io_ops structure. This patch removes the callback definition,
exports the function and calls it directly. No functional changes.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ rename to btrfs_run_delalloc_range ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5eaad97a

btrfs: Add function to distinguish between data and btree inode · 06f2548f

由 Nikolay Borisov 提交于 11月 09, 2018

This will be used in future patches that remove the optional
extent_io_ops callbacks.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

06f2548f

btrfs: volumes: Make sure no dev extent is beyond device boundary · 05a37c48

由 Qu Wenruo 提交于 10月 05, 2018

Add extra dev extent end check against device boundary.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

05a37c48

btrfs: volumes: Make sure there is no overlap of dev extents at mount time · 5eb19381

由 Qu Wenruo 提交于 10月 05, 2018

Enhance btrfs_verify_dev_extents() to remember previous checked dev
extents, so it can verify no dev extents can overlap.

Analysis from Hans:

"Imagine allocating a DATA|DUP chunk.

 In the chunk allocator, we first set...
   max_stripe_size = SZ_1G;
   max_chunk_size = BTRFS_MAX_DATA_CHUNK_SIZE
 ... which is 10GiB.

 Then...
   /* we don't want a chunk larger than 10% of writeable space */
   max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
       		 max_chunk_size);

 Imagine we only have one 7880MiB block device in this filesystem. Now
 max_chunk_size is down to 788MiB.

 The next step in the code is to search for max_stripe_size * dev_stripes
 amount of free space on the device, which is in our example 1GiB * 2 =
 2GiB. Imagine the device has exactly 1578MiB free in one contiguous
 piece. This amount of bytes will be put in devices_info[ndevs - 1].max_avail

 Next we recalculate the stripe_size (which is actually the device extent
 length), based on the actual maximum amount of available raw disk space:
   stripe_size = div_u64(devices_info[ndevs - 1].max_avail, dev_stripes);

 stripe_size is now 789MiB

 Next we do...
   data_stripes = num_stripes / ncopies
 ...where data_stripes ends up as 1, because num_stripes is 2 (the amount
 of device extents we're going to have), and DUP has ncopies 2.

 Next there's a check...
   if (stripe_size * data_stripes > max_chunk_size)
 ...which matches because 789MiB * 1 > 788MiB.

 We go into the if code, and next is...
   stripe_size = div_u64(max_chunk_size, data_stripes);
 ...which resets stripe_size to max_chunk_size: 788MiB

 Next is a fun one...
   /* bump the answer up to a 16MB boundary */
   stripe_size = round_up(stripe_size, SZ_16M);
 ...which changes stripe_size from 788MiB to 800MiB.

 We're not done changing stripe_size yet...
   /* But don't go higher than the limits we found while searching
    * for free extents
    */
   stripe_size = min(devices_info[ndevs - 1].max_avail,
       	      stripe_size);

 This is bad. max_avail is twice the stripe_size (we need to fit 2 device
 extents on the same device for DUP).

 The result here is that 800MiB < 1578MiB, so it's unchanged. However,
 the resulting DUP chunk will need 1600MiB disk space, which isn't there,
 and the second dev_extent might extend into the next thing (next
 dev_extent? end of device?) for 22MiB.

 The last shown line of code relies on a situation where there's twice
 the value of stripe_size present as value for the variable stripe_size
 when it's DUP. This was actually the case before commit 92e222df
 "btrfs: alloc_chunk: fix DUP stripe size handling", from which I quote:
   "[...] in the meantime there's a check to see if the stripe_size does
 not exceed max_chunk_size. Since during this check stripe_size is twice
 the amount as intended, the check will reduce the stripe_size to
 max_chunk_size if the actual correct to be used stripe_size is more than
 half the amount of max_chunk_size."

 In the previous version of the code, the 16MiB alignment (why is this
 done, by the way?) would result in a 50% chance that it would actually
 do an 8MiB alignment for the individual dev_extents, since it was
 operating on double the size. Does this matter?

 Does it matter that stripe_size can be set to anything which is not
 16MiB aligned because of the amount of remaining available disk space
 which is just taken?

 What is the main purpose of this round_up?

 The most straightforward thing to do seems something like...
   stripe_size = min(
       div_u64(devices_info[ndevs - 1].max_avail, dev_stripes),
       stripe_size
   )
 ..just putting half of the max_avail into stripe_size."

Link: https://lore.kernel.org/linux-btrfs/b3461a38-e5f8-f41d-c67c-2efac8129054@mendix.com/Reported-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>
[ add analysis from report ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5eb19381

btrfs: Refactor find_free_extent loops update into find_free_extent_update_loop · e72d79d6

由 Qu Wenruo 提交于 11月 02, 2018

We have a complex loop design for find_free_extent(), that has different
behavior for each loop, some even includes new chunk allocation.

Instead of putting such a long code into find_free_extent() and makes it
harder to read, just extract them into find_free_extent_update_loop().

With all the cleanups, the main find_free_extent() should be pretty
barebone:

find_free_extent()
|- Iterate through all block groups
|  |- Get a valid block group
|  |- Try to do clustered allocation in that block group
|  |- Try to do unclustered allocation in that block group
|  |- Check if the result is valid
|  |  |- If valid, then exit
|  |- Jump to next block group
|
|- Push harder to find free extents
   |- If not found, re-iterate all block groups
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com>
[ copy callchain from changelog to function comment ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e72d79d6

btrfs: Refactor unclustered extent allocation into find_free_extent_unclustered() · e1a41848

由 Qu Wenruo 提交于 11月 02, 2018

This patch will extract unclsutered extent allocation code into
find_free_extent_unclustered().

And this helper function will use return value to indicate what to do
next.

This should make find_free_extent() a little easier to read.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
[Update merge conflict with fb5c39d7 ("btrfs: don't use ctl->free_space for max_extent_size")]
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e1a41848

btrfs: Refactor clustered extent allocation into find_free_extent_clustered · d06e3bb6

由 Qu Wenruo 提交于 11月 02, 2018

We have two main methods to find free extents inside a block group:

1) clustered allocation
2) unclustered allocation

This patch will extract the clustered allocation into
find_free_extent_clustered() to make it a little easier to read.

Instead of jumping between different labels in find_free_extent(), the
helper function will use return value to indicate different behavior.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d06e3bb6

btrfs: Introduce find_free_extent_ctl structure for later rework · b4bd745d

由 Qu Wenruo 提交于 11月 02, 2018

Instead of tons of different local variables in find_free_extent(),
extract them into find_free_extent_ctl structure, and add better
explanation for them.

Some modification may looks redundant, but will later greatly simplify
function parameter list during find_free_extent() refactor.

Also add two comments to co-operate with fb5c39d7 ("btrfs: don't use
ctl->free_space for max_extent_size"), to make ffe_ctl->max_extent_size
update more reader-friendly.
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b4bd745d

btrfs: extent-tree: Detect bytes_pinned underflow earlier · e2907c1a

由 Lu Fengqi 提交于 10月 24, 2018

Introduce a new wrapper update_bytes_pinned to replace open coded
bytes_pinned modifiers. Now the underflows of space_info::bytes_pinned
get detected and reported.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e2907c1a

btrfs: extent-tree: Detect bytes_may_use underflow earlier · 9f9b8e8d

由 Qu Wenruo 提交于 10月 24, 2018

Although we have space_info::bytes_may_use underflow detection in
btrfs_free_reserved_data_space_noquota(), we have more callers who are
subtracting number from space_info::bytes_may_use.

So instead of doing underflow detection for every caller, introduce a
new wrapper update_bytes_may_use() to replace open coded bytes_may_use
modifiers.

This also introduce a macro to declare more wrappers, but currently
space_info::bytes_may_use is the mostly interesting one.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9f9b8e8d

Btrfs: remove no longer used stuff for tracking pending ordered extents · 85dd506c

由 Filipe Manana 提交于 10月 26, 2018

Tracking pending ordered extents per transaction was introduced in commit
50d9aa99 ("Btrfs: make sure logged extents complete in the current
transaction V3") and later updated in commit 161c3549 ("Btrfs: change
how we wait for pending ordered extents").

However now that on fsync we always wait for ordered extents to complete
before logging, done in commit 5636cf7d ("btrfs: remove the logged
extents infrastructure"), we no longer need the stuff to track for pending
ordered extents, which was not completely removed in the mentioned commit.
So remove the remaining of the pending ordered extents infrastructure.
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

85dd506c

Btrfs: remove no longer used logged range variables when logging extents · ce02f032

由 Filipe Manana 提交于 10月 26, 2018

The logged_start and logged_end variables, at btrfs_log_changed_extents,
were added in commit 8c6c5928 ("btrfs: log csums for all modified
extents"). However since the recent simplification for fsync, which makes
us wait for all ordered extents to complete before logging extents, we
no longer need those variables. Commit a2120a47 ("btrfs: clean up the
left over logged_list usage") forgot to remove them.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ce02f032

15 12月, 2018 2 次提交

userfaultfd: check VM_MAYWRITE was set after verifying the uffd is registered · 01e881f5

由 Andrea Arcangeli 提交于 12月 14, 2018

Calling UFFDIO_UNREGISTER on virtual ranges not yet registered in uffd
could trigger an harmless false positive WARN_ON.  Check the vma is
already registered before checking VM_MAYWRITE to shut off the false
positive warning.

Link: http://lkml.kernel.org/r/20181206212028.18726-2-aarcange@redhat.com
Cc: <stable@vger.kernel.org>
Fixes: 29ec9066 ("userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas")
Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Reported-by: syzbot+06c7092e7d71218a2c16@syzkaller.appspotmail.com
Acked-by: NMike Rapoport <rppt@linux.ibm.com>
Acked-by: NHugh Dickins <hughd@google.com>
Acked-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

01e881f5

fs/iomap.c: get/put the page in iomap_page_create/release() · 61c6de66

由 Piotr Jaroszynski 提交于 12月 14, 2018

migrate_page_move_mapping() expects pages with private data set to have
a page_count elevated by 1.  This is what used to happen for xfs through
the buffer_heads code before the switch to iomap in commit 82cb1417
("xfs: add support for sub-pagesize writeback without buffer_heads").
Not having the count elevated causes move_pages() to fail on memory
mapped files coming from xfs.

Make iomap compatible with the migrate_page_move_mapping() assumption by
elevating the page count as part of iomap_page_create() and lowering it
in iomap_page_release().

It causes the move_pages() syscall to misbehave on memory mapped files
from xfs.  It does not not move any pages, which I suppose is "just" a
perf issue, but it also ends up returning a positive number which is out
of spec for the syscall.  Talking to Michal Hocko, it sounds like
returning positive numbers might be a necessary update to move_pages()
anyway though
(https://lkml.kernel.org/r/20181116114955.GJ14706@dhcp22.suse.cz).

I only hit this in tests that verify that move_pages() actually moved
the pages.  The test also got confused by the positive return from
move_pages() (it got treated as a success as positive numbers were not
expected and not handled) making it a bit harder to track down what's
going on.

Link: http://lkml.kernel.org/r/20181115184140.1388751-1-pjaroszynski@nvidia.com
Fixes: 82cb1417 ("xfs: add support for sub-pagesize writeback without buffer_heads")
Signed-off-by: NPiotr Jaroszynski <pjaroszynski@nvidia.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Brian Foster <bfoster@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

61c6de66

12 12月, 2018 3 次提交

fuse: continue to send FUSE_RELEASEDIR when FUSE_OPEN returns ENOSYS · 2e64ff15

由 Chad Austin 提交于 12月 10, 2018

When FUSE_OPEN returns ENOSYS, the no_open bit is set on the connection.

Because the FUSE_RELEASE and FUSE_RELEASEDIR paths share code, this
incorrectly caused the FUSE_RELEASEDIR request to be dropped and never sent
to userspace.

Pass an isdir bool to distinguish between FUSE_RELEASE and FUSE_RELEASEDIR
inside of fuse_file_put.

Fixes: 7678ac50 ("fuse: support clients that don't implement 'open'")
Cc: <stable@vger.kernel.org> # v3.14
Signed-off-by: NChad Austin <chadaustin@fb.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

2e64ff15

aio: fix spectre gadget in lookup_ioctx · a538e3ff

由 Jeff Moyer 提交于 12月 11, 2018

Matthew pointed out that the ioctx_table is susceptible to spectre v1,
because the index can be controlled by an attacker.  The below patch
should mitigate the attack for all of the aio system calls.

Cc: stable@vger.kernel.org
Reported-by: NMatthew Wilcox <willy@infradead.org>
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a538e3ff

ceph: make 'nocopyfrom' a default mount option · 6f9718fe

由 Luis Henriques 提交于 12月 10, 2018

Since we found a problem with the 'copy-from' operation after objects have
been truncated, offloading object copies to OSDs should be discouraged
until the issue is fixed.

Thus, this patch adds the 'nocopyfrom' mount option to the default mount
options which effectily means that remote copies won't be done in
copy_file_range unless they are explicitly enabled at mount time.

[ Adjust ceph_show_options() accordingly. ]

Link: https://tracker.ceph.com/issues/37378Signed-off-by: NLuis Henriques <lhenriques@suse.com>
Reviewed-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>

6f9718fe

10 12月, 2018 1 次提交

fuse: Fix memory leak in fuse_dev_free() · d72f70da

由 Takeshi Misawa 提交于 12月 09, 2018

When ntfs is unmounted, the following leak is
reported by kmemleak.

kmemleak report:

unreferenced object 0xffff880052bf4400 (size 4096):
  comm "mount.ntfs", pid 16530, jiffies 4294861127 (age 3215.836s)
  hex dump (first 32 bytes):
    00 44 bf 52 00 88 ff ff 00 44 bf 52 00 88 ff ff  .D.R.....D.R....
    10 44 bf 52 00 88 ff ff 10 44 bf 52 00 88 ff ff  .D.R.....D.R....
  backtrace:
    [<00000000bf4a2f8d>] fuse_fill_super+0xb22/0x1da0 [fuse]
    [<000000004dde0f0c>] mount_bdev+0x263/0x320
    [<0000000025aebc66>] mount_fs+0x82/0x2bf
    [<0000000042c5a6be>] vfs_kern_mount.part.33+0xbf/0x480
    [<00000000ed10cd5b>] do_mount+0x3de/0x2ad0
    [<00000000d59ff068>] ksys_mount+0xba/0xd0
    [<000000001bda1bcc>] __x64_sys_mount+0xba/0x150
    [<00000000ebe26304>] do_syscall_64+0x151/0x490
    [<00000000d25f2b42>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [<000000002e0abd2c>] 0xffffffffffffffff

fuse_dev_alloc() allocate fud->pq.processing.
But this hash table is not freed.

Fix this by freeing fud->pq.processing.
Signed-off-by: NTakeshi Misawa <jeliantsurux@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Fixes: be2ff42c ("fuse: Use hash table to link processing request")

d72f70da

07 12月, 2018 1 次提交

CIFS: Avoid returning EBUSY to upper layer VFS · 6ac79291

由 Long Li 提交于 12月 06, 2018

EBUSY is not handled by VFS, and will be passed to user-mode. This is not
correct as we need to wait for more credits.

This patch also fixes a bug where rsize or wsize is used uninitialized when
the call to server->ops->wait_mtu_credits() fails.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NLong Li <longli@microsoft.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>
Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>

6ac79291

06 12月, 2018 2 次提交

cifs: Fix separator when building path from dentry · c988de29

由 Paulo Alcantara 提交于 11月 15, 2018

Make sure to use the CIFS_DIR_SEP(cifs_sb) as path separator for
prefixpath too. Fixes a bug with smb1 UNIX extensions.

Fixes: a6b5058f ("fs/cifs: make share unaccessible at root level mountable")
Signed-off-by: NPaulo Alcantara <palcantara@suse.com>
Reviewed-by: NAurelien Aptel <aaptel@suse.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>

c988de29

cifs: In Kconfig CONFIG_CIFS_POSIX needs depends on legacy (insecure cifs) · 6e785302

由 Steve French 提交于 11月 03, 2018

Missing a dependency.  Shouldn't show cifs posix extensions
in Kconfig if CONFIG_CIFS_ALLOW_INSECURE_DIALECTS (ie SMB1
protocol) is disabled.
Signed-off-by: NSteve French <stfrench@microsoft.com>
Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>

6e785302

05 12月, 2018 6 次提交

dax: Fix unlock mismatch with updated API · 27359fd6

由 Matthew Wilcox 提交于 11月 30, 2018

Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to
store a replacement entry in the Xarray at the given xas-index with the
DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked
value of the entry relative to the current Xarray state to be specified.

In most contexts dax_unlock_entry() is operating in the same scope as
the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry()
case the implementation needs to recall the original entry. In the case
where the original entry is a 'pmd' entry it is possible that the pfn
performed to do the lookup is misaligned to the value retrieved in the
Xarray.

Change the api to return the unlock cookie from dax_lock_page() and pass
it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was
assuming that the page was PMD-aligned if the entry was a PMD entry with
signatures like:

 WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0
 RIP: 0010:dax_insert_entry+0x2b2/0x2d0
 [..]
 Call Trace:
  dax_iomap_pte_fault.isra.41+0x791/0xde0
  ext4_dax_huge_fault+0x16f/0x1f0
  ? up_read+0x1c/0xa0
  __do_fault+0x1f/0x160
  __handle_mm_fault+0x1033/0x1490
  handle_mm_fault+0x18b/0x3d0

Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org
Fixes: 9f32d221 ("dax: Convert dax_lock_mapping_entry to XArray")
Reported-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NMatthew Wilcox <willy@infradead.org>
Tested-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

27359fd6

iomap: partially revert (simulated directio short read on EFAULT) · 8f67b5ad

由 Darrick J. Wong 提交于 12月 02, 2018

In commit 4721a601, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read.  This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.

In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads.  This causes infinite
splice() loops and assertion failures on generic/095 on overlayfs
because xfs only permit total success or total failure of a directio
operation.  The underlying issue in the pipe splice code has now been
fixed by changing the pipe splice loop to avoid avoid reading more data
than there is space in the pipe.

Therefore, it's no longer necessary to simulate the short directio, so
remove the hack from iomap.

Fixes: 4721a601 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
Ranted-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

8f67b5ad

splice: don't read more than available pipe space · 17614445

由 Darrick J. Wong 提交于 11月 30, 2018

In commit 4721a601, we tried to fix a problem wherein directio reads
into a splice pipe will bounce EFAULT/EAGAIN all the way out to
userspace by simulating a zero-byte short read.  This happens because
some directio read implementations (xfs) will call
bio_iov_iter_get_pages to grab pipe buffer pages and issue asynchronous
reads, but as soon as we run out of pipe buffers that _get_pages call
returns EFAULT, which the splice code translates to EAGAIN and bounces
out to userspace.

In that commit, the iomap code catches the EFAULT and simulates a
zero-byte read, but that causes assertion errors on regular splice reads
because xfs doesn't allow short directio reads.

The brokenness is compounded by splice_direct_to_actor immediately
bailing on do_splice_to returning <= 0 without ever calling ->actor
(which empties out the pipe), so if userspace calls back we'll EFAULT
again on the full pipe, and nothing ever gets copied.

Therefore, teach splice_direct_to_actor to clamp its requests to the
amount of free space in the pipe and remove the simulated short read
from the iomap directio code.

Fixes: 4721a601 ("iomap: dio data corruption and spurious errors when pipes fill")
Reported-by: NMurphy Zhou <jencce.kernel@gmail.com>
Ranted-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

17614445

vfs: allow some remap flags to be passed to vfs_clone_file_range · 6744557b

由 Darrick J. Wong 提交于 11月 30, 2018

In overlayfs, ovl_remap_file_range calls vfs_clone_file_range on the
lower filesystem's inode, passing through whatever remap flags it got
from its caller.  Since vfs_copy_file_range first tries a filesystem's
remap function with REMAP_FILE_CAN_SHORTEN, this can get passed through
to the second vfs_copy_file_range call, and this isn't an issue.
Change the WARN_ON to look only for the DEDUP flag.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

6744557b

xfs: fix inverted return from xfs_btree_sblock_verify_crc · 7d048df4

由 Eric Sandeen 提交于 11月 30, 2018

xfs_btree_sblock_verify_crc is a bool so should not be returning
a failaddr_t; worse, if xfs_log_check_lsn fails it returns
__this_address which looks like a boolean true (i.e. success)
to the caller.

(interestingly xfs_btree_lblock_verify_crc doesn't have the issue)
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

7d048df4

xfs: fix PAGE_MASK usage in xfs_free_file_space · a579121f

由 Darrick J. Wong 提交于 11月 27, 2018

In commit e53c4b59, I *tried* to teach xfs to force writeback when we
fzero/fpunch right up to EOF so that if EOF is in the middle of a page,
the post-EOF part of the page gets zeroed before we return to userspace.
Unfortunately, I missed the part where PAGE_MASK is ~(PAGE_SIZE - 1),
which means that we totally fail to zero if we're fpunching and EOF is
within the first page. Worse yet, the same PAGE_MASK thinko plagues the
filemap_write_and_wait_range call, so we'd initiate writeback of the
entire file, which (mostly) masked the thinko.

Drop the tricky PAGE_MASK and replace it with correct usage of PAGE_SIZE
and the proper rounding macros.

Fixes: e53c4b59 ("xfs: ensure post-EOF zeroing happens after zeroing part of a file")
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

a579121f

04 12月, 2018 3 次提交

Revert "exec: make de_thread() freezable" · a72173ec

由 Rafael J. Wysocki 提交于 12月 03, 2018

Revert commit c2239788 "exec: make de_thread() freezable" as
requested by Ingo Molnar:

"So there's a new regression in v4.20-rc4, my desktop produces this
lockdep splat:

[ 1772.588771] WARNING: pkexec/4633 still has locks held!
[ 1772.588773] 4.20.0-rc4-custom-00213-g93a49841322b #1 Not tainted
[ 1772.588775] ------------------------------------
[ 1772.588776] 1 lock held by pkexec/4633:
[ 1772.588778]  #0: 00000000ed85fbf8 (&sig->cred_guard_mutex){+.+.}, at: prepare_bprm_creds+0x2a/0x70
[ 1772.588786] stack backtrace:
[ 1772.588789] CPU: 7 PID: 4633 Comm: pkexec Not tainted 4.20.0-rc4-custom-00213-g93a49841322b #1
[ 1772.588792] Call Trace:
[ 1772.588800]  dump_stack+0x85/0xcb
[ 1772.588803]  flush_old_exec+0x116/0x890
[ 1772.588807]  ? load_elf_phdrs+0x72/0xb0
[ 1772.588809]  load_elf_binary+0x291/0x1620
[ 1772.588815]  ? sched_clock+0x5/0x10
[ 1772.588817]  ? search_binary_handler+0x6d/0x240
[ 1772.588820]  search_binary_handler+0x80/0x240
[ 1772.588823]  load_script+0x201/0x220
[ 1772.588825]  search_binary_handler+0x80/0x240
[ 1772.588828]  __do_execve_file.isra.32+0x7d2/0xa60
[ 1772.588832]  ? strncpy_from_user+0x40/0x180
[ 1772.588835]  __x64_sys_execve+0x34/0x40
[ 1772.588838]  do_syscall_64+0x60/0x1c0

The warning gets triggered by an ancient lockdep check in the freezer:

(gdb) list *0xffffffff812ece06
0xffffffff812ece06 is in flush_old_exec (./include/linux/freezer.h:57).
52	 * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
53	 * If try_to_freeze causes a lockdep warning it means the caller may deadlock
54	 */
55	static inline bool try_to_freeze_unsafe(void)
56	{
57		might_sleep();
58		if (likely(!freezing(current)))
59			return false;
60		return __refrigerator(false);
61	}

I reviewed the ->cred_guard_mutex code, and the mutex is held across all
of exec() - and we always did this.

But there's this recent -rc4 commit:

> Chanho Min (1):
>       exec: make de_thread() freezable

  c2239788: exec: make de_thread() freezable

I believe this commit is bogus, you cannot call try_to_freeze() from
de_thread(), because it's holding the ->cred_guard_mutex."
Reported-by: NIngo Molnar <mingo@kernel.org>
Tested-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

a72173ec

btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable · 10950929

由 Qu Wenruo 提交于 11月 23, 2018

[BUG]
A completely valid btrfs will refuse to mount, with error message like:
  BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \
    bg_start=12018974720 bg_len=10888413184, invalid block group size, \
    have 10888413184 expect (0, 10737418240]

This has been reported several times as the 4.19 kernel is now being
used. The filesystem refuses to mount, but is otherwise ok and booting
4.18 is a workaround.

Btrfs check returns no error, and all kernels used on this fs is later
than 2011, which should all have the 10G size limit commit.

[CAUSE]
For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
stripe stripe bump up.

__btrfs_alloc_chunk()
|- max_stripe_size = 1G
|- max_chunk_size = 10G
|- data_stripe = 11
|- if (1G * 11 > 10G) {
       stripe_size = 976128930;
       stripe_size = round_up(976128930, SZ_16M) = 989855744

However the final stripe_size (989855744) * 11 = 10888413184, which is
still larger than 10G.

[FIX]
For the comprehensive check, we need to do the full check at chunk read
time, and rely on bg <-> chunk mapping to do the check.

We could just skip the length check for now.

Fixes: fce466ea ("btrfs: tree-checker: Verify block_group_item")
Cc: stable@vger.kernel.org # v4.19+
Reported-by: NWang Yugui <wangyugui@e16-tech.com>
Signed-off-by: NQu Wenruo <wqu@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

10950929

Revert "ovl: relax permission checking on underlying layers" · ec7ba118

由 Miklos Szeredi 提交于 12月 04, 2018

This reverts commit 007ea448.

The commit broke some selinux-testsuite cases, and it looks like there's no
straightforward fix keeping the direction of this patch, so revert for now.

The original patch was trying to fix the consistency of permission checks, and
not an observed bug.  So reverting should be safe.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

ec7ba118

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功