提交 · c995ab3cda3f4178c1f1a47926bea5f8372880cb · openanolis / cloud-kernel

02 11月, 2017 2 次提交

btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents · c995ab3c

由 Zygo Blaxell 提交于 9月 22, 2017

The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and
offset (encoded as a single logical address) to a list of extent refs.
LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping
(extent ref -> extent bytenr and offset, or logical address).  These are
useful capabilities for programs that manipulate extents and extent
references from userspace (e.g. dedup and defrag utilities).

When the extents are uncompressed (and not encrypted and not other),
check_extent_in_eb performs filtering of the extent refs to remove any
extent refs which do not contain the same extent offset as the 'logical'
parameter's extent offset.  This prevents LOGICAL_INO from returning
references to more than a single block.

To find the set of extent references to an uncompressed extent from [a, b),
userspace has to run a loop like this pseudocode:

	for (i = a; i < b; ++i)
		extent_ref_set += LOGICAL_INO(i);

At each iteration of the loop (up to 32768 iterations for a 128M extent),
data we are interested in is collected in the kernel, then deleted by
the filter in check_extent_in_eb.

When the extents are compressed (or encrypted or other), the 'logical'
parameter must be an extent bytenr (the 'a' parameter in the loop).
No filtering by extent offset is done (or possible?) so the result is
the complete set of extent refs for the entire extent.  This removes
the need for the loop, since we get all the extent refs in one call.

Add an 'ignore_offset' argument to iterate_inodes_from_logical,
[...several levels of function call graph...], and check_extent_in_eb, so
that we can disable the extent offset filtering for uncompressed extents.
This flag can be set by an improved version of the LOGICAL_INO ioctl to
get either behavior as desired.

There is no functional change in this patch.  The new flag is always
false.
Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ minor coding style fixes ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c995ab3c

btrfs: allow to set compression level for zlib · f51d2b59

由 David Sterba 提交于 9月 15, 2017

Preliminary support for setting compression level for zlib, the
following works:

$ mount -o compess=zlib                 # default
$ mount -o compess=zlib0                # same
$ mount -o compess=zlib9                # level 9, slower sync, less data
$ mount -o compess=zlib1                # level 1, faster sync, more data
$ mount -o remount,compress=zlib3	# level set by remount

The compress-force works the same as compress'.  The level is visible in
the same format in /proc/mounts. Level set via file property does not
work yet.

Required patch: "btrfs: prepare for extensions in compression options"
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f51d2b59

30 10月, 2017 8 次提交

btrfs: use BLK_STS defines where needed · 2dbe0c77

由 Anand Jain 提交于 10月 14, 2017

At few places we could use BLK_STS_OK and BLK_STS_NOSUPP.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NSatoru Taekeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ dropped first hunk btrfs_endio_direct_read ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2dbe0c77

btrfs: pass root to various extent ref mod functions · 84f7d8e6

由 Josef Bacik 提交于 9月 29, 2017

We need the actual root for the ref verifier tool to work, so change
these functions to pass the root around instead.  This will be used in
a subsequent patch.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

84f7d8e6

Btrfs: remove nr_async_submits and async_submit_draining · 736cd52e

由 Liu Bo 提交于 9月 07, 2017

Now that we have the combo of flushing twice, which can make sure IO
have started since the second flush will wait for page lock which
won't be unlocked unless setting page writeback and queuing ordered
extents, we don't need %async_submit_draining, %async_delalloc_pages
and %nr_async_submits to tell whether the IO has actually started.

Moreover, all the flushers in use are followed by functions that wait
for ordered extents to complete, so %nr_async_submits, which tracks
whether bio's async submit has made progress, doesn't really make
sense.

However, %async_delalloc_pages is still required by shrink_delalloc()
as that function doesn't flush twice in the normal case (just issues a
writeback with WB_REASON_FS_FREE_SPACE).
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

736cd52e

Btrfs: compress_file_range remove dead variable num_bytes · 1170862d

由 Timofey Titovets 提交于 10月 03, 2017

Remove dead assigment of num_bytes.

Also as num_bytes only used in the will_compress block as copy of
total_in just replace that with total_in and drop num_bytes entirely.
Signed-off-by: NTimofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1170862d

btrfs: Fix bool initialization/comparison · 897ca819

由 Thomas Meyer 提交于 10月 07, 2017

Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: NThomas Meyer <thomas@m3y3r.de>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

897ca819

Btrfs: cleanup 'start' subtraction from try uncompressed inline extent · 6018ba0a

由 Timofey Titovets 提交于 9月 15, 2017

Was added in:
  c8b97818
  "Btrfs: Add zlib compression support"
Survive to near time (from 08.10.2008).

Because 'start' checked for zero before branch, so it's safe to remove
that subtraction.
Signed-off-by: NTimofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: NSatoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6018ba0a

btrfs: Remove unused parameter from check_direct_IO · 8c70c9f8

由 Nikolay Borisov 提交于 8月 21, 2017

Introduced by 5a5f79b5 ("Btrfs: allow unaligned DIO") and never
used. The buffered fallback from unaligned DIO works as expected.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NTimofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8c70c9f8

Btrfs: do not async submit for nodatasum inodes · 9b4a9b28

由 Liu Bo 提交于 8月 18, 2017

While we submit direct writes, if the inode is flagged with nodatasum,
there's no benefit to submit asynchronously, because

a) we don't have to calculate checksum across processors,

b) and direct IO has started a plug, but async submit makes us queue
IO on each device's scheduled IO list instead of DIO's plug list, so
that IOs get much less merges in general.

Lets use sync submit for nodatasum inodes.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9b4a9b28

26 9月, 2017 3 次提交

Btrfs: fix unexpected result when dio reading corrupted blocks · 99c4e3b9

由 Liu Bo 提交于 9月 15, 2017

commit 4246a0b6 ("block: add a bi_error field to struct bio")
changed the logic of how dio read endio reports errors.

For single stripe dio read, %bio->bi_status reflects the error before
verifying checksum, and now we're updating it when data block matches
with its checksum, while in the mismatching case, %bio->bi_status is
not updated to relfect that.

When some blocks in a file have been corrupted on disk, reading such a
file ends up with

1) checksum errors are reported in kernel log
2) read(2) returns successfully with some content being 0x01.

In order to fix it, we need to report its checksum mismatch error to
the upper layer (dio layer in this case) as well.

Fixes: 4246a0b6 ("block: add a bi_error field to struct bio")
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reported-by: NGoffredo Baroncelli <kreijack@inwind.it>
Tested-by: NGoffredo Baroncelli <kreijack@inwind.it>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

99c4e3b9

btrfs: finish ordered extent cleaning if no progress is found · 67c003f9

由 Naohiro Aota 提交于 9月 01, 2017

__endio_write_update_ordered() repeats the search until it reaches the end
of the specified range. This works well with direct IO path, because before
the function is called, it's ensured that there are ordered extents filling
whole the range. It's not the case, however, when it's called from
run_delalloc_range(): it is possible to have error in the midle of the loop
in e.g. run_delalloc_nocow(), so that there exisits the range not covered
by any ordered extents. By cleaning such "uncomplete" range,
__endio_write_update_ordered() stucks at offset where there're no ordered
extents.

Since the ordered extents are created from head to tail, we can stop the
search if there are no offset progress.

Fixes: 52427260 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <stable@vger.kernel.org> # 4.12
Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: NQu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

67c003f9

btrfs: clear ordered flag on cleaning up ordered extents · 63d71450

由 Naohiro Aota 提交于 9月 01, 2017

Commit 52427260 ("btrfs: Handle delalloc error correctly to avoid
ordered extent hang") introduced btrfs_cleanup_ordered_extents() to cleanup
submitted ordered extents. However, it does not clear the ordered bit
(Private2) of corresponding pages. Thus, the following BUG occurs from
free_pages_check_bad() (on btrfs/125 with nospace_cache).

BUG: Bad page state in process btrfs  pfn:3fa787
page:ffffdf2acfe9e1c0 count:0 mapcount:0 mapping:          (null) index:0xd
flags: 0x8000000000002008(uptodate|private_2)
raw: 8000000000002008 0000000000000000 000000000000000d 00000000ffffffff
raw: ffffdf2acf5c1b20 ffffb443802238b0 0000000000000000 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags: 0x2000(private_2)

This patch clears the flag same as other places calling
btrfs_dec_test_ordered_pending() for every page in the specified range.

Fixes: 52427260 ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <stable@vger.kernel.org> # 4.12
Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: NQu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

63d71450

24 8月, 2017 1 次提交

Btrfs: fix blk_status_t/errno confusion · 58efbc9f

由 Omar Sandoval 提交于 8月 22, 2017

This fixes several instances of blk_status_t and bare errno ints being
mixed up, some of which are real bugs.

In the normal case, 0 matches BLK_STS_OK, so we don't observe any
effects of the missing conversion, but in case of errors or passes
through the repair/retry paths, the errors get mixed up.

The changes were identified using 'sparse', we don't have reports of the
buggy behaviour.

Fixes: 4e4cbee9 ("block: switch bios to blk_status_t")
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

58efbc9f

22 8月, 2017 1 次提交

btrfs: remove unnecessary memory barrier in btrfs_direct_IO · dc59215d

由 Nikolay Borisov 提交于 8月 01, 2017

Commit 38851cc1 ("Btrfs: implement unlocked dio write") implemented
unlocked dio write, allowing multiple dio writers to write to
non-overlapping, and non-eof-extending regions. In doing so it also
introduced a broken memory barrier. It is broken due to 2 things:

1. Memory barriers _MUST_ always be paired, this is clearly not the case
   here

2. Checkpatch actually produces a warning if a memory barrier is
   introduced that doesn't have a comment explaining how it's being
   paired.

Specifically for inode::i_dio_count that's wrapped inside
inode_dio_begin, there is no explicit barrier semantics attached, so
removing is fine as the atomic is used in common the waiter/wakeup
pattern.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ enhance changelog ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

dc59215d

18 8月, 2017 2 次提交

btrfs: Move skip checksum check from btrfs_submit_direct to __btrfs_submit_dio_bio · e6961cac

由 Nikolay Borisov 提交于 8月 03, 2017

Currently the code checks whether we should do data checksumming in
btrfs_submit_direct and the boolean result of this check is passed to
btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which
actually consumes it. The last function actually has all the necessary context
to figure out whether to skip the check or not, so let's move the check closer
to where it's being consumed. No functional changes.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Reviewed-by: NChris Mason <clm@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e6961cac

Btrfs: avoid unnecessarily locking inode when clearing a range · 4a4b964f

由 Filipe Manana 提交于 7月 27, 2017

If the range being cleared was not marked for defrag and we are not
about to clear the range from the defrag status, we don't need to
lock and unlock the inode.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NChris Mason <clm@fb.com>
Reviewed-by: NWang Shilong <wangshilong1991@gmail.com>
Signed-off-by: NChris Mason <clm@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4a4b964f

16 8月, 2017 11 次提交

btrfs: fix readdir deadlock with pagefault · 23b5ec74

由 Josef Bacik 提交于 7月 24, 2017

Readdir does dir_emit while under the btree lock.  dir_emit can trigger
the page fault which means we can deadlock.  Fix this by allocating a
buffer on opening a directory and copying the readdir into this buffer
and doing dir_emit from outside of the tree lock.

Thread A
readdir  <holding tree lock>
  dir_emit
    <page fault>
      down_read(mmap_sem)

Thread B
mmap write
  down_write(mmap_sem)
    page_mkwrite
      wait_ordered_extents

Process C
finish_ordered_extent
  insert_reserved_file_extent
   try to lock leaf <hang>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ copy the deadlock scenario to changelog ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

23b5ec74

btrfs: allow defrag compress to override NOCOMPRESS attribute · 1e20d1c4

由 David Sterba 提交于 7月 17, 2017

Currently, the BTRFS_INODE_NOCOMPRESS will prevent any compression on a
given file, except when the mount is force-compress. As users have
reported on IRC, this will also prevent compression when requested by
defrag (btrfs fi defrag -c file).

The nocompress flag is set automatically by filesystem when the ratios
are bad and the user would have to manually drop the bit in order to
make defrag -c work. This is not good from the usability perspective.

This patch will raise priority for the defrag -c over nocompress, ie.
any file with NOCOMPRESS bit set will get defragmented. The bit will
remain untouched.

Alternate option was to also drop the nocompress bit and keep the
decision logic as is, but I think this is not the right solution.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

1e20d1c4

btrfs: separate defrag and property compression · eec63c65

由 David Sterba 提交于 7月 17, 2017

Add new value for compression to distinguish between defrag and
property. Previously, a single variable was used and this caused clashes
when the per-file 'compression' was set and a defrag -c was called.

The property-compression is loaded when the file is open, defrag will
overwrite the same variable and reset to 0 (ie. NONE) at when the file
defragmentaion is finished. That's considered a usability bug.

Now we won't touch the property value, use the defrag-compression. The
precedence of defrag is higher than for property (and whole-filesystem).
Signed-off-by: NDavid Sterba <dsterba@suse.com>

eec63c65

btrfs: rename variable holding per-inode compression type · b52aa8c9

由 David Sterba 提交于 7月 17, 2017

This is preparatory for separating inode compression requested by defrag
and set via properties. This will fix a usability bug when defrag will
reset compression type to NONE. If the file has compression set via
property, it will not apply anymore (until next mount or reset through
command line).

We're going to fix that by adding another variable just for the defrag
call and won't touch the property. The defrag will have higher priority
when deciding whether to compress the data.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

b52aa8c9

Btrfs: add skeleton code for compression heuristic · c2fcdcdf

由 Timofey Titovets 提交于 7月 17, 2017

Add skeleton code for compresison heuristics. Now it iterates over all
the pages, but in the end always says "yes, compress please", ie it does
not change the current behaviour.

In the future we're going to add various heuristics to analyze the data.
This patch can be used as a baseline for measuring if the effectivness
and performance.
Signed-off-by: NTimofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ enhanced changelog, modified comments ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c2fcdcdf

btrfs: account that we're waiting for DIO read · 9c17f6cd

由 David Sterba 提交于 7月 19, 2017

Correctly account for IO when waiting for a submitted DIO read, the case
when we're retrying.  This only for the accounting purposes and should
not change other behaviour.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9c17f6cd

D
btrfs: fix spelling of snapshotting · ea14b57f
由 David Sterba 提交于 6月 22, 2017
```
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
ea14b57f

btrfs: cleanup types storing REQ_* · f1c77c55

由 David Sterba 提交于 6月 06, 2017

Unify types of local variables and parameters that store various
REQ_* values to unsigned int.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f1c77c55

btrfs: btrfs_inherit_iflags() can be static · 19aee8de

由 Anand Jain 提交于 7月 18, 2017

btrfs_new_inode() is the only consumer move it to inode.c,
from ioctl.c.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

19aee8de

btrfs: drop newlines from strings when using btrfs_* helpers · 913e1535

由 David Sterba 提交于 7月 13, 2017

The helpers append "\n" so we can keep the actual strings shorter. The
extra newline will print an empty line.  Some messages have been
slightly modified to be more consistent with the rest (lowercase first
letter).
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

913e1535

Btrfs: report errors when checksum is not found · 0d1e0bea

由 Liu Bo 提交于 7月 11, 2017

When btrfs fails the checksum check, it'll fill the whole page with
"1".

However, if %csum_expected is 0 (which means there is no checksum), then
for some unknown reason, we just pretend that the read is correct, so
userspace would be confused about the dilemma that read is successful but
getting a page with all content being "1".

This can happen due to a bug in btrfs-convert.

This fixes it by always returning errors if checksum doesn't match.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0d1e0bea

17 7月, 2017 1 次提交

VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) · bc98a42c

由 David Howells 提交于 7月 17, 2017

Firstly by applying the following with coccinelle's spatch:

	@@ expression SB; @@
	-SB->s_flags & MS_RDONLY
	+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

	@@ expression A, SB; @@
	(
	-(!sb_rdonly(SB)) && A
	+!sb_rdonly(SB) && A
	|
	-A != (sb_rdonly(SB))
	+A != sb_rdonly(SB)
	|
	-A == (sb_rdonly(SB))
	+A == sb_rdonly(SB)
	|
	-!(sb_rdonly(SB))
	+!sb_rdonly(SB)
	|
	-A && (sb_rdonly(SB))
	+A && sb_rdonly(SB)
	|
	-A || (sb_rdonly(SB))
	+A || sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) != A
	+sb_rdonly(SB) != A
	|
	-(sb_rdonly(SB)) == A
	+sb_rdonly(SB) == A
	|
	-(sb_rdonly(SB)) && A
	+sb_rdonly(SB) && A
	|
	-(sb_rdonly(SB)) || A
	+sb_rdonly(SB) || A
	)

	@@ expression A, B, SB; @@
	(
	-(sb_rdonly(SB)) ? 1 : 0
	+sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) ? A : B
	+sb_rdonly(SB) ? A : B
	)

to remove left over excess bracketage and finally by applying:

	@@ expression A, SB; @@
	(
	-(A & MS_RDONLY) != sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
	|
	-(A & MS_RDONLY) == sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
	)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

bc98a42c

15 7月, 2017 2 次提交

btrfs: btrfs_create_repair_bio never fails, skip error handling · e8f5b395

由 David Sterba 提交于 7月 13, 2017

As the function uses the non-failing bio allocation, we can remove error
handling from the callers as well.
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e8f5b395

btrfs: cloned bios must not be iterated by bio_for_each_segment_all · c09abff8

由 David Sterba 提交于 7月 13, 2017

We've started using cloned bios more in 4.13, there are some specifics
regarding the iteration.  Filipe found [1] that the raid56 iterated a
cloned bio using bio_for_each_segment_all, which is incorrect. The
cloned bios have wrong bi_vcnt and this could lead to silent
corruptions.  This patch adds assertions to all remaining
bio_for_each_segment_all cases.

[1] https://patchwork.kernel.org/patch/9838535/Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c09abff8

30 6月, 2017 3 次提交

btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges · bc42bda2

由 Qu Wenruo 提交于 2月 27, 2017

[BUG]
For the following case, btrfs can underflow qgroup reserved space
at an error path:
(Page size 4K, function name without "btrfs_" prefix)

         Task A                  |             Task B
----------------------------------------------------------------------
Buffered_write [0, 2K)           |
|- check_data_free_space()       |
|  |- qgroup_reserve_data()      |
|     Range aligned to page      |
|     range [0, 4K)          <<< |
|     4K bytes reserved      <<< |
|- copy pages to page cache      |
                                 | Buffered_write [2K, 4K)
                                 | |- check_data_free_space()
                                 | |  |- qgroup_reserved_data()
                                 | |     Range alinged to page
                                 | |     range [0, 4K)
                                 | |     Already reserved by A <<<
                                 | |     0 bytes reserved      <<<
                                 | |- delalloc_reserve_metadata()
                                 | |  And it *FAILED* (Maybe EQUOTA)
                                 | |- free_reserved_data_space()
                                      |- qgroup_free_data()
                                         Range aligned to page range
                                         [0, 4K)
                                         Freeing 4K
(Special thanks to Chandan for the detailed report and analyse)

[CAUSE]
Above Task B is freeing reserved data range [0, 4K) which is actually
reserved by Task A.

And at writeback time, page dirty by Task A will go through writeback
routine, which will free 4K reserved data space at file extent insert
time, causing the qgroup underflow.

[FIX]
For btrfs_qgroup_free_data(), add @reserved parameter to only free
data ranges reserved by previous btrfs_qgroup_reserve_data().
So in above case, Task B will try to free 0 byte, so no underflow.
Reported-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Tested-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

bc42bda2

btrfs: qgroup: Introduce extent changeset for qgroup reserve functions · 364ecf36

由 Qu Wenruo 提交于 2月 27, 2017

Introduce a new parameter, struct extent_changeset for
btrfs_qgroup_reserved_data() and its callers.

Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
which range it reserved in current reserve, so it can free it in error
paths.

The reason we need to export it to callers is, at buffered write error
path, without knowing what exactly which range we reserved in current
allocation, we can free space which is not reserved by us.

This will lead to qgroup reserved space underflow.
Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

364ecf36

btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write... · a12b877b

由 Qu Wenruo 提交于 2月 27, 2017

btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled

[BUG]
Under the following case, we can underflow qgroup reserved space.

            Task A                |            Task B
---------------------------------------------------------------
 Quota disabled                   |
 Buffered write                   |
 |- btrfs_check_data_free_space() |
 |  *NO* qgroup space is reserved |
 |  since quota is *DISABLED*     |
 |- All pages are copied to page  |
    cache                         |
                                  | Enable quota
                                  | Quota scan finished
                                  |
                                  | Sync_fs
                                  | |- run_delalloc_range
                                  | |- Write pages
                                  | |- btrfs_finish_ordered_io
                                  |    |- insert_reserved_file_extent
                                  |       |- btrfs_qgroup_release_data()
                                  |          Since no qgroup space is
                                             reserved in Task A, we
                                             underflow qgroup reserved
                                             space
This can be detected by fstest btrfs/104.

[CAUSE]
In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes
size of qgroup reserved_space in all cases.
And btrfs_qgroup_release_data() will check if quotas are enabled.

However in the above case, the buffered write happens before quota is
enabled, so we don't have the reserved space for that range.

[FIX]
In insert_reserved_file_extent(), we tell qgroup to release the acctual
byte number it released.
In the above case, since we don't have the reserved space, we tell
qgroups to release 0 byte, so the problem can be fixed.

And thanks to the @reserved parameter introduced by the qgroup rework,
and previous patch to return released bytes, the fix can be as small as
10 lines.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
[ changelog updates ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a12b877b

22 6月, 2017 1 次提交

btrfs: Check name_len with boundary in verify dir_item · e79a3327

由 Su Yue 提交于 6月 06, 2017

Originally, verify_dir_item verifies name_len of dir_item with fixed
values but not item boundary.
If corrupted name_len was not bigger than the fixed value, for example
255, the function will think the dir_item is fine. And then reading
beyond boundary will cause crash.

Example:
	1. Corrupt one dir_item name_len to be 255.
        2. Run 'ls -lar /mnt/test/ > /dev/null'
dmesg:
[   48.451449] BTRFS info (device vdb1): disk space caching is enabled
[   48.451453] BTRFS info (device vdb1): has skinny extents
[   48.489420] general protection fault: 0000 [#1] SMP
[   48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq
[   48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5
[   48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
[   48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000
[   48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs]
[   48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202
[   48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000
[   48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368
[   48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000
[   48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288
[   48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20
[   48.490008] FS:  00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000
[   48.490008] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0
[   48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   48.490008] Call Trace:
[   48.490008]  btrfs_real_readdir+0x3b7/0x4a0 [btrfs]
[   48.490008]  iterate_dir+0x181/0x1b0
[   48.490008]  SyS_getdents+0xa7/0x150
[   48.490008]  ? fillonedir+0x150/0x150
[   48.490008]  entry_SYSCALL_64_fastpath+0x18/0xad
[   48.490008] RIP: 0033:0x7fef5032546b
[   48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e
[   48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b
[   48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003
[   48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000
[   48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38
[   48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e
[   48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98
[   48.499455] ---[ end trace 321920d8e8339505 ]---

Fix it by adding a parameter @slot and check name_len with item boundary
by calling btrfs_is_name_len_valid.
Signed-off-by: NSu Yue <suy.fnst@cn.fujitsu.com>
rev
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e79a3327

21 6月, 2017 1 次提交

percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch · 104b4e51

由 Nikolay Borisov 提交于 6月 20, 2017

Currently, percpu_counter_add is a wrapper around __percpu_counter_add
which is preempt safe due to explicit calls to preempt_disable.  Given
how __ prefix is used in percpu related interfaces, the naming
unfortunately creates the false sense that __percpu_counter_add is
less safe than percpu_counter_add.  In terms of context-safety,
they're equivalent.  The only difference is that the __ version takes
a batch parameter.

Make this a bit more explicit by just renaming __percpu_counter_add to
percpu_counter_add_batch.

This patch doesn't cause any functional changes.

tj: Minor updates to patch description for clarity.  Cosmetic
    indentation updates.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: "David S. Miller" <davem@davemloft.net>

104b4e51

20 6月, 2017 4 次提交

btrfs: nowait aio support · edf064e7

由 Goldwyn Rodrigues 提交于 6月 20, 2017

Return EAGAIN if any of the following checks fail
 + i_rwsem is not lockable
 + NODATACOW or PREALLOC is not set
 + Cannot nocow at the desired location
 + Writing beyond end of file which is not allocated
Acked-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

edf064e7

btrfs: cleanup duplicate return value in insert_inline_extent · 79b4f4c6

由 David Sterba 提交于 6月 15, 2017

The pattern when err is used for function exit and ret is used for
return values of callees is not used here.
Reviewed-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

79b4f4c6

Btrfs: compression must free at least one sector size · 170607eb

由 Timofey Titovets 提交于 6月 06, 2017

We already skip storing data where compression does not make the result
at least one byte less.  Let's make the logic better and check
that compression frees at least one sector size of bytes, otherwise it's
not that useful.
Signed-off-by: NTimofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ changelog updated ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

170607eb

Btrfs: skip checksum verification if IO error occurs · ef7cdac1

由 Liu Bo 提交于 4月 13, 2017

Currently dio read also goes to verify checksum if -EIO has been returned,
although it usually fails on checksum, it's not necessary at all, we could
directly check if there is another copy to read.

And with this, the behavior of dio read is now consistent with that of
buffered read.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ use bool for uptodate ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ef7cdac1

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功