提交 · 4dbd80fb9176f23c78cecd0a8285001cd2066425 · openeuler / Kernel

26 4月, 2017 1 次提交

btrfs: Fix metadata underflow caused by btrfs_reloc_clone_csum error · 4dbd80fb

由 Qu Wenruo 提交于 3月 08, 2017

[BUG]
When btrfs_reloc_clone_csum() reports error, it can underflow metadata
and leads to kernel assertion on outstanding extents in
run_delalloc_nocow() and cow_file_range().

 BTRFS info (device vdb5): relocating block group 12582912 flags data
 BTRFS info (device vdb5): found 1 extents
 assertion failed: inode->outstanding_extents >= num_extents, file: fs/btrfs//extent-tree.c, line: 5858

Currently, due to another bug blocking ordered extents, the bug is only
reproducible under certain block group layout and using error injection.

a) Create one data block group with one 4K extent in it.
   To avoid the bug that hangs btrfs due to ordered extent which never
   finishes
b) Make btrfs_reloc_clone_csum() always fail
c) Relocate that block group

[CAUSE]
run_delalloc_nocow() and cow_file_range() handles error from
btrfs_reloc_clone_csum() wrongly:

(The ascii chart shows a more generic case of this bug other than the
bug mentioned above)

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
                    |<----------- cleanup range --------------->|
|<-----------  ----------->|
             \/
 btrfs_finish_ordered_io() range

So error handler, which calls extent_clear_unlock_delalloc() with
EXTENT_DELALLOC and EXTENT_DO_ACCOUNT bits, and btrfs_finish_ordered_io()
will both cover OE n, and free its metadata, causing metadata under flow.

[Fix]
The fix is to ensure after calling btrfs_add_ordered_extent(), we only
call error handler after increasing the iteration offset, so that
cleanup range won't cover any created ordered extent.

|<------------------ delalloc range --------------------------->|
| OE 1 | OE 2 | ... | OE n |
|<-----------  ----------->|<---------- cleanup range --------->|
             \/
 btrfs_finish_ordered_io() range
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>

4dbd80fb

12 4月, 2017 2 次提交

Btrfs: fix segmentation fault when doing dio read · 97bf5a55

由 Liu Bo 提交于 4月 07, 2017

Commit 2dabb324 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
introduced this bug during iterating bio pages in dio read's endio hook,
and it could end up with segment fault of the dio reading task.

So the reason is 'if (nr_sectors--)', and it makes the code assume that
there is one more block in the same page, so page offset is increased and
the bio which is created to repair the bad block then has an incorrect
bvec.bv_offset, and a later access of the page content would throw a
segmentation fault.

This also adds ASSERT to check page offset against page size.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

97bf5a55

Btrfs: fix invalid dereference in btrfs_retry_endio · 2e949b0a

由 Liu Bo 提交于 4月 05, 2017

When doing directIO repair, we have this oops:

[ 1458.532816] general protection fault: 0000 [#1] SMP
...
[ 1458.536291] Workqueue: btrfs-endio-repair btrfs_endio_repair_helper [btrfs]
[ 1458.536893] task: ffff88082a42d100 task.stack: ffffc90002b3c000
[ 1458.537499] RIP: 0010:btrfs_retry_endio+0x7e/0x1a0 [btrfs]
...
[ 1458.543261] Call Trace:
[ 1458.543958]  ? rcu_read_lock_sched_held+0xc4/0xd0
[ 1458.544374]  bio_endio+0xed/0x100
[ 1458.544750]  end_workqueue_fn+0x3c/0x40 [btrfs]
[ 1458.545257]  normal_work_helper+0x9f/0x900 [btrfs]
[ 1458.545762]  btrfs_endio_repair_helper+0x12/0x20 [btrfs]
[ 1458.546224]  process_one_work+0x34d/0xb70
[ 1458.546570]  ? process_one_work+0x29e/0xb70
[ 1458.546938]  worker_thread+0x1cf/0x960
[ 1458.547263]  ? process_one_work+0xb70/0xb70
[ 1458.547624]  kthread+0x17d/0x180
[ 1458.547909]  ? kthread_create_on_node+0x70/0x70
[ 1458.548300]  ret_from_fork+0x31/0x40

It turns out that btrfs_retry_endio is trying to get inode from a directIO
page.

This fixes the problem by using the saved inode pointer, done->inode.
btrfs_retry_endio_nocsum has the same problem, and it's fixed as well.

Also cleanup unused @start (which is too trivial for a separate patch).

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2e949b0a

29 3月, 2017 1 次提交

Btrfs: bring back repair during read · 9d0d1c8b

由 Liu Bo 提交于 3月 24, 2017

Commit 20a7db8a ("btrfs: add dummy callback for readpage_io_failed
and drop checks") made a cleanup around readpage_io_failed_hook, and
it was supposed to keep the original sematics, but it also
unexpectedly disabled repair during read for dup, raid1 and raid10.

This fixes the problem by letting data's inode call the generic
readpage_io_failed callback by returning -EAGAIN from its
readpage_io_failed_hook in order to notify end_bio_extent_readpage to
do the rest.  We don't call it directly because the generic one takes
an offset from end_bio_extent_readpage() to calculate the index in the
checksum array and inode's readpage_io_failed_hook doesn't offer that
offset.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ keep the const function attribute ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

9d0d1c8b

18 3月, 2017 1 次提交

btrfs: add missing memset while reading compressed inline extents · e1699d2d

由 Zygo Blaxell 提交于 3月 10, 2017

This is a story about 4 distinct (and very old) btrfs bugs.

Commit c8b97818 ("Btrfs: Add zlib compression support") added
three data corruption bugs for inline extents (bugs #1-3).

Commit 93c82d57 ("Btrfs: zero page past end of inline file items")
fixed bug #1:  uncompressed inline extents followed by a hole and more
extents could get non-zero data in the hole as they were read.  The fix
was to add a memset in btrfs_get_extent to zero out the hole.

Commit 166ae5a4 ("btrfs: fix inline compressed read err corruption")
fixed bug #2:  compressed inline extents which contained non-zero bytes
might be replaced with zero bytes in some cases.  This patch removed an
unhelpful memset from uncompress_inline, but the case where memset is
required was missed.

There is also a memset in the decompression code, but this only covers
decompressed data that is shorter than the ram_bytes from the extent
ref record.  This memset doesn't cover the region between the end of the
decompressed data and the end of the page.  It has also moved around a
few times over the years, so there's no single patch to refer to.

This patch fixes bug #3:  compressed inline extents followed by a hole
and more extents could get non-zero data in the hole as they were read
(i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
The fix is the same:  zero out the hole in the compressed case too,
by putting a memset back in uncompress_inline, but this time with
correct parameters.

The last and oldest bug, bug #0, is the cause of the offending inline
extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless quirk
of behavior somewhere in the btrfs write code.  In a few special cases,
an inline extent and hole are allowed to persist where they normally
would be combined with later extents in the file.

A fast reproducer for bug #0 is presented below.  A few offending extents
are also created in the wild during large rsync transfers with the -S
flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
will produce a handful of offending files as well.  Once an offending
file is created, it can present different content to userspace each
time it is read.

Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
kernel back to v3.5 has this behavior.  There are fossil records of this
bug's effects in commits all the way back to v2.6.32.  I have no reason
to believe bug #0 wasn't present at the beginning of btrfs compression
support in v2.6.29, but I can't easily test kernels that old to be sure.

It is not clear whether bug #0 is worth fixing.  A fix would likely
require injecting extra reads into currently write-only paths, and most
of the exceptional cases caused by bug #0 are already handled now.

Whether we like them or not, bug #0's inline extents followed by holes
are part of the btrfs de-facto disk format now, and we need to be able
to read them without data corruption or an infoleak.  So enough about
bug #0, let's get back to bug #3 (this patch).

An example of on-disk structure leading to data corruption found in
the wild:

        item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
                inode generation 50 transid 50 size 47424 nbytes 49141
                block group 0 mode 100644 links 1 uid 0 gid 0
                rdev 0 flags 0x0(none)
        item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
                inode ref index 3 namelen 10 name: DB_File.so
        item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
                inline extent data size 1341 ram 4085 compress(zlib)
        item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
                extent data disk byte 5367308288 nr 20480
                extent data offset 0 nr 45056 ram 45056
                extent compression(zlib)

Different data appears in userspace during each read of the 11 bytes
between 4085 and 4096.  The extent in item 63 is not long enough to
fill the first page of the file, so a memset is required to fill the
space between item 63 (ending at 4085) and item 64 (beginning at 4096)
with zero.

Here is a reproducer from Liu Bo, which demonstrates another method
of creating the same inline extent and hole pattern:

Using 'page_poison=on' kernel command line (or enable
CONFIG_PAGE_POISONING) run the following:

	# touch foo
	# chattr +c foo
	# xfs_io -f -c "pwrite -W 0 1000" foo
	# xfs_io -f -c "falloc 4 8188" foo
	# od -x foo
	# echo 3 >/proc/sys/vm/drop_caches
	# od -x foo

This produce the following on my box:

Correct output:  file contains 1000 data bytes followed
by zeros:

	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
	*
	0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000
	0001760 0000 0000 0000 0000 0000 0000 0000 0000
	*
	0020000

Actual output:  the data after the first 1000 bytes
will be different each run:

	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
	*
	0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
	0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
	0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
	(...)
Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NChris Mason <clm@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

e1699d2d

28 2月, 2017 34 次提交

btrfs: add dummy callback for readpage_io_failed and drop checks · 20a7db8a

由 David Sterba 提交于 2月 17, 2017

Make extent_io_ops::readpage_io_failed_hook callback mandatory and
define a dummy function for btrfs_extent_io_ops. As the failed IO
callback is not performance critical, the branch vs extra trade off does
not hurt.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

20a7db8a

btrfs: document existence of extent_io ops callbacks · 4d53dddb

由 David Sterba 提交于 2月 17, 2017

Some of the callbacks defined in btree_extent_io_ops and
btrfs_extent_io_ops do always exist so we don't need to check the
existence before each call. This patch just reorders the definition and
documents which are mandatory/optional.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4d53dddb

btrfs: let writepage_end_io_hook return void · c3988d63

由 David Sterba 提交于 2月 17, 2017

There's no error path in any of the instances, always return 0.
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c3988d63

btrfs: derive maximum output size in the compression implementation · e5d74902

由 David Sterba 提交于 2月 14, 2017

The value of max_out can be calculated from the parameters passed to the
compressors, which is number of pages and the page size, and we don't
have to needlessly pass it around.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e5d74902

D
btrfs: use predefined limits for calculating maximum number of pages for compression · 069eac78
由 David Sterba 提交于 2月 14, 2017
```
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
069eac78

btrfs: export compression buffer limits in a header · ff763866

由 David Sterba 提交于 2月 14, 2017

Move the buffer limit definitions out of compress_file_range.
Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ff763866

btrfs: merge nr_pages input and output parameter in compress_pages · 4d3a800e

由 David Sterba 提交于 2月 14, 2017

The parameter saying how many pages can be allocated at maximum can be
merged with the output page counter, to save some stack space.  The
compression implementation will sink the parameter to a local variable
so everything works as before.

The nr_pages variables can also be simply merged in compress_file_range
into one.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4d3a800e

btrfs: merge length input and output parameter in compress_pages · 38c31464

由 David Sterba 提交于 2月 14, 2017

The length parameter is basically duplicated for input and output in the
top level caller of the compress_pages chain. We can simply use one
variable for that and reduce stack consumption. The compression
implementation will sink the parameter to a local variable so everything
works as before.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

38c31464

N
btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode · 0b581701
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
0b581701
N
btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode · abcefb1e
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
abcefb1e
N
btrfs: Make btrfs_add_nondir take btrfs_inode · cef415af
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
cef415af

btrfs: Make btrfs_add_link take btrfs_inode · db0a669f

由 Nikolay Borisov 提交于 2月 20, 2017

Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

db0a669f

N
btrfs: Make btrfs_del_delalloc_inode take btrfs_inode · 9e3e97f4
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
9e3e97f4

btrfs: Make get_extent_t take btrfs_inode · fc4f21b1

由 Nikolay Borisov 提交于 2月 20, 2017

In addition to changing the signature, this patch also switches
all the functions which are used as an argument to also take btrfs_inode.
Namely those are: btrfs_get_extent and btrfs_get_extent_filemap.
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

fc4f21b1

N
btrfs: Make btrfs_clear_bit_hook take btrfs_inode · 6fc0ef68
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
6fc0ef68
N
btrfs: Make btrfs_extent_item_to_extent_map take btrfs_inode · 9cdc5124
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
9cdc5124
N
btrfs: Make btrfs_orphan_add take btrfs_inode · 73f2e545
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
73f2e545
N
btrfs: make btrfs_orphan_del take btrfs_inode · 3d6ae7bb
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
3d6ae7bb
N
btrfs: make btrfs_free_io_failure_record take btrfs_inode · 7ab7956e
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
7ab7956e
N
btrfs: make clean_io_failure take btrfs_inode · b30cb441
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
b30cb441
N
btrfs: make btrfs_print_data_csum_error take btrfs_inode · 0970a22e
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
0970a22e

btrfs: make free_io_failure take btrfs_inode · 4ac1f4ac

由 Nikolay Borisov 提交于 2月 20, 2017

Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4ac1f4ac

N
btrfs: Make btrfs_lookup_ordered_range take btrfs_inode · a776c6fa
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
a776c6fa
N
btrfs: Make btrfs_mark_extent_written take btrfs_inode · 7a6d7067
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
7a6d7067
N
btrfs: Make btrfs_drop_extent_cache take btrfs_inode · dcdbc059
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
dcdbc059
N
btrfs: Make (__)btrfs_add_inode_defrag take btrfs_inode · 6158e1ce
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
6158e1ce
N
btrfs: all btrfs_delalloc_release_metadata take btrfs_inode · 691fa059
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
691fa059
N
btrfs: Make btrfs_orphan_release_metadata take btrfs_inode · 703b391a
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
703b391a
N
btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode · 8ed7a2a0
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
8ed7a2a0
N
btrfs: make btrfs_is_free_space_inode take btrfs_inode · 70ddc553
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
70ddc553
N
btrfs: Make btrfs_i_size_write take btrfs_inode · 6ef06d27
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
6ef06d27
N
btrfs: Make btrfs_set_inode_index take btrfs_inode · 877574e2
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
877574e2
N
btrfs: make btrfs_set_inode_index_count take btrfs_inode · 4c570655
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
4c570655
N
btrfs: Make btrfs_insert_dir_item take btrfs_inode · 8e7611cf
由 Nikolay Borisov 提交于 2月 20, 2017
```
Signed-off-by: NNikolay Borisov <nborisov@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
8e7611cf

24 2月, 2017 1 次提交

Btrfs: fix data loss after truncate when using the no-holes feature · 76b42abb

由 Filipe Manana 提交于 2月 14, 2017

If we have a file with an implicit hole (NO_HOLES feature enabled) that
has an extent following the hole, delayed writes against regions of the
file behind the hole happened before but were not yet flushed and then
we truncate the file to a smaller size that lies inside the hole, we
end up persisting a wrong disk_i_size value for our inode that leads to
data loss after umounting and mounting again the filesystem or after
the inode is evicted and loaded again.

This happens because at inode.c:btrfs_truncate_inode_items() we end up
setting last_size to the offset of the extent that we deleted and that
followed the hole. We then pass that value to btrfs_ordered_update_i_size()
which updates the inode's disk_i_size to a value smaller then the offset
of the buffered (delayed) writes.

Example reproducer:

 $ mkfs.btrfs -f /dev/sdb
 $ mount /dev/sdb /mnt

 $ xfs_io -f -c "pwrite -S 0x01 0K 32K" /mnt/foo
 $ xfs_io -d -c "pwrite -S 0x02 -b 32K 64K 32K" /mnt/foo
 $ xfs_io -c "truncate 60K" /mnt/foo
   --> inode's disk_i_size updated to 0

 $ md5sum /mnt/foo
 3c5ca3c3ab42f4b04d7e7eb0b0d4d806  /mnt/foo

 $ umount /dev/sdb
 $ mount /dev/sdb /mnt

 $ md5sum /mnt/foo
 d41d8cd98f00b204e9800998ecf8427e  /mnt/foo
   --> Empty file, all data lost!

Cc: <stable@vger.kernel.org>  # 3.14+
Fixes: 16e7549f ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>

76b42abb

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功