提交 · fc4adbff823f76577ece26dcb88bf6f8392dbd43 · openeuler / Kernel

10 6月, 2014 5 次提交

btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking · fc4adbff

由 Alex Gartrell 提交于 5月 20, 2014

In these instances, we are trying to determine if a page has been accessed
since we began the operation for the sake of retry.  This is easily
accomplished by doing a gang lookup in the page mapping radix tree, and it
saves us the dependency on the flag (so that we might eventually delete
it).

btrfs_page_exists_in_range borrows heavily from find_get_page, replacing
the radix tree look up with a gang lookup of 1, so that we can find the
next highest page >= index and see if it falls into our lock range.
Signed-off-by: NChris Mason <clm@fb.com>
Signed-off-by: NAlex Gartrell <agartrell@fb.com>

fc4adbff

Btrfs: rework qgroup accounting · fcebe456

由 Josef Bacik 提交于 5月 13, 2014

Currently qgroups account for space by intercepting delayed ref updates to fs
trees. It does this by adding sequence numbers to delayed ref updates so that
it can figure out how the tree looked before the update so we can adjust the
counters properly. The problem with this is that it does not allow delayed refs
to be merged, so if you say are defragging an extent with 5k snapshots pointing
to it we will thrash the delayed ref lock because we need to go back and
manually merge these things together. Instead we want to process quota changes
when we know they are going to happen, like when we first allocate an extent, we
free a reference for an extent, we add new references etc. This patch
accomplishes this by only adding qgroup operations for real ref changes. We
only modify the sequence number when we need to lookup roots for bytenrs, this
reduces the amount of churn on the sequence number and allows us to merge
delayed refs as we add them most of the time. This patch encompasses a bunch of
architectural changes

1) qgroup ref operations: instead of tracking qgroup operations through the
delayed refs we simply add new ref operations whenever we notice that we need to
when we've modified the refs themselves.

2) tree mod seq: we no longer have this separation of major/minor counters.
this makes the sequence number stuff much more sane and we can remove some
locking that was needed to protect the counter.

3) delayed ref seq: we now read the tree mod seq number and use that as our
sequence. This means each new delayed ref doesn't have it's own unique sequence
number, rather whenever we go to lookup backrefs we inc the sequence number so
we can make sure to keep any new operations from screwing up our world view at
that given point. This allows us to merge delayed refs during runtime.

With all of these changes the delayed ref stuff is a little saner and the qgroup
accounting stuff no longer goes negative in some cases like it was before.
Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

fcebe456

Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root · 27cdeb70

由 Miao Xie 提交于 4月 02, 2014

Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

27cdeb70

Btrfs: fix leaf corruption caused by ENOSPC while hole punching · fc19c5e7

由 Filipe Manana 提交于 4月 29, 2014

While running a stress test with multiple threads writing to the same btrfs
file system, I ended up with a situation where a leaf was corrupted in that
it had 2 file extent item keys that had the same exact key. I was able to
detect this quickly thanks to the following patch which triggers an assertion
as soon as a leaf is marked dirty if there are duplicated keys or out of order
keys:

    Btrfs: check if items are ordered when a leaf is marked dirty
    (https://patchwork.kernel.org/patch/3955431/)

Basically while running the test, I got the following in dmesg:

    [28877.415877] WARNING: CPU: 2 PID: 10706 at fs/btrfs/file.c:553 btrfs_drop_extent_cache+0x435/0x440 [btrfs]()
    (...)
    [28877.415917] Call Trace:
    [28877.415922]  [<ffffffff816f1189>] dump_stack+0x4e/0x68
    [28877.415926]  [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0
    [28877.415929]  [<ffffffff8104a37a>] warn_slowpath_null+0x1a/0x20
    [28877.415944]  [<ffffffffa03775a5>] btrfs_drop_extent_cache+0x435/0x440 [btrfs]
    [28877.415949]  [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
    [28877.415962]  [<ffffffffa03777d9>] fill_holes+0x229/0x3e0 [btrfs]
    [28877.415972]  [<ffffffffa0345865>] ? block_rsv_add_bytes+0x55/0x80 [btrfs]
    [28877.415984]  [<ffffffffa03792cb>] btrfs_fallocate+0xb6b/0xc20 [btrfs]
    (...)
    [29854.132560] BTRFS critical (device sdc): corrupt leaf, bad key order: block=955232256,root=1, slot=24
    [29854.132565] BTRFS info (device sdc): leaf 955232256 total ptrs 40 free space 778
    (...)
    [29854.132637] 	item 23 key (3486 108 667648) itemoff 2694 itemsize 53
    [29854.132638] 		extent data disk bytenr 14574411776 nr 286720
    [29854.132639] 		extent data offset 0 nr 286720 ram 286720
    [29854.132640] 	item 24 key (3486 108 954368) itemoff 2641 itemsize 53
    [29854.132641] 		extent data disk bytenr 0 nr 0
    [29854.132643] 		extent data offset 0 nr 0 ram 0
    [29854.132644] 	item 25 key (3486 108 954368) itemoff 2588 itemsize 53
    [29854.132645] 		extent data disk bytenr 8699670528 nr 77824
    [29854.132646] 		extent data offset 0 nr 77824 ram 77824
    [29854.132647] 	item 26 key (3486 108 1146880) itemoff 2535 itemsize 53
    [29854.132648] 		extent data disk bytenr 8699670528 nr 77824
    [29854.132649] 		extent data offset 0 nr 77824 ram 77824
    (...)
    [29854.132707] kernel BUG at fs/btrfs/ctree.h:3901!
    (...)
    [29854.132771] Call Trace:
    [29854.132779]  [<ffffffffa0342b5c>] setup_items_for_insert+0x2dc/0x400 [btrfs]
    [29854.132791]  [<ffffffffa0378537>] __btrfs_drop_extents+0xba7/0xdd0 [btrfs]
    [29854.132794]  [<ffffffff8109c0d6>] ? trace_hardirqs_on_caller+0x16/0x1d0
    [29854.132797]  [<ffffffff8109c29d>] ? trace_hardirqs_on+0xd/0x10
    [29854.132800]  [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
    [29854.132810]  [<ffffffffa036783b>] insert_reserved_file_extent.constprop.66+0xab/0x310 [btrfs]
    [29854.132820]  [<ffffffffa036a6c6>] __btrfs_prealloc_file_range+0x116/0x340 [btrfs]
    [29854.132830]  [<ffffffffa0374d53>] btrfs_prealloc_file_range+0x23/0x30 [btrfs]
    (...)

So this is caused by getting an -ENOSPC error while punching a file hole, more
specifically, we get -ENOSPC error from __btrfs_drop_extents in the while loop
of file.c:btrfs_punch_hole() when it's unable to modify the btree to delete one
or more file extent items due to lack of enough free space. When this happens,
in btrfs_punch_hole(), we attempt to reclaim free space by switching our transaction
block reservation object to root->fs_info->trans_block_rsv, end our transaction and
start a new transaction basically - and, we keep increasing our current offset
(cur_offset) as long as it's smaller than the end of the target range (lockend) -
this makes use leave the loop with cur_offset == drop_end which in turn makes us
call fill_holes() for inserting a file extent item that represents a 0 bytes range
hole (and this insertion succeeds, as in the meanwhile more space became available).

This 0 bytes file hole extent item is a problem because any subsequent caller of
__btrfs_drop_extents (regular file writes, or fallocate calls for e.g.), with a
start file offset that is equal to the offset of the hole, will not remove this
extent item due to the following conditional in the while loop of
__btrfs_drop_extents:

    if (extent_end <= search_start) {
            path->slots[0]++;
            goto next_slot;
    }

This later makes the call to setup_items_for_insert() (at the very end of
__btrfs_drop_extents), insert a new file extent item with the same offset as
the 0 bytes file hole extent item that follows it. Needless is to say that this
causes chaos, either when reading the leaf from disk (btree_readpage_end_io_hook),
where we perform leaf sanity checks or in subsequent operations that manipulate
file extent items, as in the fallocate call as shown by the dmesg trace above.

Without my other patch to perform the leaf sanity checks once a leaf is marked
as dirty (if the integrity checker is enabled), it would have been much harder
to debug this issue.

This change might fix a few similar issues reported by users in the mailing
list regarding assertion failures in btrfs_set_item_key_safe calls performed
by __btrfs_drop_extents, such as the following report:

    http://comments.gmane.org/gmane.comp.file-systems.btrfs/32938

Asking fill_holes() to create a 0 bytes wide file hole item also produced the
first warning in the trace above, as we passed a range to btrfs_drop_extent_cache
that has an end smaller (by -1) than its start.

On 3.14 kernels this issue manifests itself through leaf corruption, as we get
duplicated file extent item keys in a leaf when calling setup_items_for_insert(),
but on older kernels, setup_items_for_insert() isn't called by __btrfs_drop_extents(),
instead we have callers of __btrfs_drop_extents(), namely the functions
inode.c:insert_inline_extent() and inode.c:insert_reserved_file_extent(), calling
btrfs_insert_empty_item() to insert the new file extent item, which would fail with
error -EEXIST, instead of inserting a duplicated key - which is still a serious
issue as it would make all similar file extent item replace operations keep
failing if they target the same file range.

Cc: stable@vger.kernel.org
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NChris Mason <clm@fb.com>

fc19c5e7

Btrfs: read inode size after acquiring the mutex when punching a hole · a1a50f60

由 Filipe Manana 提交于 4月 26, 2014

In a previous change, commit 12870f1c,
I accidentally moved the roundup of inode->i_size to outside of the
critical section delimited by the inode mutex, which is not atomic and
not correct since the size can be changed by other task before we acquire
the mutex. Therefore fix it.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NChris Mason <clm@fb.com>

a1a50f60

25 4月, 2014 2 次提交

btrfs: replace error code from btrfs_drop_extents · 3f9e3df8

由 David Sterba 提交于 4月 15, 2014

There's a case which clone does not handle and used to BUG_ON instead,
(testcase xfstests/btrfs/035), now returns EINVAL. This error code is
confusing to the ioctl caller, as it normally signifies errorneous
arguments.

Change it to ENOPNOTSUPP which allows a fall back to copy instead of
clone. This does not affect the common reflink operation.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

3f9e3df8

btrfs: Change the hole range to a more accurate value. · c5f7d0bb

由 Qu Wenruo 提交于 4月 15, 2014

Commit 3ac0d7b9 fixed the btrfs expanding
write problem but the hole punched is sometimes too large for some
iovec, which has unmapped data ranges.
This patch will change to hole range to a more accurate value using the
counts checked by the write check routines.
Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

c5f7d0bb

08 4月, 2014 2 次提交

mm: implement ->map_pages for page cache · f1820361

由 Kirill A. Shutemov 提交于 4月 07, 2014

filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f1820361

btrfs: Change the expanding write sequence to fix snapshot related bug. · 3ac0d7b9

由 Qu Wenruo 提交于 3月 27, 2014

When testing fsstress with snapshot making background, some snapshot
following problem.

Snapshot 270:
inode 323: size 0

Snapshot 271:
inode 323: size 349145
|-------Hole---|---------Empty gap-------|-------Hole-----|
0	    122880			172032	      349145

Snapshot 272:
inode 323: size 349145
|-------Hole---|------------Data---------|-------Hole-----|
0	    122880			172032	      349145

The fsstress operation on inode 323 is the following:
write: 		offset 	126832 	len 43124
truncate: 	size 	349145

Since the write with offset is consist of 2 operations:
1. punch hole
2. write data
Hole punching is faster than data write, so hole punching in write
and truncate is done first and then buffered write, so the snapshot 271 got
empty gap, which will not pass btrfsck.

To fix the bug, this patch will change the write sequence which will
first punch a hole covering the write end if a hole is needed.
Reported-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

3ac0d7b9

04 4月, 2014 1 次提交

fs/direct-io.c: remove some left over checks · 45d4f855

由 Dan Carpenter 提交于 4月 03, 2014

We know that "ret > 0" is true here.  These tests were left over from
commit 02afc27f ('direct-io: Handle O_(D)SYNC AIO') and aren't
needed any more.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

45d4f855

02 4月, 2014 3 次提交
- A
  generic_file_direct_write(): get rid of ppos argument · 5cb6c6c7
  由 Al Viro 提交于 2月 11, 2014
```
always equal to &iocb->ki_pos.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  5cb6c6c7
- A
  btrfs_file_aio_write(): get rid of ppos · 867c4f93
  由 Al Viro 提交于 2月 11, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  867c4f93
- A
  callers of iov_copy_from_user_atomic() don't need pagecache_disable() · 9e8c2af9
  由 Al Viro 提交于 2月 02, 2014
```
... it does that itself (via kmap_atomic())
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  9e8c2af9
22 3月, 2014 1 次提交

Btrfs: fix a crash of clone with inline extents's split · 00fdf13a

由 Liu Bo 提交于 3月 10, 2014

xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
of inline extents in __btrfs_drop_extents().

For inline extents, we cannot duplicate another EXTENT_DATA item, because
it breaks the rule of inline extents, that is, 'start offset' needs to be 0.

We have set limitations for the source inode's compressed inline extents,
because it needs to decompress and recompress.  Now the destination inode's
inline extents also need similar limitations.

With this, xfstests btrfs/035 doesn't run into panic.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <clm@fb.com>

00fdf13a

11 3月, 2014 8 次提交

M
Btrfs: remove the unnecessary flush when preparing the pages · b88935bf
由 Miao Xie 提交于 3月 06, 2014
```
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
```
b88935bf

Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume · 8257b2dc

由 Miao Xie 提交于 3月 06, 2014

If the snapshot creation happened after the nocow write but before the dirty
data flush, we would fail to flush the dirty data because of no space.

So we must keep track of when those nocow write operations start and when they
end, if there are nocow writers, the snapshot creators must wait. In order
to implement this function, I introduce btrfs_{start, end}_nocow_write(),
which is similar to mnt_{want,drop}_write().

These two functions are only used for nocow file write operations.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

8257b2dc

Btrfs: fix preallocate vs double nocow write · 7b2b7085

由 Miao Xie 提交于 2月 27, 2014

We can not release the reserved metadata space for the first write if we
find the write position is pre-allocated. Because the kernel might write
the data on the disk before we do the second write but after the can-nocow
check, if we release the space for the first write, we might fail to update
the metadata because of no space.

Fix this problem by end nocow write if there is dirty data in the range whose
space is pre-allocated.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

7b2b7085

Btrfs: fix wrong lock range and write size in check_can_nocow() · c933956d

由 Miao Xie 提交于 2月 27, 2014

The write range may not be sector-aligned, for example:

       |--------|--------|	<- write range, sector-unaligned, size: 2blocks
  |--------|--------|--------|  <- correct lock range, size: 3blocks

But according to the old code, we used the size of write range to calculate
the lock range directly, not considered the offset, we would get a wrong lock
range:

       |--------|--------|	<- write range, sector-unaligned, size: 2blocks
  |--------|--------|		<- wrong lock range, size: 2blocks

And besides that, the old code also had the same problem when calculating
the real write size. Correct them.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

c933956d

Btrfs: more efficient btrfs_drop_extent_cache · 176840b3

由 Filipe Manana 提交于 2月 25, 2014

While droping extent map structures from the extent cache that cover our
target range, we would remove each extent map structure from the red black
tree and then add either 1 or 2 new extent map structures if the former
extent map covered sections outside our target range.

This change simply attempts to replace the existing extent map structure
with a new one that covers the subsection we're not interested in, instead
of doing a red black remove operation followed by an insertion operation.

The number of elements in an inode's extent map tree can get very high for large
files under random writes. For example, while running the following test:

    sysbench --test=fileio --file-num=1 --file-total-size=10G \
        --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
        --max-requests=500000 --file-rw-ratio=2 [prepare|run]

I captured the following histogram capturing the number of extent_map items
in the red black tree while that test was running:

    Count: 122462
    Range:  1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
    Percentiles:  90th: 160120.000; 95th: 166335.000; 99th: 171070.000
       1.000 -    5.231:   452 |
       5.231 -  187.392:    87 |
     187.392 -  585.911:   206 |
     585.911 - 1827.438:   623 |
    1827.438 - 5695.245:  1962 #
    5695.245 - 17744.861:  6204 ####
   17744.861 - 55283.764: 21115 ############
   55283.764 - 172231.000: 91813 #####################################################

Benchmark:

    sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
        --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
        --file-io-mode=sync --file-fsync-freq=0 [prepare|run]

Before this change: 122.1Mb/sec
After this change:  125.07Mb/sec
(averages of 5 test runs)

Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

176840b3

Btrfs: don't insert useless holes when punching beyond the inode's size · 12870f1c

由 Filipe Manana 提交于 2月 15, 2014

If we punch beyond the size of an inode, we'll correctly remove any prealloc extents,
but we'll also insert file extent items representing holes (disk bytenr == 0) that start
with a key offset that lies beyond the inode's size and are not contiguous with the last
file extent item.

Example:

  $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
  $XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo
  $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo

btrfs-debug-tree output:

  item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
	inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1
  item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13
	inode ref index 2 namelen 3 name: foo
  item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
	extent data disk byte 0 nr 0 gen 6
	extent data offset 0 nr 90112 ram 122880
	extent compression 0
  item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53
	extent data disk byte 12845056 nr 4096 gen 6
	extent data offset 0 nr 45056 ram 45056
	extent compression 2
  item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53
	extent data disk byte 0 nr 0 gen 6
	extent data offset 0 nr 860160 ram 860160
	extent compression 0

The last extent item, which represents a hole, is useless as it lies beyond the inode's
size.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

12870f1c

Btrfs: fix skipped error handle when log sync failed · 8b050d35

由 Miao Xie 提交于 2月 20, 2014

It is possible that many tasks sync the log tree at the same time, but
only one task can do the sync work, the others will wait for it. But those
wait tasks didn't get the result of the log sync, and returned 0 when they
ended the wait. It caused those tasks skipped the error handle, and the
serious problem was they told the users the file sync succeeded but in
fact they failed.

This patch fixes this problem by introducing a log context structure,
we insert it into the a global list. When the sync fails, we will set
the error number of every log context in the list, then the waiting tasks
get the error number of the log context and handle the error if need.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

8b050d35

Btrfs: faster/more efficient insertion of file extent items · d5f37527

由 Filipe David Borba Manana 提交于 2月 09, 2014

This is an extension to my previous commit titled:

  "Btrfs: faster file extent item replace operations"
  (hash 1acae57b)

Instead of inserting the new file extent item if we deleted existing
file extent items covering our target file range, also allow to insert
the new file extent item if we didn't find any existing items to delete
and replace_extent != 0, since in this case our caller would do another
tree search to insert the new file extent item anyway, therefore just
combine the two tree searches into a single one, saving cpu time, reducing
lock contention and reducing btree node/leaf COW operations.

This covers the case where applications keep doing tail append writes to
files, which for example is the case of Apache CouchDB (its database and
view index files are always open with O_APPEND).
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

d5f37527

29 1月, 2014 9 次提交

Btrfs: don't use ram_bytes for uncompressed inline items · 514ac8ad

由 Chris Mason 提交于 1月 03, 2014

If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
the new size.  The fixe uses the size directly from the item header when
reading uncompressed inlines, and also fixes truncate to update the
size as it goes.
Reported-by: NJens Axboe <axboe@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>
CC: stable@vger.kernel.org

514ac8ad

Btrfs: fix the race between write back and nocow buffered write · f1de9683

由 Miao Xie 提交于 1月 09, 2014

When we ran the 274th case of xfstests with nodatacow mount option,
We met the following warning message:
WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0

It is caused by the race between the write back and nocow buffered
write:
  Task1				Task2
  __btrfs_buffered_write()
    skip data reservation
    reserve the metadata space
    copy the data
    dirty the pages
    unlock the pages
				write back the pages
				release the data space
   				  becasue there is no
				  noreserve flag
   set the noreserve flag

This patch fixes this problem by unlocking the pages after
the noreserve flag is set.
Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

f1de9683

Btrfs: make fsync latency less sucky · 5039eddc

由 Josef Bacik 提交于 1月 15, 2014

Looking into some performance related issues with large amounts of metadata
revealed that we can have some pretty huge swings in fsync() performance. If we
have a lot of delayed refs backed up (as you will tend to do with lots of
metadata) fsync() will wander off and try to run some of those delayed refs
which can result in reading from disk and such. Since the actual act of fsync()
doesn't create any delayed refs there is no need to make it throttle on delayed
ref stuff, that will be handled by other people. With this patch we get much
smoother fsync performance with large amounts of metadata. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

5039eddc

Btrfs: faster file extent item replace operations · 1acae57b

由 Filipe David Borba Manana 提交于 1月 07, 2014

When writing to a file we drop existing file extent items that cover the
write range and then add a new file extent item that represents that write
range.

Before this change we were doing a tree lookup to remove the file extent
items, and then after we did another tree lookup to insert the new file
extent item.
Most of the time all the file extent items we need to drop are located
within a single leaf - this is the leaf where our new file extent item ends
up at. Therefore, in this common case just combine these 2 operations into
a single one.

By avoiding the second btree navigation for insertion of the new file extent
item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
COW operations, CPU time on btree node/leaf key binary searches, etc.

Besides for file writes, this is an operation that happens for file fsync's
as well. However log btrees are much less likely to big as big as regular
fs btrees, therefore the impact of this change is smaller.

The following benchmark was performed against an SSD drive and a
HDD drive, both for random and sequential writes:

  sysbench --test=fileio --file-num=4096 --file-total-size=8G \
     --file-test-mode=[rndwr|seqwr] --num-threads=512 \
     --file-block-size=8192 \ --max-requests=1000000 \
     --file-fsync-freq=0 --file-io-mode=sync [prepare|run]

All results below are averages of 10 runs of the respective test.

** SSD sequential writes

Before this change: 225.88 Mb/sec
After this change:  277.26 Mb/sec

** SSD random writes

Before this change: 49.91 Mb/sec
After this change:  56.39 Mb/sec

** HDD sequential writes

Before this change: 68.53 Mb/sec
After this change:  69.87 Mb/sec

** HDD random writes

Before this change: 13.04 Mb/sec
After this change:  14.39 Mb/sec
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

1acae57b

Btrfs: fix use of uninitialized err variable · fc28b62d

由 Filipe David Borba Manana 提交于 12月 13, 2013

fs/btrfs/file.c: In function ‘prepare_pages.isra.18’:
fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized]
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

fc28b62d

Btrfs: fix ordered extent check in btrfs_punch_hole · 6126e3ca

由 Filipe David Borba Manana 提交于 11月 19, 2013

If the ordered extent's last byte was 1 less than our region's
start byte, we would unnecessarily wait for the completion of
that ordered extent, because it doesn't intersect our target
range.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

6126e3ca

Btrfs: fix the reserved space leak caused by the race between nonlock dio and buffered io · 376cc685

由 Miao Xie 提交于 12月 10, 2013

When we ran sysbench on the fs with compression, the following WARN_ONs were
triggered:
 fs/btrfs/inode.c:7829	WARN_ON(BTRFS_I(inode)->outstanding_extents);
 fs/btrfs/inode.c:7830	WARN_ON(BTRFS_I(inode)->reserved_extents);
 fs/btrfs/inode.c:7832	WARN_ON(BTRFS_I(inode)->csum_bytes);

Steps to reproduce:
 # mkfs.btrfs -f <dev>
 # mount -o compress <dev> <mnt>
 # cd <mnt>
 # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
 > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
 > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
 > --file-test-mode=sync prepare
 # cd -
 # umount <mnt>
 # mount -o compress <dev> <mnt>
 # cd <mnt>
 # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
 > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
 > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
 > --file-test-mode=sync run
 # cd -
 # umount <mnt>

The reason of this problem is:
Task0				Task1
btrfs_direct_IO
  unlock(&inode->i_mutex)
				lock(&inode->i_mutex)
				reserve_space()
				prepare_pages()
				  lock_extent()
				  clear_extent()
				  unlock_extent()
  lock_extent()
  test_extent(uptodate)
    return false
				copy_data()
				set_delalloc_extent()
  extent need compress
    go back to buffered write
  clear_extent(DELALLOC | DIRTY)
  unlock_extent()

Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which
was set by task1, it made the dirty pages in that extents couldn't be flushed
into the disk, so the reserved space for that extent was not released at
the end.

This patch fixes the above bug by unlocking the extent after the delalloc.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

376cc685

Btrfs: cleanup unnecessary parameter and variant of prepare_pages() · b37392ea

由 Miao Xie 提交于 12月 10, 2013

- the caller has gotten the inode object, needn't pass the file object.
  And if so, we needn't define a inode pointer variant.
- the position should be aligned by the page size not sector size, so
  we also needn't pass the root object into prepare_pages().
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

b37392ea

Btrfs: incompatible format change to remove hole extents · 16e7549f

由 Josef Bacik 提交于 10月 22, 2013

Btrfs has always had these filler extent data items for holes in inodes. This
has made somethings very easy, like logging hole punches and sending hole
punches. However for large holey files these extent data items are pure
overhead. So add an incompatible feature to no longer add hole extents to
reduce the amount of metadata used by these sort of files. This has a few
changes for logging and send obviously since they will need to detect holes and
log/send the holes if there are any. I've tested this thoroughly with xfstests
and it doesn't cause any issues with and without the incompat format set.
Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <clm@fb.com>

16e7549f

12 11月, 2013 4 次提交

btrfs: Fix checkpatch.pl warning of spacing issues · 67871254

由 Dulshani Gunawardhana 提交于 10月 31, 2013

Fix spacing issues detected via checkpatch.pl in accordance with the
kernel style guidelines.
Signed-off-by: NDulshani Gunawardhana <dulshani.gunawardhana89@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

67871254

Btrfs: return an error from btrfs_wait_ordered_range · 0ef8b726

由 Josef Bacik 提交于 10月 25, 2013

I noticed that if the free space cache has an error writing out it's data it
won't actually error out, it will just carry on. This is because it doesn't
check the return value of btrfs_wait_ordered_range, which didn't actually return
anything. So fix this in order to keep us from making free space cache look
valid when it really isnt. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

0ef8b726

btrfs: remove fs/btrfs/compat.h · 8b558c5f

由 Zach Brown 提交于 10月 16, 2013

fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink()
and inc_nlink().  This doesn't belong in mainline.
Signed-off-by: NZach Brown <zab@redhat.com>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

8b558c5f

Btrfs: fix up seek_hole/seek_data handling · 7f4ca37c

由 Josef Bacik 提交于 10月 18, 2013

Whoever wrote this was braindead. Also it doesn't work right if you have
VACANCY's since we assumed you would only have that at the end of the file,
which won't be the case in the near future. I tested this with generic/285 and
generic/286 as well as the btrfs tests that use fssum since it uses
seek_hole/seek_data to verify things are ok. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

7f4ca37c

21 9月, 2013 1 次提交

Btrfs: don't leak transaction in btrfs_sync_file() · a0634be5

由 Filipe David Borba Manana 提交于 9月 11, 2013

In btrfs_sync_file(), if the call to btrfs_log_dentry_safe() returns
a negative error (for e.g. -ENOMEM via btrfs_log_inode()), we would
return without ending/freeing the transaction.
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

a0634be5

04 9月, 2013 1 次提交

direct-io: Handle O_(D)SYNC AIO · 02afc27f

由 Christoph Hellwig 提交于 9月 04, 2013

Call generic_write_sync() from the deferred I/O completion handler if
O_DSYNC is set for a write request.  Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.

Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

02afc27f

01 9月, 2013 3 次提交

Btrf: cleanup: don't check for root_refs == 0 twice · 23fa76b0

由 Stefan Behrens 提交于 8月 23, 2013

btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in three more places.
Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

23fa76b0

Btrfs: avoid starting a transaction in the write path · 00361589

由 Josef Bacik 提交于 8月 14, 2013

I noticed while looking at a deadlock that we are always starting a transaction
in cow_file_range(). This isn't really needed since we only need a transaction
if we are doing an inline extent, or if the allocator needs to allocate a chunk.
So push down all the transaction start stuff to be closer to where we actually
need a transaction in all of these cases. This will hopefully reduce our write
latency when we are committing often. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

00361589

Btrfs: don't bother autodefragging if our root is going away · 2112ac80

由 Josef Bacik 提交于 7月 26, 2013

We can end up with inodes on the auto defrag list that exist on roots that are
going to be deleted.  This is extra work we don't need to do, so just bail if
our root has 0 root refs.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
Signed-off-by: NChris Mason <chris.mason@fusionio.com>

2112ac80

openeuler / Kernel 12 个月 前同步成功

openeuler / Kernel
12 个月前同步成功