提交 · afe3d24267926eb78ba863016bdd65cfe718aef5 · openeuler / raspberrypi-kernel

11 3月, 2014 40 次提交

btrfs: Replace fs_info->delalloc_workers with btrfs_workqueue · afe3d242

由 Qu Wenruo 提交于 2月 28, 2014

Much like the fs_info->workers, replace the fs_info->delalloc_workers
use the same btrfs_workqueue.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

afe3d242

btrfs: Replace fs_info->workers with btrfs_workqueue. · 5cdc7ad3

由 Qu Wenruo 提交于 2月 28, 2014

Use the newly created btrfs_workqueue_struct to replace the original
fs_info->workers
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

5cdc7ad3

btrfs: Add threshold workqueue based on kernel workqueue · 0bd9289c

由 Qu Wenruo 提交于 2月 28, 2014

The original btrfs_workers has thresholding functions to dynamically
create or destroy kthreads.

Though there is no such function in kernel workqueue because the worker
is not created manually, we can still use the workqueue_set_max_active
to simulated the behavior, mainly to achieve a better HDD performance by
setting a high threshold on submit_workers.
(Sadly, no resource can be saved)

So in this patch, extra workqueue pending counters are introduced to
dynamically change the max active of each btrfs_workqueue_struct, hoping
to restore the behavior of the original thresholding function.

Also, workqueue_set_max_active use a mutex to protect workqueue_struct,
which is not meant to be called too frequently, so a new interval
mechanism is applied, that will only call workqueue_set_max_active after
a count of work is queued. Hoping to balance both the random and
sequence performance on HDD.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

0bd9289c

btrfs: Add high priority workqueue support for btrfs_workqueue_struct · 1ca08976

由 Qu Wenruo 提交于 2月 28, 2014

Add high priority function to btrfs_workqueue.

This is implemented by embedding a btrfs_workqueue into a
btrfs_workqueue and use some helper functions to differ the normal
priority wq and high priority wq.
So the high priority wq is completely independent from the normal
workqueue.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

1ca08976

btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue · 08a9ff32

由 Qu Wenruo 提交于 2月 28, 2014

Use kernel workqueue to implement a new btrfs_workqueue_struct, which
has the ordering execution feature like the btrfs_worker.

The func is executed in a concurrency way, and the
ordred_func/ordered_free is executed in the sequence them are queued
after the corresponding func is done.

The new btrfs_workqueue works much like the original one, one workqueue
for normal work and a list for ordered work.
When a work is queued, ordered work will be added to the list and helper
function will be queued into the workqueue.
The helper function will execute a normal work and then check and execute as many
ordered work as possible in the sequence they were queued.

At this patch, high priority work queue or thresholding is not added yet.
The high priority feature and thresholding will be added in the following patches.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

08a9ff32

btrfs: Cleanup the unused struct async_sched. · f5961d41

由 Qu Wenruo 提交于 2月 28, 2014

The struct async_sched is not used by any codes and can be removed.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: NJosef Bacik <jbacik@fusionio.com>
Tested-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

f5961d41

Btrfs: skip search tree for REG files · 644d1940

由 Liu Bo 提交于 2月 27, 2014

It is really unnecessary to search tree again for @gen, @mode and @rdev
in the case of REG inodes' creation, as we've got btrfs_inode_item in sctx,
and @gen, @mode and @rdev can easily be fetched.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

644d1940

Btrfs: fix preallocate vs double nocow write · 7b2b7085

由 Miao Xie 提交于 2月 27, 2014

We can not release the reserved metadata space for the first write if we
find the write position is pre-allocated. Because the kernel might write
the data on the disk before we do the second write but after the can-nocow
check, if we release the space for the first write, we might fail to update
the metadata because of no space.

Fix this problem by end nocow write if there is dirty data in the range whose
space is pre-allocated.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

7b2b7085

Btrfs: fix wrong lock range and write size in check_can_nocow() · c933956d

由 Miao Xie 提交于 2月 27, 2014

The write range may not be sector-aligned, for example:

       |--------|--------|	<- write range, sector-unaligned, size: 2blocks
  |--------|--------|--------|  <- correct lock range, size: 3blocks

But according to the old code, we used the size of write range to calculate
the lock range directly, not considered the offset, we would get a wrong lock
range:

       |--------|--------|	<- write range, sector-unaligned, size: 2blocks
  |--------|--------|		<- wrong lock range, size: 2blocks

And besides that, the old code also had the same problem when calculating
the real write size. Correct them.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

c933956d

D
btrfs: send: simplify allocation code in fs_path_ensure_buf · 9c9ca00b
由 David Sterba 提交于 2月 25, 2014
```
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
```
9c9ca00b

btrfs: send: fix old buffer length in fs_path_ensure_buf · 1b2782c8

由 David Sterba 提交于 2月 25, 2014

In "btrfs: send: lower memory requirements in common case" the code to
save the old_buf_len was incorrectly moved to a wrong place and broke
the original logic.
Reported-by: NFilipe David Manana <fdmanana@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Reviewed-by: NFilipe David Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

1b2782c8

Btrfs: more efficient btrfs_drop_extent_cache · 176840b3

由 Filipe Manana 提交于 2月 25, 2014

While droping extent map structures from the extent cache that cover our
target range, we would remove each extent map structure from the red black
tree and then add either 1 or 2 new extent map structures if the former
extent map covered sections outside our target range.

This change simply attempts to replace the existing extent map structure
with a new one that covers the subsection we're not interested in, instead
of doing a red black remove operation followed by an insertion operation.

The number of elements in an inode's extent map tree can get very high for large
files under random writes. For example, while running the following test:

    sysbench --test=fileio --file-num=1 --file-total-size=10G \
        --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
        --max-requests=500000 --file-rw-ratio=2 [prepare|run]

I captured the following histogram capturing the number of extent_map items
in the red black tree while that test was running:

    Count: 122462
    Range:  1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
    Percentiles:  90th: 160120.000; 95th: 166335.000; 99th: 171070.000
       1.000 -    5.231:   452 |
       5.231 -  187.392:    87 |
     187.392 -  585.911:   206 |
     585.911 - 1827.438:   623 |
    1827.438 - 5695.245:  1962 #
    5695.245 - 17744.861:  6204 ####
   17744.861 - 55283.764: 21115 ############
   55283.764 - 172231.000: 91813 #####################################################

Benchmark:

    sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
        --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
        --file-io-mode=sync --file-fsync-freq=0 [prepare|run]

Before this change: 122.1Mb/sec
After this change:  125.07Mb/sec
(averages of 5 test runs)

Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

176840b3

Btrfs: more efficient split extent state insertion · f2071b21

由 Filipe Manana 提交于 2月 12, 2014

When we split an extent state there's no need to start the rbtree search
from the root node - we can start it from the original extent state node,
since we would end up in its subtree if we do the search starting at the
root node anyway.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

f2071b21

Btrfs: remove unneeded field / smaller extent_map structure · cbc0e928

由 Filipe Manana 提交于 2月 25, 2014

We don't need to have an unsigned int field in the extent_map struct
to tell us whether the extent map is in the inode's extent_map tree or
not. We can use the rb_node struct field and the RB_CLEAR_NODE and
RB_EMPTY_NODE macros to achieve the same task.

This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
64 bits system).
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

cbc0e928

Btrfs: skip locking when searching commit root · e84752d4

由 Wang Shilong 提交于 2月 13, 2014

We won't change commit root, skip locking dance with commit root
when walking backrefs, this can speed up btrfs send operations.
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

e84752d4

Btrfs: wake up @scrub_pause_wait as much as we can · 32a44789

由 Wang Shilong 提交于 2月 19, 2014

check if @scrubs_running=@scrubs_paused condition inside wait_event()
is not an atomic operation which means we may inc/dec @scrub_running/
paused at any time. Let's wake up @scrub_pause_wait as much as we can
to let commit transaction blocked less.

An example below:

Thread1				Thread2
|->scrub_blocked_if_needed()	|->scrub_pending_trans_workers_inc
  |->increase @scrub_paused
                                       |->increase @scrub_running
  |->wake up scrub_pause_wait list
                                       |->scrub blocked
                                       |->increase @scrub_paused

Thread3 is commiting transaction which is blocked at btrfs_scrub_pause().
So after Thread2 increase @scrub_paused, we meet the condition
@scrub_paused=@scrub_running, but transaction will be still blocked until
another calling to wake up @scrub_pause_wait.
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

32a44789

Btrfs: cancel scrub on transaction abortion · c0af8f0b

由 Wang Shilong 提交于 2月 19, 2014

If we fail to commit transaction, we'd better
cancel scrub operations.
Suggested-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

c0af8f0b

Btrfs: device_replace: fix deadlock for nocow case · 12cf9372

由 Wang Shilong 提交于 2月 19, 2014

commit cb7ab021 cause a following deadlock found by
xfstests,btrfs/011:

Thread1 is commiting transaction which is blocked at
btrfs_scrub_pause().

Thread2 is calling btrfs_file_aio_write() which has held
inode's @i_mutex and commit transaction(blocked because
Thread1 is committing transaction).

Thread3 is copy_nocow_page worker which will also try to
hold inode @i_mutex, so thread3 will wait Thread1 finished.

Thread4 is waiting pending workers finished which will wait
Thread3 finished. So the problem is like this:

Thread1--->Thread4--->Thread3--->Thread2---->Thread1

Deadlock happens! we fix it by letting Thread1 go firstly,
which means we won't block transaction commit while we are
waiting pending workers finished.
Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

12cf9372

Btrfs: fix a possible deadlock between scrub and transaction committing · 6cf7f77e

由 Wang Shilong 提交于 2月 19, 2014

btrfs_scrub_continue() will be called when cleaning up transaction.However,
this can only be called if btrfs_scrub_pause() is called before.
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

6cf7f77e

btrfs: Use PTR_ERR_OR_ZERO · 886322e8

由 Sachin Kamat 提交于 2月 17, 2014

PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
also include missing err.h header.
Signed-off-by: NSachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

886322e8

Btrfs: fix send issuing outdated paths for utimes, chown and chmod · bf0d1f44

由 Filipe Manana 提交于 2月 21, 2014

When doing an incremental send, if we had a directory pending a move/rename
operation and none of its parents, except for the immediate parent, were
pending a move/rename, after processing the directory's references, we would
be issuing utimes, chown and chmod intructions against am outdated path - a
path which matched the one in the parent root.

This change also simplifies a bit the code that deals with building a path
for a directory which has a move/rename operation delayed.

Steps to reproduce:

    $ mkfs.btrfs -f /dev/sdb3
    $ mount /dev/sdb3 /mnt/btrfs
    $ mkdir -p /mnt/btrfs/a/b/c/d/e
    $ mkdir /mnt/btrfs/a/b/c/f
    $ chmod 0777 /mnt/btrfs/a/b/c/d/e
    $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
    $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
    $ mv /mnt/btrfs/a/b/c/f /mnt/btrfs/a/b/f2
    $ mv /mnt/btrfs/a/b/c/d/e /mnt/btrfs/a/b/f2/e2
    $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2
    $ mv /mnt/btrfs/a/b/c2/d /mnt/btrfs/a/b/c2/d2
    $ chmod 0700 /mnt/btrfs/a/b/f2/e2
    $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
    $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send

    $ umount /mnt/btrfs
    $ mkfs.btrfs -f /dev/sdb3
    $ mount /dev/sdb3 /mnt/btrfs
    $ btrfs receive /mnt/btrfs -f /tmp/base.send
    $ btrfs receive /mnt/btrfs -f /tmp/incremental.send

The second btrfs receive command failed with:

    ERROR: chmod a/b/c/d/e failed. No such file or directory

A test case for xfstests follows.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

bf0d1f44

Btrfs: correctly determine if blocks are shared in btrfs_compare_trees · 6baa4293

由 Filipe Manana 提交于 2月 20, 2014

Just comparing the pointers (logical disk addresses) of the btree nodes is
not completely bullet proof, we have to check if their generation numbers
match too.

It is guaranteed that a COW operation will result in a block with a different
logical disk address than the original block's address, but over time we can
reuse that former logical disk address.

For example, creating a 2Gb filesystem on a loop device, and having a script
running in a loop always updating the access timestamp of a file, resulted in
the same logical disk address being reused for the same fs btree block in about
only 4 minutes.

This could make us skip entire subtrees when doing an incremental send (which
is currently the only user of btrfs_compare_trees). However the odds of getting
2 blocks at the same tree level, with the same logical disk address, equal first
slot keys and different generations, should hopefully be very low.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

6baa4293

Btrfs: fix send attempting to rmdir non-empty directories · 9dc44214

由 Filipe Manana 提交于 2月 19, 2014

The incremental send algorithm assumed that it was possible to issue
a directory remove (rmdir) if the the inode number it was currently
processing was greater than (or equal) to any inode that referenced
the directory's inode. This wasn't a valid assumption because any such
inode might be a child directory that is pending a move/rename operation,
because it was moved into a directory that has a higher inode number and
was moved/renamed too - in other words, the case the following commit
addressed:

    9f03740a
    (Btrfs: fix infinite path build loops in incremental send)

This made an incremental send issue an rmdir operation before the
target directory was actually empty, which made btrfs receive fail.
Therefore it needs to wait for all pending child directory inodes to
be moved/renamed before sending an rmdir operation.

Simple steps to reproduce this issue:

    $ mkfs.btrfs -f /dev/sdb3
    $ mount /dev/sdb3 /mnt/btrfs
    $ mkdir -p /mnt/btrfs/a/b/c/x
    $ mkdir /mnt/btrfs/a/b/y
    $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
    $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
    $ mv /mnt/btrfs/a/b/y /mnt/btrfs/a/b/YY
    $ mv /mnt/btrfs/a/b/c/x /mnt/btrfs/a/b/YY
    $ rmdir /mnt/btrfs/a/b/c
    $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
    $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send

    $ umount /mnt/btrfs
    $ mkfs.btrfs -f /dev/sdb3
    $ mount /dev/sdb3 /mnt/btrfs
    $ btrfs receive /mnt/btrfs -f /tmp/base.send
    $ btrfs receive /mnt/btrfs -f /tmp/incremental.send

The second btrfs receive command failed with:

    ERROR: rmdir o259-6-0 failed. Directory not empty

A test case for xfstests follows.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

9dc44214

Btrfs: send, don't send rmdir for same target multiple times · 29d6d30f

由 Filipe Manana 提交于 2月 16, 2014

When doing an incremental send, if we delete a directory that has N > 1
hardlinks for the same file and that file has the highest inode number
inside the directory contents, an incremental send would send N times an
rmdir operation against the directory. This made the btrfs receive command
fail on the second rmdir instruction, as the target directory didn't exist
anymore.

Steps to reproduce the issue:

    $ mkfs.btrfs -f /dev/sdb3
    $ mount /dev/sdb3 /mnt/btrfs
    $ mkdir -p /mnt/btrfs/a/b/c
    $ echo 'ola mundo' > /mnt/btrfs/a/b/c/foo.txt
    $ ln /mnt/btrfs/a/b/c/foo.txt /mnt/btrfs/a/b/c/bar.txt
    $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
    $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
    $ rm -f /mnt/btrfs/a/b/c/foo.txt
    $ rm -f /mnt/btrfs/a/b/c/bar.txt
    $ rmdir /mnt/btrfs/a/b/c
    $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
    $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send

    $ umount /mnt/btrfs
    $ mkfs.btrfs -f /dev/sdb3
    $ mount /dev/sdb3 /mnt/btrfs
    $ btrfs receive /mnt/btrfs -f /tmp/base.send
    $ btrfs receive /mnt/btrfs -f /tmp/incremental.send

The second btrfs receive command failed with:

    ERROR: rmdir o259-6-0 failed. No such file or directory

A test case for xfstests follows.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

29d6d30f

Btrfs: incremental send, fix invalid path after dir rename · 2b863a13

由 Filipe Manana 提交于 2月 16, 2014

This fixes yet one more case not caught by the commit titled:

   Btrfs: fix infinite path build loops in incremental send

In this case, even before the initial full send, we have a directory
which is a child of a directory with a higher inode number. Then we
perform the initial send, and after we rename both the child and the
parent, without moving them around. After doing these 2 renames, an
incremental send sent a rename instruction for the child directory
which contained an invalid "from" path (referenced the parent's old
name, not the new one), which made the btrfs receive command fail.

Steps to reproduce:

    $ mkfs.btrfs -f /dev/sdb3
    $ mount /dev/sdb3 /mnt/btrfs
    $ mkdir -p /mnt/btrfs/a/b
    $ mkdir /mnt/btrfs/d
    $ mkdir /mnt/btrfs/a/b/c
    $ mv /mnt/btrfs/d /mnt/btrfs/a/b/c
    $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
    $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
    $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/x
    $ mv /mnt/btrfs/a/b/x/d /mnt/btrfs/a/b/x/y
    $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
    $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send

    $ umout /mnt/btrfs
    $ mkfs.btrfs -f /dev/sdb3
    $ mount /dev/sdb3 /mnt/btrfs
    $ btrfs receive /mnt/btrfs -f /tmp/base.send
    $ btrfs receive /mnt/btrfs -f /tmp/incremental.send

The second btrfs receive command failed with:
  "ERROR: rename a/b/c/d -> a/b/x/y failed. No such file or directory"

A test case for xfstests follows.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

2b863a13

Btrfs: don't insert useless holes when punching beyond the inode's size · 12870f1c

由 Filipe Manana 提交于 2月 15, 2014

If we punch beyond the size of an inode, we'll correctly remove any prealloc extents,
but we'll also insert file extent items representing holes (disk bytenr == 0) that start
with a key offset that lies beyond the inode's size and are not contiguous with the last
file extent item.

Example:

  $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
  $XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo
  $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo

btrfs-debug-tree output:

  item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
	inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1
  item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13
	inode ref index 2 namelen 3 name: foo
  item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
	extent data disk byte 0 nr 0 gen 6
	extent data offset 0 nr 90112 ram 122880
	extent compression 0
  item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53
	extent data disk byte 12845056 nr 4096 gen 6
	extent data offset 0 nr 45056 ram 45056
	extent compression 2
  item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53
	extent data disk byte 0 nr 0 gen 6
	extent data offset 0 nr 860160 ram 860160
	extent compression 0

The last extent item, which represents a hole, is useless as it lies beyond the inode's
size.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

12870f1c

Btrfs: cleanup delayed-ref.c:find_ref_head() · 85fdfdf6

由 Filipe Manana 提交于 2月 12, 2014

The argument last wasn't used, all callers supplied a NULL value
for it. Also removed unnecessary intermediate storage of the result
of key comparisons.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

85fdfdf6

Btrfs: remove unnecessary ref heads rb tree search · 6103fb43

由 Filipe Manana 提交于 2月 12, 2014

When we didn't find the exact ref head we were looking for, if
return_bigger != 0 we set a new search key to match either the
next node after the last one we found or the first one in the
ref heads rb tree, and then did another full tree search. For both
cases this ended up being pointless as we would end up returning
an entry we already had before repeating the search.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

6103fb43

btrfs: wake up transaction thread upon remount · 2c6a92b0

由 Justin Maggard 提交于 2月 20, 2014

Now that we can adjust the commit interval with a remount, we need
to wake up the transaction thread or else he will continue to sleep
until the previous transaction interval has elapsed before waking
up.  So, if we go from a large commit interval to something smaller,
the transaction thread will not wake up until the large interval has
expired.  This also causes the cleaner thread to stay sleeping, since
it gets woken up by the transaction thread.

Fix it by simply waking up the transaction thread during a remount.
Signed-off-by: NJustin Maggard <jmaggard10@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

2c6a92b0

Btrfs: stop joining the log transaction if sync log fails · 50471a38

由 Miao Xie 提交于 2月 20, 2014

If the log sync fails, there is something wrong in the log tree, we
should not continue to join the log transaction and log the metadata.
What we should do is to do a full commit.

This patch fixes this problem by setting ->last_trans_log_full_commit
to the current transaction id, it will tell the tasks not to join
the log transaction, and do a full commit.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

50471a38

Btrfs: just wait or commit our own log sub-transaction · d1433deb

由 Miao Xie 提交于 2月 20, 2014

We might commit the log sub-transaction which didn't contain the metadata we
logged. It was because we didn't record the log transid and just select
the current log sub-transaction to commit, but the right one might be
committed by the other task already. Actually, we needn't do anything
and it is safe that we go back directly in this case.

This patch improves the log sync by the above idea. We record the transid
of the log sub-transaction in which we log the metadata, and the transid
of the log sub-transaction we have committed. If the committed transid
is >= the transid we record when logging the metadata, we just go back.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

d1433deb

Btrfs: fix skipped error handle when log sync failed · 8b050d35

由 Miao Xie 提交于 2月 20, 2014

It is possible that many tasks sync the log tree at the same time, but
only one task can do the sync work, the others will wait for it. But those
wait tasks didn't get the result of the log sync, and returned 0 when they
ended the wait. It caused those tasks skipped the error handle, and the
serious problem was they told the users the file sync succeeded but in
fact they failed.

This patch fixes this problem by introducing a log context structure,
we insert it into the a global list. When the sync fails, we will set
the error number of every log context in the list, then the waiting tasks
get the error number of the log context and handle the error if need.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

8b050d35

Btrfs: use signed integer instead of unsigned long integer for log transid · bb14a59b

由 Miao Xie 提交于 2月 20, 2014

The log trans id is initialized to be 0 every time we create a log tree,
and the log tree need be re-created after a new transaction is started,
it means the log trans id is unlikely to be a huge number, so we can use
signed integer instead of unsigned long integer to save a bit space.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

bb14a59b

Btrfs: remove unnecessary memory barrier in btrfs_sync_log() · 7483e1a4

由 Miao Xie 提交于 2月 20, 2014

Mutex unlock implies certain memory barriers to make sure all the memory
operation completes before the unlock, and the next mutex lock implies memory
barriers to make sure the all the memory happens after the lock. So it is
a full memory barrier(smp_mb), we needn't add memory barriers. Remove them.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

7483e1a4

Btrfs: don't start the log transaction if the log tree init fails · e87ac136

由 Miao Xie 提交于 2月 20, 2014

The old code would start the log transaction even the log tree init
failed, it was unnecessary. Fix it.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

e87ac136

Btrfs: fix the skipped transaction commit during the file sync · 48cab2e0

由 Miao Xie 提交于 2月 20, 2014

We may abort the wait earlier if ->last_trans_log_full_commit was set to
the current transaction id, at this case, we need commit the current
transaction instead of the log sub-transaction. But the current code
didn't tell the caller to do it (return 0, not -EAGAIN). Fix it.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

48cab2e0

Btrfs: use ACCESS_ONCE to prevent the optimize accesses to ->last_trans_log_full_commit · 5c902ba6

由 Miao Xie 提交于 2月 20, 2014

->last_trans_log_full_commit may be changed by the other tasks without lock,
so we need prevent the compiler from the optimize access just like
	tmp = fs_info->last_trans_log_full_commit
	if (tmp == ...)
		...

	<do something>

	if (tmp == ...)
		...

In fact, we need get the new value of ->last_trans_log_full_commit during
the second access. Fix it by ACCESS_ONCE().
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

5c902ba6

Btrfs: avoid warning bomb of btrfs_invalidate_inodes · 7813b3db

由 Liu Bo 提交于 2月 10, 2014

So after transaction is aborted, we need to cleanup inode resources by
calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes
roots' refs to be zero in old times and sets a WARN_ON(), however, this
is not always true within cleaning up transaction, so we get to detect
transaction abortion and not warn at all.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

7813b3db

Btrfs: fix possible deadlock in btrfs_cleanup_transaction · 2a85d9ca

由 Liu Bo 提交于 2月 10, 2014

[13654.480669] ======================================================
[13654.480905] [ INFO: possible circular locking dependency detected ]
[13654.481003] 3.12.0+ #4 Tainted: G        W  O
[13654.481060] -------------------------------------------------------
[13654.481060] btrfs-transacti/9347 is trying to acquire lock:
[13654.481060]  (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
[13654.481060] but task is already holding lock:
[13654.481060]  (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs]
[13654.481060] which lock already depends on the new lock.

[13654.481060] the existing dependency chain (in reverse order) is:
[13654.481060] -> #1 (&(&fs_info->ordered_root_lock)->rlock){+.+...}:
[13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
[13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
[13654.481060]        [<ffffffffa02f011b>] __btrfs_add_ordered_extent+0x39b/0x450 [btrfs]
[13654.481060]        [<ffffffffa02f0202>] btrfs_add_ordered_extent+0x32/0x40 [btrfs]
[13654.481060]        [<ffffffffa02df6aa>] run_delalloc_nocow+0x78a/0x9d0 [btrfs]
[13654.481060]        [<ffffffffa02dfc0d>] run_delalloc_range+0x31d/0x390 [btrfs]
[13654.481060]        [<ffffffffa02f7c00>] __extent_writepage+0x310/0x780 [btrfs]
[13654.481060]        [<ffffffffa02f830a>] extent_write_cache_pages.isra.29.constprop.48+0x29a/0x410 [btrfs]
[13654.481060]        [<ffffffffa02f879d>] extent_writepages+0x4d/0x70 [btrfs]
[13654.481060]        [<ffffffffa02d9f68>] btrfs_writepages+0x28/0x30 [btrfs]
[13654.481060]        [<ffffffff8114be91>] do_writepages+0x21/0x50
[13654.481060]        [<ffffffff81140d49>] __filemap_fdatawrite_range+0x59/0x60
[13654.481060]        [<ffffffff81140e13>] filemap_fdatawrite_range+0x13/0x20
[13654.481060]        [<ffffffffa02f1db9>] btrfs_wait_ordered_range+0x49/0x140 [btrfs]
[13654.481060]        [<ffffffffa0318fe2>] __btrfs_write_out_cache+0x682/0x8b0 [btrfs]
[13654.481060]        [<ffffffffa031952d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs]
[13654.481060]        [<ffffffffa02c7083>] btrfs_write_dirty_block_groups+0x593/0x680 [btrfs]
[13654.481060]        [<ffffffffa0345307>] commit_cowonly_roots+0x14b/0x20d [btrfs]
[13654.481060]        [<ffffffffa02d7c1a>] btrfs_commit_transaction+0x43a/0x9d0 [btrfs]
[13654.481060]        [<ffffffffa030061a>] btrfs_create_uuid_tree+0x5a/0x100 [btrfs]
[13654.481060]        [<ffffffffa02d5a8a>] open_ctree+0x21da/0x2210 [btrfs]
[13654.481060]        [<ffffffffa02ab6fe>] btrfs_mount+0x68e/0x870 [btrfs]
[13654.481060]        [<ffffffff811b2409>] mount_fs+0x39/0x1b0
[13654.481060]        [<ffffffff811cd653>] vfs_kern_mount+0x63/0xf0
[13654.481060]        [<ffffffff811cfcce>] do_mount+0x23e/0xa90
[13654.481060]        [<ffffffff811d05a3>] SyS_mount+0x83/0xc0
[13654.481060]        [<ffffffff81692b52>] system_call_fastpath+0x16/0x1b
[13654.481060] -> #0 (&(&root->ordered_extent_lock)->rlock){+.+...}:
[13654.481060]        [<ffffffff810c340a>] __lock_acquire+0x150a/0x1a70
[13654.481060]        [<ffffffff810c4103>] lock_acquire+0x93/0x130
[13654.481060]        [<ffffffff81689991>] _raw_spin_lock+0x41/0x50
[13654.481060]        [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
[13654.481060]        [<ffffffffa02d35ce>] transaction_kthread+0x22e/0x270 [btrfs]
[13654.481060]        [<ffffffff81079efa>] kthread+0xea/0xf0
[13654.481060]        [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0
[13654.481060] other info that might help us debug this:

[13654.481060]  Possible unsafe locking scenario:

[13654.481060]        CPU0                    CPU1
[13654.481060]        ----                    ----
[13654.481060]   lock(&(&fs_info->ordered_root_lock)->rlock);
[13654.481060]				 lock(&(&root->ordered_extent_lock)->rlock);
[13654.481060]				 lock(&(&fs_info->ordered_root_lock)->rlock);
[13654.481060]   lock(&(&root->ordered_extent_lock)->rlock);
[13654.481060]
 *** DEADLOCK ***
[...]

======================================================

btrfs_destroy_all_ordered_extents()
gets &fs_info->ordered_root_lock __BEFORE__ acquiring &root->ordered_extent_lock,
while btrfs_[add,remove]_ordered_extent()
acquires &fs_info->ordered_root_lock __AFTER__ getting &root->ordered_extent_lock.

This patch fixes the above problem.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

2a85d9ca

Btrfs: faster/more efficient insertion of file extent items · d5f37527

由 Filipe David Borba Manana 提交于 2月 09, 2014

This is an extension to my previous commit titled:

  "Btrfs: faster file extent item replace operations"
  (hash 1acae57b)

Instead of inserting the new file extent item if we deleted existing
file extent items covering our target file range, also allow to insert
the new file extent item if we didn't find any existing items to delete
and replace_extent != 0, since in this case our caller would do another
tree search to insert the new file extent item anyway, therefore just
combine the two tree searches into a single one, saving cpu time, reducing
lock contention and reducing btree node/leaf COW operations.

This covers the case where applications keep doing tail append writes to
files, which for example is the case of Apache CouchDB (its database and
view index files are always open with O_APPEND).
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

d5f37527