提交 · ba929b6646c5b87c7bb15cd8d3e51617725c983b · openanolis / cloud-kernel

27 7月, 2016 1 次提交

mm, memcg: use consistent gfp flags during readahead · 8a5c743e

由 Michal Hocko 提交于 7月 26, 2016

Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs.  This gfp mask discrepancy is really unfortunate and easily
fixable.  Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.

This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again.  We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well.  The same applies to read_cache_pages.

ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask.  Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.

[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
  Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Chris Mason <clm@fb.com>
Cc: Steve French <sfrench@samba.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8a5c743e

21 7月, 2016 2 次提交

Btrfs: fix delalloc accounting after copy_from_user faults · 8b8b08cb

由 Chris Mason 提交于 7月 19, 2016

Commit 56244ef1 was almost but not quite enough to fix the
reservation math after btrfs_copy_from_user returned partial copies.

Some users are still seeing warnings in btrfs_destroy_inode, and with a
long enough test run I'm able to trigger them as well.

This patch fixes the accounting math again, bringing it much closer to
the way it was before the sectorsize conversion Chandan did.  The
problem is accounting for the offset into the page/sector when we do a
partial copy.  This one just uses the dirty_sectors variable which
should already be updated properly.
Signed-off-by: NChris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v4.6+

8b8b08cb

Btrfs: avoid deadlocks during reservations in btrfs_truncate_block · bac357dc

由 Josef Bacik 提交于 7月 20, 2016

The new enospc code makes it possible to deadlock if we don't use
FLUSH_LIMIT during reservations inside a transaction.  This enforces
the correct flush type to avoid both deadlocks and assertions
Signed-off-by: NChris Mason <clm@fb.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

bac357dc

19 7月, 2016 1 次提交

Btrfs: fix comparison in __btrfs_map_block() · df5c82a8

由 Vincent Stehlé 提交于 7月 15, 2016

Add missing comparison to op in expression, which was forgotten when doing
the REQ_OP transition.

Fixes: b3d3fa51 ("btrfs: update __btrfs_map_block for REQ_OP transition")
Signed-off-by: NVincent Stehlé <vincent.stehle@intel.com>
Reviewed-by: NMike Christie <mchristi@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

df5c82a8

08 7月, 2016 17 次提交

Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes · 8ca17f0f

由 Josef Bacik 提交于 5月 27, 2016

We used to allow you to set FLUSH_ALL and then just wouldn't do things like
commit transactions or wait on ordered extents if we noticed you were in a
transaction. However now that all the flushing for FLUSH_ALL is asynchronous
we've lost the ability to tell, and we could end up deadlocking. So instead use
FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if
we error out to preserve the previous behavior. I've also added an ASSERT() to
catch anybody else who tries to do this. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8ca17f0f

Btrfs: fill relocation block rsv after allocation · ac2fabac

由 Josef Bacik 提交于 5月 27, 2016

Since we set the reloc control before we've reserved our space for relocation we
could race with a root being dirtied and not actually have space to do our init
reloc root. So once we've allocated it and set it up go ahead and make our
reservation before setting the relocate control, that way anybody who tries to
do the reloc root init has space to use. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ac2fabac

Btrfs: always use trans->block_rsv for orphans · 40acc3ee

由 Josef Bacik 提交于 5月 27, 2016

This is the case all the time anyway except for relocation which could be doing
a reloc root for a non ref counted root, in which case we'd end up with some
random block rsv rather than the one we have our reservation in. If there isn't
enough space in the block rsv we are trying to steal from we'll BUG() because we
expect there to be space for the orphan to make its reservation. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

40acc3ee

Btrfs: change how we calculate the global block rsv · ae2e4728

由 Josef Bacik 提交于 5月 27, 2016

Traditionally we've calculated the global block rsv by guessing how much of the
metadata used amount was the extent tree, and then taking the data size and
figuring out how large the csum tree would have to be to hold that much data.

This is imprecise and falls down on MIXED file systems as we can't trust the
data used amount. This resulted in failures for xfstests generic/333 because it
creates lots of clones, which explodes out the extent tree. Our global reserve
calculations were woefully inaccurate in this case which meant we got into a
situation where we did not have enough reserved to do our work.

We know we only use the global block rsv for the extent, csum, and root trees,
so just get the bytes used for these trees and use that as the basis of our
global reserve. Since these are not reference counted trees the bytes_used
value will be accurate. This fixed the transaction aborts seen with
generic/333. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ae2e4728

Btrfs: use root when checking need_async_flush · 87241c2e

由 Josef Bacik 提交于 4月 25, 2016

Instead of doing fs_info->fs_root in need_async_flush, which may not be set
during recovery when mounting, just pass the root itself in, which makes more
sense as thats what btrfs_calc_reclaim_metadata_size takes.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reported-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

87241c2e

Btrfs: don't bother kicking async if there's nothing to reclaim · d38b349c

由 Josef Bacik 提交于 3月 25, 2016

We do this check when we start the async reclaimer thread, might as well check
before we kick it off to save us some cycles.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d38b349c

Btrfs: fix release reserved extents trace points · 31bada7c

由 Josef Bacik 提交于 3月 25, 2016

We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
isn't quite right because we will go through and free that extent later when we
unpin, so it messes up apps that are accounting for the reservation space. We
were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
only actually free the reservation instead of pinning the extent. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

31bada7c

Btrfs: add tracepoints for flush events · f376df2b

由 Josef Bacik 提交于 3月 25, 2016

We want to track when we're triggering flushing from our reservation code and
what flushing is being done when we start flushing.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f376df2b

Btrfs: fix delalloc reservation amount tracepoint · f485c9ee

由 Josef Bacik 提交于 3月 25, 2016

We can sometimes drop the reservation we had for our inode, so we need to remove
that amount from to_reserve so that our tracepoint reports a valid amount of
space.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f485c9ee

Btrfs: trace pinned extents · c51e7bb1

由 Josef Bacik 提交于 3月 25, 2016

Pinned extents are an important metric to keep track of for enospc.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c51e7bb1

Btrfs: introduce ticketed enospc infrastructure · 957780eb

由 Josef Bacik 提交于 5月 17, 2016

Our enospc flushing sucks. It is born from a time where we were early
enospc'ing constantly because multiple threads would race in for the same
reservation and randomly starve other ones out. So I came up with this solution
to block any other reservations from happening while one guy tried to flush
stuff to satisfy his reservation. This gives us pretty good correctness, but
completely crap latency.

The solution I've come up with is ticketed reservations. Basically we try to
make our reservation, and if we can't we put a ticket on a list in order and
kick off an async flusher thread. This async flusher thread does the same old
flushing we always did, just asynchronously. As space is freed and added back
to the space_info it checks and sees if we have any tickets that need
satisfying, and adds space to the tickets and wakes up anything we've satisfied.

Once the flusher thread stops making progress it wakes up all the current
tickets and tells them to take a hike.

There is a priority list for things that can't flush, since the async flusher
could do anything we need to avoid deadlocks. These guys get priority for
having their reservation made, and will still do manual flushing themselves in
case the async flusher isn't running.

This patch gives us significantly better latencies. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

957780eb

Btrfs: add tracepoint for adding block groups · c83f8eff

由 Josef Bacik 提交于 3月 25, 2016

I'm writing a tool to visualize the enospc system inside btrfs, I need this
tracepoint in order to keep track of the block groups in the system.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c83f8eff

Btrfs: warn_on for unaccounted spaces · d555b6c3

由 Josef Bacik 提交于 3月 25, 2016

These were hidden behind enospc_debug, which isn't helpful as they indicate
actual bugs, unlike the rest of the enospc_debug stuff which is really debug
information.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d555b6c3

Btrfs: change delayed reservation fallback behavior · c48f49d6

由 Josef Bacik 提交于 3月 25, 2016

We reserve space for the inode update when we first reserve space for writing to
a file. However there are lots of ways that we can use this reservation and not
have it for subsequent ordered extents. Previously we'd fall through and try to
reserve metadata bytes for this, then we'd just steal the full reservation from
the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
reservation from the global reserve. The problem with this is we can easily
just return ENOSPC and fallback to updating the inode item directly. In the
worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
fall all the way through, however if we just fallback and update the inode
directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.

We would have also just added the extent item for the inode so we likely will
have already cow'ed down most of the way to the leaf containing the inode item,
so we are more often than not only need one or two nodesize's worth of
reservations. Given the reservation for the extent itself is also a worst case
we will likely already have space to cover the inode update.

This change will make us behave better in the theoretical worst case, and much
better in the case that we don't have our reservation and cannot reserve more
metadata. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c48f49d6

Btrfs: always reserve metadata for delalloc extents · 48c3d480

由 Josef Bacik 提交于 3月 25, 2016

There are a few races in the metadata reservation stuff. First we add the bytes
to the block_rsv well after we've set the bit on the inode saying that we have
space for it and after we've reserved the bytes. So use the normal
btrfs_block_rsv_add helper for this case. Secondly we can flush delalloc
extents when we try to reserve space for our write, which means that we could
have used up the space for the inode and we wouldn't know because we only check
before the reservation. So instead make sure we are always reserving space for
the inode update, and then if we don't need it release those bytes afterward.
Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

48c3d480

Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8

由 Josef Bacik 提交于 3月 25, 2016

So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
Not only this but it unconditionally changes the size of the block_rsv. This
isn't a bug strictly speaking, but it makes truncate block rsv's look funny
because every time we migrate bytes over its size grows, even though we only
want it to be a specific size. So collapse this into one function that takes an
update_size argument and make truncate and evict not update the size for
consistency sake. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

25d609f8

Btrfs: add bytes_readonly to the spaceinfo at once · e40edf2d

由 Josef Bacik 提交于 3月 25, 2016

For some reason we're adding bytes_readonly to the space info after we update
the space info with the block group info. This creates a tiny race where we
could over-reserve space because we haven't yet taken out the bytes_readonly
bit. Since we already know this information at the time we call
update_space_info, just pass it along so it can be updated all at once. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e40edf2d

25 6月, 2016 1 次提交

Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99

由 Omar Sandoval 提交于 5月 20, 2016

Commit fe742fd4 ("Revert "btrfs: switch to ->iterate_shared()"")
backed out the conversion to ->iterate_shared() for Btrfs because the
delayed inode handling in btrfs_real_readdir() is racy. However, we can
still do readdir in parallel if there are no delayed nodes.

This is a temporary fix which upgrades the shared inode lock to an
exclusive lock only when we have delayed items until we come up with a
more complete solution. While we're here, rename the
btrfs_{get,put}_delayed_items functions to make it very clear that
they're just for readdir.

Tested with xfstests and by doing a parallel kernel build:

	while make tinyconfig && make -j4 && git clean dqfx; do
		:
	done

along with a bunch of parallel finds in another shell:

	while true; do
		for ((i=0; i<4; i++)); do
			find . >/dev/null &
		done
		wait
	done
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

02dbfc99

24 6月, 2016 4 次提交

Btrfs: Force stripesize to the value of sectorsize · b7f67055

由 Chandan Rajendra 提交于 6月 23, 2016

Btrfs code currently assumes stripesize to be same as
sectorsize. However Btrfs-progs (until commit
df05c7ed455f519e6e15e46196392e4757257305) has been setting
btrfs_super_block->stripesize to a value of 4096.

This commit makes sure that the value of btrfs_super_block->stripesize
is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
to sectorsize.
Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

b7f67055

btrfs: fix disk_i_size update bug when fallocate() fails · c0d2f610

由 Wang Xiaoguang 提交于 6月 22, 2016

When doing truncate operation, btrfs_setsize() will first call
truncate_setsize() to set new inode->i_size, but if later
btrfs_truncate() fails, btrfs_setsize() will call
"i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
inmemory inode size, now bug occurs. It's because for truncate
case btrfs_ordered_update_i_size() directly uses inode->i_size
to update BTRFS_I(inode)->disk_i_size, indeed we should use the
"offset" argument to update disk_i_size. Here is the call graph:
==>btrfs_truncate()
====>btrfs_truncate_inode_items()
======>btrfs_ordered_update_i_size(inode, last_size, NULL);
Here btrfs_ordered_update_i_size()'s offset argument is last_size.

And below test case can reveal this bug:

dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
dev=$(losetup --show -f fs.img)
mkdir -p /mnt/mntpoint
mkfs.btrfs  -f $dev
mount $dev /mnt/mntpoint
cd /mnt/mntpoint

echo "workdir is: /mnt/mntpoint"
blocksize=$((128 * 1024))
dd if=/dev/zero of=testfile bs=$blocksize count=1
sync
count=$((17*1024*1024*1024/blocksize))
echo "file size is:" $((count*blocksize))
for ((i = 1; i <= $count; i++)); do
	i=$((i + 1))
	dst_offset=$((blocksize * i))
	xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
		testfile > /dev/null
done
sync

truncate --size 0 testfile
ls -l testfile
du -sh testfile
exit

In this case, truncate operation will fail for enospc reason and
"du -sh testfile" returns value greater than 0, but testfile's
size is 0, we need to reflect correct inode->i_size.
Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

c0d2f610

Btrfs: fix error handling in map_private_extent_buffer · 415b35a5

由 Liu Bo 提交于 6月 17, 2016

map_private_extent_buffer() can return -EINVAL in two different cases,
1. when the requested contents span two pages if nodesize is larger
   than pagesize,
2. when it detects something insane.

The 2nd one used to be only a WARN_ON(1), and we decided to return a error
to callers, but we didn't fix up all its callers, which will be
addressed by this patch.

Without this, btrfs may end up with 'general protection', ie.
reading invalid memory.
Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

415b35a5

Btrfs: fix error return code in btrfs_init_test_fs() · 04e1b65a

由 Wei Yongjun 提交于 6月 17, 2016

Fix to return a negative error code from the kern_mount() error handling
case instead of 0(ret is set to 0 by register_filesystem), as done
elsewhere in this function.
Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

04e1b65a

23 6月, 2016 3 次提交

Btrfs: don't do nocow check unless we have to · c6887cd1

由 Josef Bacik 提交于 3月 25, 2016

Before we write into prealloc/nocow space we have to make sure that there are no
references to the extents we are writing into, which means checking the extent
tree and csum tree in the case of nocow.  So we don't want to do the nocow dance
unless we can't reserve data space, since it's a serious drag on performance.
With the following sequence

fallocate -l10737418240 /mnt/btrfs-test/file
cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
	--end_fsync=1

we get the worst case scenario where we have to fall back on to doing the check
anyway.

Without this patch
lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec

With this patch
lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec

We get twice the throughput, half of the runtime, and half of the average
latency.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
[ PAGE_CACHE_ removal related fixups ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

c6887cd1

btrfs: fix deadlock in delayed_ref_async_start · 0f873eca

由 Chris Mason 提交于 4月 27, 2016

"Btrfs: track transid for delayed ref flushing" was deadlocking on
btrfs_attach_transaction because its not safe to call from the async
delayed ref start code.  This commit brings back btrfs_join_transaction
instead and checks for a blocked commit.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

0f873eca

Btrfs: track transid for delayed ref flushing · 31b9655f

由 Josef Bacik 提交于 4月 11, 2016

Using the offwakecputime bpf script I noticed most of our time was spent waiting
on the delayed ref throttling. This is what is supposed to happen, but
sometimes the transaction can commit and then we're waiting for throttling that
doesn't matter anymore. So change this stuff to be a little smarter by tracking
the transid we were in when we initiated the throttling. If the transaction we
get is different then we can just bail out. This resulted in a 50% speedup in
my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
over the entire run (which is about 30 minutes). Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

31b9655f

18 6月, 2016 8 次提交

Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize · dd5c9311

由 Chandan Rajendra 提交于 6月 16, 2016

Older btrfs-progs/mkfs.btrfs sets 4096 as the stripesize. Hence
restricting stripesize to be equal to sectorsize would cause super block
validation to return an error on architectures where PAGE_SIZE is not
equal to 4096.

Hence as a workaround, this commit allows stripesize to be set to 4096
bytes.
Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

dd5c9311

btrfs: remove build fixup for qgroup_account_snapshot · 89c5a544

由 David Sterba 提交于 6月 16, 2016

Introduced in 2c1984f2 ("btrfs: build fixup for
qgroup_account_snapshot") as temporary bisectability build fixup.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

89c5a544

D
btrfs: use new error message helper in qgroup_account_snapshot · f7af3934
由 David Sterba 提交于 6月 17, 2016
```
We've renamed btrfs_std_error, this one is left from last merge.
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
f7af3934

btrfs: avoid blocking open_ctree from cleaner_kthread · 90c711ab

由 Zygo Blaxell 提交于 6月 12, 2016

This fixes a problem introduced in commit 2f3165ec
"btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes".

open_ctree eventually calls btrfs_replay_log which in turn calls
btrfs_commit_super which tries to lock the cleaner_mutex, causing a
recursive mutex deadlock during mount.

Instead of playing whack-a-mole trying to keep up with all the
functions that may want to lock cleaner_mutex, put all the cleaner_mutex
lockers back where they were, and attack the problem more directly:
keep cleaner_kthread asleep until the filesystem is mounted.

When filesystems are mounted read-only and later remounted read-write,
open_ctree did not set fs_info->open and neither does anything else.
Set this flag in btrfs_remount so that neither btrfs_delete_unused_bgs
nor cleaner_kthread get confused by the common case of "/" filesystem
read-only mount followed by read-write remount.
Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

90c711ab

Btrfs: don't BUG_ON() in btrfs_orphan_add · 3b6571c1

由 Josef Bacik 提交于 5月 27, 2016

This is just a screwup for developers, so change it to an ASSERT() so developers
notice when things go wrong and deal with the error appropriately if ASSERT()
isn't enabled.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3b6571c1

btrfs: account for non-CoW'd blocks in btrfs_abort_transaction · 64c12921

由 Jeff Mahoney 提交于 6月 08, 2016

The test for !trans->blocks_used in btrfs_abort_transaction is
insufficient to determine whether it's safe to drop the transaction
handle on the floor.  btrfs_cow_block, informed by should_cow_block,
can return blocks that have already been CoW'd in the current
transaction.  trans->blocks_used is only incremented for new block
allocations. If an operation overlaps the blocks in the current
transaction entirely and must abort the transaction, we'll happily
let it clean up the trans handle even though it may have modified
the blocks and will commit an incomplete operation.

In the long-term, I'd like to do closer tracking of when the fs
is actually modified so we can still recover as gracefully as possible,
but that approach will need some discussion.  In the short term,
since this is the only code using trans->blocks_used, let's just
switch it to a bool indicating whether any blocks were used and set
it when should_cow_block returns false.

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

64c12921

Btrfs: check if extent buffer is aligned to sectorsize · c871b0f2

由 Liu Bo 提交于 6月 06, 2016

Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer().  An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.

This adds a warning to let us quickly know what was happening.

Now that alloc_extent_buffer() no more returns NULL, this changes its
caller and callers of its caller to match with the new error
handling.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c871b0f2

btrfs: Use correct format specifier · 16ff4b45

由 Heinrich Schuchardt 提交于 6月 11, 2016

Component mirror_num of struct btrfsic_block is defined
as unsigned int. Use %u as format specifier.
Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

16ff4b45

08 6月, 2016 3 次提交

block, drivers, fs: rename REQ_FLUSH to REQ_PREFLUSH · 28a8f0d3

由 Mike Christie 提交于 6月 05, 2016

To avoid confusion between REQ_OP_FLUSH, which is handled by
request_fn drivers, and upper layers requesting the block layer
perform a flush sequence along with possibly a WRITE, this patch
renames REQ_FLUSH to REQ_PREFLUSH.
Signed-off-by: NMike Christie <mchristi@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

28a8f0d3

block, drivers, fs: shrink bi_rw from long to int · 6296b960

由 Mike Christie 提交于 6月 05, 2016

We don't need bi_rw to be so large on 64 bit archs, so
reduce it to unsigned int.
Signed-off-by: NMike Christie <mchristi@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

6296b960

btrfs: use bio fields for op and flags · 81a75f67

由 Mike Christie 提交于 6月 05, 2016

The bio REQ_OP and bi_rw rq_flag_bits are now always setup, so there is
no need to pass around the rq_flag_bits bits too. btrfs users should
should access the bio insead.
Signed-off-by: NMike Christie <mchristi@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

81a75f67

openanolis / cloud-kernel 12 个月 前同步成功

openanolis / cloud-kernel
12 个月前同步成功