提交 · 0fd8c3dae14fb64947842472940b807ca0781da9 · xiphi1978 / linux

26 7月, 2016 19 次提交

Btrfs: fix panic in balance due to EIO · 0fd8c3da

由 Liu Bo 提交于 7月 12, 2016

During build_backref_tree(), if we fail to read a btree node,
we can eventually run into BUG_ON(cache->nr_nodes) that we put
in backref_cache_cleanup(), meaning we have at least one
memory leak.

This frees the backref_node that we's allocated at the very
beginning of build_backref_tree().
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0fd8c3da

Btrfs: fix eb memory leak due to readpage failure · baf863b9

由 Liu Bo 提交于 7月 11, 2016

eb->io_pages is set in read_extent_buffer_pages().

In case of readpage failure, for pages that have been added to bio,
it calls bio_endio and later readpage_io_failed_hook() does the work.

When this eb's page (couldn't be the 1st page) fails to add itself to bio
due to failure in merge_bio(), it cannot decrease eb->io_pages via bio_endio,
 and ends up with a memory leak eventually.

This lets __do_readpage propagate errors to callers and adds the
 'atomic_dec(&eb->io_pages)'.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

baf863b9

Btrfs: change BUG_ON()'s to ASSERT()'s in backref_cache_cleanup() · f4907095

由 Liu Bo 提交于 7月 11, 2016

Since it is just an in-memory building of the backrefs of several
btree blocks, nothing is fatal other than memory leaks, so this
changes BUG_ON()'s to ASSERT()'s.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f4907095

btrfs: fix free space calculation in dump_space_info() · 39581a3a

由 Wang Xiaoguang 提交于 7月 11, 2016

In btrfs, btrfs_space_info's bytes_may_use is treated as fs used
space, as what we do in reserve_metadata_bytes() or
btrfs_alloc_data_chunk_ondemand(), so in dump_space_info(), when
calculating free space, we should also subtract btrfs_space_info's
bytes_may_use.
Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

39581a3a

Btrfs: subpage-blocksize: Rate limit scrub error message · 751bebbe

由 Chandan Rajendra 提交于 7月 04, 2016

btrfs/073 invokes scrub ioctl in a tight loop. In subpage-blocksize
scenario this results in a lot of "scrub: size assumption sectorsize !=
PAGE_SIZE " messages being printed on the console. To reduce the number
of such messages this commit uses btrfs_err_rl() instead of
btrfs_err().
Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

751bebbe

btrfs: expand cow_file_range() to support in-band dedup and subpage-blocksize · dda3245e

由 Wang Xiaoguang 提交于 7月 11, 2016

Extract cow_file_range() new parameters for both in-band dedupe and
subpage sector size patchset.

This should make conflict of both patchset to minimal, and reduce the
effort needed to rebase them.

Cc: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

dda3245e

Btrfs: fix BUG_ON in btrfs_submit_compressed_write · f5daf2c7

由 Liu Bo 提交于 6月 22, 2016

This is similar to btrfs_submit_compressed_read(), if we fail after
bio is allocated, then we can use bio_endio() and errors are saved
 in bio->bi_error.  But please note that we don't return errors to
its caller because the caller assumes it won't call endio to cleanup
on error.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f5daf2c7

btrfs: make sure device is synced before return · e2bf6e89

由 Anand Jain 提交于 6月 23, 2016

An inconsistent behavior due to stale reads from the
disk was reported

  mail-archive.com/linux-btrfs@vger.kernel.org/msg54188.html

This patch will make sure devices are synced before
return in the unmount thread.
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e2bf6e89

btrfs: reorg btrfs_close_one_device() · f448341a

由 Anand Jain 提交于 6月 14, 2016

Moves closer to the caller and removes declaration
Signed-off-by: NAnand Jain <anand.jain@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f448341a

btrfs: Cleanup compress_file_range() · c8bb0c8b

由 Ashish Samant 提交于 3月 25, 2016

Remove unnecessary checks in compress_file_range().
Signed-off-by: NAshish Samant <ashish.samant@oracle.com>
[ minor coding style fixups ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c8bb0c8b

Btrfs: cleanup BUG_ON in merge_bio · 6f034ece

由 Liu Bo 提交于 6月 22, 2016

One can use btrfs-corrupt-block to hit BUG_ON() in merge_bio(),
thus this aims to stop anyone to panic the whole system by using
 their btrfs.

Since the error in merge_bio can only come from __btrfs_map_block()
when chunk tree mapping has something insane and __btrfs_map_block()
has already had printed the reason, we can just return errors in
merge_bio.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6f034ece

btrfs: Fix slab accounting flags · fba4b697

由 Nikolay Borisov 提交于 6月 23, 2016

BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.

To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).
Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

fba4b697

btrfs: Replace -ENOENT by -ERANGE in btrfs_get_acl() · a60617d0

由 Salah Triki 提交于 7月 03, 2016

size contains the value returned by posix_acl_from_xattr(), which
returns -ERANGE, -ENODATA, zero, or an integer greater than zero. So
replace -ENOENT by -ERANGE.
Signed-off-by: NSalah Triki <salah.triki@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

a60617d0

btrfs: Handle uninitialised inode eviction · 3d48d981

由 Nikolay Borisov 提交于 6月 29, 2016

The code flow in btrfs_new_inode allows for btrfs_evict_inode to be
called with not fully initialised inode (e.g. ->root member not
being set). This can happen when btrfs_set_inode_index in
btrfs_new_inode fails, which in turn would call iput for the newly
allocated inode. This in turn leads to vfs calling into btrfs_evict_inode.
This leads to null pointer dereference. To handle this situation check whether
the passed inode has root set and just free it in case it doesn't.
Signed-off-by: NNikolay Borisov <kernel@kyup.com>
Reviewed-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3d48d981

Btrfs: fix read_node_slot to return errors · fb770ae4

由 Liu Bo 提交于 7月 05, 2016

We use read_node_slot() to read btree node and it has two cases,
a) slot is out of range, which means 'no such entry'
b) we fail to read the block, due to checksum fails or corrupted
   content or not with uptodate flag.
But we're returning NULL in both cases, this makes it return -ENOENT
in case a) and return -EIO in case b), and this fixes its callers
as well as btrfs_search_forward() 's caller to catch the new errors.

The problem is reported by Peter Becker, and I can manage to
hit the same BUG_ON by mounting my fuzz image.
Reported-by: NPeter Becker <floyd.net@gmail.com>
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

fb770ae4

Btrfs: fix double free of fs root · 876d2cf1

由 Liu Bo 提交于 6月 28, 2016

I got this warning while mounting a btrfs image,

[ 3020.509606] ------------[ cut here ]------------
[ 3020.510107] WARNING: CPU: 3 PID: 5581 at lib/idr.c:1051 ida_remove+0xca/0x190
[ 3020.510853] ida_remove called for id=42 which is not allocated.
[ 3020.511466] Modules linked in:
[ 3020.511802] CPU: 3 PID: 5581 Comm: mount Not tainted 4.7.0-rc5+ #274
[ 3020.512438] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[ 3020.513385]  0000000000000286 0000000021295d86 ffff88006c66b8f0 ffffffff8182ba5a
[ 3020.514153]  0000000000000000 0000000000000009 ffff88006c66b930 ffffffff810e0ed7
[ 3020.514928]  0000041b00000000 ffffffff8289a8c0 ffff88007f437880 0000000000000000
[ 3020.515717] Call Trace:
[ 3020.515965]  [<ffffffff8182ba5a>] dump_stack+0xc9/0x13f
[ 3020.516487]  [<ffffffff810e0ed7>] __warn+0x147/0x160
[ 3020.517005]  [<ffffffff810e0f4f>] warn_slowpath_fmt+0x5f/0x80
[ 3020.517572]  [<ffffffff8182e6ca>] ida_remove+0xca/0x190
[ 3020.518075]  [<ffffffff813a2bcc>] free_anon_bdev+0x2c/0x60
[ 3020.518609]  [<ffffffff81657a9f>] free_fs_root+0x13f/0x160
[ 3020.519138]  [<ffffffff8165c679>] btrfs_get_fs_root+0x379/0x3d0
[ 3020.519710]  [<ffffffff81e6e975>] ? __mutex_unlock_slowpath+0x155/0x2c0
[ 3020.520366]  [<ffffffff816615b1>] open_ctree+0x2e91/0x3200
[ 3020.520965]  [<ffffffff8161ede2>] btrfs_mount+0x1322/0x15b0
[ 3020.521536]  [<ffffffff81e60e74>] ? kmemleak_alloc_percpu+0x44/0x170
[ 3020.522167]  [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210
[ 3020.522780]  [<ffffffff813a4f59>] mount_fs+0x49/0x2c0
[ 3020.523305]  [<ffffffff813d840c>] vfs_kern_mount+0xac/0x1b0
[ 3020.523872]  [<ffffffff8161dee1>] btrfs_mount+0x421/0x15b0
[ 3020.524402]  [<ffffffff81e60e74>] ? kmemleak_alloc_percpu+0x44/0x170
[ 3020.525045]  [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210
[ 3020.525657]  [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210
[ 3020.526289]  [<ffffffff813a4f59>] mount_fs+0x49/0x2c0
[ 3020.526803]  [<ffffffff813d840c>] vfs_kern_mount+0xac/0x1b0
[ 3020.527365]  [<ffffffff813dc27a>] do_mount+0x41a/0x1770
[ 3020.527899]  [<ffffffff812e800d>] ? strndup_user+0x6d/0xc0
[ 3020.528447]  [<ffffffff812e7f68>] ? memdup_user+0x78/0xb0
[ 3020.528987]  [<ffffffff813ddad0>] SyS_mount+0x150/0x160
[ 3020.529493]  [<ffffffff81e72b7c>] entry_SYSCALL_64_fastpath+0x1f/0xbd

It turns out that we free fs root twice, btrfs_init_fs_root() calls
free_anon_bdev(root->anon_dev) and later then btrfs_get_fs_root() cals
free_fs_root which does another free_anon_bdev() and it ends up with the
above warning.

Instead of reset root->anon_dev to 0 after free_anon_bdev(), we can let
btrfs_init_fs_root() return directly since its callers have already done
the free job by calling free_fs_root().
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

876d2cf1

Btrfs: error out if generic_bin_search get invalid arguments · 5e24e9af

由 Liu Bo 提交于 6月 23, 2016

With btrfs-corrupt-block, one can set btree node/leaf's field, if
we assign a negative value to node/leaf, we can get various hangs,
eg. if extent_root's nritems is -2ULL, then we get stuck in
 btrfs_read_block_groups() because it has a while loop and
btrfs_search_slot() on extent_root will always return the first
 child.

This lets us know what's happening and returns a EINVAL to callers
instead of returning the first item.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5e24e9af

Btrfs: check inconsistence between chunk and block group · 6fb37b75

由 Liu Bo 提交于 6月 22, 2016

With btrfs-corrupt-block, one can drop one chunk item and mounting
will end up with a panic in btrfs_full_stripe_len().

This doesn't not remove the BUG_ON, but instead checks it a bit
earlier when we find the block group item.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

6fb37b75

W
btrfs: add missing bytes_readonly attribute file in sysfs · c1fd5c30
由 Wang Xiaoguang 提交于 6月 21, 2016
```
Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
c1fd5c30

21 7月, 2016 2 次提交

Btrfs: fix delalloc accounting after copy_from_user faults · 8b8b08cb

由 Chris Mason 提交于 7月 19, 2016

Commit 56244ef1 was almost but not quite enough to fix the
reservation math after btrfs_copy_from_user returned partial copies.

Some users are still seeing warnings in btrfs_destroy_inode, and with a
long enough test run I'm able to trigger them as well.

This patch fixes the accounting math again, bringing it much closer to
the way it was before the sectorsize conversion Chandan did.  The
problem is accounting for the offset into the page/sector when we do a
partial copy.  This one just uses the dirty_sectors variable which
should already be updated properly.
Signed-off-by: NChris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v4.6+

8b8b08cb

Btrfs: avoid deadlocks during reservations in btrfs_truncate_block · bac357dc

由 Josef Bacik 提交于 7月 20, 2016

The new enospc code makes it possible to deadlock if we don't use
FLUSH_LIMIT during reservations inside a transaction.  This enforces
the correct flush type to avoid both deadlocks and assertions
Signed-off-by: NChris Mason <clm@fb.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

bac357dc

08 7月, 2016 17 次提交

Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes · 8ca17f0f

由 Josef Bacik 提交于 5月 27, 2016

We used to allow you to set FLUSH_ALL and then just wouldn't do things like
commit transactions or wait on ordered extents if we noticed you were in a
transaction. However now that all the flushing for FLUSH_ALL is asynchronous
we've lost the ability to tell, and we could end up deadlocking. So instead use
FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if
we error out to preserve the previous behavior. I've also added an ASSERT() to
catch anybody else who tries to do this. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

8ca17f0f

Btrfs: fill relocation block rsv after allocation · ac2fabac

由 Josef Bacik 提交于 5月 27, 2016

Since we set the reloc control before we've reserved our space for relocation we
could race with a root being dirtied and not actually have space to do our init
reloc root. So once we've allocated it and set it up go ahead and make our
reservation before setting the relocate control, that way anybody who tries to
do the reloc root init has space to use. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ac2fabac

Btrfs: always use trans->block_rsv for orphans · 40acc3ee

由 Josef Bacik 提交于 5月 27, 2016

This is the case all the time anyway except for relocation which could be doing
a reloc root for a non ref counted root, in which case we'd end up with some
random block rsv rather than the one we have our reservation in. If there isn't
enough space in the block rsv we are trying to steal from we'll BUG() because we
expect there to be space for the orphan to make its reservation. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

40acc3ee

Btrfs: change how we calculate the global block rsv · ae2e4728

由 Josef Bacik 提交于 5月 27, 2016

Traditionally we've calculated the global block rsv by guessing how much of the
metadata used amount was the extent tree, and then taking the data size and
figuring out how large the csum tree would have to be to hold that much data.

This is imprecise and falls down on MIXED file systems as we can't trust the
data used amount. This resulted in failures for xfstests generic/333 because it
creates lots of clones, which explodes out the extent tree. Our global reserve
calculations were woefully inaccurate in this case which meant we got into a
situation where we did not have enough reserved to do our work.

We know we only use the global block rsv for the extent, csum, and root trees,
so just get the bytes used for these trees and use that as the basis of our
global reserve. Since these are not reference counted trees the bytes_used
value will be accurate. This fixed the transaction aborts seen with
generic/333. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ae2e4728

Btrfs: use root when checking need_async_flush · 87241c2e

由 Josef Bacik 提交于 4月 25, 2016

Instead of doing fs_info->fs_root in need_async_flush, which may not be set
during recovery when mounting, just pass the root itself in, which makes more
sense as thats what btrfs_calc_reclaim_metadata_size takes.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reported-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

87241c2e

Btrfs: don't bother kicking async if there's nothing to reclaim · d38b349c

由 Josef Bacik 提交于 3月 25, 2016

We do this check when we start the async reclaimer thread, might as well check
before we kick it off to save us some cycles.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d38b349c

Btrfs: fix release reserved extents trace points · 31bada7c

由 Josef Bacik 提交于 3月 25, 2016

We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
isn't quite right because we will go through and free that extent later when we
unpin, so it messes up apps that are accounting for the reservation space. We
were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
only actually free the reservation instead of pinning the extent. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

31bada7c

Btrfs: add tracepoints for flush events · f376df2b

由 Josef Bacik 提交于 3月 25, 2016

We want to track when we're triggering flushing from our reservation code and
what flushing is being done when we start flushing.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f376df2b

Btrfs: fix delalloc reservation amount tracepoint · f485c9ee

由 Josef Bacik 提交于 3月 25, 2016

We can sometimes drop the reservation we had for our inode, so we need to remove
that amount from to_reserve so that our tracepoint reports a valid amount of
space.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

f485c9ee

Btrfs: trace pinned extents · c51e7bb1

由 Josef Bacik 提交于 3月 25, 2016

Pinned extents are an important metric to keep track of for enospc.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c51e7bb1

Btrfs: introduce ticketed enospc infrastructure · 957780eb

由 Josef Bacik 提交于 5月 17, 2016

Our enospc flushing sucks. It is born from a time where we were early
enospc'ing constantly because multiple threads would race in for the same
reservation and randomly starve other ones out. So I came up with this solution
to block any other reservations from happening while one guy tried to flush
stuff to satisfy his reservation. This gives us pretty good correctness, but
completely crap latency.

The solution I've come up with is ticketed reservations. Basically we try to
make our reservation, and if we can't we put a ticket on a list in order and
kick off an async flusher thread. This async flusher thread does the same old
flushing we always did, just asynchronously. As space is freed and added back
to the space_info it checks and sees if we have any tickets that need
satisfying, and adds space to the tickets and wakes up anything we've satisfied.

Once the flusher thread stops making progress it wakes up all the current
tickets and tells them to take a hike.

There is a priority list for things that can't flush, since the async flusher
could do anything we need to avoid deadlocks. These guys get priority for
having their reservation made, and will still do manual flushing themselves in
case the async flusher isn't running.

This patch gives us significantly better latencies. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

957780eb

Btrfs: add tracepoint for adding block groups · c83f8eff

由 Josef Bacik 提交于 3月 25, 2016

I'm writing a tool to visualize the enospc system inside btrfs, I need this
tracepoint in order to keep track of the block groups in the system.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c83f8eff

Btrfs: warn_on for unaccounted spaces · d555b6c3

由 Josef Bacik 提交于 3月 25, 2016

These were hidden behind enospc_debug, which isn't helpful as they indicate
actual bugs, unlike the rest of the enospc_debug stuff which is really debug
information.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d555b6c3

Btrfs: change delayed reservation fallback behavior · c48f49d6

由 Josef Bacik 提交于 3月 25, 2016

We reserve space for the inode update when we first reserve space for writing to
a file. However there are lots of ways that we can use this reservation and not
have it for subsequent ordered extents. Previously we'd fall through and try to
reserve metadata bytes for this, then we'd just steal the full reservation from
the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
reservation from the global reserve. The problem with this is we can easily
just return ENOSPC and fallback to updating the inode item directly. In the
worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
fall all the way through, however if we just fallback and update the inode
directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.

We would have also just added the extent item for the inode so we likely will
have already cow'ed down most of the way to the leaf containing the inode item,
so we are more often than not only need one or two nodesize's worth of
reservations. Given the reservation for the extent itself is also a worst case
we will likely already have space to cover the inode update.

This change will make us behave better in the theoretical worst case, and much
better in the case that we don't have our reservation and cannot reserve more
metadata. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c48f49d6

Btrfs: always reserve metadata for delalloc extents · 48c3d480

由 Josef Bacik 提交于 3月 25, 2016

There are a few races in the metadata reservation stuff. First we add the bytes
to the block_rsv well after we've set the bit on the inode saying that we have
space for it and after we've reserved the bytes. So use the normal
btrfs_block_rsv_add helper for this case. Secondly we can flush delalloc
extents when we try to reserve space for our write, which means that we could
have used up the space for the inode and we wouldn't know because we only check
before the reservation. So instead make sure we are always reserving space for
the inode update, and then if we don't need it release those bytes afterward.
Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

48c3d480

Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8

由 Josef Bacik 提交于 3月 25, 2016

So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
Not only this but it unconditionally changes the size of the block_rsv. This
isn't a bug strictly speaking, but it makes truncate block rsv's look funny
because every time we migrate bytes over its size grows, even though we only
want it to be a specific size. So collapse this into one function that takes an
update_size argument and make truncate and evict not update the size for
consistency sake. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

25d609f8

Btrfs: add bytes_readonly to the spaceinfo at once · e40edf2d

由 Josef Bacik 提交于 3月 25, 2016

For some reason we're adding bytes_readonly to the space info after we update
the space info with the block group info. This creates a tiny race where we
could over-reserve space because we haven't yet taken out the bytes_readonly
bit. Since we already know this information at the time we call
update_space_info, just pass it along so it can be updated all at once. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e40edf2d

25 6月, 2016 1 次提交

Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99

由 Omar Sandoval 提交于 5月 20, 2016

Commit fe742fd4 ("Revert "btrfs: switch to ->iterate_shared()"")
backed out the conversion to ->iterate_shared() for Btrfs because the
delayed inode handling in btrfs_real_readdir() is racy. However, we can
still do readdir in parallel if there are no delayed nodes.

This is a temporary fix which upgrades the shared inode lock to an
exclusive lock only when we have delayed items until we come up with a
more complete solution. While we're here, rename the
btrfs_{get,put}_delayed_items functions to make it very clear that
they're just for readdir.

Tested with xfstests and by doing a parallel kernel build:

	while make tinyconfig && make -j4 && git clean dqfx; do
		:
	done

along with a bunch of parallel finds in another shell:

	while true; do
		for ((i=0; i<4; i++)); do
			find . >/dev/null &
		done
		wait
	done
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

02dbfc99

24 6月, 2016 1 次提交

Btrfs: Force stripesize to the value of sectorsize · b7f67055

由 Chandan Rajendra 提交于 6月 23, 2016

Btrfs code currently assumes stripesize to be same as
sectorsize. However Btrfs-progs (until commit
df05c7ed455f519e6e15e46196392e4757257305) has been setting
btrfs_super_block->stripesize to a value of 4096.

This commit makes sure that the value of btrfs_super_block->stripesize
is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
to sectorsize.
Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

b7f67055