提交 · e07222c7d20633b86d3af9d55f97d687837b7c02 · openanolis / cloud-kernel

14 2月, 2017 8 次提交

N
btrfs: Make btrfs_delayed_delete_inode_ref take btrfs_inode · e07222c7
由 Nikolay Borisov 提交于 1月 10, 2017
```
Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
e07222c7
N
btrfs: Make btrfs_delete_delayed_dir_index take btrfs_inode · e67bbbb9
由 Nikolay Borisov 提交于 1月 10, 2017
```
Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
e67bbbb9
N
btrfs: Make btrfs_insert_delayed_dir_index take btrfs_inode · 6f45d185
由 Nikolay Borisov 提交于 1月 10, 2017
```
Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
6f45d185
N
btrfs: Make btrfs_delayed_inode_reserve_metadata take btrfs_inode · fcabdd1c
由 Nikolay Borisov 提交于 1月 10, 2017
```
Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
fcabdd1c
N
btrfs: Make btrfs_get_or_create_delayed_node take btrfs_inode · e5517a7b
由 Nikolay Borisov 提交于 1月 10, 2017
```
Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
e5517a7b

btrfs: Make btrfs_get_delayed_node take btrfs_inode · 340c6ca9

由 Nikolay Borisov 提交于 1月 10, 2017

This function is internal to btrfs and doesn't really deal with any
VFS members, as such it needn't take a struct inode refrence but
btrfs_inode.
Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

340c6ca9

btrfs: Make btrfs_ino take a struct btrfs_inode · 4a0cc7ca

由 Nikolay Borisov 提交于 1月 10, 2017

Currently btrfs_ino takes a struct inode and this causes a lot of
internal btrfs functions which consume this ino to take a VFS inode,
rather than btrfs' own struct btrfs_inode. In order to fix this "leak"
of VFS structs into the internals of btrfs first it's necessary to
eliminate all uses of struct inode for the purpose of inode. This patch
does that by using BTRFS_I to convert an inode to btrfs_inode. With
this problem eliminated subsequent patches will start eliminating the
passing of struct inode altogether, eventually resulting in a lot cleaner
code.
Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
[ fix btrfs_get_extent tracepoint prototype ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

4a0cc7ca

Btrfs: ACCESS_ONCE cleanup · 20c7bcec

由 Seraphime Kirkovski 提交于 12月 15, 2016

This replaces ACCESS_ONCE macro with the corresponding
READ|WRITE macros
Signed-off-by: NSeraphime Kirkovski <kirkseraph@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

20c7bcec

14 12月, 2016 1 次提交

btrfs: limit async_work allocation and worker func duration · 2939e1a8

由 Maxim Patlasov 提交于 12月 12, 2016

Problem statement: unprivileged user who has read-write access to more than
one btrfs subvolume may easily consume all kernel memory (eventually
triggering oom-killer).

Reproducer (./mkrmdir below essentially loops over mkdir/rmdir):

[root@kteam1 ~]# cat prep.sh

DEV=/dev/sdb
mkfs.btrfs -f $DEV
mount $DEV /mnt
for i in `seq 1 16`
do
	mkdir /mnt/$i
	btrfs subvolume create /mnt/SV_$i
	ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2`
	mount -t btrfs -o subvolid=$ID $DEV /mnt/$i
	chmod a+rwx /mnt/$i
done

[root@kteam1 ~]# sh prep.sh

[maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done

[root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done
kmalloc-128        10144  10144    128   32    1 : tunables    0    0    0 : slabdata    317    317      0
kmalloc-128       9992352 9992352    128   32    1 : tunables    0    0    0 : slabdata 312261 312261      0
kmalloc-128       24226752 24226752    128   32    1 : tunables    0    0    0 : slabdata 757086 757086      0
kmalloc-128       42754240 42754240    128   32    1 : tunables    0    0    0 : slabdata 1336070 1336070      0

The huge numbers above come from insane number of async_work-s allocated
and queued by btrfs_wq_run_delayed_node.

The problem is caused by btrfs_wq_run_delayed_node() queuing more and more
works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The
worker func (btrfs_async_run_delayed_root) processes at least
BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery
works as expected while the list is almost empty. As soon as it is getting
bigger, worker func starts to process more than one item at a time, it takes
longer, and the chances to have async_works queued more than needed is getting
higher.

The problem above is worsened by another flaw of delayed-inode implementation:
if async_work was queued in a throttling branch (number of items >=
BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until
the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that
the func occupies CPU infinitely (up to 30sec in my experiments): while the
func is trying to drain the list, the user activity may add more and more
items to the list.

The patch fixes both problems in straightforward way: refuse queuing too
many works in btrfs_wq_run_delayed_node and bail out of worker func if
at least BTRFS_DELAYED_WRITEBACK items are processed.

Changed in v2: remove support of thresh == NO_THRESHOLD.
Signed-off-by: NMaxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: NChris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v3.15+

2939e1a8

06 12月, 2016 5 次提交

btrfs: remove root parameter from transaction commit/end routines · 3a45bb20

由 Jeff Mahoney 提交于 9月 09, 2016

Now we only use the root parameter to print the root objectid in
a tracepoint.  We can use the root parameter from the transaction
handle for that.  It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3a45bb20

btrfs: take an fs_info directly when the root is not used otherwise · 2ff7e61e

由 Jeff Mahoney 提交于 6月 22, 2016

There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer.  Let's convert those to
just accept an fs_info pointer directly.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2ff7e61e

btrfs: root->fs_info cleanup, access fs_info->delayed_root directly · ccdf9b30

由 Jeff Mahoney 提交于 6月 22, 2016

This results in btrfs_assert_delayed_root_empty and
btrfs_destroy_delayed_inode taking an fs_info instead of a root.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ccdf9b30

btrfs: root->fs_info cleanup, add fs_info convenience variables · 0b246afa

由 Jeff Mahoney 提交于 6月 22, 2016

In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable.  This makes the code considerably
more readable.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

0b246afa

J
btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size · 27965b6c
由 Jeff Mahoney 提交于 6月 16, 2016
```
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
27965b6c

30 11月, 2016 1 次提交

btrfs: increment ctx->pos for every emitted or skipped dirent in readdir · d2fbb2b5

由 Jeff Mahoney 提交于 11月 05, 2016

If we process the last item in the leaf and hit an I/O error while
reading the next leaf, we return -EIO without having adjusted the
position.  Since we have emitted dirents, getdents() will return
the byte count to the user instead of the error.  Subsequent callers
will emit the last successful dirent again, and return -EIO again,
with the same result.  Callers loop forever.

Instead, if we always increment ctx->pos after emitting or skipping
the dirent, we'll be sure that we won't hit the same one again.  When
we go to process the next leaf, we won't have emitted any dirents
and the -EIO will be returned to the user properly.  We also don't
need to track if we've emitted a dirent already or if we've changed
the position yet.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

d2fbb2b5

27 9月, 2016 2 次提交

btrfs: unsplit printed strings · 5d163e0e

由 Jeff Mahoney 提交于 9月 20, 2016

CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."

This patch unsplits user-visible strings.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5d163e0e

btrfs: squash lines for simple wrapper functions · e2c89907

由 Masahiro Yamada 提交于 9月 13, 2016

Remove unneeded variables and assignments.
Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e2c89907

26 9月, 2016 1 次提交

Btrfs: add a flags field to btrfs_fs_info · afcdd129

由 Josef Bacik 提交于 9月 02, 2016

We have a lot of random ints in btrfs_fs_info that can be put into flags. This
is mostly equivalent with the exception of how we deal with quota going on or
off, now instead we set a flag when we are turning it on or off and deal with
that appropriately, rather than just having a pending state that the current
quota_enabled gets set to. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

afcdd129

26 7月, 2016 2 次提交

btrfs: btrfs_abort_transaction, drop root parameter · 66642832

由 Jeff Mahoney 提交于 6月 10, 2016

__btrfs_abort_transaction doesn't use its root parameter except to
obtain an fs_info pointer.  We can obtain that from trans->root->fs_info
for now and from trans->fs_info in a later patch.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

66642832

btrfs: Fix slab accounting flags · fba4b697

由 Nikolay Borisov 提交于 6月 23, 2016

BTRFS is using a variety of slab caches to satisfy internal needs.
Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
meaning allocations from the caches are going to be accounted as
SReclaimable. At the same time btrfs is not registering any shrinkers
whatsoever, thus preventing memory from the slabs to be shrunk. This
means those caches are not in fact reclaimable.

To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
inode cache, since this one is being freed by the generic VFS super_block
shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
to better document the lifetime of the objects (it just translates
to SLAB_RECLAIM_ACCOUNT).
Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

fba4b697

08 7月, 2016 2 次提交

Btrfs: change delayed reservation fallback behavior · c48f49d6

由 Josef Bacik 提交于 3月 25, 2016

We reserve space for the inode update when we first reserve space for writing to
a file. However there are lots of ways that we can use this reservation and not
have it for subsequent ordered extents. Previously we'd fall through and try to
reserve metadata bytes for this, then we'd just steal the full reservation from
the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
reservation from the global reserve. The problem with this is we can easily
just return ENOSPC and fallback to updating the inode item directly. In the
worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
fall all the way through, however if we just fallback and update the inode
directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.

We would have also just added the extent item for the inode so we likely will
have already cow'ed down most of the way to the leaf containing the inode item,
so we are more often than not only need one or two nodesize's worth of
reservations. Given the reservation for the extent itself is also a worst case
we will likely already have space to cover the inode update.

This change will make us behave better in the theoretical worst case, and much
better in the case that we don't have our reservation and cannot reserve more
metadata. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c48f49d6

Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8

由 Josef Bacik 提交于 3月 25, 2016

So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
Not only this but it unconditionally changes the size of the block_rsv. This
isn't a bug strictly speaking, but it makes truncate block rsv's look funny
because every time we migrate bytes over its size grows, even though we only
want it to be a specific size. So collapse this into one function that takes an
update_size argument and make truncate and evict not update the size for
consistency sake. Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

25d609f8

25 6月, 2016 1 次提交

Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99

由 Omar Sandoval 提交于 5月 20, 2016

Commit fe742fd4 ("Revert "btrfs: switch to ->iterate_shared()"")
backed out the conversion to ->iterate_shared() for Btrfs because the
delayed inode handling in btrfs_real_readdir() is racy. However, we can
still do readdir in parallel if there are no delayed nodes.

This is a temporary fix which upgrades the shared inode lock to an
exclusive lock only when we have delayed items until we come up with a
more complete solution. While we're here, rename the
btrfs_{get,put}_delayed_items functions to make it very clear that
they're just for readdir.

Tested with xfstests and by doing a parallel kernel build:

	while make tinyconfig && make -j4 && git clean dqfx; do
		:
	done

along with a bunch of parallel finds in another shell:

	while true; do
		for ((i=0; i<4; i++)); do
			find . >/dev/null &
		done
		wait
	done
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

02dbfc99

10 5月, 2016 1 次提交

btrfs: GFP_NOFS does not GFP_HIGHMEM · e1860a77

由 David Sterba 提交于 5月 09, 2016

Masking HIGHMEM out of NOFS does not make sense.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

e1860a77

14 3月, 2016 1 次提交

btrfs: Print Warning only if ENOSPC_DEBUG is enabled · 2e3fcb1c

由 Ashish Samant 提交于 3月 11, 2016

Dont print warning for ENOSPC error unless ENOSPC_DEBUG is enabled. Use
btrfs_debug if it is enabled.
Signed-off-by: NAshish Samant <ashish.samant@oracle.com>
[ preserve the WARN_ON ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2e3fcb1c

18 2月, 2016 1 次提交

btrfs: drop null testing before destroy functions · 5598e900

由 Kinglong Mee 提交于 1月 29, 2016

Cleanup.

kmem_cache_destroy has support NULL argument checking,
so drop the double null testing before calling it.
Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

5598e900

11 2月, 2016 1 次提交

btrfs: properly set the termination value of ctx->pos in readdir · bc4ef759

由 David Sterba 提交于 11月 13, 2015

The value of ctx->pos in the last readdir call is supposed to be set to
INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
larger value, then it's LLONG_MAX.

There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
the increment.

We can get to that situation like that:

* emit all regular readdir entries
* still in the same call to readdir, bump the last pos to INT_MAX
* next call to readdir will not emit any entries, but will reach the
  bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX

Normally this is not a problem, but if we call readdir again, we'll find
'pos' set to LLONG_MAX and the unconditional increment will overflow.

The report from Victor at
(http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
print shows that pattern:

 Overflow: e
 Overflow: 7fffffff
 Overflow: 7fffffffffffffff
 PAX: size overflow detected in function btrfs_real_readdir
   fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
   context: dir_context;
 CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
  ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
  ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
  ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
 Call Trace:
  [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
  [<ffffffff811cb706>] report_size_overflow+0x36/0x40
  [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
  [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
  [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
  [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
 Overflow: 1a
  [<ffffffff811db070>] ? iterate_dir+0x150/0x150
  [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83

The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
are not yet synced and are processed from the delayed list. Then the code
could go to the bump section again even though it might not emit any new
dir entries from the delayed list.

The fix avoids entering the "bump" section again once we've finished
emitting the entries, both for synced and delayed entries.

References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284Reported-by: NVictor <services@swwu.com>
CC: stable@vger.kernel.org
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
Signed-off-by: NChris Mason <clm@fb.com>

bc4ef759

07 1月, 2016 1 次提交

btrfs: zero out delayed node upon allocation · 352dd9c8

由 Alexandru Moise 提交于 10月 25, 2015

It's slightly cleaner to zero-out the delayed node upon allocation
than to do it by hand in btrfs_init_delayed_node() for a few members
Signed-off-by: NAlexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

352dd9c8

11 10月, 2015 1 次提交

btrfs: comment the rest of implicit barriers before waitqueue_active · ee863954

由 David Sterba 提交于 2月 16, 2015

There are atomic operations that imply the barrier for waitqueue_active
mixed in an if-condition.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

ee863954

26 4月, 2015 1 次提交

Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. · 6e17d30b

由 Yang Dongsheng 提交于 4月 09, 2015

We need to fill inode when we found a node for it in delayed_nodes_tree.
But we did not fill the ->last_trans currently, it will cause the test
of xfstest/generic/311 fail. Scenario of the 311 is shown as below:

Problem:
	(1). test_fd = open(fname, O_RDWR|O_DIRECT)
	(2). pwrite(test_fd, buf, 4096, 0)
	(3). close(test_fd)
	(4). drop_all_caches()	<-------- "echo 3 > /proc/sys/vm/drop_caches"
	(5). test_fd = open(fname, O_RDWR|O_DIRECT)
	(6). fsync(test_fd);
				<-------- we did not get the correct log entry for the file
Reason:
	When we re-open this file in (5), we would find a node
in delayed_nodes_tree and fill the inode we are lookup with the
information. But the ->last_trans is not filled, then the fsync()
will check the ->last_trans and found it's 0 then say this inode
is already in our tree which is commited, not recording the extents
for it.

Fix:
	This patch fill the ->last_trans properly and set the
runtime_flags if needed in this situation. Then we can get the
log entries we expected after (6) and generic/311 passed.
Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: NMiao Xie <miaoxie@huawei.com>
Signed-off-by: NChris Mason <clm@fb.com>

6e17d30b

17 2月, 2015 1 次提交

Btrfs: delayed-inode: replace root args iff only fs_info used · a585e948

由 Daniel Dressler 提交于 11月 17, 2014

This is the second independent patch of a larger project to cleanup
btrfs's internal usage of btrfs_root. Many functions take btrfs_root
only to grab the fs_info struct.

By requiring a root these functions cause programmer overhead. That
these functions can accept any valid root is not obvious until
inspection.

This patch reduces the specificity of such functions to accept the
fs_info directly.

These patches can be applied independently and thus are not being
submitted as a patch series. There should be about 26 patches by the
project's completion. Each patch will cleanup between 1 and 34 functions
apiece.  Each patch covers a single file's functions.

This patch affects the following function(s):
  1) btrfs_wq_run_delayed_node
Signed-off-by: NDaniel Dressler <danieru.dressler@gmail.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>

a585e948

03 2月, 2015 2 次提交

Btrfs: Add code to support file creation time · 9cc97d64

由 chandan r 提交于 7月 04, 2012

This patch adds a new member to the 'struct btrfs_inode' structure to hold
the file creation time.
Signed-off-by: Nchandan <chandanrmail@gmail.com>
[refreshed, removed btrfs_inode_otime]
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

9cc97d64

btrfs: kill btrfs_inode_*time helpers · a937b979

由 David Sterba 提交于 12月 12, 2014

They just opencode taking address of the timespec member.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

a937b979

03 1月, 2015 1 次提交

Btrfs: don't delay inode ref updates during log replay · 6f896054

由 Chris Mason 提交于 12月 31, 2014

Commit 1d52c78a (Btrfs: try not to ENOSPC on log replay) added a
check to skip delayed inode updates during log replay because it
confuses the enospc code.  But the delayed processing will end up
ignoring delayed refs from log replay because the inode itself wasn't
put through the delayed code.

This can end up triggering a warning at commit time:

WARNING: CPU: 2 PID: 778 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x32/0x34()

Which is repeated for each commit because we never process the delayed
inode ref update.

The fix used here is to change btrfs_delayed_delete_inode_ref to return
an error if we're currently in log replay.  The caller will do the ref
deletion immediately and everything will work properly.
Signed-off-by: NChris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v3.18 and any stable series that picked 1d52c78a

6f896054

18 9月, 2014 1 次提交

btrfs: kill the key type accessor helpers · 962a298f

由 David Sterba 提交于 6月 04, 2014

btrfs_set_key_type and btrfs_key_type are used inconsistently along with
open coded variants. Other members of btrfs_key are accessed directly
without any helpers anyway.
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

962a298f

24 8月, 2014 1 次提交

Btrfs: fix task hang under heavy compressed write · 9e0af237

由 Liu Bo 提交于 8月 15, 2014

This has been reported and discussed for a long time, and this hang occurs in
both 3.15 and 3.16.

Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.

Btrfs has a kind of work queued as an ordered way, which means that its
ordered_func() must be processed in the way of FIFO, so it usually looks like --

normal_work_helper(arg)
    work = container_of(arg, struct btrfs_work, normal_work);

    work->func() <---- (we name it work X)
    for ordered_work in wq->ordered_list
            ordered_work->ordered_func()
            ordered_work->ordered_free()

The hang is a rare case, first when we find free space, we get an uncached block
group, then we go to read its free space cache inode for free space information,
so it will

file a readahead request
    btrfs_readpages()
         for page that is not in page cache
                __do_readpage()
                     submit_extent_page()
                           btrfs_submit_bio_hook()
                                 btrfs_bio_wq_end_io()
                                 submit_bio()
                                 end_workqueue_bio() <--(ret by the 1st endio)
                                      queue a work(named work Y) for the 2nd
                                      also the real endio()

So the hang occurs when work Y's work_struct and work X's work_struct happens
to share the same address.

A bit more explanation,

A,B,C -- struct btrfs_work
arg   -- struct work_struct

kthread:
worker_thread()
    pick up a work_struct from @worklist
    process_one_work(arg)
	worker->current_work = arg;  <-- arg is A->normal_work
	worker->current_func(arg)
		normal_work_helper(arg)
		     A = container_of(arg, struct btrfs_work, normal_work);

		     A->func()
		     A->ordered_func()
		     A->ordered_free()  <-- A gets freed

		     B->ordered_func()
			  submit_compressed_extents()
			      find_free_extent()
				  load_free_space_inode()
				      ...   <-- (the above readhead stack)
				      end_workqueue_bio()
					   btrfs_queue_work(work C)
		     B->ordered_free()

As if work A has a high priority in wq->ordered_list and there are more ordered
works queued after it, such as B->ordered_func(), its memory could have been
freed before normal_work_helper() returns, which means that kernel workqueue
code worker_thread() still has worker->current_work pointer to be work
A->normal_work's, ie. arg's address.

Meanwhile, work C is allocated after work A is freed, work C->normal_work
and work A->normal_work are likely to share the same address(I confirmed this
with ftrace output, so I'm not just guessing, it's rare though).

When another kthread picks up work C->normal_work to process, and finds our
kthread is processing it(see find_worker_executing_work()), it'll think
work C as a collision and skip then, which ends up nobody processing work C.

So the situation is that our kthread is waiting forever on work C.

Besides, there're other cases that can lead to deadlock, but the real problem
is that all btrfs workqueue shares one work->func, -- normal_work_helper,
so this makes each workqueue to have its own helper function, but only a
wraper pf normal_work_helper.

With this patch, I no long hit the above hang.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <clm@fb.com>

9e0af237

10 6月, 2014 1 次提交

btrfs: free delayed node outside of root->inode_lock · 96493031

由 Jeff Mahoney 提交于 5月 27, 2014

On heavy workloads, we're seeing soft lockup warnings on
root->inode_lock in __btrfs_release_delayed_node. The low hanging fruit
is to reduce the size of the critical section.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

96493031

11 3月, 2014 2 次提交

btrfs: Cleanup the "_struct" suffix in btrfs_workequeue · d458b054

由 Qu Wenruo 提交于 2月 28, 2014

Since the "_struct" suffix is mainly used for distinguish the differnt
btrfs_work between the original and the newly created one,
there is no need using the suffix since all btrfs_workers are changed
into btrfs_workqueue.

Also this patch fixed some codes whose code style is changed due to the
too long "_struct" suffix.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

d458b054

btrfs: Replace fs_info->delayed_workers workqueue with btrfs_workqueue. · 5b3bc44e

由 Qu Wenruo 提交于 2月 28, 2014

Replace the fs_info->delayed_workers with the newly created
btrfs_workqueue.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Tested-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NJosef Bacik <jbacik@fb.com>

5b3bc44e

29 1月, 2014 1 次提交

Btrfs: introduce the delayed inode ref deletion for the single link inode · 67de1176

由 Miao Xie 提交于 12月 26, 2013

The inode reference item is close to inode item, so we insert it simultaneously
with the inode item insertion when we create a file/directory.. In fact, we also
can handle the inode reference deletion by the same way. So we made this patch to
introduce the delayed inode reference deletion for the single link inode(At most
case, the file doesn't has hard link, so we don't take the hard link into account).

This function is based on the delayed inode mechanism. After applying this patch,
we can reduce the time of the file/directory deletion by ~10%.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

67de1176

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功