提交 · 5723cb01f0295ace2b029b0737dd6525a2de337f · openeuler / Kernel

11 5月, 2015 11 次提交

debugfs: switch to simple_follow_link() · 5723cb01

由 Al Viro 提交于 5月 02, 2015

Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5723cb01

A
jffs2: switch to simple_follow_link() · a8db149f
由 Al Viro 提交于 5月 02, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
a8db149f

ext4: switch to simple_follow_link() · 75e7566b

由 Al Viro 提交于 5月 02, 2015

for fast symlinks only, of course...
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

75e7566b

ext3: switch to simple_follow_link() · 115b4205

由 Al Viro 提交于 5月 02, 2015

Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

115b4205

befs: switch to simple_follow_link() · d0deec19

由 Al Viro 提交于 5月 02, 2015

Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d0deec19

ext2: use simple_follow_link() · cbe0fa38

由 Al Viro 提交于 5月 02, 2015

Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

cbe0fa38

libfs: simple_follow_link() · 61ba64fc

由 Al Viro 提交于 5月 02, 2015

let "fast" symlinks store the pointer to the body into ->i_link and
use simple_follow_link for ->follow_link()
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

61ba64fc

A
ext4: split inode_operations for encrypted symlinks off the rest · a7a67e8a
由 Al Viro 提交于 4月 27, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
a7a67e8a

ovl: rearrange ovl_follow_link to it doesn't need to call ->put_link · 3188b295

由 NeilBrown 提交于 3月 23, 2015

ovl_follow_link current calls ->put_link on an error path.
However ->put_link is about to change in a way that it will be
impossible to call it from ovl_follow_link.

So rearrange the code to avoid the need for that error path.
Specifically: move the kmalloc() call before the ->follow_link()
call to the subordinate filesystem.
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3188b295

9p: don't bother with __getname() in ->follow_link() · 90e4fc88

由 Al Viro 提交于 4月 14, 2015

We copy there a kmalloc'ed string and proceed to kfree that string immediately
after that. Easier to just feed that string to nd_set_link() and _not_
kfree it until ->put_link() (which becomes kfree_put_link() in that case).
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

90e4fc88

A
9p: don't bother with 4K allocation for 24-byte local array... · b46c267e
由 Al Viro 提交于 4月 14, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
b46c267e

10 5月, 2015 1 次提交

mnt: Fix fs_fully_visible to verify the root directory is visible · 7e96c1b0

由 Eric W. Biederman 提交于 5月 08, 2015

This fixes a dumb bug in fs_fully_visible that allows proc or sys to
be mounted if there is a bind mount of part of /proc/ or /sys/ visible.

Cc: stable@vger.kernel.org
Reported-by: NEric Windisch <ewindisch@docker.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

7e96c1b0

09 5月, 2015 2 次提交

path_openat(): fix double fput() · f15133df

由 Al Viro 提交于 5月 08, 2015

path_openat() jumps to the wrong place after do_tmpfile() - it has
already done path_cleanup() (as part of path_lookupat() called by
do_tmpfile()), so doing that again can lead to double fput().

Cc: stable@vger.kernel.org	# v3.11+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

f15133df

namei: d_is_negative() should be checked before ->d_seq validation · 766c4cbf

由 Al Viro 提交于 5月 07, 2015

Fetching ->d_inode, verifying ->d_seq and finding d_is_negative() to
be true does *not* mean that inode we'd fetched had been NULL - that
holds only while ->d_seq is still unchanged.

Shift d_is_negative() checks into lookup_fast() prior to ->d_seq
verification.
Reported-by: NSteven Rostedt <rostedt@goodmis.org>
Tested-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

766c4cbf

07 5月, 2015 1 次提交

Btrfs: fix wrong mapping flags for free space inode · 1d3c61c2

由 Filipe Manana 提交于 5月 06, 2015

We were passing a flags value that differed from the intention in commit
2b108268 ("Btrfs: don't use highmem for free space cache pages").

This caused problems in a ARM machine, leaving btrfs unusable there.
Reported-by: NMerlijn Wajer <merlijn@wizzup.org>
Tested-by: NMerlijn Wajer <merlijn@wizzup.org>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

1d3c61c2

06 5月, 2015 4 次提交

splice: sendfile() at once fails for big files · 0ff28d9f

由 Christophe Leroy 提交于 5月 06, 2015

Using sendfile with below small program to get MD5 sums of some files,
it appear that big files (over 64kbytes with 4k pages system) get a
wrong MD5 sum while small files get the correct sum.
This program uses sendfile() to send a file to an AF_ALG socket
for hashing.

/* md5sum2.c */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <linux/if_alg.h>

int main(int argc, char **argv)
{
	int sk = socket(AF_ALG, SOCK_SEQPACKET, 0);
	struct stat st;
	struct sockaddr_alg sa = {
		.salg_family = AF_ALG,
		.salg_type = "hash",
		.salg_name = "md5",
	};
	int n;

	bind(sk, (struct sockaddr*)&sa, sizeof(sa));

	for (n = 1; n < argc; n++) {
		int size;
		int offset = 0;
		char buf[4096];
		int fd;
		int sko;
		int i;

		fd = open(argv[n], O_RDONLY);
		sko = accept(sk, NULL, 0);
		fstat(fd, &st);
		size = st.st_size;
		sendfile(sko, fd, &offset, size);
		size = read(sko, buf, sizeof(buf));
		for (i = 0; i < size; i++)
			printf("%2.2x", buf[i]);
		printf("  %s\n", argv[n]);
		close(fd);
		close(sko);
	}
	exit(0);
}

Test below is done using official linux patch files. First result is
with a software based md5sum. Second result is with the program above.

root@vgoip:~# ls -l patch-3.6.*
-rw-r--r--    1 root     root         64011 Aug 24 12:01 patch-3.6.2.gz
-rw-r--r--    1 root     root         94131 Aug 24 12:01 patch-3.6.3.gz

root@vgoip:~# md5sum patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443  patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43  patch-3.6.3.gz

root@vgoip:~# ./md5sum2 patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443  patch-3.6.2.gz
5fd77b24e68bb24dcc72d6e57c64790e  patch-3.6.3.gz

After investivation, it appears that sendfile() sends the files by blocks
of 64kbytes (16 times PAGE_SIZE). The problem is that at the end of each
block, the SPLICE_F_MORE flag is missing, therefore the hashing operation
is reset as if it was the end of the file.

This patch adds SPLICE_F_MORE to the flags when more data is pending.

With the patch applied, we get the correct sums:

root@vgoip:~# md5sum patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443  patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43  patch-3.6.3.gz

root@vgoip:~# ./md5sum2 patch-3.6.*
b3ffb9848196846f31b2ff133d2d6443  patch-3.6.2.gz
c5e8f687878457db77cb7158c38a7e43  patch-3.6.3.gz
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NJens Axboe <axboe@fb.com>

0ff28d9f

ocfs2: dlm: fix race between purge and get lock resource · b1432a2a

由 Junxiao Bi 提交于 5月 05, 2015

There is a race window in dlm_get_lock_resource(), which may return a
lock resource which has been purged.  This will cause the process to
hang forever in dlmlock() as the ast msg can't be handled due to its
lock resource not existing.

    dlm_get_lock_resource {
        ...
        spin_lock(&dlm->spinlock);
        tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash);
        if (tmpres) {
             spin_unlock(&dlm->spinlock);
             >>>>>>>> race window, dlm_run_purge_list() may run and purge
                              the lock resource
             spin_lock(&tmpres->spinlock);
             ...
             spin_unlock(&tmpres->spinlock);
        }
    }
Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b1432a2a

nilfs2: fix sanity check of btree level in nilfs_btree_root_broken() · d8fd150f

由 Ryusuke Konishi 提交于 5月 05, 2015

The range check for b-tree level parameter in nilfs_btree_root_broken()
is wrong; it accepts the case of "level == NILFS_BTREE_LEVEL_MAX" even
though the level is limited to values in the range of 0 to
(NILFS_BTREE_LEVEL_MAX - 1).

Since the level parameter is read from storage device and used to index
nilfs_btree_path array whose element count is NILFS_BTREE_LEVEL_MAX, it
can cause memory overrun during btree operations if the boundary value
is set to the level parameter on device.

This fixes the broken sanity check and adds a comment to clarify that
the upper bound NILFS_BTREE_LEVEL_MAX is exclusive.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d8fd150f

configfs: init configfs module earlier at boot time · f5b69770

由 Daniel Baluta 提交于 5月 05, 2015

We need this earlier in the boot process to allow various subsystems to
use configfs (e.g Industrial IIO).

Also, debugfs is at core_initcall level and configfs should be on the same
level from infrastructure point of view.
Signed-off-by: NDaniel Baluta <daniel.baluta@intel.com>
Suggested-by: NLars-Peter Clausen <lars@metafoo.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f5b69770

05 5月, 2015 2 次提交

f2fs: fix wrong error hanlder in f2fs_follow_link · 7263b1bd

由 Jaegeuk Kim 提交于 4月 22, 2015

The page_follow_link_light returns NULL and its error pointer was remained
in nd->path.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NChao Yu <chao2.yu@samsung.com>
Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>

7263b1bd

Revert "f2fs: enhance multi-threads performance" · 5463e7c1

由 Jaegeuk Kim 提交于 4月 21, 2015

This reports performance regression by Yuanhan Liu.
The basic idea was to reduce one-point mutex, but it turns out this causes
another contention like context swithes.

https://lkml.org/lkml/2015/4/21/11

Until finishing the analysis on this issue, I'd like to revert this for a while.

This reverts commit 78373b73.

5463e7c1

03 5月, 2015 3 次提交

ext4: fix growing of tiny filesystems · 2c869b26

由 Jan Kara 提交于 5月 02, 2015

The estimate of necessary transaction credits in ext4_flex_group_add()
is too pessimistic. It reserves credit for sb, resize inode, and resize
inode dindirect block for each group added in a flex group although they
are always the same block and thus it is enough to account them only
once. Also the number of modified GDT block is overestimated since we
fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block.

Make the estimation more precise. That reduces number of requested
credits enough that we can grow 20 MB filesystem (which has 1 MB
journal, 79 reserved GDT blocks, and flex group size 16 by default).
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NEric Sandeen <sandeen@redhat.com>

2c869b26

ext4: move check under lock scope to close a race. · 280227a7

由 Davide Italiano 提交于 5月 02, 2015

fallocate() checks that the file is extent-based and returns
EOPNOTSUPP in case is not. Other tasks can convert from and to
indirect and extent so it's safe to check only after grabbing
the inode mutex.
Signed-off-by: NDavide Italiano <dccitaliano@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

280227a7

ext4: fix data corruption caused by unwritten and delayed extents · d2dc317d

由 Lukas Czerner 提交于 5月 02, 2015

Currently it is possible to lose whole file system block worth of data
when we hit the specific interaction with unwritten and delayed extents
in status extent tree.

The problem is that when we insert delayed extent into extent status
tree the only way to get rid of it is when we write out delayed buffer.
However there is a limitation in the extent status tree implementation
so that when inserting unwritten extent should there be even a single
delayed block the whole unwritten extent would be marked as delayed.

At this point, there is no way to get rid of the delayed extents,
because there are no delayed buffers to write out. So when a we write
into said unwritten extent we will convert it to written, but it still
remains delayed.

When we try to write into that block later ext4_da_map_blocks() will set
the buffer new and delayed and map it to invalid block which causes
the rest of the block to be zeroed loosing already written data.

For now we can fix this by simply not allowing to set delayed status on
written extent in the extent status tree. Also add WARN_ON() to make
sure that we notice if this happens in the future.

This problem can be easily reproduced by running the following xfs_io.

xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
          -c "falloc 0 131072" \
          -c "pwrite -S 0xbb 65536 2048" \
          -c "fsync" /mnt/test/fff

echo 3 > /proc/sys/vm/drop_caches
xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff

This can be theoretically also reproduced by at random by running fsx,
but it's not very reliable, though on machines with bigger page size
(like ppc) this can be seen more often (especially xfstest generic/127)
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

d2dc317d

02 5月, 2015 4 次提交

ext4 crypto: remove duplicated encryption mode definitions · 9402bdca

由 Chanho Park 提交于 5月 02, 2015

This patch removes duplicated encryption modes which were already in
ext4.h. They were duplicated from commit 3edc18d8 and commit f542fb.

Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: NChanho Park <chanho61.park@samsung.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

9402bdca

ext4 crypto: do not select from EXT4_FS_ENCRYPTION · fb63e548

由 Herbert Xu 提交于 5月 02, 2015

This patch adds a tristate EXT4_ENCRYPTION to do the selections
for EXT4_FS_ENCRYPTION because selecting from a bool causes all
the selected options to be built-in, even if EXT4 itself is a
module.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

fb63e548

ext4 crypto: add padding to filenames before encrypting · a44cd7a0

由 Theodore Ts'o 提交于 5月 01, 2015

This obscures the length of the filenames, to decrease the amount of
information leakage. By default, we pad the filenames to the next 4
byte boundaries. This costs nothing, since the directory entries are
aligned to 4 byte boundaries anyway. Filenames can also be padded to
8, 16, or 32 bytes, which will consume more directory space.

Change-Id: Ibb7a0fb76d2c48e2061240a709358ff40b14f322
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

a44cd7a0

ext4 crypto: simplify and speed up filename encryption · 5de0b4d0

由 Theodore Ts'o 提交于 5月 01, 2015

Avoid using SHA-1 when calculating the user-visible filename when the
encryption key is available, and avoid decrypting lots of filenames
when searching for a directory entry in a directory block.

Change-Id: If4655f144784978ba0305b597bfa1c8d7bb69e63
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

5de0b4d0

30 4月, 2015 1 次提交

Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent · 5d2361db

由 Forrest Liu 提交于 2月 09, 2015

btrfs_release_extent_buffer_page() can't handle dummy extent that
allocated by btrfs_clone_extent_buffer() properly. That is because
reference count of pages that allocated by btrfs_clone_extent_buffer()
was 2, 1 by alloc_page(), and another by attach_extent_buffer_page().

Running following command repeatly can check this memory leak problem

    btrfs inspect-internal inode-resolve 256 /mnt/btrfs
Signed-off-by: NChien-Kuan Yeh <ckya@synology.com>
Signed-off-by: NForrest Liu <forrestl@synology.com>
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Tested-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

5d2361db

26 4月, 2015 9 次提交

Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. · 6e17d30b

由 Yang Dongsheng 提交于 4月 09, 2015

We need to fill inode when we found a node for it in delayed_nodes_tree.
But we did not fill the ->last_trans currently, it will cause the test
of xfstest/generic/311 fail. Scenario of the 311 is shown as below:

Problem:
	(1). test_fd = open(fname, O_RDWR|O_DIRECT)
	(2). pwrite(test_fd, buf, 4096, 0)
	(3). close(test_fd)
	(4). drop_all_caches()	<-------- "echo 3 > /proc/sys/vm/drop_caches"
	(5). test_fd = open(fname, O_RDWR|O_DIRECT)
	(6). fsync(test_fd);
				<-------- we did not get the correct log entry for the file
Reason:
	When we re-open this file in (5), we would find a node
in delayed_nodes_tree and fill the inode we are lookup with the
information. But the ->last_trans is not filled, then the fsync()
will check the ->last_trans and found it's 0 then say this inode
is already in our tree which is commited, not recording the extents
for it.

Fix:
	This patch fill the ->last_trans properly and set the
runtime_flags if needed in this situation. Then we can get the
log entries we expected after (6) and generic/311 passed.
Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: NMiao Xie <miaoxie@huawei.com>
Signed-off-by: NChris Mason <clm@fb.com>

6e17d30b

btrfs: unlock i_mutex after attempting to delete subvolume during send · 909e26dc

由 Omar Sandoval 提交于 4月 10, 2015

Whenever the check for a send in progress introduced in commit
521e0546 (btrfs: protect snapshots from deleting during send) is
hit, we return without unlocking inode->i_mutex. This is easy to see
with lockdep enabled:

[  +0.000059] ================================================
[  +0.000028] [ BUG: lock held when returning to user space! ]
[  +0.000029] 4.0.0-rc5-00096-g3c435c1e #93 Not tainted
[  +0.000026] ------------------------------------------------
[  +0.000029] btrfs/211 is leaving the kernel with locks still held!
[  +0.000029] 1 lock held by btrfs/211:
[  +0.000023]  #0:  (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0

Make sure we unlock it in the error path.
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: NOmar Sandoval <osandov@osandov.com>
Signed-off-by: NChris Mason <clm@fb.com>

909e26dc

btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache · b8605454

由 Omar Sandoval 提交于 2月 24, 2015

If io_ctl_prepare_pages fails, the pages in io_ctl.pages are not valid.
When we try to access them later, things will blow up in various ways.

Also fix the comment about the return value, which is an errno on error,
not -1, and update the cases where it was not.
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NOmar Sandoval <osandov@osandov.com>
Signed-off-by: NChris Mason <clm@fb.com>

b8605454

btrfs: fix race on ENOMEM in alloc_extent_buffer · 5ca64f45

由 Omar Sandoval 提交于 2月 24, 2015

Consider the following interleaving of overlapping calls to
alloc_extent_buffer:

Call 1:

- Successfully allocates a few pages with find_or_create_page
- find_or_create_page fails, goto free_eb
- Unlocks the allocated pages

Call 2:
- Calls find_or_create_page and gets a page in call 1's extent_buffer
- Finds that the page is already associated with an extent_buffer
- Grabs a reference to the half-written extent_buffer and calls
  mark_extent_buffer_accessed on it

mark_extent_buffer_accessed will then try to call mark_page_accessed on
a null page and panic.

The fix is to decrement the reference count on the half-written
extent_buffer before unlocking the pages so call 2 won't use it. We
should also set exists = NULL in the case that we don't use exists to
avoid accidentally returning a freed extent_buffer in an error case.
Signed-off-by: NOmar Sandoval <osandov@osandov.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <clm@fb.com>

5ca64f45

btrfs: handle ENOMEM in btrfs_alloc_tree_block · 67b7859e

由 Omar Sandoval 提交于 2月 24, 2015

This is one of the first places to give out when memory is tight. Handle
it properly rather than with a BUG_ON.

Also fix the comment about the return value, which is an ERR_PTR, not
NULL, on error.
Signed-off-by: NOmar Sandoval <osandov@osandov.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

67b7859e

Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole · 1b984508

由 Forrest Liu 提交于 2月 09, 2015

If device tree has hole, find_free_dev_extent() cannot find available
address properly.

The problem can be reproduce by following script.

    mntpath=/btrfs
    loopdev=/dev/loop0
    filepath=/home/forrest/image

    umount $mntpath
    losetup -d $loopdev
    truncate --size 100g $filepath
    losetup $loopdev $filepath
    mkfs.btrfs -f $loopdev
    mount $loopdev $mntpath

    # make device tree with one big hole
    for i in `seq 1 1 100`; do
        fallocate -l 1g $mntpath/$i
    done
    sync
    for i in `seq 1 1 95`; do
        rm $mntpath/$i
    done
    sync

    # wait cleaner thread remove unused block group
    sleep 300

    fallocate -l 1g $mntpath/aaa

    # failed to allocate new chunk
    fallocate -l 1g $mntpath/bbb

Above script will make device tree with one big hole, and can only allocate
just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb

    item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48
        dev extent chunk_tree 3
        chunk objectid 256 chunk offset 106292051968 length 1073741824
    item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48
        dev extent chunk_tree 3
        chunk objectid 256 chunk offset 103108575232 length 1073741824
Signed-off-by: NForrest Liu <forrestl@synology.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <clm@fb.com>

1b984508

Btrfs: don't check for delalloc_bytes in cache_save_setup · e4c88f00

由 Chris Mason 提交于 4月 18, 2015

Now that we're doing free space cache writeback outside the critical
section in the commit, there is a bigger window for delalloc_bytes to
be added after a cache has been written.  find_free_extent may do this
without putting the block group back into the dirty list, and also
without a transaction running.

Checking for delalloc_bytes in cache_save_setup means we might leave the
cache marked as written without invalidating it.  Consistency checks
during mount will toss the cache, but it's better to get rid of the
check in cache_save_setup and let it get invalidated by the checks
already done during cache write out.
Signed-off-by: NChris Mason <clm@fb.com>

e4c88f00

Btrfs: fix deadlock when starting writeback of bg caches · 24b89d08

由 Filipe Manana 提交于 4月 25, 2015

While starting the writes of the dirty block group caches, if we don't
find a block group item in the extent tree we were leaving without
releasing our path, running delayed references and then looping again to
process any new dirty block groups. However this second iteration of the
loop could cause a deadlock because it tries to lock some other extent
tree node/leaf which another task already locked and it's blocked because
it's waiting for a lock on some node/leaf that is in our path that was not
released before.
We could also deadlock when running the delayed references - as we could
end up trying to lock the same nodes/leafs that we have in our local path
(with a different lock type).

Got into such case when running xfstests:

[20892.242791] ------------[ cut here ]------------
[20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[20892.245874] BTRFS: Transaction aborted (error -2)
(...)
[20892.269378] Call Trace:
[20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
[20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
[20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
(...)
[20892.291316] ---[ end trace 597f77e664245373 ]---
[20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
[20892.297390] BTRFS info (device sdg): forced readonly
[20892.298222] ------------[ cut here ]------------
[20892.299190] WARNING: CPU: 0 PID: 13299 at fs/btrfs/ctree.c:2683 btrfs_search_slot+0x7e/0x7d2 [btrfs]()
(...)
[20892.326253] Call Trace:
[20892.326904]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[20892.329503]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[20892.330815]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[20892.332556]  [<ffffffffa0510b73>] ? btrfs_search_slot+0x7e/0x7d2 [btrfs]
[20892.333955]  [<ffffffff81045f62>] warn_slowpath_null+0x1a/0x1c
[20892.335562]  [<ffffffffa0510b73>] btrfs_search_slot+0x7e/0x7d2 [btrfs]
[20892.336849]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
[20892.338222]  [<ffffffffa051ad52>] ? cache_save_setup+0x43/0x2a5 [btrfs]
[20892.339823]  [<ffffffffa051ad66>] ? cache_save_setup+0x57/0x2a5 [btrfs]
[20892.341275]  [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46
[20892.342810]  [<ffffffffa0515de7>] write_one_cache_group+0x3f/0xaf [btrfs]
[20892.344184]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
[20892.347162]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
(...)
[20892.361015] ---[ end trace 597f77e664245374 ]---
[21120.688097] INFO: task kworker/u8:17:29854 blocked for more than 120 seconds.
[21120.689881]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[21120.691384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
(...)
[21120.703696] Call Trace:
[21120.704310]  [<ffffffff8143107e>] schedule+0x74/0x83
[21120.705490]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
[21120.706757]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[21120.708156]  [<ffffffffa054ac1e>] lock_extent_buffer_for_io+0x3e/0x194 [btrfs]
[21120.709892]  [<ffffffffa054bb86>] ? btree_write_cache_pages+0x273/0x385 [btrfs]
[21120.711605]  [<ffffffffa054bc42>] btree_write_cache_pages+0x32f/0x385 [btrfs]
[21120.723440]  [<ffffffffa0527552>] btree_writepages+0x23/0x5c [btrfs]
[21120.724943]  [<ffffffff8110c4c8>] do_writepages+0x23/0x2c
[21120.726008]  [<ffffffff81176dde>] __writeback_single_inode+0x73/0x2fa
[21120.727230]  [<ffffffff8117714a>] ? writeback_sb_inodes+0xe5/0x38b
[21120.728526]  [<ffffffff811771fb>] ? writeback_sb_inodes+0x196/0x38b
[21120.729701]  [<ffffffff8117726a>] writeback_sb_inodes+0x205/0x38b
(...)
[21120.747853] INFO: task btrfs:13282 blocked for more than 120 seconds.
[21120.749459]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[21120.751137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
(...)
[21120.768457] Call Trace:
[21120.769039]  [<ffffffff8143107e>] schedule+0x74/0x83
[21120.770107]  [<ffffffffa052f25c>] btrfs_commit_transaction+0x315/0x9c9 [btrfs]
[21120.771558]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[21120.773659]  [<ffffffffa056fd8c>] prepare_to_relocate+0xcb/0xd2 [btrfs]
[21120.776257]  [<ffffffffa05741da>] relocate_block_group+0x44/0x4a9 [btrfs]
[21120.777755]  [<ffffffffa05747a0>] ? btrfs_relocate_block_group+0x161/0x288 [btrfs]
[21120.779459]  [<ffffffffa05747a8>] btrfs_relocate_block_group+0x169/0x288 [btrfs]
[21120.781153]  [<ffffffffa0550403>] btrfs_relocate_chunk.isra.29+0x3e/0xa7 [btrfs]
[21120.783918]  [<ffffffffa05518fd>] btrfs_balance+0xaa4/0xc52 [btrfs]
[21120.785436]  [<ffffffff8114306e>] ? cpu_cache_get.isra.39+0xe/0x1f
[21120.786434]  [<ffffffffa0559252>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs]
(...)
[21120.889251] INFO: task fsstress:13288 blocked for more than 120 seconds.
[21120.890526]       Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[21120.891773] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
(...)
[21120.899960] Call Trace:
[21120.900743]  [<ffffffff8143107e>] schedule+0x74/0x83
[21120.903004]  [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs]
[21120.904383]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[21120.905608]  [<ffffffffa051125b>] btrfs_search_slot+0x766/0x7d2 [btrfs]
[21120.906812]  [<ffffffff8114290e>] ? virt_to_head_page+0x9/0x2c
[21120.907874]  [<ffffffff81144b7f>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb
[21120.909551]  [<ffffffffa05124e0>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs]
[21120.910914]  [<ffffffffa0512585>] btrfs_insert_item+0x5a/0xa5 [btrfs]
[21120.912181]  [<ffffffffa0520271>] ? btrfs_create_pending_block_groups+0x96/0x130 [btrfs]
[21120.913784]  [<ffffffffa052028a>] btrfs_create_pending_block_groups+0xaf/0x130 [btrfs]
[21120.915374]  [<ffffffffa052ffc2>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[21120.916735]  [<ffffffffa05302b4>] btrfs_end_transaction+0x10/0x12 [btrfs]
[21120.917996]  [<ffffffffa051ab26>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[21120.919478]  [<ffffffffa051ba25>] btrfs_delalloc_reserve_space+0x1e/0x51 [btrfs]
[21120.921226]  [<ffffffffa05382f2>] btrfs_truncate_page+0x85/0x2c4 [btrfs]
[21120.923121]  [<ffffffffa0538572>] btrfs_cont_expand+0x41/0x3ef [btrfs]
[21120.924449]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
[21120.926602]  [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc
[21120.927769]  [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs]
[21120.929324]  [<ffffffffa05410a0>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs]
[21120.930723]  [<ffffffffa05410d9>] btrfs_file_write_iter+0x1e2/0x431 [btrfs]
[21120.931897]  [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e
[21120.934446]  [<ffffffff811534c3>] new_sync_write+0x7c/0xa0
[21120.935528]  [<ffffffff81153b58>] vfs_write+0xb2/0x117
(...)

Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                      outside critical section in commit")
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

24b89d08

Btrfs: fix race between start dirty bg cache writeout and bg deletion · b58d1a9e

由 Filipe Manana 提交于 4月 25, 2015

While running xfstests I ran into the following:

[20892.242791] ------------[ cut here ]------------
[20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[20892.245874] BTRFS: Transaction aborted (error -2)
[20892.247329] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse$
[20892.258488] CPU: 0 PID: 13299 Comm: fsstress Tainted: G        W       4.0.0-rc5-btrfs-next-9+ #2
[20892.262011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20892.264738]  0000000000000009 ffff880427f8bc18 ffffffff8142fa46 ffffffff8108b6a2
[20892.266244]  ffff880427f8bc68 ffff880427f8bc58 ffffffff81045ea5 ffff880427f8bc48
[20892.267761]  ffffffffa0509a6d 00000000fffffffe ffff8803545d6f40 ffffffffa05a15a0
[20892.269378] Call Trace:
[20892.269915]  [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[20892.271097]  [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[20892.272173]  [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[20892.273386]  [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.274857]  [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[20892.275851]  [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.277341]  [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
[20892.278628]  [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
[20892.280191]  [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
[20892.281781]  [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[20892.282873]  [<ffffffffa054163b>] btrfs_sync_file+0x313/0x387 [btrfs]
[20892.284111]  [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
[20892.285203]  [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
[20892.286290]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[20892.287469]  [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
[20892.288412]  [<ffffffff8117ae54>] do_fsync+0x34/0x4e
[20892.289348]  [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
[20892.290255]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[20892.291316] ---[ end trace 597f77e664245373 ]---
[20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
[20892.297390] BTRFS info (device sdg): forced readonly

This happens because in btrfs_start_dirty_block_groups() we splice the
transaction's list of dirty block groups into a local list and then we
keep extracting the first element of the list without holding the
cache_write_mutex mutex. This means that before we acquire that mutex
the first block group on the list might be removed by a conurrent task
running btrfs_remove_block_group(). So make sure we extract the first
element (and test the list emptyness) while holding that mutex.

Fixes: 1bbc621e ("Btrfs: allow block group cache writeout
                      outside critical section in commit")
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

b58d1a9e

25 4月, 2015 2 次提交

RCU pathwalk breakage when running into a symlink overmounting something · 3cab989a

由 Al Viro 提交于 4月 24, 2015

Calling unlazy_walk() in walk_component() and do_last() when we find
a symlink that needs to be followed doesn't acquire a reference to vfsmount.
That's fine when the symlink is on the same vfsmount as the parent directory
(which is almost always the case), but it's not always true - one _can_
manage to bind a symlink on top of something.  And in such cases we end up
with excessive mntput().

Cc: stable@vger.kernel.org # since 2.6.39
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3cab989a

direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0

由 Jens Axboe 提交于 4月 15, 2015

do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
 |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
 | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
 | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
 | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
 | 99.99th=[  165]

After:

clat percentiles (usec):
 |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
 | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
 | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
 | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
 | 99.99th=[  438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJens Axboe <axboe@fb.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

fe0f07d0

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功