提交 · 4c2e07c6a29e0129e975727b9f57eede813eea85 · openeuler / Kernel

25 6月, 2016 7 次提交

Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99

由 Omar Sandoval 提交于 5月 20, 2016

Commit fe742fd4 ("Revert "btrfs: switch to ->iterate_shared()"")
backed out the conversion to ->iterate_shared() for Btrfs because the
delayed inode handling in btrfs_real_readdir() is racy. However, we can
still do readdir in parallel if there are no delayed nodes.

This is a temporary fix which upgrades the shared inode lock to an
exclusive lock only when we have delayed items until we come up with a
more complete solution. While we're here, rename the
btrfs_{get,put}_delayed_items functions to make it very clear that
they're just for readdir.

Tested with xfstests and by doing a parallel kernel build:

	while make tinyconfig && make -j4 && git clean dqfx; do
		:
	done

along with a bunch of parallel finds in another shell:

	while true; do
		for ((i=0; i<4; i++)); do
			find . >/dev/null &
		done
		wait
	done
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

02dbfc99

autofs: don't get stuck in a loop if vfs_write() returns an error · 5a9294e5

由 Andrey Vagin 提交于 6月 24, 2016

__vfs_write() returns a negative value in a error case.

Link: http://lkml.kernel.org/r/20160616083108.6278.65815.stgit@pluto.themaw.netSigned-off-by: NAndrey Vagin <avagin@openvz.org>
Signed-off-by: NIan Kent <raven@themaw.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5a9294e5

fs/nilfs2: fix potential underflow in call to crc32_le · 63d2f95d

由 Torsten Hilbrich 提交于 6月 24, 2016

The value `bytes' comes from the filesystem which is about to be
mounted.  We cannot trust that the value is always in the range we
expect it to be.

Check its value before using it to calculate the length for the crc32_le
call.  It value must be larger (or equal) sumoff + 4.

This fixes a kernel bug when accidentially mounting an image file which
had the nilfs2 magic value 0x3434 at the right offset 0x406 by chance.
The bytes 0x01 0x00 were stored at 0x408 and were interpreted as a
s_bytes value of 1.  This caused an underflow when substracting sumoff +
4 (20) in the call to crc32_le.

  BUG: unable to handle kernel paging request at ffff88021e600000
  IP:  crc32_le+0x36/0x100
  ...
  Call Trace:
    nilfs_valid_sb.part.5+0x52/0x60 [nilfs2]
    nilfs_load_super_block+0x142/0x300 [nilfs2]
    init_nilfs+0x60/0x390 [nilfs2]
    nilfs_mount+0x302/0x520 [nilfs2]
    mount_fs+0x38/0x160
    vfs_kern_mount+0x67/0x110
    do_mount+0x269/0xe00
    SyS_mount+0x9f/0x100
    entry_SYSCALL_64_fastpath+0x16/0x71

Link: http://lkml.kernel.org/r/1466778587-5184-2-git-send-email-konishi.ryusuke@lab.ntt.co.jpSigned-off-by: NTorsten Hilbrich <torsten.hilbrich@secunet.com>
Tested-by: NTorsten Hilbrich <torsten.hilbrich@secunet.com>
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

63d2f95d

ocfs2: disable BUG assertions in reading blocks · 7186ee06

由 Gang He 提交于 6月 24, 2016

According to some high-load testing, these two BUG assertions were
encountered, this led system panic.  Actually, there were some
discussions about removing these two BUG() assertions, it would not
bring any side effect.

Then, I did the the following changes,

1) use the existing macro CATCH_BH_JBD_RACES to wrap BUG() in the
   ocfs2_read_blocks_sync function like before.

2) disable the macro CATCH_BH_JBD_RACES in Makefile by default.

Link: http://lkml.kernel.org/r/1466574294-26863-1-git-send-email-ghe@suse.comSigned-off-by: NGang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7186ee06

jbd2: get rid of superfluous __GFP_REPEAT · f2db1971

由 Michal Hocko 提交于 6月 24, 2016

jbd2_alloc is explicit about its allocation preferences wrt.  the
allocation size.  Sub page allocations go to the slab allocator and
larger are using either the page allocator or vmalloc.  This is all good
but the logic is unnecessarily complex.

1) as per Ted, the vmalloc fallback is a left-over:

 : jbd2_alloc is only passed in the bh->b_size, which can't be PAGE_SIZE, so
 : the code path that calls vmalloc() should never get called.  When we
 : conveted jbd2_alloc() to suppor sub-page size allocations in commit
 : d2eecb03, there was an assumption that it could be called with a size
 : greater than PAGE_SIZE, but that's certaily not true today.

Moreover vmalloc allocation might even lead to a deadlock because the
callers expect GFP_NOFS context while vmalloc is GFP_KERNEL.

2) __GFP_REPEAT for requests <= PAGE_ALLOC_COSTLY_ORDER is ignored
   since the flag was introduced.

Let's simplify the code flow and use the slab allocator for sub-page
requests and the page allocator for others.  Even though order > 0 is
not currently used as per above leave that option open.

Link: http://lkml.kernel.org/r/1464599699-30131-18-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f2db1971

nfsd: check permissions when setting ACLs · 99965378

由 Ben Hutchings 提交于 6月 22, 2016

Use set_posix_acl, which includes proper permission checks, instead of
calling ->set_acl directly.  Without this anyone may be able to grant
themselves permissions to a file by setting the ACL.

Lock the inode to make the new checks atomic with respect to set_acl.
(Also, nfsd was the only caller of set_acl not locking the inode, so I
suspect this may fix other races.)

This also simplifies the code, and ensures our ACLs are checked by
posix_acl_valid.

The permission checks and the inode locking were lost with commit
4ac7249e, which changed nfsd to use the set_acl inode operation directly
instead of going through xattr handlers.
Reported-by: NDavid Sinquin <david@sinquin.eu>
[agreunba@redhat.com: use set_posix_acl]
Fixes: 4ac7249e
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@vger.kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

99965378

posix_acl: Add set_posix_acl · 485e71e8

由 Andreas Gruenbacher 提交于 6月 22, 2016

Factor out part of posix_acl_xattr_set into a common function that takes
a posix_acl, which nfsd can also call.

The prototype already exists in include/linux/posix_acl.h.
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
Cc: stable@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

485e71e8

24 6月, 2016 4 次提交

Btrfs: Force stripesize to the value of sectorsize · b7f67055

由 Chandan Rajendra 提交于 6月 23, 2016

Btrfs code currently assumes stripesize to be same as
sectorsize. However Btrfs-progs (until commit
df05c7ed455f519e6e15e46196392e4757257305) has been setting
btrfs_super_block->stripesize to a value of 4096.

This commit makes sure that the value of btrfs_super_block->stripesize
is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
to sectorsize.
Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

b7f67055

btrfs: fix disk_i_size update bug when fallocate() fails · c0d2f610

由 Wang Xiaoguang 提交于 6月 22, 2016

When doing truncate operation, btrfs_setsize() will first call
truncate_setsize() to set new inode->i_size, but if later
btrfs_truncate() fails, btrfs_setsize() will call
"i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
inmemory inode size, now bug occurs. It's because for truncate
case btrfs_ordered_update_i_size() directly uses inode->i_size
to update BTRFS_I(inode)->disk_i_size, indeed we should use the
"offset" argument to update disk_i_size. Here is the call graph:
==>btrfs_truncate()
====>btrfs_truncate_inode_items()
======>btrfs_ordered_update_i_size(inode, last_size, NULL);
Here btrfs_ordered_update_i_size()'s offset argument is last_size.

And below test case can reveal this bug:

dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
dev=$(losetup --show -f fs.img)
mkdir -p /mnt/mntpoint
mkfs.btrfs  -f $dev
mount $dev /mnt/mntpoint
cd /mnt/mntpoint

echo "workdir is: /mnt/mntpoint"
blocksize=$((128 * 1024))
dd if=/dev/zero of=testfile bs=$blocksize count=1
sync
count=$((17*1024*1024*1024/blocksize))
echo "file size is:" $((count*blocksize))
for ((i = 1; i <= $count; i++)); do
	i=$((i + 1))
	dst_offset=$((blocksize * i))
	xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
		testfile > /dev/null
done
sync

truncate --size 0 testfile
ls -l testfile
du -sh testfile
exit

In this case, truncate operation will fail for enospc reason and
"du -sh testfile" returns value greater than 0, but testfile's
size is 0, we need to reflect correct inode->i_size.
Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

c0d2f610

Btrfs: fix error handling in map_private_extent_buffer · 415b35a5

由 Liu Bo 提交于 6月 17, 2016

map_private_extent_buffer() can return -EINVAL in two different cases,
1. when the requested contents span two pages if nodesize is larger
   than pagesize,
2. when it detects something insane.

The 2nd one used to be only a WARN_ON(1), and we decided to return a error
to callers, but we didn't fix up all its callers, which will be
addressed by this patch.

Without this, btrfs may end up with 'general protection', ie.
reading invalid memory.
Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

415b35a5

Btrfs: fix error return code in btrfs_init_test_fs() · 04e1b65a

由 Wei Yongjun 提交于 6月 17, 2016

Fix to return a negative error code from the kern_mount() error handling
case instead of 0(ret is set to 0 by register_filesystem), as done
elsewhere in this function.
Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

04e1b65a

23 6月, 2016 4 次提交

Btrfs: don't do nocow check unless we have to · c6887cd1

由 Josef Bacik 提交于 3月 25, 2016

Before we write into prealloc/nocow space we have to make sure that there are no
references to the extents we are writing into, which means checking the extent
tree and csum tree in the case of nocow.  So we don't want to do the nocow dance
unless we can't reserve data space, since it's a serious drag on performance.
With the following sequence

fallocate -l10737418240 /mnt/btrfs-test/file
cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
	--end_fsync=1

we get the worst case scenario where we have to fall back on to doing the check
anyway.

Without this patch
lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec

With this patch
lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec

We get twice the throughput, half of the runtime, and half of the average
latency.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
[ PAGE_CACHE_ removal related fixups ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

c6887cd1

btrfs: fix deadlock in delayed_ref_async_start · 0f873eca

由 Chris Mason 提交于 4月 27, 2016

"Btrfs: track transid for delayed ref flushing" was deadlocking on
btrfs_attach_transaction because its not safe to call from the async
delayed ref start code.  This commit brings back btrfs_join_transaction
instead and checks for a blocked commit.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

0f873eca

Btrfs: track transid for delayed ref flushing · 31b9655f

由 Josef Bacik 提交于 4月 11, 2016

Using the offwakecputime bpf script I noticed most of our time was spent waiting
on the delayed ref throttling. This is what is supposed to happen, but
sometimes the transaction can commit and then we're waiting for throttling that
doesn't matter anymore. So change this stuff to be a little smarter by tracking
the transid we were in when we initiated the throttling. If the transaction we
get is different then we can just bail out. This resulted in a 50% speedup in
my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
over the entire run (which is about 30 minutes). Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>

31b9655f

UBIFS: Implement ->migratepage() · 4ac1c17b

由 Kirill A. Shutemov 提交于 6月 16, 2016

During page migrations UBIFS might get confused
and the following assert triggers:
[  213.480000] UBIFS assert failed in ubifs_set_page_dirty at 1451 (pid 436)
[  213.490000] CPU: 0 PID: 436 Comm: drm-stress-test Not tainted 4.4.4-00176-geaa802524636-dirty #1008
[  213.490000] Hardware name: Allwinner sun4i/sun5i Families
[  213.490000] [<c0015e70>] (unwind_backtrace) from [<c0012cdc>] (show_stack+0x10/0x14)
[  213.490000] [<c0012cdc>] (show_stack) from [<c02ad834>] (dump_stack+0x8c/0xa0)
[  213.490000] [<c02ad834>] (dump_stack) from [<c0236ee8>] (ubifs_set_page_dirty+0x44/0x50)
[  213.490000] [<c0236ee8>] (ubifs_set_page_dirty) from [<c00fa0bc>] (try_to_unmap_one+0x10c/0x3a8)
[  213.490000] [<c00fa0bc>] (try_to_unmap_one) from [<c00fadb4>] (rmap_walk+0xb4/0x290)
[  213.490000] [<c00fadb4>] (rmap_walk) from [<c00fb1bc>] (try_to_unmap+0x64/0x80)
[  213.490000] [<c00fb1bc>] (try_to_unmap) from [<c010dc28>] (migrate_pages+0x328/0x7a0)
[  213.490000] [<c010dc28>] (migrate_pages) from [<c00d0cb0>] (alloc_contig_range+0x168/0x2f4)
[  213.490000] [<c00d0cb0>] (alloc_contig_range) from [<c010ec00>] (cma_alloc+0x170/0x2c0)
[  213.490000] [<c010ec00>] (cma_alloc) from [<c001a958>] (__alloc_from_contiguous+0x38/0xd8)
[  213.490000] [<c001a958>] (__alloc_from_contiguous) from [<c001ad44>] (__dma_alloc+0x23c/0x274)
[  213.490000] [<c001ad44>] (__dma_alloc) from [<c001ae08>] (arm_dma_alloc+0x54/0x5c)
[  213.490000] [<c001ae08>] (arm_dma_alloc) from [<c035cecc>] (drm_gem_cma_create+0xb8/0xf0)
[  213.490000] [<c035cecc>] (drm_gem_cma_create) from [<c035cf20>] (drm_gem_cma_create_with_handle+0x1c/0xe8)
[  213.490000] [<c035cf20>] (drm_gem_cma_create_with_handle) from [<c035d088>] (drm_gem_cma_dumb_create+0x3c/0x48)
[  213.490000] [<c035d088>] (drm_gem_cma_dumb_create) from [<c0341ed8>] (drm_ioctl+0x12c/0x444)
[  213.490000] [<c0341ed8>] (drm_ioctl) from [<c0121adc>] (do_vfs_ioctl+0x3f4/0x614)
[  213.490000] [<c0121adc>] (do_vfs_ioctl) from [<c0121d30>] (SyS_ioctl+0x34/0x5c)
[  213.490000] [<c0121d30>] (SyS_ioctl) from [<c000f2c0>] (ret_fast_syscall+0x0/0x34)

UBIFS is using PagePrivate() which can have different meanings across
filesystems. Therefore the generic page migration code cannot handle this
case correctly.
We have to implement our own migration function which basically does a
plain copy but also duplicates the page private flag.
UBIFS is not a block device filesystem and cannot use buffer_migrate_page().

Cc: stable@vger.kernel.org
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
[rw: Massaged changelog, build fixes, etc...]
Signed-off-by: NRichard Weinberger <richard@nod.at>
Acked-by: NChristoph Hellwig <hch@lst.de>

4ac1c17b

20 6月, 2016 1 次提交

fix idiotic braino in d_alloc_parallel() · e7d6ef97

由 Al Viro 提交于 6月 20, 2016

Check for d_unhashed() while searching in in-lookup hash was absolutely
wrong.  Worse, it masked a deadlock on dget() done under bitlock that
nests inside ->d_lock.  Thanks to J. R. Okajima for spotting it.
Spotted-by: N"J. R. Okajima" <hooanon05g@gmail.com>
Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e7d6ef97

18 6月, 2016 8 次提交

Btrfs: btrfs_check_super_valid: Allow 4096 as stripesize · dd5c9311

由 Chandan Rajendra 提交于 6月 16, 2016

Older btrfs-progs/mkfs.btrfs sets 4096 as the stripesize. Hence
restricting stripesize to be equal to sectorsize would cause super block
validation to return an error on architectures where PAGE_SIZE is not
equal to 4096.

Hence as a workaround, this commit allows stripesize to be set to 4096
bytes.
Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

dd5c9311

btrfs: remove build fixup for qgroup_account_snapshot · 89c5a544

由 David Sterba 提交于 6月 16, 2016

Introduced in 2c1984f2 ("btrfs: build fixup for
qgroup_account_snapshot") as temporary bisectability build fixup.
Signed-off-by: NDavid Sterba <dsterba@suse.com>

89c5a544

D
btrfs: use new error message helper in qgroup_account_snapshot · f7af3934
由 David Sterba 提交于 6月 17, 2016
```
We've renamed btrfs_std_error, this one is left from last merge.
Signed-off-by: NDavid Sterba <dsterba@suse.com>
```
f7af3934

btrfs: avoid blocking open_ctree from cleaner_kthread · 90c711ab

由 Zygo Blaxell 提交于 6月 12, 2016

This fixes a problem introduced in commit 2f3165ec
"btrfs: don't force mounts to wait for cleaner_kthread to delete one or more subvolumes".

open_ctree eventually calls btrfs_replay_log which in turn calls
btrfs_commit_super which tries to lock the cleaner_mutex, causing a
recursive mutex deadlock during mount.

Instead of playing whack-a-mole trying to keep up with all the
functions that may want to lock cleaner_mutex, put all the cleaner_mutex
lockers back where they were, and attack the problem more directly:
keep cleaner_kthread asleep until the filesystem is mounted.

When filesystems are mounted read-only and later remounted read-write,
open_ctree did not set fs_info->open and neither does anything else.
Set this flag in btrfs_remount so that neither btrfs_delete_unused_bgs
nor cleaner_kthread get confused by the common case of "/" filesystem
read-only mount followed by read-write remount.
Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

90c711ab

Btrfs: don't BUG_ON() in btrfs_orphan_add · 3b6571c1

由 Josef Bacik 提交于 5月 27, 2016

This is just a screwup for developers, so change it to an ASSERT() so developers
notice when things go wrong and deal with the error appropriately if ASSERT()
isn't enabled.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

3b6571c1

btrfs: account for non-CoW'd blocks in btrfs_abort_transaction · 64c12921

由 Jeff Mahoney 提交于 6月 08, 2016

The test for !trans->blocks_used in btrfs_abort_transaction is
insufficient to determine whether it's safe to drop the transaction
handle on the floor.  btrfs_cow_block, informed by should_cow_block,
can return blocks that have already been CoW'd in the current
transaction.  trans->blocks_used is only incremented for new block
allocations. If an operation overlaps the blocks in the current
transaction entirely and must abort the transaction, we'll happily
let it clean up the trans handle even though it may have modified
the blocks and will commit an incomplete operation.

In the long-term, I'd like to do closer tracking of when the fs
is actually modified so we can still recover as gracefully as possible,
but that approach will need some discussion.  In the short term,
since this is the only code using trans->blocks_used, let's just
switch it to a bool indicating whether any blocks were used and set
it when should_cow_block returns false.

Cc: stable@vger.kernel.org # 3.4+
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Reviewed-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

64c12921

Btrfs: check if extent buffer is aligned to sectorsize · c871b0f2

由 Liu Bo 提交于 6月 06, 2016

Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
via alloc_extent_buffer().  An unaligned eb can have more pages than it
should have, which ends up extent buffer's leak or some corrupted content
in extent buffer.

This adds a warning to let us quickly know what was happening.

Now that alloc_extent_buffer() no more returns NULL, this changes its
caller and callers of its caller to match with the new error
handling.
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

c871b0f2

btrfs: Use correct format specifier · 16ff4b45

由 Heinrich Schuchardt 提交于 6月 11, 2016

Component mirror_num of struct btrfsic_block is defined
as unsigned int. Use %u as format specifier.
Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

16ff4b45

16 6月, 2016 3 次提交

nfsd: Make init_open_stateid() a bit more whole · 8c7245ab

由 Oleg Drokin 提交于 6月 14, 2016

Move the state selection logic inside from the caller,
always making it return correct stp to use.
Signed-off-by: NJ . Bruce Fields <bfields@fieldses.org>
Signed-off-by: NOleg Drokin <green@linuxhacker.ru>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

8c7245ab

nfsd: Extend the mutex holding region around in nfsd4_process_open2() · 5cc1fb2a

由 Oleg Drokin 提交于 6月 14, 2016

To avoid racing entry into nfs4_get_vfs_file().
Make init_open_stateid() return with locked stateid to be unlocked
by the caller.
Signed-off-by: NOleg Drokin <green@linuxhacker.ru>
Cc: stable@vger.kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

5cc1fb2a

nfsd: Always lock state exclusively. · feb9dad5

由 Oleg Drokin 提交于 6月 14, 2016

It used to be the case that state had an rwlock that was locked for write
by downgrades, but for read for upgrades (opens). Well, the problem is
if there are two competing opens for the same state, they step on
each other toes potentially leading to leaking file descriptors
from the state structure, since access mode is a bitmap only set once.
Signed-off-by: NOleg Drokin <green@linuxhacker.ru>
Cc: stable@vger.kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

feb9dad5

15 6月, 2016 5 次提交

nfsd4/rpc: move backchannel create logic into rpc code · d50039ea

由 J. Bruce Fields 提交于 5月 16, 2016

Also simplify the logic a bit.

Cc: stable@vger.kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Acked-by: NTrond Myklebust <trondmy@primarydata.com>

d50039ea

ovl: fix uid/gid when creating over whiteout · d0e13f5b

由 Miklos Szeredi 提交于 6月 15, 2016

Fix a regression when creating a file over a whiteout.  The new
file/directory needs to use the current fsuid/fsgid, not the ones from the
mounter's credentials.

The refcounting is a bit tricky: prepare_creds() sets an original refcount,
override_creds() gets one more, which revert_cred() drops.  So

  1) we need to expicitly put the mounter's credentials when overriding
     with the updated one

  2) we need to put the original ref to the updated creds (and this can
     safely be done before revert_creds(), since we'll still have the ref
     from override_creds()).
Reported-by: NStephen Smalley <sds@tycho.nsa.gov>
Fixes: 3fe6e52f ("ovl: override creds with the ones from the superblock mounter")
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

d0e13f5b

debugfs: open_proxy_open(): avoid double fops release · 75f0b68b

由 Nicolai Stange 提交于 5月 24, 2016

Debugfs' open_proxy_open(), the ->open() installed at all inodes created
through debugfs_create_file_unsafe(),
- grabs a reference to the original file_operations instance passed to
  debugfs_create_file_unsafe() via fops_get(),
- installs it at the file's ->f_op by means of replace_fops()
- and calls fops_put() on it.

Since the semantics of replace_fops() are such that the reference's
ownership is transferred, the subsequent fops_put() will result in a double
release when the file is eventually closed.

Currently, this is not an issue since fops_put() basically does a
module_put() on the file_operations' ->owner only and there don't exist any
modules calling debugfs_create_file_unsafe() yet. This is expected to
change in the future though, c.f. commit c6468808 ("debugfs: add
support for self-protecting attribute file fops").

Remove the call to fops_put() from open_proxy_open().

Fixes: 9fd4dcec ("debugfs: prevent access to possibly dead
                      file_operations at file open")
Signed-off-by: NNicolai Stange <nicstange@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

75f0b68b

debugfs: full_proxy_open(): free proxy on ->open() failure · b10e3e90

由 Nicolai Stange 提交于 5月 24, 2016

Debugfs' full_proxy_open(), the ->open() installed at all inodes created
through debugfs_create_file(),
- grabs a reference to the original struct file_operations instance passed
  to debugfs_create_file(),
- dynamically allocates a proxy struct file_operations instance wrapping
  the original
- and installs this at the file's ->f_op.

Afterwards, it calls the original ->open() and passes its return value back
to the VFS layer.

Now, if that return value indicates failure, the VFS layer won't ever call
->release() and thus, neither the reference to the original file_operations
nor the memory for the proxy file_operations will get released, i.e. both
are leaked.

Upon failure of the original fops' ->open(), undo the proxy installation.
That is:
- Set the struct file ->f_op to what it had been when full_proxy_open()
  was entered.
- Drop the reference to the original file_operations.
- Free the memory holding the proxy file_operations.

Fixes: 49d200de ("debugfs: prevent access to removed files' private
                      data")
Signed-off-by: NNicolai Stange <nicstange@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

b10e3e90

mnt: Account for MS_RDONLY in fs_fully_visible · 695e9df0

由 Eric W. Biederman 提交于 6月 10, 2016

In rare cases it is possible for s_flags & MS_RDONLY to be set but
MNT_READONLY to be clear.  This starting combination can cause
fs_fully_visible to fail to ensure that the new mount is readonly.
Therefore force MNT_LOCK_READONLY in the new mount if MS_RDONLY
is set on the source filesystem of the mount.

In general both MS_RDONLY and MNT_READONLY are set at the same for
mounts so I don't expect any programs to care.  Nor do I expect
MS_RDONLY to be set on proc or sysfs in the initial user namespace,
which further decreases the likelyhood of problems.

Which means this change should only affect system configurations by
paranoid sysadmins who should welcome the additional protection
as it keeps people from wriggling out of their policies.

Cc: stable@vger.kernel.org
Fixes: 8c6cf9cc ("mnt: Modify fs_fully_visible to deal with locked ro nodev and atime")
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

695e9df0

14 6月, 2016 1 次提交

nfsd: Fix NFSD_MDS_PR_KEY on 32-bit by adding ULL postfix · eee93016

由 Geert Uytterhoeven 提交于 3月 25, 2016

On 32-bit:

    fs/nfsd/blocklayout.c: In function ‘nfsd4_block_get_device_info_scsi’:
    fs/nfsd/blocklayout.c:337: warning: integer constant is too large for ‘long’ type
    fs/nfsd/blocklayout.c:344: warning: integer constant is too large for ‘long’ type
    fs/nfsd/blocklayout.c: In function ‘nfsd4_scsi_fence_client’:
    fs/nfsd/blocklayout.c:385: warning: integer constant is too large for ‘long’ type

Add the missing "ULL" postfix to 64-bit constant NFSD_MDS_PR_KEY to fix
this.

Fixes: f99d4fbd ("nfsd: add SCSI layout support")
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

eee93016

12 6月, 2016 1 次提交

autofs races · ea01a184

由 Al Viro 提交于 6月 12, 2016

* make autofs4_expire_indirect() skip the dentries being in process of
expiry
* do *not* mess with list_move(); making sure that dentry with
AUTOFS_INF_EXPIRING are not picked for expiry is enough.
* do not remove NO_RCU when we set EXPIRING, don't bother with smp_mb()
there.  Clear it at the same time we clear EXPIRING.  Makes a bunch of
tests simpler.
* rename NO_RCU to WANT_EXPIRE, which is what it really is.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ea01a184

11 6月, 2016 2 次提交

ecryptfs: forbid opening files without mmap handler · 2f36db71

由 Jann Horn 提交于 6月 01, 2016

This prevents users from triggering a stack overflow through a recursive
invocation of pagefault handling that involves mapping procfs files into
virtual memory.
Signed-off-by: NJann Horn <jannh@google.com>
Acked-by: NTyler Hicks <tyhicks@canonical.com>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2f36db71

proc: prevent stacking filesystems on top · e54ad7f1

由 Jann Horn 提交于 6月 01, 2016

This prevents stacking filesystems (ecryptfs and overlayfs) from using
procfs as lower filesystem.  There is too much magic going on inside
procfs, and there is no good reason to stack stuff on top of procfs.

(For example, procfs does access checks in VFS open handlers, and
ecryptfs by design calls open handlers from a kernel thread that doesn't
drop privileges or so.)
Signed-off-by: NJann Horn <jannh@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e54ad7f1

10 6月, 2016 1 次提交

much milder d_walk() race · ba65dc5e

由 Al Viro 提交于 6月 10, 2016

d_walk() relies upon the tree not getting rearranged under it without
rename_lock being touched.  And we do grab rename_lock around the
places that change the tree topology.  Unfortunately, branch reordering
is just as bad from d_walk() POV and we have two places that do it
without touching rename_lock - one in handling of cursors (for ramfs-style
directories) and another in autofs.  autofs one is a separate story; this
commit deals with the cursors.
	* mark cursor dentries explicitly at allocation time
	* make __dentry_kill() leave ->d_child.next pointing to the next
non-cursor sibling, making sure that it won't be moved around unnoticed
before the parent is relocked on ascend-to-parent path in d_walk().
	* make d_walk() skip cursors explicitly; strictly speaking it's
not necessary (all callbacks we pass to d_walk() are no-ops on cursors),
but it makes analysis easier.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ba65dc5e

08 6月, 2016 3 次提交

coredump: fix dumping through pipes · 1607f09c

由 Mateusz Guzik 提交于 6月 05, 2016

The offset in the core file used to be tracked with ->written field of
the coredump_params structure. The field was retired in favour of
file->f_pos.

However, ->f_pos is not maintained for pipes which leads to breakage.

Restore explicit tracking of the offset in coredump_params. Introduce
->pos field for this purpose since ->written was already reused.

Fixes: a0083939 ("get rid of coredump_params->written").
Reported-by: NZbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl>
Signed-off-by: NMateusz Guzik <mguzik@redhat.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1607f09c

fix a regression in atomic_open() · a01e718f

由 Al Viro 提交于 6月 07, 2016

open("/foo/no_such_file", O_RDONLY | O_CREAT) on should fail with
EACCES when /foo is not writable; failing with ENOENT is obviously
wrong.  That got broken by a braino introduced when moving the
creat_error logics from atomic_open() to lookup_open().  Easy to
fix, fortunately.
Spotted-by: N"Yan, Zheng" <ukernel@gmail.com>
Tested-by: N"Yan, Zheng" <ukernel@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a01e718f

fix d_walk()/non-delayed __d_free() race · 3d56c25e

由 Al Viro 提交于 6月 07, 2016

Ascend-to-parent logics in d_walk() depends on all encountered child
dentries not getting freed without an RCU delay.  Unfortunately, in
quite a few cases it is not true, with hard-to-hit oopsable race as
the result.

Fortunately, the fix is simiple; right now the rule is "if it ever
been hashed, freeing must be delayed" and changing it to "if it
ever had a parent, freeing must be delayed" closes that hole and
covers all cases the old rule used to cover.  Moreover, pipes and
sockets remain _not_ covered, so we do not introduce RCU delay in
the cases which are the reason for having that delay conditional
in the first place.

Cc: stable@vger.kernel.org # v3.2+ (and watch out for __d_materialise_dentry())
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3d56c25e

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功