提交 · 1a8550332a7f0111306b6b34e44a7c696ef68c4d · openanolis / cloud-kernel

04 11月, 2014 4 次提交

GFS2: If we use up our block reservation, request more next time · 1a855033

由 Bob Peterson 提交于 10月 29, 2014

If we run out of blocks for a given multi-block allocation, we obviously
did not reserve enough. We should reserve more blocks for the next
reservation to reduce fragmentation. This patch increases the size hint
for reservations when they run out.
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

1a855033

GFS2: Only increase rs_sizehint · 33ad5d54

由 Bob Peterson 提交于 10月 29, 2014

If an application does a sequence of (1) big write, (2) little write
we don't necessarily want to reset the size hint based on the smaller
size. The fact that they did any big writes implies they may do more,
and therefore we should try to allocate bigger block reservations, even
if the last few were small writes. Therefore this patch changes function
gfs2_size_hint so that the size hint can only grow; it cannot shrink.
This is especially important where there are multiple writers.
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

33ad5d54

GFS2: Set of distributed preferences for rgrps · 0e27c18c

由 Bob Peterson 提交于 10月 29, 2014

This patch tries to use the journal numbers to evenly distribute
which node prefers which resource group for block allocations. This
is to help performance.
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

0e27c18c

GFS2: directly return gfs2_dir_check() · 37975f15

由 Fabian Frederick 提交于 10月 09, 2014

No need to store gfs2_dir_check result and test it before returning.
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

37975f15

01 11月, 2014 1 次提交

ovl: initialize ->is_cursor · 9f2f7d4c

由 Miklos Szeredi 提交于 10月 31, 2014

Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9f2f7d4c

31 10月, 2014 3 次提交

Return short read or 0 at end of a raw device, not EIO · b2de525f

由 David Jeffery 提交于 9月 29, 2014

Author: David Jeffery <djeffery@redhat.com>
Changes to the basic direct I/O code have broken the raw driver when reading
to the end of a raw device.  Instead of returning a short read for a read that
extends partially beyond the device's end or 0 when at the end of the device,
these reads now return EIO.

The raw driver needs the same end of device handling as was added for normal
block devices.  Using blkdev_read_iter, which has the needed size checks,
prevents the EIO conditions at the end of the device.
Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b2de525f

isofs: don't bother with ->d_op for normal case · b0afd8e5

由 Al Viro 提交于 10月 28, 2014

we only need it for joliet and case-insensitive mounts
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b0afd8e5

fs: allow open(dir, O_TMPFILE|..., 0) with mode 0 · 69a91c23

由 Eric Rannaud 提交于 10月 30, 2014

The man page for open(2) indicates that when O_CREAT is specified, the
'mode' argument applies only to future accesses to the file:

	Note that this mode applies only to future accesses of the newly
	created file; the open() call that creates a read-only file
	may well return a read/write file descriptor.

The man page for open(2) implies that 'mode' is treated identically by
O_CREAT and O_TMPFILE.

O_TMPFILE, however, behaves differently:

	int fd = open("/tmp", O_TMPFILE | O_RDWR, 0);
	assert(fd == -1);
	assert(errno == EACCES);

	int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
	assert(fd > 0);

For O_CREAT, do_last() sets acc_mode to MAY_OPEN only:

	if (*opened & FILE_CREATED) {
		/* Don't check for write permission, don't truncate */
		open_flag &= ~O_TRUNC;
		will_truncate = false;
		acc_mode = MAY_OPEN;
		path_to_nameidata(path, nd);
		goto finish_open_created;
	}

But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to
may_open().

This patch lines up the behavior of O_TMPFILE with O_CREAT. After the
inode is created, may_open() is called with acc_mode = MAY_OPEN, in
do_tmpfile().

A different, but related glibc bug revealed the discrepancy:
https://sourceware.org/bugzilla/show_bug.cgi?id=17523

The glibc lazily loads the 'mode' argument of open() and openat() using
va_arg() only if O_CREAT is present in 'flags' (to support both the 2
argument and the 3 argument forms of open; same idea for openat()).
However, the glibc ignores the 'mode' argument if O_TMPFILE is in
'flags'.

On x86_64, for open(), it magically works anyway, as 'mode' is in
RDX when entering open(), and is still in RDX on SYSCALL, which is where
the kernel looks for the 3rd argument of a syscall.

But openat() is not quite so lucky: 'mode' is in RCX when entering the
glibc wrapper for openat(), while the kernel looks for the 4th argument
of a syscall in R10. Indeed, the syscall calling convention differs from
the regular calling convention in this respect on x86_64. So the kernel
sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and
fails with EACCES.
Signed-off-by: NEric Rannaud <e@nanocritical.com>
Acked-by: NAndy Lutomirski <luto@amacapital.net>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

69a91c23

30 10月, 2014 12 次提交

ext4: make ext4_ext_convert_to_initialized() return proper number of blocks · ae9e9c6a

由 Jan Kara 提交于 10月 30, 2014

ext4_ext_convert_to_initialized() can return more blocks than are
actually allocated from map->m_lblk in case where initial part of the
on-disk extent is zeroed out. Luckily this doesn't have serious
consequences because the caller currently uses the return value
only to unmap metadata buffers. Anyway this is a data
corruption/exposure problem waiting to happen so fix it.

Coverity-id: 1226848
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

ae9e9c6a

ext4: bail early when clearing inode journal flag fails · 4f879ca6

由 Jan Kara 提交于 10月 30, 2014

When clearing inode journal flag, we call jbd2_journal_flush() to force
all the journalled data to their final locations. Currently we ignore
when this fails and continue clearing inode journal flag. This isn't a
big problem because when jbd2_journal_flush() fails, journal is likely
aborted anyway. But it can still lead to somewhat confusing results so
rather bail out early.

Coverity-id: 989044
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

4f879ca6

ext4: bail out from make_indexed_dir() on first error · 6050d47a

由 Jan Kara 提交于 10月 30, 2014

When ext4_handle_dirty_dx_node() or ext4_handle_dirty_dirent_node()
fail, there's really something wrong with the fs and there's no point in
continuing further. Just return error from make_indexed_dir() in that
case. Also initialize frames array so that if we return early due to
error, dx_release() doesn't try to dereference uninitialized memory
(which could happen also due to error in do_split()).

Coverity-id: 741300
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

6050d47a

jbd2: use a better hash function for the revoke table · d48458d4

由 Theodore Ts'o 提交于 10月 30, 2014

The old hash function didn't work well for 64-bit block numbers, and
used undefined (negative) shift right behavior.  Use the generic
64-bit hash function instead.
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reported-by: NAndrey Ryabinin <a.ryabinin@samsung.com>

d48458d4

ext4: prevent bugon on race between write/fcntl · a41537e6

由 Dmitry Monakhov 提交于 10月 30, 2014

O_DIRECT flags can be toggeled via fcntl(F_SETFL). But this value checked
twice inside ext4_file_write_iter() and __generic_file_write() which
result in BUG_ON inside ext4_direct_IO.

Let's initialize iocb->private unconditionally.

TESTCASE: xfstest:generic/036  https://patchwork.ozlabs.org/patch/402445/

#TYPICAL STACK TRACE:
kernel BUG at fs/ext4/inode.c:2960!
invalid opcode: 0000 [#1] SMP
Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod
CPU: 6 PID: 5505 Comm: aio-dio-fcntl-r Not tainted 3.17.0-rc2-00176-gff5c017 #161
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
task: ffff88080e95a7c0 ti: ffff88080f908000 task.ti: ffff88080f908000
RIP: 0010:[<ffffffff811fabf2>]  [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0
RSP: 0018:ffff88080f90bb58  EFLAGS: 00010246
RAX: 0000000000000400 RBX: ffff88080fdb2a28 RCX: 00000000a802c818
RDX: 0000040000080000 RSI: ffff88080d8aeb80 RDI: 0000000000000001
RBP: ffff88080f90bbc8 R08: 0000000000000000 R09: 0000000000001581
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88080d8aeb80
R13: ffff88080f90bbf8 R14: ffff88080fdb28c8 R15: ffff88080fdb2a28
FS:  00007f23b2055700(0000) GS:ffff880818400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f23b2045000 CR3: 000000080cedf000 CR4: 00000000000407e0
Stack:
 ffff88080f90bb98 0000000000000000 7ffffffffffffffe ffff88080fdb2c30
 0000000000000200 0000000000000200 0000000000000001 0000000000000200
 ffff88080f90bbc8 ffff88080fdb2c30 ffff88080f90be08 0000000000000200
Call Trace:
 [<ffffffff8112ca9d>] generic_file_direct_write+0xed/0x180
 [<ffffffff8112f2b2>] __generic_file_write_iter+0x222/0x370
 [<ffffffff811f495b>] ext4_file_write_iter+0x34b/0x400
 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410
 [<ffffffff811bd709>] ? aio_run_iocb+0x239/0x410
 [<ffffffff810990e5>] ? local_clock+0x25/0x30
 [<ffffffff810abd94>] ? __lock_acquire+0x274/0x700
 [<ffffffff811f4610>] ? ext4_unwritten_wait+0xb0/0xb0
 [<ffffffff811bd756>] aio_run_iocb+0x286/0x410
 [<ffffffff810990e5>] ? local_clock+0x25/0x30
 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190
 [<ffffffff811bc05b>] ? lookup_ioctx+0x4b/0xf0
 [<ffffffff811bde3b>] do_io_submit+0x55b/0x740
 [<ffffffff811bdcaa>] ? do_io_submit+0x3ca/0x740
 [<ffffffff811be030>] SyS_io_submit+0x10/0x20
 [<ffffffff815ce192>] system_call_fastpath+0x16/0x1b
Code: 01 48 8b 80 f0 01 00 00 48 8b 18 49 8b 45 10 0f 85 f1 01 00 00 48 03 45 c8 48 3b 43 48 0f 8f e3 01 00 00 49 83 7c
24 18 00 75 04 <0f> 0b eb fe f0 ff 83 ec 01 00 00 49 8b 44 24 18 8b 00 85 c0 89
RIP  [<ffffffff811fabf2>] ext4_direct_IO+0x162/0x3d0
 RSP <ffff88080f90bb58>
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org

a41537e6

ext4: remove extent status procfs files if journal load fails · 50460fe8

由 Darrick J. Wong 提交于 10月 30, 2014

If we can't load the journal, remove the procfs files for the extent
status information file to avoid leaking resources.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

50460fe8

ext4: disallow changing journal_csum option during remount · 6b992ff2

由 Darrick J. Wong 提交于 10月 30, 2014

ext4 does not permit changing the metadata or journal checksum feature
flag while mounted.  Until we decide to support that, don't allow a
remount to change the journal_csum flag (right now we silently fail to
change anything).
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

6b992ff2

ext4: enable journal checksum when metadata checksum feature enabled · 98c1a759

由 Darrick J. Wong 提交于 10月 30, 2014

If metadata checksumming is turned on for the FS, we need to tell the
journal to use checksumming too.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

98c1a759

ext4: fix oops when loading block bitmap failed · 599a9b77

由 Jan Kara 提交于 10月 30, 2014

When we fail to load block bitmap in __ext4_new_inode() we will
dereference NULL pointer in ext4_journal_get_write_access(). So check
for error from ext4_read_block_bitmap().

Coverity-id: 989065
Cc: stable@vger.kernel.org
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

599a9b77

ext4: fix overflow when updating superblock backups after resize · 9378c676

由 Jan Kara 提交于 10月 30, 2014

When there are no meta block groups update_backups() will compute the
backup block in 32-bit arithmetics thus possibly overflowing the block
number and corrupting the filesystem. OTOH filesystems without meta
block groups larger than 16 TB should be rare. Fix the problem by doing
the counting in 64-bit arithmetics.

Coverity-id: 741252
CC: stable@vger.kernel.org
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NLukas Czerner <lczerner@redhat.com>

9378c676

ocfs2: fix d_splice_alias() return code checking · d3556bab

由 Richard Weinberger 提交于 10月 29, 2014

d_splice_alias() can return a valid dentry, NULL or an ERR_PTR.
Currently the code checks not for ERR_PTR and will cuase an oops in
ocfs2_dentry_attach_lock().  Fix this by using IS_ERR_OR_NULL().
Signed-off-by: NRichard Weinberger <richard@nod.at>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d3556bab

fsnotify: next_i is freed during fsnotify_unmount_inodes. · 6424babf

由 Jerry Hoemann 提交于 10月 29, 2014

During file system stress testing on 3.10 and 3.12 based kernels, the
umount command occasionally hung in fsnotify_unmount_inodes in the
section of code:

                spin_lock(&inode->i_lock);
                if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
                        spin_unlock(&inode->i_lock);
                        continue;
                }

As this section of code holds the global inode_sb_list_lock, eventually
the system hangs trying to acquire the lock.

Multiple crash dumps showed:

The inode->i_state == 0x60 and i_count == 0 and i_sb_list would point
back at itself.  As this is not the value of list upon entry to the
function, the kernel never exits the loop.

To help narrow down problem, the call to list_del_init in
inode_sb_list_del was changed to list_del.  This poisons the pointers in
the i_sb_list and causes a kernel to panic if it transverse a freed
inode.

Subsequent stress testing paniced in fsnotify_unmount_inodes at the
bottom of the list_for_each_entry_safe loop showing next_i had become
free.

We believe the root cause of the problem is that next_i is being freed
during the window of time that the list_for_each_entry_safe loop
temporarily releases inode_sb_list_lock to call fsnotify and
fsnotify_inode_delete.

The code in fsnotify_unmount_inodes attempts to prevent the freeing of
inode and next_i by calling __iget.  However, the code doesn't do the
__iget call on next_i

	if i_count == 0 or
	if i_state & (I_FREEING | I_WILL_FREE)

The patch addresses this issue by advancing next_i in the above two cases
until we either find a next_i which we can __iget or we reach the end of
the list.  This makes the handling of next_i more closely match the
handling of the variable "inode."

The time to reproduce the hang is highly variable (from hours to days.) We
ran the stress test on a 3.10 kernel with the proposed patch for a week
without failure.

During list_for_each_entry_safe, next_i is becoming free causing
the loop to never terminate.  Advance next_i in those cases where
__iget is not done.
Signed-off-by: NJerry Hoemann <jerry.hoemann@hp.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Ken Helias <kenhelias@firemail.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6424babf

29 10月, 2014 5 次提交

A
isofs_cmp(): we'll never see a dentry for . or .. · f643ff55
由 Al Viro 提交于 10月 28, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
f643ff55

overlayfs: fix lockdep misannotation · d1b72cc6

由 Miklos Szeredi 提交于 10月 27, 2014

In an overlay directory that shadows an empty lower directory, say
/mnt/a/empty102, do:

 	touch /mnt/a/empty102/x
 	unlink /mnt/a/empty102/x
 	rmdir /mnt/a/empty102

It's actually harmless, but needs another level of nesting between
I_MUTEX_CHILD and I_MUTEX_NORMAL.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Tested-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d1b72cc6

ovl: fix check for cursor · c2096537

由 Miklos Szeredi 提交于 10月 27, 2014

ovl_cache_entry.name is now an array not a pointer, so it makes no sense
test for it being NULL.

Detected by coverity.

From: Miklos Szeredi <mszeredi@suse.cz>
Fixes: 68bf8611 ("overlayfs: make ovl_cache_entry->name an array instead of
+pointer")
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c2096537

overlayfs: barriers for opening upper-layer directory · d45f00ae

由 Al Viro 提交于 10月 28, 2014

make sure that
	a) all stores done by opening struct file don't leak past storing
the reference in od->upperfile
	b) the lockless side has read dependency barrier
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d45f00ae

Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items · d05a2b4c

由 Filipe Manana 提交于 10月 27, 2014

We have a race that can lead us to miss skinny extent items in the function
btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
So basically the sequence of steps is:

1) We search in the extent tree for the skinny extent, which returns > 0
(not found);

2) We check the previous item in the returned leaf for a non-skinny extent,
and we don't find it;

3) Because we didn't find the non-skinny extent in step 2), we release our
path to search the extent tree again, but this time for a non-skinny
extent key;

4) Right after we released our path in step 3), a skinny extent was inserted
in the extent tree (delayed refs were run) - our second extent tree search
will miss it, because it's not looking for a skinny extent;

5) After the second search returned (with ret > 0), we look for any delayed
ref for our extent's bytenr (and we do it while holding a read lock on the
leaf), but we won't find any, as such delayed ref had just run and completed
after we released out path in step 3) before doing the second search.

Fix this by removing completely the path release and re-search logic. This is
safe, because if we seach for a metadata item and we don't find it, we have the
guarantee that the returned leaf is the one where the item would be inserted,
and so path->slots[0] > 0 and path->slots[0] - 1 must be the slot where the
non-skinny extent item is if it exists. The only case where path->slots[0] is
zero is when there are no smaller keys in the tree (i.e. no left siblings for
our leaf), in which case the re-search logic isn't needed as well.

This race has been present since the introduction of skinny metadata (change
3173a18f).
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

d05a2b4c

28 10月, 2014 3 次提交

Btrfs: properly clean up btrfs_end_io_wq_cache · 5ed5f588

由 Josef Bacik 提交于 10月 15, 2014

In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on
unload, which makes us unable to unload and then re-load the btrfs module.  This
fixes the problem.  Thanks,
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.cz>
Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

5ed5f588

Btrfs: fix invalid leaf slot access in btrfs_lookup_extent() · 1a4ed8fd

由 Filipe Manana 提交于 10月 27, 2014

If we couldn't find our extent item, we accessed the current slot
(path->slots[0]) to check if it corresponds to an equivalent skinny
metadata item. However this slot could be beyond our last item in the
leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
we shouldn't process it.

Since btrfs_lookup_extent() is only used to find extent items for data
extents, fix this by removing completely the logic that looks up for an
equivalent skinny metadata item, since it can not exist.
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NChris Mason <clm@fb.com>

1a4ed8fd

btrfs: use macro accessors in superblock validation checks · 21e7626b

由 David Sterba 提交于 10月 27, 2014

The initial patch c926093e (btrfs: add more superblock checks)
did not properly use the macro accessors that wrap endianness and the
code would not work correctly on big endian machines.
Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
Signed-off-by: NChris Mason <clm@fb.com>

21e7626b

25 10月, 2014 4 次提交
- A
  overlayfs: embed middle into overlay_readdir_data · db6ec212
  由 Al Viro 提交于 10月 23, 2014
```
same story...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  db6ec212
- A
  overlayfs: embed root into overlay_readdir_data · 49be4fb9
  由 Al Viro 提交于 10月 23, 2014
```
no sense having it a pointer - all instances have it pointing to
local variable in the same stack frame
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  49be4fb9
- A
  overlayfs: make ovl_cache_entry->name an array instead of pointer · 68bf8611
  由 Al Viro 提交于 10月 23, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  68bf8611
- A
  overlayfs: don't hold ->i_mutex over opening the real directory · 3d268c9b
  由 Al Viro 提交于 10月 23, 2014
```
just use it to serialize the assignment
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  3d268c9b
24 10月, 2014 8 次提交

fix inode leaks on d_splice_alias() failure exits · 51486b90

由 Al Viro 提交于 10月 23, 2014

d_splice_alias() callers expect it to either stash the inode reference
into a new alias, or drop the inode reference.  That makes it possible
to just return d_splice_alias() result from ->lookup() instance, without
any extra housekeeping required.

Unfortunately, that should include the failure exits.  If d_splice_alias()
returns an error, it leaves the dentry it has been given negative and
thus it *must* drop the inode reference.  Easily fixed, but it goes way
back and will need backporting.

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

51486b90

fs: limit filesystem stacking depth · 69c433ed

由 Miklos Szeredi 提交于 10月 24, 2014

Add a simple read-only counter to super_block that indicates how deep this
is in the stack of filesystems.  Previously ecryptfs was the only stackable
filesystem and it explicitly disallowed multiple layers of itself.

Overlayfs, however, can be stacked recursively and also may be stacked
on top of ecryptfs or vice versa.

To limit the kernel stack usage we must limit the depth of the
filesystem stack.  Initially the limit is set to 2.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

69c433ed

overlayfs: implement show_options · f45827e8

由 Erez Zadok 提交于 10月 24, 2014

This is useful because of the stacking nature of overlayfs.  Users like to
find out (via /proc/mounts) which lower/upper directory were used at mount
time.

AV: even failing ovl_parse_opt() could've done some kstrdup()
AV: failure of ovl_alloc_entry() should end up with ENOMEM, not EINVAL
Signed-off-by: NErez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

f45827e8

overlayfs: add statfs support · cc259639

由 Andy Whitcroft 提交于 10月 24, 2014

Add support for statfs to the overlayfs filesystem.  As the upper layer
is the target of all write operations assume that the space in that
filesystem is the space in the overlayfs.  There will be some inaccuracy as
overwriting a file will copy it up and consume space we were not expecting,
but it is better than nothing.

Use the upper layer dentry and mount from the overlayfs root inode,
passing the statfs call to that filesystem.
Signed-off-by: NAndy Whitcroft <apw@canonical.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

cc259639

overlay filesystem · e9be9d5e

由 Miklos Szeredi 提交于 10月 24, 2014

Overlayfs allows one, usually read-write, directory tree to be
overlaid onto another, read-only directory tree.  All modifications
go to the upper, writable layer.

This type of mechanism is most often used for live CDs but there's a
wide variety of other uses.

The implementation differs from other "union filesystem"
implementations in that after a file is opened all operations go
directly to the underlying, lower or upper, filesystems.  This
simplifies the implementation and allows native performance in these
cases.

The dentry tree is duplicated from the underlying filesystems, this
enables fast cached lookups without adding special support into the
VFS.  This uses slightly more memory than union mounts, but dentries
are relatively small.

Currently inodes are duplicated as well, but it is a possible
optimization to share inodes for non-directories.

Opening non directories results in the open forwarded to the
underlying filesystem.  This makes the behavior very similar to union
mounts (with the same limitations vs. fchmod/fchown on O_RDONLY file
descriptors).

Usage:

  mount -t overlayfs overlayfs -olowerdir=/lower,upperdir=/upper/upper,workdir=/upper/work /overlay

The following cotributions have been folded into this patch:

Neil Brown <neilb@suse.de>:
 - minimal remount support
 - use correct seek function for directories
 - initialise is_real before use
 - rename ovl_fill_cache to ovl_dir_read

Felix Fietkau <nbd@openwrt.org>:
 - fix a deadlock in ovl_dir_read_merged
 - fix a deadlock in ovl_remove_whiteouts

Erez Zadok <ezk@fsl.cs.sunysb.edu>
 - fix cleanup after WARN_ON

Sedat Dilek <sedat.dilek@googlemail.com>
 - fix up permission to confirm to new API

Robin Dong <hao.bigrat@gmail.com>
 - fix possible leak in ovl_new_inode
 - create new inode in ovl_link

Andy Whitcroft <apw@canonical.com>
 - switch to __inode_permission()
 - copy up i_uid/i_gid from the underlying inode

AV:
 - ovl_copy_up_locked() - dput(ERR_PTR(...)) on two failure exits
 - ovl_clear_empty() - one failure exit forgetting to do unlock_rename(),
   lack of check for udir being the parent of upper, dropping and regaining
   the lock on udir (which would require _another_ check for parent being
   right).
 - bogus d_drop() in copyup and rename [fix from your mail]
 - copyup/remove and copyup/rename races [fix from your mail]
 - ovl_dir_fsync() leaving ERR_PTR() in ->realfile
 - ovl_entry_free() is pointless - it's just a kfree_rcu()
 - fold ovl_do_lookup() into ovl_lookup()
 - manually assigning ->d_op is wrong.  Just use ->s_d_op.
 [patches picked from Miklos]:
 * copyup/remove and copyup/rename races
 * bogus d_drop() in copyup and rename

Also thanks to the following people for testing and reporting bugs:

  Jordi Pujol <jordipujolp@gmail.com>
  Andy Whitcroft <apw@canonical.com>
  Michal Suchanek <hramrach@centrum.cz>
  Felix Fietkau <nbd@openwrt.org>
  Erez Zadok <ezk@fsl.cs.sunysb.edu>
  Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

e9be9d5e

ext4: support RENAME_WHITEOUT · cd808dec

由 Miklos Szeredi 提交于 10月 24, 2014

Add whiteout support to ext4_rename().  A whiteout inode (chrdev/0,0) is
created before the rename takes place.  The whiteout inode is added to the
old entry instead of deleting it.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

cd808dec

vfs: add RENAME_WHITEOUT · 0d7a8555

由 Miklos Szeredi 提交于 10月 24, 2014

This adds a new RENAME_WHITEOUT flag.  This flag makes rename() create a
whiteout of source.  The whiteout creation is atomic relative to the
rename.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

0d7a8555

vfs: add whiteout support · 787fb6bc

由 Miklos Szeredi 提交于 10月 24, 2014

Whiteout isn't actually a new file type, but is represented as a char
device (Linus's idea) with 0/0 device number.

This has several advantages compared to introducing a new whiteout file
type:

 - no userspace API changes (e.g. trivial to make backups of upper layer
   filesystem, without losing whiteouts)

 - no fs image format changes (you can boot an old kernel/fsck without
   whiteout support and things won't break)

 - implementation is trivial
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

787fb6bc

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功