提交 · 1b53cf9815bb4744958d41f3795d5d5a1d365e2d · openanolis / cloud-kernel

16 3月, 2017 1 次提交

fscrypt: remove broken support for detecting keyring key revocation · 1b53cf98

由 Eric Biggers 提交于 2月 21, 2017

Filesystem encryption ostensibly supported revoking a keyring key that
had been used to "unlock" encrypted files, causing those files to become
"locked" again. This was, however, buggy for several reasons, the most
severe of which was that when key revocation happened to be detected for
an inode, its fscrypt_info was immediately freed, even while other
threads could be using it for encryption or decryption concurrently.
This could be exploited to crash the kernel or worse.

This patch fixes the use-after-free by removing the code which detects
the keyring key having been revoked, invalidated, or expired. Instead,
an encrypted inode that is "unlocked" now simply remains unlocked until
it is evicted from memory. Note that this is no worse than the case for
block device-level encryption, e.g. dm-crypt, and it still remains
possible for a privileged user to evict unused pages, inodes, and
dentries by running 'sync; echo 3 > /proc/sys/vm/drop_caches', or by
simply unmounting the filesystem. In fact, one of those actions was
already needed anyway for key revocation to work even somewhat sanely.
This change is not expected to break any applications.

In the future I'd like to implement a real API for fscrypt key
revocation that interacts sanely with ongoing filesystem operations ---
waiting for existing operations to complete and blocking new operations,
and invalidating and sanitizing key material and plaintext from the VFS
caches. But this is a hard problem, and for now this bug must be fixed.

This bug affected almost all versions of ext4, f2fs, and ubifs
encryption, and it was potentially reachable in any kernel configured
with encryption support (CONFIG_EXT4_ENCRYPTION=y,
CONFIG_EXT4_FS_ENCRYPTION=y, CONFIG_F2FS_FS_ENCRYPTION=y, or
CONFIG_UBIFS_FS_ENCRYPTION=y). Note that older kernels did not use the
shared fs/crypto/ code, but due to the potential security implications
of this bug, it may still be worthwhile to backport this fix to them.

Fixes: b7236e21 ("ext4 crypto: reorganize how we store keys in the inode")
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Acked-by: NMichael Halcrow <mhalcrow@google.com>

1b53cf98

21 2月, 2017 1 次提交

ext4: rename EXT4_IOC_GOINGDOWN to EXT4_IOC_SHUTDOWN · e9be2ac7

由 Theodore Ts'o 提交于 2月 20, 2017

It's very likely the file system independent ioctl name will be
FS_IOC_SHUTDOWN, so let's use the same name for the ext4 ioctl name.
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

e9be2ac7

17 2月, 2017 1 次提交

vfs: fix uninitialized flags in splice_to_pipe() · 5a81e6a1

由 Miklos Szeredi 提交于 2月 16, 2017

Flags (PIPE_BUF_FLAG_PACKET, PIPE_BUF_FLAG_GIFT) could remain on the
unused part of the pipe ring buffer.  Previously splice_to_pipe() left
the flags value alone, which could result in incorrect behavior.

Uninitialized flags appears to have been there from the introduction of
the splice syscall.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # 2.6.17+
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5a81e6a1

16 2月, 2017 2 次提交

fuse: fix uninitialized flags in pipe_buffer · 84588a93

由 Miklos Szeredi 提交于 2月 16, 2017

Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Fixes: d82718e3 ("fuse_dev_splice_read(): switch to add_to_pipe()")
Cc: <stable@vger.kernel.org> # 4.9+

84588a93

rhashtable: Revert nested table changes. · bf3f14d6

由 David S. Miller 提交于 2月 15, 2017

This reverts commits:

6a254780
9dbbfb0a
40137906

It's too risky to put in this late in the release
cycle.  We'll put these changes into the next merge
window instead.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bf3f14d6

15 2月, 2017 3 次提交

fuse: fix use after free issue in fuse_dev_do_read() · 6ba4d272

由 Sahitya Tummala 提交于 2月 08, 2017

There is a potential race between fuse_dev_do_write()
and request_wait_answer() contexts as shown below:

TASK 1:
__fuse_request_send():
  |--spin_lock(&fiq->waitq.lock);
  |--queue_request();
  |--spin_unlock(&fiq->waitq.lock);
  |--request_wait_answer():
       |--if (test_bit(FR_SENT, &req->flags))
       <gets pre-empted after it is validated true>
                                   TASK 2:
                                   fuse_dev_do_write():
                                     |--clears bit FR_SENT,
                                     |--request_end():
                                        |--sets bit FR_FINISHED
                                        |--spin_lock(&fiq->waitq.lock);
                                        |--list_del_init(&req->intr_entry);
                                        |--spin_unlock(&fiq->waitq.lock);
                                        |--fuse_put_request();
       |--queue_interrupt();
       <request gets queued to interrupts list>
            |--wake_up_locked(&fiq->waitq);
       |--wait_event_freezable();
       <as FR_FINISHED is set, it returns and then
       the caller frees this request>

Now, the next fuse_dev_do_read(), see interrupts list is not empty
and then calls fuse_read_interrupt() which tries to access the request
which is already free'd and gets the below crash:

[11432.401266] Unable to handle kernel paging request at virtual address
6b6b6b6b6b6b6b6b
...
[11432.418518] Kernel BUG at ffffff80083720e0
[11432.456168] PC is at __list_del_entry+0x6c/0xc4
[11432.463573] LR is at fuse_dev_do_read+0x1ac/0x474
...
[11432.679999] [<ffffff80083720e0>] __list_del_entry+0x6c/0xc4
[11432.687794] [<ffffff80082c65e0>] fuse_dev_do_read+0x1ac/0x474
[11432.693180] [<ffffff80082c6b14>] fuse_dev_read+0x6c/0x78
[11432.699082] [<ffffff80081d5638>] __vfs_read+0xc0/0xe8
[11432.704459] [<ffffff80081d5efc>] vfs_read+0x90/0x108
[11432.709406] [<ffffff80081d67f0>] SyS_read+0x58/0x94

As FR_FINISHED bit is set before deleting the intr_entry with input
queue lock in request completion path, do the testing of this flag and
queueing atomically with the same lock in queue_interrupt().
Signed-off-by: NSahitya Tummala <stummala@codeaurora.org>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Fixes: fd22d62e ("fuse: no fc->lock for iqueue parts")
Cc: <stable@vger.kernel.org> # 4.2+

6ba4d272

ext4: fix fencepost in s_first_meta_bg validation · 2ba3e6e8

由 Theodore Ts'o 提交于 2月 15, 2017

It is OK for s_first_meta_bg to be equal to the number of block group
descriptor blocks.  (It rarely happens, but it shouldn't cause any
problems.)

https://bugzilla.kernel.org/show_bug.cgi?id=194567

Fixes: 3a4b77cdSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

2ba3e6e8

ext4: don't BUG when truncating encrypted inodes on the orphan list · 0d06863f

由 Theodore Ts'o 提交于 2月 14, 2017

Fix a BUG when the kernel tries to mount a file system constructed as
follows:

echo foo > foo.txt
mke2fs -Fq -t ext4 -O encrypt foo.img 100
debugfs -w foo.img << EOF
write foo.txt a
set_inode_field a i_flags 0x80800
set_super_value s_last_orphan 12
quit
EOF

root@kvm-xfstests:~# mount -o loop foo.img /mnt
[  160.238770] ------------[ cut here ]------------
[  160.240106] kernel BUG at /usr/projects/linux/ext4/fs/ext4/inode.c:3874!
[  160.240106] invalid opcode: 0000 [#1] SMP
[  160.240106] Modules linked in:
[  160.240106] CPU: 0 PID: 2547 Comm: mount Tainted: G        W       4.10.0-rc3-00034-gcdd33b941b67 #227
[  160.240106] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1 04/01/2014
[  160.240106] task: f4518000 task.stack: f47b6000
[  160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4
[  160.240106] EFLAGS: 00010246 CPU: 0
[  160.240106] EAX: 00000001 EBX: f7be4b50 ECX: f47b7dc0 EDX: 00000007
[  160.240106] ESI: f43b05a8 EDI: f43babec EBP: f47b7dd0 ESP: f47b7dac
[  160.240106]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  160.240106] CR0: 80050033 CR2: bfd85b08 CR3: 34a00680 CR4: 000006f0
[  160.240106] Call Trace:
[  160.240106]  ext4_truncate+0x1e9/0x3e5
[  160.240106]  ext4_fill_super+0x286f/0x2b1e
[  160.240106]  ? set_blocksize+0x2e/0x7e
[  160.240106]  mount_bdev+0x114/0x15f
[  160.240106]  ext4_mount+0x15/0x17
[  160.240106]  ? ext4_calculate_overhead+0x39d/0x39d
[  160.240106]  mount_fs+0x58/0x115
[  160.240106]  vfs_kern_mount+0x4b/0xae
[  160.240106]  do_mount+0x671/0x8c3
[  160.240106]  ? _copy_from_user+0x70/0x83
[  160.240106]  ? strndup_user+0x31/0x46
[  160.240106]  SyS_mount+0x57/0x7b
[  160.240106]  do_int80_syscall_32+0x4f/0x61
[  160.240106]  entry_INT80_32+0x2f/0x2f
[  160.240106] EIP: 0xb76b919e
[  160.240106] EFLAGS: 00000246 CPU: 0
[  160.240106] EAX: ffffffda EBX: 08053838 ECX: 08052188 EDX: 080537e8
[  160.240106] ESI: c0ed0000 EDI: 00000000 EBP: 080537e8 ESP: bfa13660
[  160.240106]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[  160.240106] Code: 59 8b 00 a8 01 0f 84 09 01 00 00 8b 07 66 25 00 f0 66 3d 00 80 75 61 89 f8 e8 3e e2 ff ff 84 c0 74 56 83 bf 48 02 00 00 00 75 02 <0f> 0b 81 7d e8 00 10 00 00 74 02 0f 0b 8b 43 04 8b 53 08 31 c9
[  160.240106] EIP: ext4_block_zero_page_range+0x1a7/0x2b4 SS:ESP: 0068:f47b7dac
[  160.317241] ---[ end trace d6a773a375c810a5 ]---

The problem is that when the kernel tries to truncate an inode in
ext4_truncate(), it tries to clear any on-disk data beyond i_size.
Without the encryption key, it can't do that, and so it triggers a
BUG.

E2fsck does *not* provide this service, and in practice most file
systems have their orphan list processed by e2fsck, so to avoid
crashing, this patch skips this step if we don't have access to the
encryption key (which is the case when processing the orphan list; in
all other cases, we will have the encryption key, or the kernel
wouldn't have allowed the file to be opened).

An open question is whether the fact that e2fsck isn't clearing the
bytes beyond i_size causing problems --- and if we've lived with it
not doing it for so long, can we drop this from the kernel replay of
the orphan list in all cases (not just when we don't have the key for
encrypted inodes).

Addresses-Google-Bug: #35209576
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

0d06863f

14 2月, 2017 1 次提交

gfs2: Use rhashtable walk interface in glock_hash_walk · 6a254780

由 Herbert Xu 提交于 2月 11, 2017

The function glock_hash_walk walks the rhashtable by hand.  This
is broken because if it catches the hash table in the middle of
a rehash, then it will miss entries.

This patch replaces the manual walk by using the rhashtable walk
interface.

Fixes: 88ffbf3e ("GFS2: Use resizable hash table for glocks")
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6a254780

11 2月, 2017 1 次提交

Btrfs: fix btrfs_decompress_buf2page() · 6e78b3f7

由 Omar Sandoval 提交于 2月 10, 2017

If btrfs_decompress_buf2page() is handed a bio with its page in the
middle of the working buffer, then we adjust the offset into the working
buffer. After we copy into the bio, we advance the iterator by the
number of bytes we copied. Then, we have some logic to handle the case
of discontiguous pages and adjust the offset into the working buffer
again. However, if we didn't advance the bio to a new page, we may enter
this case in error, essentially repeating the adjustment that we already
made when we entered the function. The end result is bogus data in the
bio.

Previously, we only checked for this case when we advanced to a new
page, but the conversion to bio iterators changed that. This restores
the old, correct behavior.

A case I saw when testing with zlib was:

    buf_start = 42769
    total_out = 46865
    working_bytes = total_out - buf_start = 4096
    start_byte = 45056

The condition (total_out > start_byte && buf_start < start_byte) is
true, so we adjust the offset:

    buf_offset = start_byte - buf_start = 2287
    working_bytes -= buf_offset = 1809
    current_buf_start = buf_start = 42769

Then, we copy

    bytes = min(bvec.bv_len, PAGE_SIZE - buf_offset, working_bytes) = 1809
    buf_offset += bytes = 4096
    working_bytes -= bytes = 0
    current_buf_start += bytes = 44578

After bio_advance(), we are still in the same page, so start_byte is the
same. Then, we check (total_out > start_byte && current_buf_start < start_byte),
which is true! So, we adjust the values again:

    buf_offset = start_byte - buf_start = 2287
    working_bytes = total_out - start_byte = 1809
    current_buf_start = buf_start + buf_offset = 45056

But note that working_bytes was already zero before this, so we should
have stopped copying.

Fixes: 974b1adc ("btrfs: use bio iterators for the decompression handlers")
Reported-by: NPat Erley <pat-lkml@erley.org>
Reviewed-by: NChris Mason <clm@fb.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NChris Mason <clm@fb.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Tested-by: NLiu Bo <bo.li.liu@oracle.com>

6e78b3f7

10 2月, 2017 5 次提交

timerfd: Protect the might cancel mechanism proper · 1e38da30

由 Thomas Gleixner 提交于 1月 31, 2017

The handling of the might_cancel queueing is not properly protected, so
parallel operations on the file descriptor can race with each other and
lead to list corruptions or use after free.

Protect the context for these operations with a seperate lock.

The wait queue lock cannot be reused for this because that would create a
lock inversion scenario vs. the cancel lock. Replacing might_cancel with an
atomic (atomic_t or atomic bit) does not help either because it still can
race vs. the actual list operation.
Reported-by: NDmitry Vyukov <dvyukov@google.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: "linux-fsdevel@vger.kernel.org"
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701311521430.3457@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

1e38da30

ext4: do not use stripe_width if it is not set · 5469d7c3

由 Jan Kara 提交于 2月 10, 2017

Avoid using stripe_width for sbi->s_stripe value if it is not actually
set. It prevents using the stride for sbi->s_stripe.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

5469d7c3

ext4: fix stripe-unaligned allocations · d9b22cf9

由 Jan Kara 提交于 2月 10, 2017

When a filesystem is created using:

	mkfs.ext4 -b 4096 -E stride=512 <dev>

and we try to allocate 64MB extent, we will end up directly in
ext4_mb_complex_scan_group(). This is because the request is detected
as power-of-two allocation (so we start in ext4_mb_regular_allocator()
with ac_criteria == 0) however the check before
ext4_mb_simple_scan_group() refuses the direct buddy scan because the
allocation request is too large. Since cr == 0, the check whether we
should use ext4_mb_scan_aligned() fails as well and we fall back to
ext4_mb_complex_scan_group().

Fix the problem by checking for upper limit on power-of-two requests
directly when detecting them.
Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

d9b22cf9

nfsd: Revert "nfsd: special case truncates some more" · 0839ffb8

由 J. Bruce Fields 提交于 2月 09, 2017

This patch incorrectly attempted nested mnt_want_write, and incorrectly
disabled nfsd's owner override for truncate.  We'll fix those problems
and make another attempt soon, for the moment I think the safest is to
revert.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

0839ffb8

pstore: don't OOPS when there are no ftrace zones · 8672aed7

由 Brian Norris 提交于 2月 08, 2017

We'll OOPS in ramoops_get_next_prz() if the platform didn't ask for any
ftrace zones (i.e., cxt->fprzs will be NULL). Let's just skip this
entire FTRACE section if there's no 'fprzs'.

Regression seen on a coreboot/depthcharge-based Chromebook.

Fixes: 2fbea82b ("pstore: Merge per-CPU ftrace records into one")
Cc: Joel Fernandes <joelaf@google.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NBrian Norris <briannorris@chromium.org>
Signed-off-by: NKees Cook <keescook@chromium.org>

8672aed7

09 2月, 2017 3 次提交

dax: assert that i_rwsem is held exclusive for writes · 168316db

由 Christoph Hellwig 提交于 2月 08, 2017

Make sure all callers follow the same locking protocol, given that DAX
transparantly replaced the normal buffered I/O path.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

168316db

ext4: fix DAX write locking · ff5462e3

由 Christoph Hellwig 提交于 2月 08, 2017

Unlike O_DIRECT DAX is not an optional opt-in feature selected by the
application, so we'll have to provide the traditional synchronіzation
of overlapping writes as we do for buffered writes.

This was broken historically for DAX, but got fixed for ext2 and XFS
as part of the iomap conversion.  Fix up ext4 as well.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NJan Kara <jack@suse.cz>

ff5462e3

btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls · 2a362249

由 Jeff Mahoney 提交于 2月 06, 2017

Commit 4c63c245 incorrectly assumed that returning -ENOIOCTLCMD would
cause the native ioctl to be called.  The ->compat_ioctl callback is
expected to handle all ioctls, not just compat variants.  As a result,
when using 32-bit userspace on 64-bit kernels, everything except those
three ioctls would return -ENOTTY.

Fixes: 4c63c245 ("btrfs: bugfix: handle FS_IOC32_{GETFLAGS,SETFLAGS,GETVERSION} in btrfs_ioctl")
Cc: stable@vger.kernel.org
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2a362249

08 2月, 2017 2 次提交

fscrypt: constify struct fscrypt_operations · 6f69f0ed

由 Eric Biggers 提交于 2月 07, 2017

Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Reviewed-by: NRichard Weinberger <richard@nod.at>

6f69f0ed

mm: fix KPF_SWAPCACHE in /proc/kpageflags · b6789123

由 Hugh Dickins 提交于 2月 07, 2017

Commit 6326fec1 ("mm: Use owner_priv bit for PageSwapCache, valid
when PageSwapBacked") aliased PG_swapcache to PG_owner_priv_1 (and
depending on PageSwapBacked being true).

As a result, the KPF_SWAPCACHE bit in '/proc/kpageflags' should now be
synthesized, instead of being shown on unrelated pages which just happen
to have PG_owner_priv_1 set.
Signed-off-by: NHugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b6789123

07 2月, 2017 3 次提交

fscrypt: properly declare on-stack completion · b14c8e6a

由 Richard Weinberger 提交于 2月 01, 2017

When a completion is declared on-stack we have to use
COMPLETION_INITIALIZER_ONSTACK().

Fixes: 0b81d077 ("fs crypto: move per-file encryption from f2fs
tree to fs/crypto")
Signed-off-by: NRichard Weinberger <richard@nod.at>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

b14c8e6a

fscrypt: split supp and notsupp declarations into their own headers · 46f47e48

由 Eric Biggers 提交于 1月 24, 2017

Previously, each filesystem configured without encryption support would
define all the public fscrypt functions to their notsupp_* stubs. This
list of #defines had to be updated in every filesystem whenever a change
was made to the public fscrypt functions. To make things more
maintainable now that we have three filesystems using fscrypt, split the
old header fscrypto.h into several new headers. fscrypt_supp.h contains
the real declarations and is included by filesystems when configured
with encryption support, whereas fscrypt_notsupp.h contains the inline
stubs and is included by filesystems when configured without encryption
support. fscrypt_common.h contains common declarations needed by both.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

46f47e48

fscrypt: remove redundant assignment of res · 02680b31

由 Colin Ian King 提交于 1月 10, 2017

res is assigned to sizeof(ctx), however, this is unused and res
is updated later on without that assigned value to res ever being
used.  Remove this redundant assignment.

Fixes CoverityScan CID#1395546 "Unused value"
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

02680b31

06 2月, 2017 1 次提交

ext4: add EXT4_IOC_GOINGDOWN ioctl · 783d9485

由 Theodore Ts'o 提交于 2月 05, 2017

This ioctl is modeled after the xfs's XFS_IOC_GOINGDOWN ioctl.  (In
fact, it uses the same code points.)
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

783d9485

05 2月, 2017 6 次提交

ext4: add shutdown bit and check for it · 0db1ff22

由 Theodore Ts'o 提交于 2月 05, 2017

Add a shutdown bit that will cause ext4 processing to fail immediately
with EIO.
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

0db1ff22

ext4: rename s_resize_flags to s_ext4_flags · 9549a168

由 Theodore Ts'o 提交于 2月 05, 2017

We are currently using one bit in s_resize_flags; rename it in order
to allow more of the bits in that unsigned long for other purposes.
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

9549a168

ext4: return EROFS if device is r/o and journal replay is needed · 4753d8a2

由 Theodore Ts'o 提交于 2月 05, 2017

If the file system requires journal recovery, and the device is
read-ony, return EROFS to the mount system call.  This allows xfstests
generic/050 to pass.
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

4753d8a2

ext4: preserve the needs_recovery flag when the journal is aborted · 97abd7d4

由 Theodore Ts'o 提交于 2月 04, 2017

If the journal is aborted, the needs_recovery feature flag should not
be removed.  Otherwise, it's the journal might not get replayed and
this could lead to more data getting lost.
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

97abd7d4

jbd2: don't leak modified metadata buffers on an aborted journal · e112666b

由 Theodore Ts'o 提交于 2月 04, 2017

If the journal has been aborted, we shouldn't mark the underlying
buffer head as dirty, since that will cause the metadata block to get
modified.  And if the journal has been aborted, we shouldn't allow
this since it will almost certainly lead to a corrupted file system.
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

e112666b

ext4: fix inline data error paths · eb5efbcb

由 Theodore Ts'o 提交于 2月 04, 2017

The write_end() function must always unlock the page and drop its ref
count, even on an error.
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org

eb5efbcb

04 2月, 2017 1 次提交

fs: break out of iomap_file_buffered_write on fatal signals · d1908f52

由 Michal Hocko 提交于 2月 03, 2017

Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion.  He has tracked
this down to the following path

	__alloc_pages_nodemask+0x436/0x4d0
	alloc_pages_current+0x97/0x1b0
	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
	? iomap_write_end+0x80/0x80             fs/iomap.c:150
	iomap_apply+0xb3/0x130                  fs/iomap.c:79
	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
	? iomap_write_end+0x80/0x80
	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
	? remove_wait_queue+0x59/0x60
	xfs_file_write_iter+0x90/0x130 [xfs]
	__vfs_write+0xe5/0x140
	vfs_write+0xc7/0x1f0
	? syscall_trace_enter+0x1d0/0x380
	SyS_write+0x58/0xc0
	do_syscall_64+0x6c/0x200
	entry_SYSCALL64_slow_path+0x25/0x25

the oom victim has access to all memory reserves to make a forward
progress to exit easier.  But iomap_file_buffered_write and other
callers of iomap_apply loop to complete the full request.  We need to
check for fatal signals and back off with a short write instead.

As the iomap_apply delegates all the work down to the actor we have to
hook into those.  All callers that work with the page cache are calling
iomap_write_begin so we will check for signals there.  dax_iomap_actor
has to handle the situation explicitly because it copies data to the
userspace directly.  Other callers like iomap_page_mkwrite work on a
single page or iomap_fiemap_actor do not allocate memory based on the
given len.

Fixes: 68a9f5e7 ("xfs: implement iomap based buffered write path")
Link: http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>	[4.8+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d1908f52

03 2月, 2017 1 次提交

ext4: move halfmd4 into hash.c directly · 1c83a9aa

由 Jason A. Donenfeld 提交于 2月 02, 2017

The "half md4" transform should not be used by any new code. And
fortunately, it's only used now by ext4. Since ext4 supports several
hashing methods, at some point it might be desirable to move to
something like SipHash. As an intermediate step, remove half md4 from
cryptohash.h and lib, and make it just a local function in ext4's
hash.c. There's precedent for doing this; the other function ext can use
for its hashes -- TEA -- is also implemented in the same place. Also, by
being a local function, this might allow gcc to perform some additional
optimizations.
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

1c83a9aa

02 2月, 2017 2 次提交

ext4: fix use-after-iput when fscrypt contexts are inconsistent · dd01b690

由 Eric Biggers 提交于 2月 01, 2017

In the case where the child's encryption context was inconsistent with
its parent directory, we were using inode->i_sb and inode->i_ino after
the inode had already been iput().  Fix this by doing the iput() in the
correct places.

Note: only ext4 had this bug, not f2fs and ubifs.

Fixes: d9cdc903 ("ext4 crypto: enforce context consistency")
Cc: stable@vger.kernel.org
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

dd01b690

jbd2: fix use after free in kjournald2() · dbfcef6b

由 Sahitya Tummala 提交于 2月 01, 2017

Below is the synchronization issue between unmount and kjournald2
contexts, which results into use after free issue in kjournald2().
Fix this issue by using journal->j_state_lock to synchronize the
wait_event() done in journal_kill_thread() and the wake_up() done
in kjournald2().

TASK 1:
umount cmd:
   |--jbd2_journal_destroy() {
       |--journal_kill_thread() {
            write_lock(&journal->j_state_lock);
	    journal->j_flags |= JBD2_UNMOUNT;
	    ...
	    write_unlock(&journal->j_state_lock);
	    wake_up(&journal->j_wait_commit);	   TASK 2 wakes up here:
	    					   kjournald2() {
						     ...
						     checks JBD2_UNMOUNT flag and calls goto end-loop;
						     ...
						     end_loop:
						       write_unlock(&journal->j_state_lock);
						       journal->j_task = NULL; --> If this thread gets
						       pre-empted here, then TASK 1 wait_event will
						       exit even before this thread is completely
						       done.
	    wait_event(journal->j_wait_done_commit, journal->j_task == NULL);
	    ...
	    write_lock(&journal->j_state_lock);
	    write_unlock(&journal->j_state_lock);
	  }
       |--kfree(journal);
     }
}
						       wake_up(&journal->j_wait_done_commit); --> this step
						       now results into use after free issue.
						   }
Signed-off-by: NSahitya Tummala <stummala@codeaurora.org>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

dbfcef6b

01 2月, 2017 6 次提交

s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs · 42b425b3

由 Frederic Weisbecker 提交于 1月 31, 2017

This way we don't need to deal with cputime_t details from the core code.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-32-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

42b425b3

fs/binfmt: Convert obsolete cputime type to nsecs · cd19c364

由 Frederic Weisbecker 提交于 1月 31, 2017

Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-12-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

cd19c364

sched/cputime: Convert task/group cputime to nsecs · 5613fda9

由 Frederic Weisbecker 提交于 1月 31, 2017

Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

5613fda9

sched/cputime: Introduce special task_cputime_t() API to return old-typed cputime · a1cecf2b

由 Frederic Weisbecker 提交于 1月 31, 2017

This API returns a task's cputime in cputime_t in order to ease the
conversion of cputime internals to use nsecs units instead. Blindly
converting all cputime readers to use this API now will later let us
convert more smoothly and step by step all these places to use the
new nsec based cputime.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-7-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a1cecf2b

sched/cputime: Convert guest time accounting to nsecs (u64) · 16a6d9be

由 Frederic Weisbecker 提交于 1月 31, 2017

cputime_t is being obsolete and replaced by nsecs units in order to make
internal timestamps less opaque and more granular.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-6-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

16a6d9be

sched/cputime: Convert kcpustat to nsecs · 7fb1327e

由 Frederic Weisbecker 提交于 1月 31, 2017

Kernel CPU stats are stored in cputime_t which is an architecture
defined type, and hence a bit opaque and requiring accessors and mutators
for any operation.

Converting them to nsecs simplifies the code and is one step toward
the removal of cputime_t in the core code.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

7fb1327e

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功