提交 · 6c60d2b5746cf23025ffe71bd7ff9075048fc90c · OpenHarmony / kernel_linux

27 7月, 2016 9 次提交

fs/fs-writeback.c: add a new writeback list for sync · 6c60d2b5

由 Dave Chinner 提交于 7月 26, 2016

wait_sb_inodes() currently does a walk of all inodes in the filesystem
to find dirty one to wait on during sync. This is highly inefficient
and wastes a lot of CPU when there are lots of clean cached inodes that
we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for. We do this by
adding inodes to a writeback list on the sb when the mapping is first
tagged as having pages under writeback. wait_sb_inodes() can then walk
this list of "inodes under IO" and wait specifically just for the inodes
that the current sync(2) needs to wait for.

Define a couple helpers to add/remove an inode from the writeback list
and call them when the overall mapping is tagged for or cleared from
writeback. Update wait_sb_inodes() to walk only the inodes under
writeback due to the sync.

With this change, filesystem sync times are significantly reduced for
fs' with largely populated inode caches and otherwise no other work to
do. For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
than 0.1s when the filesystem is fully clean.

Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.comSigned-off-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Tested-by: NHolger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6c60d2b5

ocfs2/cluster: clean up unnecessary assignment for 'ret' · 7d65b274

由 piaojun 提交于 7月 26, 2016

Clean up unnecessary assignment for 'ret'.

Link: http://lkml.kernel.org/r/578C61F6.4080403@huawei.comSigned-off-by: NJun Piao <piaojun@huawei.com>
Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7d65b274

ocfs2: remove obscure BUG_ON in dlmglue · e81f1c5c

由 Joseph Qi 提交于 7月 26, 2016

These BUG_ON(!inode) are obscure because we have already used inode to
get osb.  And actually we can guarantee here inode is valid in the
context.  So we can safely remove them.

Link: http://lkml.kernel.org/r/5776336A.6030104@huawei.comSigned-off-by: NJoseph Qi <joseph.qi@huawei.com>
Reviewed-by: NEric Ren <zren@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e81f1c5c

ocfs2: cleanup implemented prototypes · 698d44b4

由 Joseph Qi 提交于 7月 26, 2016

Several prototypes in inode.h are just defined but not actually
implemented and used, so remove them.

Link: http://lkml.kernel.org/r/57763787.4020706@huawei.comSigned-off-by: NJoseph Qi <joseph.qi@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

698d44b4

ocfs2/dlm: fix memory leak of dlm_debug_ctxt · 8ec7b17a

由 Joseph Qi 提交于 7月 26, 2016

dlm_debug_ctxt->debug_refcnt is initialized to 1 and then increased to 2
by dlm_debug_get in dlm_debug_init.  But dlm_debug_put is called only
once in dlm_debug_shutdown during unregister dlm, which leads to
dlm_debug_ctxt leaked.

Link: http://lkml.kernel.org/r/577BB755.4030900@huawei.comSigned-off-by: NJoseph Qi <joseph.qi@huawei.com>
Reviewed-by: NJiufei Xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8ec7b17a

ocfs2: cleanup unneeded goto in ocfs2_create_new_inode_locks · a8f24f1b

由 Joseph Qi 提交于 7月 26, 2016

The last goto is unneeded, so remove it.

Link: http://lkml.kernel.org/r/576213D3.6080002@huawei.comSigned-off-by: NJoseph Qi <joseph.qi@huawei.com>
Reviewed-by: NMark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a8f24f1b

ocfs2: improve recovery performance · 0b492f68

由 Junxiao Bi 提交于 7月 26, 2016

Journal replay will be run when performing recovery for a dead node.  To
avoid the stale cache impact, all blocks of dead node's journal inode
were reloaded from disk.  This hurts the performance.  Check whether one
block is cached before reloading it can improve performance a lot.  In
my test env, the time doing recovery was improved from 120s to 1s.

[akpm@linux-foundation.org: clean up the for loop p_blkno handling]
Link: http://lkml.kernel.org/r/1466155682-24656-1-git-send-email-junxiao.bi@oracle.comSigned-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
Cc: "Gang He" <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0b492f68

ocfs2: fix a redundant re-initialization · 191df2b5

由 Eric Ren 提交于 7月 26, 2016

Obviously, memset() has zeroed the whole struct locking_max_version.
So, it's no need to zero its two fields individually.

Link: http://lkml.kernel.org/r/1463970605-18354-1-git-send-email-zren@suse.comSigned-off-by: NEric Ren <zren@suse.com>
Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
Reviewed-by: NGang He <ghe@suse.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

191df2b5

dax: remote unused fault wrappers · 6b524995

由 Ross Zwisler 提交于 7月 26, 2016

Remove the unused wrappers dax_fault() and dax_pmd_fault().  After this
removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
dax_pmd_fault() respectively, and update all callers.

The dax_fault() and dax_pmd_fault() wrappers were initially intended to
capture some filesystem independent functionality around page faults
(calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
and ctime).

However, the following commits:

   5726b27b ("ext2: Add locking for DAX faults")
   ea3d7209 ("ext4: fix races between page faults and hole punching")

added locking to the ext2 and ext4 filesystems after these common
operations but before __dax_fault() and __dax_pmd_fault() were called.
This means that these wrappers are no longer used, and are unlikely to
be used in the future.

XFS has had locking analogous to what was recently added to ext2 and
ext4 since DAX support was initially introduced by:

   6b698ede ("xfs: add DAX file operations support")

Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6b524995

22 7月, 2016 2 次提交

ovl: verify upper dentry in ovl_remove_and_whiteout() · cfc9fde0

由 Maxim Patlasov 提交于 7月 21, 2016

The upper dentry may become stale before we call ovl_lock_rename_workdir.
For example, someone could (mistakenly or maliciously) manually unlink(2)
it directly from upperdir.

To ensure it is not stale, let's lookup it after ovl_lock_rename_workdir
and and check if it matches the upper dentry.

Essentially, it is the same problem and similar solution as in
commit 11f37104 ("ovl: verify upper dentry before unlink and rename").
Signed-off-by: NMaxim Patlasov <mpatlasov@virtuozzo.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>

cfc9fde0

GFS2: Fix gfs2_replay_incr_blk for multiple journal sizes · e1cb6be9

由 Bob Peterson 提交于 7月 21, 2016

Before this patch, if you used gfs2_jadd to add new journals of a
size smaller than the existing journals, replaying those new journals
would withdraw. That's because function gfs2_replay_incr_blk was
using the number of journal blocks (jd_block) from the superblock's
journal pointer. In other words, "My journal's max size" rather than
"the journal we're replaying's size." This patch changes the function
to use the size of the pertinent journal rather than always using the
journal we happen to be using.
Signed-off-by: NBob Peterson <rpeterso@redhat.com>

e1cb6be9

16 7月, 2016 1 次提交

xfs: fix type confusion in xfs_ioc_swapext · 3e0a3965

由 Jann Horn 提交于 9月 11, 2015

Without this check, the following XFS_I invocations would return bad
pointers when used on non-XFS inodes (perhaps pointers into preceding
allocator chunks).

This could be used by an attacker to trick xfs_swap_extents into
performing locking operations on attacker-chosen structures in kernel
memory, potentially leading to code execution in the kernel.  (I have
not investigated how likely this is to be usable for an attack in
practice.)
Signed-off-by: NJann Horn <jann@thejh.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3e0a3965

15 7月, 2016 1 次提交

x86/syscalls: Add compat_sys_preadv64v2/compat_sys_pwritev64v2 · 3ebfd81f

由 H.J. Lu 提交于 7月 14, 2016

Don't use the same syscall numbers for 2 different syscalls:

 534	x32	preadv			compat_sys_preadv64
 535	x32	pwritev			compat_sys_pwritev64
 534	x32	preadv2			compat_sys_preadv2
 535	x32	pwritev2		compat_sys_pwritev2

Add compat_sys_preadv64v2() and compat_sys_pwritev64v2() so that 64-bit offset
is passed in one 64-bit register on x32, similar to compat_sys_preadv64()
and compat_sys_pwritev64().
Signed-off-by: NH.J. Lu <hjl.tools@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/CAMe9rOovCMf-RQfx_n1U_Tu_DX1BYkjtFr%3DQ4-_PFVSj9BCzUA@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

3ebfd81f

14 7月, 2016 1 次提交

chardev: add missing line break in pr_warn · 077e2642

由 Fengguang Wu 提交于 7月 12, 2016

To fix super long dmesg error lines like

After fix, it should look like

CHRDEV "dummy_stm.0" major number 224 goes below the dynamic allocation range
CHRDEV "dummy_stm.1" major number 223 goes below the dynamic allocation range
swapper: page allocation failure: order:8, mode:0x26040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK)
Reported-by: NPhilip Li <philip.li@intel.com>
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

077e2642

13 7月, 2016 1 次提交

GFS2: Check rs_free with rd_rsspin protection · 44f52122

由 Bob Peterson 提交于 7月 06, 2016

For the last process to close a file opened for write, function
gfs2_rsqa_delete was deleting the file's inode's block reservation
out of the rgrp reservations tree. Then it was checking to make sure
rs_free was 0, but it was performing the check outside the protection
of rd_rsspin spin_lock. The rd_rsspin spin_lock protection is needed
to prevent a race between the process freeing the reservation and
another who is allocating a new set of blocks inside the same rgrp
for the same inode, thus changing its value.
Signed-off-by: NBob Peterson <rpeterso@redhat.com>

44f52122

08 7月, 2016 2 次提交

ecryptfs: don't allow mmap when the lower fs doesn't support it · f0fe970d

由 Jeff Mahoney 提交于 7月 05, 2016

There are legitimate reasons to disallow mmap on certain files, notably
in sysfs or procfs.  We shouldn't emulate mmap support on file systems
that don't offer support natively.

CVE-2016-1583
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Cc: stable@vger.kernel.org
[tyhicks: clean up f_op check by using ecryptfs_file_to_lower()]
Signed-off-by: NTyler Hicks <tyhicks@canonical.com>

f0fe970d

Revert "ecryptfs: forbid opening files without mmap handler" · 78c4e172

由 Jeff Mahoney 提交于 7月 05, 2016

This reverts commit 2f36db71.

It fixed a local root exploit but also introduced a dependency on
the lower file system implementing an mmap operation just to open a file,
which is a bit of a heavy hammer.  The right fix is to have mmap depend
on the existence of the mmap handler instead.
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: NTyler Hicks <tyhicks@canonical.com>

78c4e172

06 7月, 2016 2 次提交

A
nfs_atomic_open(): prevent parallel nfs_lookup() on a negative hashed · c94c0953
由 Al Viro 提交于 7月 05, 2016
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
c94c0953

Use the right predicate in ->atomic_open() instances · 00699ad8

由 Al Viro 提交于 7月 05, 2016

->atomic_open() can be given an in-lookup dentry *or* a negative one
found in dcache.  Use d_in_lookup() to tell one from another, rather
than d_unhashed().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

00699ad8

04 7月, 2016 2 次提交

ovl: Copy up underlying inode's ->i_mode to overlay inode · 07a2daab

由 Vivek Goyal 提交于 7月 01, 2016

Right now when a new overlay inode is created, we initialize overlay
inode's ->i_mode from underlying inode ->i_mode but we retain only
file type bits (S_IFMT) and discard permission bits.

This patch changes it and retains permission bits too. This should allow
overlay to do permission checks on overlay inode itself in task context.

[SzM] It also fixes clearing suid/sgid bits on write.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Reported-by: NEryu Guan <eguan@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Fixes: 4bacc9c9 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>

07a2daab

ovl: handle ATTR_KILL* · b99c2d91

由 Miklos Szeredi 提交于 7月 04, 2016

Before 4bacc9c9 ("overlayfs: Make f_path...") file->f_path pointed to
the underlying file, hence suid/sgid removal on write worked fine.

After that patch file->f_path pointed to the overlay file, and the file
mode bits weren't copied to overlay_inode->i_mode.  So the suid/sgid
removal simply stopped working.

The fix is to copy the mode bits, but then ovl_setattr() needs to clear
ATTR_MODE to avoid the BUG() in notify_change().  So do this first, then in
the next patch copy the mode.
Reported-by: NEryu Guan <eguan@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Fixes: 4bacc9c9 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>

b99c2d91

03 7月, 2016 1 次提交

ovl: warn instead of error if d_type is not supported · e7c0b599

由 Vivek Goyal 提交于 7月 01, 2016

overlay needs underlying fs to support d_type. Recently I put in a
patch in to detect this condition and started failing mount if
underlying fs did not support d_type.

But this breaks existing configurations over kernel upgrade. Those who
are running docker (partially broken configuration) with xfs not
supporting d_type, are surprised that after kernel upgrade docker does
not run anymore.

https://github.com/docker/docker/issues/22937#issuecomment-229881315

So instead of erroring out, detect broken configuration and warn
about it. This should allow existing docker setups to continue
working after kernel upgrade.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Fixes: 45aebeaf ("ovl: Ensure upper filesystem supports d_type")
Cc: <stable@vger.kernel.org> 4.6

e7c0b599

01 7月, 2016 5 次提交

locks: use file_inode() · 6343a212

由 Miklos Szeredi 提交于 7月 01, 2016

(Another one for the f_path debacle.)

ltp fcntl33 testcase caused an Oops in selinux_file_send_sigiotask.

The reason is that generic_add_lease() used filp->f_path.dentry->inode
while all the others use file_inode().  This makes a difference for files
opened on overlayfs since the former will point to the overlay inode the
latter to the underlying inode.

So generic_add_lease() added the lease to the overlay inode and
generic_delete_lease() removed it from the underlying inode.  When the file
was released the lease remained on the overlay inode's lock list, resulting
in use after free.
Reported-by: NEryu Guan <eguan@redhat.com>
Fixes: 4bacc9c9 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

6343a212

namespace: update event counter when umounting a deleted dentry · e06b933e

由 Andrey Ulanov 提交于 4月 15, 2016

- m_start() in fs/namespace.c expects that ns->event is incremented each
  time a mount added or removed from ns->list.
- umount_tree() removes items from the list but does not increment event
  counter, expecting that it's done before the function is called.
- There are some codepaths that call umount_tree() without updating
  "event" counter. e.g. from __detach_mounts().
- When this happens m_start may reuse a cached mount structure that no
  longer belongs to ns->list (i.e. use after free which usually leads
  to infinite loop).

This change fixes the above problem by incrementing global event counter
before invoking umount_tree().

Change-Id: I622c8e84dcb9fb63542372c5dbf0178ee86bb589
Cc: stable@vger.kernel.org
Signed-off-by: NAndrey Ulanov <andreyu@google.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e06b933e

9p: use file_dentry() · b403f0e3

由 Miklos Szeredi 提交于 6月 29, 2016

v9fs may be used as lower layer of overlayfs and accessing f_path.dentry
can lead to a crash.  In this case it's a NULL pointer dereference in
p9_fid_create().

Fix by replacing direct access of file->f_path.dentry with the
file_dentry() accessor, which will always return a native object.
Reported-by: NAlessio Igor Bogani <alessioigorbogani@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Tested-by: NAlessio Igor Bogani <alessioigorbogani@gmail.com>
Fixes: 4bacc9c9 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Cc: <stable@vger.kernel.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b403f0e3

lockd: unregister notifier blocks if the service fails to come up completely · cb7d224f

由 Scott Mayhew 提交于 6月 30, 2016

If the lockd service fails to start up then we need to be sure that the
notifier blocks are not registered, otherwise a subsequent start of the
service could cause the same notifier to be registered twice, leading to
soft lockups.
Signed-off-by: NScott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 0751ddf7 "lockd: Register callbacks on the inetaddr_chain..."
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

cb7d224f

writeback: inode cgroup wb switch should not call ihold() · 74524955

由 Tahsin Erdogan 提交于 6月 16, 2016

Asynchronous wb switching of inodes takes an additional ref count on an
inode to make sure inode remains valid until switchover is completed.

However, anyone calling ihold() must already have a ref count on inode,
but in this case inode->i_count may already be zero:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 917 at fs/inode.c:397 ihold+0x2b/0x30
CPU: 1 PID: 917 Comm: kworker/u4:5 Not tainted 4.7.0-rc2+ #49
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Workqueue: writeback wb_workfn (flush-8:16)
 0000000000000000 ffff88007ca0fb58 ffffffff805990af 0000000000000000
 0000000000000000 ffff88007ca0fb98 ffffffff80268702 0000018d000004e2
 ffff88007cef40e8 ffff88007c9b89a8 ffff880079e3a740 0000000000000003
Call Trace:
 [<ffffffff805990af>] dump_stack+0x4d/0x6e
 [<ffffffff80268702>] __warn+0xc2/0xe0
 [<ffffffff802687d8>] warn_slowpath_null+0x18/0x20
 [<ffffffff8035b4ab>] ihold+0x2b/0x30
 [<ffffffff80367ecc>] inode_switch_wbs+0x11c/0x180
 [<ffffffff80369110>] wbc_detach_inode+0x170/0x1a0
 [<ffffffff80369abc>] writeback_sb_inodes+0x21c/0x530
 [<ffffffff80369f7e>] wb_writeback+0xee/0x1e0
 [<ffffffff8036a147>] wb_workfn+0xd7/0x280
 [<ffffffff80287531>] ? try_to_wake_up+0x1b1/0x2b0
 [<ffffffff8027bb09>] process_one_work+0x129/0x300
 [<ffffffff8027be06>] worker_thread+0x126/0x480
 [<ffffffff8098cde7>] ? __schedule+0x1c7/0x561
 [<ffffffff8027bce0>] ? process_one_work+0x300/0x300
 [<ffffffff80280ff4>] kthread+0xc4/0xe0
 [<ffffffff80335578>] ? kfree+0xc8/0x100
 [<ffffffff809903cf>] ret_from_fork+0x1f/0x40
 [<ffffffff80280f30>] ? __kthread_parkme+0x70/0x70
---[ end trace aaefd2fd9f306bc4 ]---
Signed-off-by: NTahsin Erdogan <tahsin@google.com>
Acked-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

74524955

30 6月, 2016 2 次提交

fuse: serialize dirops by default · 5c672ab3

由 Miklos Szeredi 提交于 6月 30, 2016

Negotiate with userspace filesystems whether they support parallel readdir
and lookup.  Disable parallelism by default for fear of breaking fuse
filesystems.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Fixes: 9902af79 ("parallel lookups: actual switch to rwsem")
Fixes: d9b3dbdc ("fuse: switch to ->iterate_shared()")

5c672ab3

configfs: Remove ppos increment in configfs_write_bin_file · f8608985

由 Marek Vasut 提交于 5月 18, 2016

The simple_write_to_buffer() already increments the @ppos on success,
see fs/libfs.c simple_write_to_buffer() comment:

"
On success, the number of bytes written is returned and the offset @ppos
advanced by this number, or negative value is returned on error.
"

If the configfs_write_bin_file() is invoked with @count smaller than the
total length of the written binary file, it will be invoked multiple times.
Since configfs_write_bin_file() increments @ppos on success, after calling
simple_write_to_buffer(), the @ppos is incremented twice.

Subsequent invocation of configfs_write_bin_file() will result in the next
piece of data being written to the offset twice as long as the length of
the previous write, thus creating buffer with "holes" in it.

The simple testcase using DTO follows:
  $ mkdir /sys/kernel/config/device-tree/overlays/1
  $ dd bs=1 if=foo.dtbo of=/sys/kernel/config/device-tree/overlays/1/dtbo
Without this patch, the testcase will result in twice as big buffer in the
kernel, which is then passed to the cfs_overlay_item_dtbo_write() .
Signed-off-by: NMarek Vasut <marex@denx.de>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Pantelis Antoniou <pantelis.antoniou@konsulko.com>

f8608985

29 6月, 2016 3 次提交

ovl: get_write_access() in truncate · 03bea604

由 Miklos Szeredi 提交于 6月 29, 2016

When truncating a file we should check write access on the underlying
inode.  And we should do so on the lower file as well (before copy-up) for
consistency.

Original patch and test case by Aihua Zhang.

 - - >o >o - - test.c - - >o >o - -
#include <stdio.h>
#include <errno.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
	int ret;

	ret = truncate(argv[0], 4096);
	if (ret != -1) {
		fprintf(stderr, "truncate(argv[0]) should have failed\n");
		return 1;
	}
	if (errno != ETXTBSY) {
		perror("truncate(argv[0])");
		return 1;
	}

	return 0;
}
 - - >o >o - - >o >o - - >o >o - -
Reported-by: NAihua Zhang <zhangaihua1@huawei.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org>

03bea604

ovl: fix dentry leak for default_permissions · a4859d75

由 Miklos Szeredi 提交于 6月 29, 2016

When using the 'default_permissions' mount option, ovl_permission() on
non-directories was missing a dput(alias), resulting in "BUG Dentry still
in use".
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Fixes: 8d3095f4 ("ovl: default permissions")
Cc: <stable@vger.kernel.org> # v4.5+

a4859d75

NFS: Fix another OPEN_DOWNGRADE bug · e547f262

由 Trond Myklebust 提交于 6月 25, 2016

Olga Kornievskaia reports that the following test fails to trigger
an OPEN_DOWNGRADE on the wire, and only triggers the final CLOSE.

	fd0 = open(foo, RDRW)   -- should be open on the wire for "both"
	fd1 = open(foo, RDONLY)  -- should be open on the wire for "read"
	close(fd0) -- should trigger an open_downgrade
	read(fd1)
	close(fd1)

The issue is that we're missing a check for whether or not the current
state transitioned from an O_RDWR state as opposed to having transitioned
from a combination of O_RDONLY and O_WRONLY.
Reported-by: NOlga Kornievskaia <aglo@umich.edu>
Fixes: cd9288ff ("NFSv4: Fix another bug in the close/open_downgrade code")
Cc: stable@vger.kernel.org # 2.6.33+
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

e547f262

28 6月, 2016 1 次提交

dax: fix offset overflow in dax_io · 02395435

由 Eric Sandeen 提交于 6月 23, 2016

This isn't functionally apparent for some reason, but
when we test io at extreme offsets at the end of the loff_t
rang, such as in fstests xfs/071, the calculation of
"max" in dax_io() can be wrong due to pos + size overflowing.

For example,

# xfs_io -c "pwrite 9223372036854771712 512" /mnt/test/file

enters dax_io with:

start 0x7ffffffffffff000
end   0x7ffffffffffff200

and the rounded up "size" variable is 0x1000.  This yields:

pos + size 0x8000000000000000 (overflows loff_t)
       end 0x7ffffffffffff200

Due to the overflow, the min() function picks the wrong
value for the "max" variable, and when we send (max - pos)
into i.e. copy_from_iter_pmem() it is also the wrong value.

This somehow(tm) gets magically absorbed without incident,
probably because iter->count is correct.  But it seems best
to fix it up properly by comparing the two values as
unsigned.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

02395435

27 6月, 2016 7 次提交

gfs2: writeout truncated pages · fd4c5748

由 Benjamin Marzinski 提交于 6月 27, 2016

When gfs2 attempts to write a page to a file that is being truncated,
and notices that the page is completely outside of the file size, it
tries to invalidate it. However, this may require a transaction for
journaled data files to revoke any buffers from the page on the active
items list. Unfortunately, this can happen inside a log flush, where a
transaction cannot be started. Also, gfs2 may need to be able to remove
the buffer from the ail1 list before it can finish the log flush.

To deal with this, when writing a page of a file with data journalling
enabled gfs2 now skips the check to see if the write is outside the file
size, and simply writes it anyway. This situation can only occur when
the truncate code still has the file locked exclusively, and hasn't
marked this block as free in the metadata (which happens later in
truc_dealloc). After gfs2 writes this page out, the truncation code
will shortly invalidate it and write out any revokes if necessary.

To do this, gfs2 now implements its own version of block_write_full_page
without the check, and calls the newly exported __block_write_full_page.
It also no longer calls gfs2_writepage_common from gfs2_jdata_writepage.
Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: NBob Peterson <rpeterso@redhat.com>

fd4c5748

fs: export __block_write_full_page · b4bba389

由 Benjamin Marzinski 提交于 6月 27, 2016

gfs2 needs to be able to skip the check to see if a page is outside of
the file size when writing it out. gfs2 can get into a situation where
it needs to flush its in-memory log to disk while a truncate is in
progress. If the file being trucated has data journaling enabled, it is
possible that there are data blocks in the log that are past the end of
the file. gfs can't finish the log flush without either writing these
blocks out or revoking them. Otherwise, if the node crashed, it could
overwrite subsequent changes made by other nodes in the cluster when
it's journal was replayed.

Unfortunately, there is no way to add log entries to the log during a
flush. So gfs2 simply writes out the page instead. This situation can
only occur when the truncate code still has the file locked exclusively,
and hasn't marked this block as free in the metadata (which happens
later in truc_dealloc). After gfs2 writes this page out, the truncation
code will shortly invalidate it and write out any revokes if necessary.

In order to make this work, gfs2 needs to be able to skip the check for
writes outside the file size. Since the check exists in
block_write_full_page, this patch exports __block_write_full_page, which
doesn't have the check.
Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: NBob Peterson <rpeterso@redhat.com>

b4bba389

gfs2: Lock holder cleanup · 6df9f9a2

由 Andreas Gruenbacher 提交于 6月 17, 2016

Make the code more readable by cleaning up the different ways of
initializing lock holders and checking for initialized lock holders:
mark lock holders as uninitialized by setting the holder's glock to NULL
(gfs2_holder_mark_uninitialized) instead of zeroing out the entire
object or using a separate flag.  Recognize initialized holders by their
non-NULL glock (gfs2_holder_initialized).  Don't zero out holder objects
which are immeditiately initialized via gfs2_holder_init or
gfs2_glock_nq_init.
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: NBob Peterson <rpeterso@redhat.com>

6df9f9a2

gfs2: Large-filesystem fix for 32-bit systems · cda9dd42

由 Andreas Gruenbacher 提交于 6月 14, 2016

Commit ff34245d switched from iget5_locked to iget_locked among other
things, but iget_locked doesn't work for filesystems larger than 2^32
blocks on 32-bit systems.  Switch back to iget5_locked.  Filesystems
larger than 2^32 blocks are unrealistic to work well on 32-bit systems,
so this is mostly a code cleanliness fix.
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: NBob Peterson <rpeterso@redhat.com>

cda9dd42

gfs2: Get rid of gfs2_ilookup · ec5ec66b

由 Andreas Gruenbacher 提交于 6月 14, 2016

Now that gfs2_lookup_by_inum only takes the inode glock for new inodes
(and not for cached inodes anymore), there no longer is a need to
optimize the cached-inode case in gfs2_get_dentry or delete_work_func,
and gfs2_ilookup can be removed.

In addition, gfs2_get_dentry wasn't checking the GFS2_DIF_SYSTEM flag in
i_diskflags in the gfs2_ilookup case (see gfs2_lookup_by_inum); this
inconsistency goes away as well.
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: NBob Peterson <rpeterso@redhat.com>

ec5ec66b

gfs2: Fix gfs2_lookup_by_inum lock inversion · 3ce37b2c

由 Andreas Gruenbacher 提交于 6月 14, 2016

The current gfs2_lookup_by_inum takes the glock of a presumed inode
identified by block number, verifies that the block is indeed an inode,
and then instantiates and reads the new inode via gfs2_inode_lookup.

However, instantiating a new inode may block on freeing a previous
instance of that inode (__wait_on_freeing_inode), and freeing an inode
requires to take the glock already held, leading to lock inversion and
deadlock.

Fix this by first instantiating the new inode, then verifying that the
block is an inode (if required), and then reading in the new inode, all
in gfs2_inode_lookup.

If the block we are looking for is not an inode, we discard the new
inode via iget_failed, which marks inodes as bad and unhashes them.
Other tasks waiting on that inode will get back a bad inode back from
ilookup or iget_locked; in that case, retry the lookup.
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: NBob Peterson <rpeterso@redhat.com>

3ce37b2c

make nfs_atomic_open() call d_drop() on all ->open_context() errors. · d20cb71d

由 Al Viro 提交于 6月 20, 2016

In "NFSv4: Move dentry instantiation into the NFSv4-specific atomic open code"
unconditional d_drop() after the ->open_context() had been removed. It had
been correct for success cases (there ->open_context() itself had been doing
dcache manipulations), but not for error ones. Only one of those (ENOENT)
got a compensatory d_drop() added in that commit, but in fact it should've
been done for all errors. As it is, the case of O_CREAT non-exclusive open
on a hashed negative dentry racing with e.g. symlink creation from another
client ended up with ->open_context() getting an error and proceeding to
call nfs_lookup(). On a hashed dentry, which would've instantly triggered
BUG_ON() in d_materialise_unique() (or, these days, its equivalent in
d_splice_alias()).

Cc: stable@vger.kernel.org # v3.10+
Tested-by: NOleg Drokin <green@linuxhacker.ru>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

d20cb71d

OpenHarmony / kernel_linux 上一次同步 大约 4 年

OpenHarmony / kernel_linux
上一次同步大约 4 年