提交 · bb176f67090ca54869fc1262c913aa69d2ede070 · openanolis / cloud-kernel

20 10月, 2017 1 次提交

membarrier: Provide register expedited private command · a961e409

由 Mathieu Desnoyers 提交于 10月 19, 2017

This introduces a "register private expedited" membarrier command which
allows eventual removal of important memory barrier constraints on the
scheduler fast-paths. It changes how the "private expedited" membarrier
command (new to 4.14) is used from user-space.

This new command allows processes to register their intent to use the
private expedited command.  This affects how the expedited private
command introduced in 4.14-rc is meant to be used, and should be merged
before 4.14 final.

Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

This fixes a problem that arose when designing requested extensions to
sys_membarrier() to allow JITs to efficiently flush old code from
instruction caches.  Several potential algorithms are much less painful
if the user register intent to use this functionality early on, for
example, before the process spawns the second thread.  Registering at
this time removes the need to interrupt each and every thread in that
process at the first expedited sys_membarrier() system call.
Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a961e409

19 10月, 2017 1 次提交

Convert fs/*/* to SB_I_VERSION · 357fdad0

由 Matthew Garrett 提交于 10月 18, 2017

[AV: in addition to the fix in previous commit]
Signed-off-by: NMatthew Garrett <mjg59@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

357fdad0

17 10月, 2017 6 次提交

fs: Avoid invalidation in interrupt context in dio_complete() · ffe51f01

由 Lukas Czerner 提交于 10月 17, 2017

Currently we try to defer completion of async DIO to the process context
in case there are any mapped pages associated with the inode so that we
can invalidate the pages when the IO completes. However the check is racy
and the pages can be mapped afterwards. If this happens we might end up
calling invalidate_inode_pages2_range() in dio_complete() in interrupt
context which could sleep. This can be reproduced by generic/451.

Fix this by passing the information whether we can or can't invalidate
to the dio_complete(). Thanks Eryu Guan for reporting this and Jan Kara
for suggesting a fix.

Fixes: 332391a9 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
Reported-by: NEryu Guan <eguan@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Tested-by: NEryu Guan <eguan@redhat.com>
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ffe51f01

vfs: fix mounting a filesystem with i_version · 917086ff

由 Mimi Zohar 提交于 10月 08, 2017

The mount i_version flag is not enabled in the new sb_flags.  This patch
adds the missing SB_I_VERSION flag.

Fixes: e462ec50 "VFS: Differentiate mount flags (MS_*) from internal
       superblock flags"
Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

917086ff

xfs: move two more RT specific functions into CONFIG_XFS_RT · 785545c8

由 Arnd Bergmann 提交于 10月 13, 2017

The last cleanup introduced two harmless warnings:

fs/xfs/xfs_fsmap.c:480:1: warning: '__xfs_getfsmap_rtdev' defined but not used
fs/xfs/xfs_fsmap.c:372:1: warning: 'xfs_getfsmap_rtdev_rtbitmap_helper' defined but not used

This moves those two functions as well.

Fixes: bb9c2e54 ("xfs: move more RT specific code under CONFIG_XFS_RT")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

785545c8

xfs: trim writepage mapping to within eof · 40214d12

由 Brian Foster 提交于 10月 13, 2017

The writeback rework in commit fbcc0256 ("xfs: Introduce
writeback context for writepages") introduced a subtle change in
behavior with regard to the block mapping used across the
->writepages() sequence. The previous xfs_cluster_write() code would
only flush pages up to EOF at the time of the writepage, thus
ensuring that any pages due to file-extending writes would be
handled on a separate cycle and with a new, updated block mapping.

The updated code establishes a block mapping in xfs_writepage_map()
that could extend beyond EOF if the file has post-eof preallocation.
Because we now use the generic writeback infrastructure and pass the
cached mapping to each writepage call, there is no implicit EOF
limit in place. If eofblocks trimming occurs during ->writepages(),
any post-eof portion of the cached mapping becomes invalid. The
eofblocks code has no means to serialize against writeback because
there are no pages associated with post-eof blocks. Therefore if an
eofblocks trim occurs and is followed by a file-extending buffered
write, not only has the mapping become invalid, but we could end up
writing a page to disk based on the invalid mapping.

Consider the following sequence of events:

- A buffered write creates a delalloc extent and post-eof
  speculative preallocation.
- Writeback starts and on the first writepage cycle, the delalloc
  extent is converted to real blocks (including the post-eof blocks)
  and the mapping is cached.
- The file is closed and xfs_release() trims post-eof blocks. The
  cached writeback mapping is now invalid.
- Another buffered write appends the file with a delalloc extent.
- The concurrent writeback cycle picks up the just written page
  because the writeback range end is LLONG_MAX. xfs_writepage_map()
  attributes it to the (now invalid) cached mapping and writes the
  data to an incorrect location on disk (and where the file offset is
  still backed by a delalloc extent).

This problem is reproduced by xfstests test generic/464, which
triggers racing writes, appends, open/closes and writeback requests.

To address this problem, trim the mapping used during writeback to
within EOF when the mapping is validated. This ensures the mapping
is revalidated for any pages encountered beyond EOF as of the time
the current mapping was cached or last validated.
Reported-by: NEryu Guan <eguan@redhat.com>
Diagnosed-by: NEryu Guan <eguan@redhat.com>
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

40214d12

fs: invalidate page cache after end_io() in dio completion · 5e25c269

由 Eryu Guan 提交于 10月 13, 2017

Commit 332391a9 ("fs: Fix page cache inconsistency when mixing
buffered and AIO DIO") moved page cache invalidation from
iomap_dio_rw() to iomap_dio_complete() for iomap based direct write
path, but before the dio->end_io() call, and it re-introdued the bug
fixed by commit c771c14b ("iomap: invalidate page caches should
be after iomap_dio_complete() in direct write").

I found this because fstests generic/418 started failing on XFS with
v4.14-rc3 kernel, which is the regression test for this specific
bug.

So similarly, fix it by moving dio->end_io() (which does the
unwritten extent conversion) before page cache invalidation, to make
sure next buffer read reads the final real allocations not unwritten
extents. I also add some comments about why should end_io() go first
in case we get it wrong again in the future.

Note that, there's no such problem in the non-iomap based direct
write path, because we didn't remove the page cache invalidation
after the ->direct_IO() in generic_file_direct_write() call, but I
decided to fix dio_complete() too so we don't leave a landmine
there, also be consistent with iomap_dio_complete().

Fixes: 332391a9 ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
Signed-off-by: NEryu Guan <eguan@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NLukas Czerner <lczerner@redhat.com>

5e25c269

xfs: cancel dirty pages on invalidation · 793d7dbe

由 Dave Chinner 提交于 10月 13, 2017

Recently we've had warnings arise from the vm handing us pages
without bufferheads attached to them. This should not ever occur
in XFS, but we don't defend against it properly if it does. The only
place where we remove bufferheads from a page is in
xfs_vm_releasepage(), but we can't tell the difference here between
"page is dirty so don't release" and "page is dirty but is being
invalidated so release it".

In some places that are invalidating pages ask for pages to be
released and follow up afterward calling ->releasepage by checking
whether the page was dirty and then aborting the invalidation. This
is a possible vector for releasing buffers from a page but then
leaving it in the mapping, so we really do need to avoid dirty pages
in xfs_vm_releasepage().

To differentiate between invalidated pages and normal pages, we need
to clear the page dirty flag when invalidating the pages. This can
be done through xfs_vm_invalidatepage(), and will result
xfs_vm_releasepage() seeing the page as clean which matches the
bufferhead state on the page after calling block_invalidatepage().

Hence we can re-add the page dirty check in xfs_vm_releasepage to
catch the case where we might be releasing a page that is actually
dirty and so should not have the bufferheads on it removed. This
will remove one possible vector of "dirty page with no bufferheads"
and so help narrow down the search for the root cause of that
problem.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

793d7dbe

14 10月, 2017 2 次提交

fs/binfmt_misc.c: node could be NULL when evicting inode · 7e866006

由 Eryu Guan 提交于 10月 13, 2017

inode->i_private is assigned by a Node pointer only after registering a
new binary format, so it could be NULL if inode was created by
bm_fill_super() (or iput() was called by the error path in
bm_register_write()), and this could result in NULL pointer dereference
when evicting such an inode.  e.g.  mount binfmt_misc filesystem then
umount it immediately:

  mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc
  umount /proc/sys/fs/binfmt_misc

will result in

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000013
  IP: bm_evict_inode+0x16/0x40 [binfmt_misc]
  ...
  Call Trace:
   evict+0xd3/0x1a0
   iput+0x17d/0x1d0
   dentry_unlink_inode+0xb9/0xf0
   __dentry_kill+0xc7/0x170
   shrink_dentry_list+0x122/0x280
   shrink_dcache_parent+0x39/0x90
   do_one_tree+0x12/0x40
   shrink_dcache_for_umount+0x2d/0x90
   generic_shutdown_super+0x1f/0x120
   kill_litter_super+0x29/0x40
   deactivate_locked_super+0x43/0x70
   deactivate_super+0x45/0x60
   cleanup_mnt+0x3f/0x70
   __cleanup_mnt+0x12/0x20
   task_work_run+0x86/0xa0
   exit_to_usermode_loop+0x6d/0x99
   syscall_return_slowpath+0xba/0xf0
   entry_SYSCALL_64_fastpath+0xa3/0xa

Fix it by making sure Node (e) is not NULL.

Link: http://lkml.kernel.org/r/20171010100642.31786-1-eguan@redhat.com
Fixes: 83f91827 ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
Signed-off-by: NEryu Guan <eguan@redhat.com>
Acked-by: NOleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7e866006

fs/mpage.c: fix mpage_writepage() for pages with buffers · f892760a

由 Matthew Wilcox 提交于 10月 13, 2017

When using FAT on a block device which supports rw_page, we can hit
BUG_ON(!PageLocked(page)) in try_to_free_buffers().  This is because we
call clean_buffers() after unlocking the page we've written.  Introduce
a new clean_page_buffers() which cleans all buffers associated with a
page and call it from within bdev_write_page().

[akpm@linux-foundation.org: s/PAGE_SIZE/~0U/ per Linus and Matthew]
Link: http://lkml.kernel.org/r/20171006211541.GA7409@bombadil.infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Reported-by: NToshi Kani <toshi.kani@hpe.com>
Reported-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Tested-by: NToshi Kani <toshi.kani@hpe.com>
Acked-by: NJohannes Thumshirn <jthumshirn@suse.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f892760a

13 10月, 2017 3 次提交

ecryptfs: fix dereference of NULL user_key_payload · f66665c0

由 Eric Biggers 提交于 10月 09, 2017

In eCryptfs, we failed to verify that the authentication token keys are
not revoked before dereferencing their payloads, which is problematic
because the payload of a revoked key is NULL.  request_key() *does* skip
revoked keys, but there is still a window where the key can be revoked
before we acquire the key semaphore.

Fix it by updating ecryptfs_get_key_payload_data() to return
-EKEYREVOKED if the key payload is NULL.  For completeness we check this
for "encrypted" keys as well as "user" keys, although encrypted keys
cannot be revoked currently.

Alternatively we could use key_validate(), but since we'll also need to
fix ecryptfs_get_key_payload_data() to validate the payload length, it
seems appropriate to just check the payload pointer.

Fixes: 237fead6 ("[PATCH] ecryptfs: fs/Makefile and fs/Kconfig")
Reviewed-by: NJames Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org>    [v2.6.19+]
Cc: Michael Halcrow <mhalcrow@google.com>
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>

f66665c0

fscrypt: fix dereference of NULL user_key_payload · d60b5b78

由 Eric Biggers 提交于 10月 09, 2017

When an fscrypt-encrypted file is opened, we request the file's master
key from the keyrings service as a logon key, then access its payload.
However, a revoked key has a NULL payload, and we failed to check for
this.  request_key() *does* skip revoked keys, but there is still a
window where the key can be revoked before we acquire its semaphore.

Fix it by checking for a NULL payload, treating it like a key which was
already revoked at the time it was requested.

Fixes: 88bd6ccd ("ext4 crypto: add encryption key management facilities")
Reviewed-by: NJames Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org>    [v4.1+]
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>

d60b5b78

FS-Cache: fix dereference of NULL user_key_payload · d124b2c5

由 Eric Biggers 提交于 10月 09, 2017

When the file /proc/fs/fscache/objects (available with
CONFIG_FSCACHE_OBJECT_LIST=y) is opened, we request a user key with
description "fscache:objlist", then access its payload.  However, a
revoked key has a NULL payload, and we failed to check for this.
request_key() *does* skip revoked keys, but there is still a window
where the key can be revoked before we access its payload.

Fix it by checking for a NULL payload, treating it like a key which was
already revoked at the time it was requested.

Fixes: 4fbf4291 ("FS-Cache: Allow the current state of all objects to be dumped")
Reviewed-by: NJames Morris <james.l.morris@oracle.com>
Cc: <stable@vger.kernel.org>    [v2.6.32+]
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>

d124b2c5

12 10月, 2017 7 次提交

xfs: handle error if xfs_btree_get_bufs fails · 93e8befc

由 Eric Sandeen 提交于 10月 09, 2017

Jason reported that a corrupted filesystem failed to replay
the log with a metadata block out of bounds warning:

XFS (dm-2): _xfs_buf_find: Block out of range: block 0x80270fff8, EOFS 0x9c40000

_xfs_buf_find() and xfs_btree_get_bufs() return NULL if
that happens, and then when xfs_alloc_fix_freelist() calls
xfs_trans_binval() on that NULL bp, we oops with:

BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8

We don't handle _xfs_buf_find errors very well, every
caller higher up the stack gets to guess at why it failed.
But we should at least handle it somehow, so return
EFSCORRUPTED here.
Reported-by: NJason L Tibbitts III <tibbs@math.uh.edu>
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

93e8befc

xfs: reinit btree pointer on attr tree inactivation walk · f35c5e10

由 Brian Foster 提交于 10月 09, 2017

xfs_attr3_root_inactive() walks the attr fork tree to invalidate the
associated blocks. xfs_attr3_node_inactive() recursively descends
from internal blocks to leaf blocks, caching block address values
along the way to revisit parent blocks, locate the next entry and
descend down that branch of the tree.

The code that attempts to reread the parent block is unsafe because
it assumes that the local xfs_da_node_entry pointer remains valid
after an xfs_trans_brelse() and re-read of the parent buffer. Under
heavy memory pressure, it is possible that the buffer has been
reclaimed and reallocated by the time the parent block is reread.
This means that 'btree' can point to an invalid memory address, lead
to a random/garbage value for child_fsb and cause the subsequent
read of the attr fork to go off the rails and return a NULL buffer
for an attr fork offset that is most likely not allocated.

Note that this problem can be manufactured by setting
XFS_ATTR_BTREE_REF to 0 to prevent LRU caching of attr buffers,
creating a file with a multi-level attr fork and removing it to
trigger inactivation.

To address this problem, reinit the node/btree pointers to the
parent buffer after it has been re-read. This ensures btree points
to a valid record and allows the walk to proceed.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

f35c5e10

xfs: Fix bool initialization/comparison · 749f24f3

由 Thomas Meyer 提交于 10月 09, 2017

Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: NThomas Meyer <thomas@m3y3r.de>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

749f24f3

xfs: don't change inode mode if ACL update fails · 67f2ffe3

由 Dave Chinner 提交于 10月 09, 2017

If we get ENOSPC half way through setting the ACL, the inode mode
can still be changed even though the ACL does not exist. Reorder the
operation to only change the mode of the inode if the ACL is set
correctly.

Whilst this does not fix the problem with crash consistency (that requires
attribute addition to be a deferred op) it does prevent ENOSPC and other
non-fatal errors setting an xattr to be handled sanely.

This fixes xfstests generic/449.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

67f2ffe3

xfs: move more RT specific code under CONFIG_XFS_RT · bb9c2e54

由 Dave Chinner 提交于 10月 09, 2017

Various utility functions and interfaces that iterate internal
devices try to reference the realtime device even when RT support is
not compiled into the kernel.

Make sure this code is excluded from the CONFIG_XFS_RT=n build,
and where appropriate stub functions to return fatal errors if
they ever get called when RT support is not present.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

bb9c2e54

xfs: Don't log uninitialised fields in inode structures · 20413e37

由 Dave Chinner 提交于 10月 09, 2017

Prevent kmemcheck from throwing warnings about reading uninitialised
memory when formatting inodes into the incore log buffer. There are
several issues here - we don't always log all the fields in the
inode log format item, and we never log the inode the
di_next_unlinked field.

In the case of the inode log format item, this is exacerbated
by the old xfs_inode_log_format structure padding issue. Hence make
the padded, 64 bit aligned version of the structure the one we always
use for formatting the log and get rid of the 64 bit variant. This
means we'll always log the 64-bit version and so recovery only needs
to convert from the unpadded 32 bit version from older 32 bit
kernels.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

20413e37

9p: set page uptodate when required in write_end() · 56ae414e

由 Alexander Levin 提交于 4月 10, 2017

Commit 77469c3f prevented setting the page as uptodate when we wrote
the right amount of data, fix that.

Fixes: 77469c3f ("9p: saner ->write_end() on failing copy into non-uptodate page")
Reviewed-by: NJan Kara <jack@suse.com>
Signed-off-by: NAlexander Levin <alexander.levin@verizon.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

56ae414e

11 10月, 2017 1 次提交

direct-io: Prevent NULL pointer access in submit_page_section · 899f0429

由 Andreas Gruenbacher 提交于 10月 09, 2017

In the code added to function submit_page_section by commit b1058b98,
sdio->bio can currently be NULL when calling dio_bio_submit.  This then
leads to a NULL pointer access in dio_bio_submit, so check for a NULL
bio in submit_page_section before trying to submit it instead.

Fixes xfstest generic/250 on gfs2.

Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

899f0429

10 10月, 2017 1 次提交

quota: Generate warnings for DQUOT_SPACE_NOFAIL allocations · ac3d7939

由 Jan Kara 提交于 10月 10, 2017

Eryu has reported that since commit 7b9ca4c6 "quota: Reduce
contention on dq_data_lock" test generic/233 occasionally fails. This is
caused by the fact that since that commit we don't generate warning and
set grace time for quota allocations that have DQUOT_SPACE_NOFAIL set
(these are for example some metadata allocations in ext4). We need these
allocations to behave regularly wrt warning generation and grace time
setting so fix the code to return to the original behavior.
Reported-and-tested-by: NEryu Guan <eguan@redhat.com>
CC: stable@vger.kernel.org
Fixes: 7b9ca4c6Signed-off-by: NJan Kara <jack@suse.cz>

ac3d7939

06 10月, 2017 1 次提交

nfsd4: define nfsd4_secinfo_no_name_release() · ec572b9e

由 Eryu Guan 提交于 9月 29, 2017

Commit 34b1744c ("nfsd4: define ->op_release for compound ops")
defined a couple ->op_release functions and run them if necessary.

But there's a problem with that is that it reused
nfsd4_secinfo_release() as the op_release of OP_SECINFO_NO_NAME, and
caused a leak on struct nfsd4_secinfo_no_name in
nfsd4_encode_secinfo_no_name(), because there's no .si_exp field in
struct nfsd4_secinfo_no_name.

I found this because I was unable to umount an ext4 partition after
exporting it via NFS & run fsstress on the nfs mount. A simplified
reproducer would be:

 # mount a local-fs device at /mnt/test, and export it via NFS with
 # fsid=0 export option (this is required)
 mount /dev/sda5 /mnt/test
 echo "/mnt/test *(rw,no_root_squash,fsid=0)" >> /etc/exports
 service nfs restart

 # locally mount the nfs export with all default, note that I have
 # nfsv4.1 configured as the default nfs version, because of the
 # fsid export option, v4 mount would fail and fall back to v3
 mount localhost:/mnt/test /mnt/nfs

 # try to umount the underlying device, but got EBUSY
 umount /mnt/nfs
 service nfs stop
 umount /mnt/test <=== EBUSY here

Fixed it by defining a separate nfsd4_secinfo_no_name_release()
function as the op_release method of OP_SECINFO_NO_NAME that
releases the correct nfsd4_secinfo_no_name structure.

Fixes: 34b1744c ("nfsd4: define ->op_release for compound ops")
Signed-off-by: NEryu Guan <eguan@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

ec572b9e

05 10月, 2017 7 次提交

ovl: fix regression caused by exclusive upper/work dir protection · 85fdee1e

由 Amir Goldstein 提交于 9月 29, 2017

Enforcing exclusive ownership on upper/work dirs caused a docker
regression: https://github.com/moby/moby/issues/34672.

Euan spotted the regression and pointed to the offending commit.
Vivek has brought the regression to my attention and provided this
reproducer:

Terminal 1:

  mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
        merged/

Terminal 2:

  unshare -m

Terminal 1:

  umount merged
  mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
        merged/
  mount: /root/overlay-testing/merged: none already mounted or mount point
         busy

To fix the regression, I replaced the error with an alarming warning.
With index feature enabled, mount does fail, but logs a suggestion to
override exclusive dir protection by disabling index.
Note that index=off mount does take the inuse locks, so a concurrent
index=off will issue the warning and a concurrent index=on mount will fail.

Documentation was updated to reflect this change.

Fixes: 2cac0c00 ("ovl: get exclusive ownership on upper/work dirs")
Cc: <stable@vger.kernel.org> # v4.13
Reported-by: NEuan Kemp <euank@euank.com>
Reported-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

85fdee1e

ovl: fix missing unlock_rename() in ovl_do_copy_up() · 5820dc08

由 Amir Goldstein 提交于 9月 25, 2017

Use the ovl_lock_rename_workdir() helper which requires
unlock_rename() only on lock success.

Fixes: ("fd210b7d ovl: move copy up lock out")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

5820dc08

ovl: fix dentry leak in ovl_indexdir_cleanup() · dc7ab677

由 Amir Goldstein 提交于 9月 24, 2017

index dentry was not released when breaking out of the loop
due to index verification error.

Fixes: 415543d5 ("ovl: cleanup bad and stale index entries on mount")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

dc7ab677

ovl: fix dput() of ERR_PTR in ovl_cleanup_index() · 9f4ec904

由 Amir Goldstein 提交于 9月 24, 2017

Fixes: caf70cb2 ("ovl: cleanup orphan index entries")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

9f4ec904

ovl: fix error value printed in ovl_lookup_index() · e0082a0f

由 Amir Goldstein 提交于 9月 24, 2017

Fixes: 359f392c ("ovl: lookup index entry for copy up origin")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

e0082a0f

ovl: fix may_write_real() for overlayfs directories · 954c736f

由 Amir Goldstein 提交于 9月 18, 2017

Overlayfs directory file_inode() is the overlay inode whether the real
inode is upper or lower.

This fixes a regression in xfstest generic/158.

Fixes: 7c6893e3 ("ovl: don't allow writing ioctl on lower layer")
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

954c736f

NFSv4/pnfs: Fix an infinite layoutget loop · e8fa33a6

由 Trond Myklebust 提交于 10月 04, 2017

Since we can now use a lock stateid or a delegation stateid, that
differs from the context stateid, we need to change the test in
nfs4_layoutget_handle_exception() to take this into account.

This fixes an infinite layoutget loop in the NFS client whereby
it keeps retrying the initial layoutget using the same broken
stateid.

Fixes: 70d2f7b1 ("pNFS: Use the standard I/O stateid when...")
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

e8fa33a6

04 10月, 2017 10 次提交

Btrfs: fix overlap of fs_info::flags values · 69ad5976

由 Tsutomu Itoh 提交于 10月 04, 2017

Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
we should change the value.

First, BTRFS_FS_EXCL_OP was set to 14.

  commit 171938e5 ("btrfs: track exclusive filesystem operation in flags")

Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.

  commit f29efe29 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")

As a result, the value 14 overlapped, by accident.
This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
the flags are internal.

Fixes: f29efe29 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ minimize the change, update only BTRFS_FS_EXCL_OP ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

69ad5976

btrfs: avoid overflow when sector_t is 32 bit · 2d8ce70a

由 Goffredo Baroncelli 提交于 10月 03, 2017

Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
introduces a regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.

In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).

I added a cast to u64 to avoid overflow.

Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: NGoffredo Baroncelli <kreijack@inwind.it>
Tested-by: NJean-Denis Girard <jd.girard@sysnux.pf>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2d8ce70a

lsm: fix smack_inode_removexattr and xattr_getsecurity memleak · 57e7ba04

由 Casey Schaufler 提交于 9月 19, 2017

security_inode_getsecurity() provides the text string value
of a security attribute. It does not provide a "secctx".
The code in xattr_getsecurity() that calls security_inode_getsecurity()
and then calls security_release_secctx() happened to work because
SElinux and Smack treat the attribute and the secctx the same way.
It fails for cap_inode_getsecurity(), because that module has no
secctx that ever needs releasing. It turns out that Smack is the
one that's doing things wrong by not allocating memory when instructed
to do so by the "alloc" parameter.

The fix is simple enough. Change the security_release_secctx() to
kfree() because it isn't a secctx being returned by
security_inode_getsecurity(). Change Smack to allocate the string when
told to do so.

Note: this also fixes memory leaks for LSMs which implement
inode_getsecurity but not release_secctx, such as capabilities.
Signed-off-by: NCasey Schaufler <casey@schaufler-ca.com>
Reported-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: stable@vger.kernel.org
Signed-off-by: NJames Morris <james.l.morris@oracle.com>

57e7ba04

xfs: handle racy AIO in xfs_reflink_end_cow · e12199f8

由 Christoph Hellwig 提交于 10月 03, 2017

If we got two AIO writes into a COW area the second one might not have any
COW extents left to convert. Handle that case gracefully instead of
triggering an assert or accessing beyond the bounds of the extent list.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

e12199f8

xfs: always swap the cow forks when swapping extents · 52bfcdd7

由 Darrick J. Wong 提交于 9月 18, 2017

Since the CoW fork exists as a secondary data structure to the data
fork, we must always swap cow forks during swapext.  We also need to
swap the extent counts and reset the cowblocks tags.
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

52bfcdd7

exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array · 50097f74