提交 · 793d7dbe6d82a50b9d14bf992b9eaacb70a11ce6 · openanolis / cloud-kernel

17 10月, 2017 1 次提交

xfs: cancel dirty pages on invalidation · 793d7dbe

由 Dave Chinner 提交于 10月 13, 2017

Recently we've had warnings arise from the vm handing us pages
without bufferheads attached to them. This should not ever occur
in XFS, but we don't defend against it properly if it does. The only
place where we remove bufferheads from a page is in
xfs_vm_releasepage(), but we can't tell the difference here between
"page is dirty so don't release" and "page is dirty but is being
invalidated so release it".

In some places that are invalidating pages ask for pages to be
released and follow up afterward calling ->releasepage by checking
whether the page was dirty and then aborting the invalidation. This
is a possible vector for releasing buffers from a page but then
leaving it in the mapping, so we really do need to avoid dirty pages
in xfs_vm_releasepage().

To differentiate between invalidated pages and normal pages, we need
to clear the page dirty flag when invalidating the pages. This can
be done through xfs_vm_invalidatepage(), and will result
xfs_vm_releasepage() seeing the page as clean which matches the
bufferhead state on the page after calling block_invalidatepage().

Hence we can re-add the page dirty check in xfs_vm_releasepage to
catch the case where we might be releasing a page that is actually
dirty and so should not have the bufferheads on it removed. This
will remove one possible vector of "dirty page with no bufferheads"
and so help narrow down the search for the root cause of that
problem.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

793d7dbe

14 10月, 2017 2 次提交

fs/binfmt_misc.c: node could be NULL when evicting inode · 7e866006

由 Eryu Guan 提交于 10月 13, 2017

inode->i_private is assigned by a Node pointer only after registering a
new binary format, so it could be NULL if inode was created by
bm_fill_super() (or iput() was called by the error path in
bm_register_write()), and this could result in NULL pointer dereference
when evicting such an inode.  e.g.  mount binfmt_misc filesystem then
umount it immediately:

  mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc
  umount /proc/sys/fs/binfmt_misc

will result in

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000013
  IP: bm_evict_inode+0x16/0x40 [binfmt_misc]
  ...
  Call Trace:
   evict+0xd3/0x1a0
   iput+0x17d/0x1d0
   dentry_unlink_inode+0xb9/0xf0
   __dentry_kill+0xc7/0x170
   shrink_dentry_list+0x122/0x280
   shrink_dcache_parent+0x39/0x90
   do_one_tree+0x12/0x40
   shrink_dcache_for_umount+0x2d/0x90
   generic_shutdown_super+0x1f/0x120
   kill_litter_super+0x29/0x40
   deactivate_locked_super+0x43/0x70
   deactivate_super+0x45/0x60
   cleanup_mnt+0x3f/0x70
   __cleanup_mnt+0x12/0x20
   task_work_run+0x86/0xa0
   exit_to_usermode_loop+0x6d/0x99
   syscall_return_slowpath+0xba/0xf0
   entry_SYSCALL_64_fastpath+0xa3/0xa

Fix it by making sure Node (e) is not NULL.

Link: http://lkml.kernel.org/r/20171010100642.31786-1-eguan@redhat.com
Fixes: 83f91827 ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
Signed-off-by: NEryu Guan <eguan@redhat.com>
Acked-by: NOleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7e866006

fs/mpage.c: fix mpage_writepage() for pages with buffers · f892760a

由 Matthew Wilcox 提交于 10月 13, 2017

When using FAT on a block device which supports rw_page, we can hit
BUG_ON(!PageLocked(page)) in try_to_free_buffers().  This is because we
call clean_buffers() after unlocking the page we've written.  Introduce
a new clean_page_buffers() which cleans all buffers associated with a
page and call it from within bdev_write_page().

[akpm@linux-foundation.org: s/PAGE_SIZE/~0U/ per Linus and Matthew]
Link: http://lkml.kernel.org/r/20171006211541.GA7409@bombadil.infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
Reported-by: NToshi Kani <toshi.kani@hpe.com>
Reported-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Tested-by: NToshi Kani <toshi.kani@hpe.com>
Acked-by: NJohannes Thumshirn <jthumshirn@suse.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f892760a

12 10月, 2017 7 次提交

xfs: handle error if xfs_btree_get_bufs fails · 93e8befc

由 Eric Sandeen 提交于 10月 09, 2017

Jason reported that a corrupted filesystem failed to replay
the log with a metadata block out of bounds warning:

XFS (dm-2): _xfs_buf_find: Block out of range: block 0x80270fff8, EOFS 0x9c40000

_xfs_buf_find() and xfs_btree_get_bufs() return NULL if
that happens, and then when xfs_alloc_fix_freelist() calls
xfs_trans_binval() on that NULL bp, we oops with:

BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8

We don't handle _xfs_buf_find errors very well, every
caller higher up the stack gets to guess at why it failed.
But we should at least handle it somehow, so return
EFSCORRUPTED here.
Reported-by: NJason L Tibbitts III <tibbs@math.uh.edu>
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

93e8befc

xfs: reinit btree pointer on attr tree inactivation walk · f35c5e10

由 Brian Foster 提交于 10月 09, 2017

xfs_attr3_root_inactive() walks the attr fork tree to invalidate the
associated blocks. xfs_attr3_node_inactive() recursively descends
from internal blocks to leaf blocks, caching block address values
along the way to revisit parent blocks, locate the next entry and
descend down that branch of the tree.

The code that attempts to reread the parent block is unsafe because
it assumes that the local xfs_da_node_entry pointer remains valid
after an xfs_trans_brelse() and re-read of the parent buffer. Under
heavy memory pressure, it is possible that the buffer has been
reclaimed and reallocated by the time the parent block is reread.
This means that 'btree' can point to an invalid memory address, lead
to a random/garbage value for child_fsb and cause the subsequent
read of the attr fork to go off the rails and return a NULL buffer
for an attr fork offset that is most likely not allocated.

Note that this problem can be manufactured by setting
XFS_ATTR_BTREE_REF to 0 to prevent LRU caching of attr buffers,
creating a file with a multi-level attr fork and removing it to
trigger inactivation.

To address this problem, reinit the node/btree pointers to the
parent buffer after it has been re-read. This ensures btree points
to a valid record and allows the walk to proceed.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

f35c5e10

xfs: Fix bool initialization/comparison · 749f24f3

由 Thomas Meyer 提交于 10月 09, 2017

Bool initializations should use true and false. Bool tests don't need
comparisons.
Signed-off-by: NThomas Meyer <thomas@m3y3r.de>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

749f24f3

xfs: don't change inode mode if ACL update fails · 67f2ffe3

由 Dave Chinner 提交于 10月 09, 2017

If we get ENOSPC half way through setting the ACL, the inode mode
can still be changed even though the ACL does not exist. Reorder the
operation to only change the mode of the inode if the ACL is set
correctly.

Whilst this does not fix the problem with crash consistency (that requires
attribute addition to be a deferred op) it does prevent ENOSPC and other
non-fatal errors setting an xattr to be handled sanely.

This fixes xfstests generic/449.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

67f2ffe3

xfs: move more RT specific code under CONFIG_XFS_RT · bb9c2e54

由 Dave Chinner 提交于 10月 09, 2017

Various utility functions and interfaces that iterate internal
devices try to reference the realtime device even when RT support is
not compiled into the kernel.

Make sure this code is excluded from the CONFIG_XFS_RT=n build,
and where appropriate stub functions to return fatal errors if
they ever get called when RT support is not present.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

bb9c2e54

xfs: Don't log uninitialised fields in inode structures · 20413e37

由 Dave Chinner 提交于 10月 09, 2017

Prevent kmemcheck from throwing warnings about reading uninitialised
memory when formatting inodes into the incore log buffer. There are
several issues here - we don't always log all the fields in the
inode log format item, and we never log the inode the
di_next_unlinked field.

In the case of the inode log format item, this is exacerbated
by the old xfs_inode_log_format structure padding issue. Hence make
the padded, 64 bit aligned version of the structure the one we always
use for formatting the log and get rid of the 64 bit variant. This
means we'll always log the 64-bit version and so recovery only needs
to convert from the unpadded 32 bit version from older 32 bit
kernels.
Signed-Off-By: NDave Chinner <dchinner@redhat.com>
Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

20413e37

9p: set page uptodate when required in write_end() · 56ae414e

由 Alexander Levin 提交于 4月 10, 2017

Commit 77469c3f prevented setting the page as uptodate when we wrote
the right amount of data, fix that.

Fixes: 77469c3f ("9p: saner ->write_end() on failing copy into non-uptodate page")
Reviewed-by: NJan Kara <jack@suse.com>
Signed-off-by: NAlexander Levin <alexander.levin@verizon.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

56ae414e

11 10月, 2017 1 次提交

direct-io: Prevent NULL pointer access in submit_page_section · 899f0429

由 Andreas Gruenbacher 提交于 10月 09, 2017

In the code added to function submit_page_section by commit b1058b98,
sdio->bio can currently be NULL when calling dio_bio_submit.  This then
leads to a NULL pointer access in dio_bio_submit, so check for a NULL
bio in submit_page_section before trying to submit it instead.

Fixes xfstest generic/250 on gfs2.

Cc: stable@vger.kernel.org # v3.10+
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

899f0429

10 10月, 2017 1 次提交

quota: Generate warnings for DQUOT_SPACE_NOFAIL allocations · ac3d7939

由 Jan Kara 提交于 10月 10, 2017

Eryu has reported that since commit 7b9ca4c6 "quota: Reduce
contention on dq_data_lock" test generic/233 occasionally fails. This is
caused by the fact that since that commit we don't generate warning and
set grace time for quota allocations that have DQUOT_SPACE_NOFAIL set
(these are for example some metadata allocations in ext4). We need these
allocations to behave regularly wrt warning generation and grace time
setting so fix the code to return to the original behavior.
Reported-and-tested-by: NEryu Guan <eguan@redhat.com>
CC: stable@vger.kernel.org
Fixes: 7b9ca4c6Signed-off-by: NJan Kara <jack@suse.cz>

ac3d7939

06 10月, 2017 1 次提交

nfsd4: define nfsd4_secinfo_no_name_release() · ec572b9e

由 Eryu Guan 提交于 9月 29, 2017

Commit 34b1744c ("nfsd4: define ->op_release for compound ops")
defined a couple ->op_release functions and run them if necessary.

But there's a problem with that is that it reused
nfsd4_secinfo_release() as the op_release of OP_SECINFO_NO_NAME, and
caused a leak on struct nfsd4_secinfo_no_name in
nfsd4_encode_secinfo_no_name(), because there's no .si_exp field in
struct nfsd4_secinfo_no_name.

I found this because I was unable to umount an ext4 partition after
exporting it via NFS & run fsstress on the nfs mount. A simplified
reproducer would be:

 # mount a local-fs device at /mnt/test, and export it via NFS with
 # fsid=0 export option (this is required)
 mount /dev/sda5 /mnt/test
 echo "/mnt/test *(rw,no_root_squash,fsid=0)" >> /etc/exports
 service nfs restart

 # locally mount the nfs export with all default, note that I have
 # nfsv4.1 configured as the default nfs version, because of the
 # fsid export option, v4 mount would fail and fall back to v3
 mount localhost:/mnt/test /mnt/nfs

 # try to umount the underlying device, but got EBUSY
 umount /mnt/nfs
 service nfs stop
 umount /mnt/test <=== EBUSY here

Fixed it by defining a separate nfsd4_secinfo_no_name_release()
function as the op_release method of OP_SECINFO_NO_NAME that
releases the correct nfsd4_secinfo_no_name structure.

Fixes: 34b1744c ("nfsd4: define ->op_release for compound ops")
Signed-off-by: NEryu Guan <eguan@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

ec572b9e

05 10月, 2017 7 次提交

ovl: fix regression caused by exclusive upper/work dir protection · 85fdee1e

由 Amir Goldstein 提交于 9月 29, 2017

Enforcing exclusive ownership on upper/work dirs caused a docker
regression: https://github.com/moby/moby/issues/34672.

Euan spotted the regression and pointed to the offending commit.
Vivek has brought the regression to my attention and provided this
reproducer:

Terminal 1:

  mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
        merged/

Terminal 2:

  unshare -m

Terminal 1:

  umount merged
  mount -t overlay -o workdir=work,lowerdir=lower,upperdir=upper none
        merged/
  mount: /root/overlay-testing/merged: none already mounted or mount point
         busy

To fix the regression, I replaced the error with an alarming warning.
With index feature enabled, mount does fail, but logs a suggestion to
override exclusive dir protection by disabling index.
Note that index=off mount does take the inuse locks, so a concurrent
index=off will issue the warning and a concurrent index=on mount will fail.

Documentation was updated to reflect this change.

Fixes: 2cac0c00 ("ovl: get exclusive ownership on upper/work dirs")
Cc: <stable@vger.kernel.org> # v4.13
Reported-by: NEuan Kemp <euank@euank.com>
Reported-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

85fdee1e

ovl: fix missing unlock_rename() in ovl_do_copy_up() · 5820dc08

由 Amir Goldstein 提交于 9月 25, 2017

Use the ovl_lock_rename_workdir() helper which requires
unlock_rename() only on lock success.

Fixes: ("fd210b7d ovl: move copy up lock out")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

5820dc08

ovl: fix dentry leak in ovl_indexdir_cleanup() · dc7ab677

由 Amir Goldstein 提交于 9月 24, 2017

index dentry was not released when breaking out of the loop
due to index verification error.

Fixes: 415543d5 ("ovl: cleanup bad and stale index entries on mount")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

dc7ab677

ovl: fix dput() of ERR_PTR in ovl_cleanup_index() · 9f4ec904

由 Amir Goldstein 提交于 9月 24, 2017

Fixes: caf70cb2 ("ovl: cleanup orphan index entries")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

9f4ec904

ovl: fix error value printed in ovl_lookup_index() · e0082a0f

由 Amir Goldstein 提交于 9月 24, 2017

Fixes: 359f392c ("ovl: lookup index entry for copy up origin")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

e0082a0f

ovl: fix may_write_real() for overlayfs directories · 954c736f

由 Amir Goldstein 提交于 9月 18, 2017

Overlayfs directory file_inode() is the overlay inode whether the real
inode is upper or lower.

This fixes a regression in xfstest generic/158.

Fixes: 7c6893e3 ("ovl: don't allow writing ioctl on lower layer")
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

954c736f

NFSv4/pnfs: Fix an infinite layoutget loop · e8fa33a6

由 Trond Myklebust 提交于 10月 04, 2017

Since we can now use a lock stateid or a delegation stateid, that
differs from the context stateid, we need to change the test in
nfs4_layoutget_handle_exception() to take this into account.

This fixes an infinite layoutget loop in the NFS client whereby
it keeps retrying the initial layoutget using the same broken
stateid.

Fixes: 70d2f7b1 ("pNFS: Use the standard I/O stateid when...")
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

e8fa33a6

04 10月, 2017 12 次提交

Btrfs: fix overlap of fs_info::flags values · 69ad5976

由 Tsutomu Itoh 提交于 10月 04, 2017

Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
we should change the value.

First, BTRFS_FS_EXCL_OP was set to 14.

  commit 171938e5 ("btrfs: track exclusive filesystem operation in flags")

Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.

  commit f29efe29 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")

As a result, the value 14 overlapped, by accident.
This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
the flags are internal.

Fixes: f29efe29 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
[ minimize the change, update only BTRFS_FS_EXCL_OP ]
Signed-off-by: NDavid Sterba <dsterba@suse.com>

69ad5976

btrfs: avoid overflow when sector_t is 32 bit · 2d8ce70a

由 Goffredo Baroncelli 提交于 10月 03, 2017

Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
introduces a regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.

In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).

I added a cast to u64 to avoid overflow.

Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: NGoffredo Baroncelli <kreijack@inwind.it>
Tested-by: NJean-Denis Girard <jd.girard@sysnux.pf>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

2d8ce70a

lsm: fix smack_inode_removexattr and xattr_getsecurity memleak · 57e7ba04

由 Casey Schaufler 提交于 9月 19, 2017

security_inode_getsecurity() provides the text string value
of a security attribute. It does not provide a "secctx".
The code in xattr_getsecurity() that calls security_inode_getsecurity()
and then calls security_release_secctx() happened to work because
SElinux and Smack treat the attribute and the secctx the same way.
It fails for cap_inode_getsecurity(), because that module has no
secctx that ever needs releasing. It turns out that Smack is the
one that's doing things wrong by not allocating memory when instructed
to do so by the "alloc" parameter.

The fix is simple enough. Change the security_release_secctx() to
kfree() because it isn't a secctx being returned by
security_inode_getsecurity(). Change Smack to allocate the string when
told to do so.

Note: this also fixes memory leaks for LSMs which implement
inode_getsecurity but not release_secctx, such as capabilities.
Signed-off-by: NCasey Schaufler <casey@schaufler-ca.com>
Reported-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: stable@vger.kernel.org
Signed-off-by: NJames Morris <james.l.morris@oracle.com>

57e7ba04

xfs: handle racy AIO in xfs_reflink_end_cow · e12199f8

由 Christoph Hellwig 提交于 10月 03, 2017

If we got two AIO writes into a COW area the second one might not have any
COW extents left to convert. Handle that case gracefully instead of
triggering an assert or accessing beyond the bounds of the extent list.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

e12199f8

xfs: always swap the cow forks when swapping extents · 52bfcdd7

由 Darrick J. Wong 提交于 9月 18, 2017

Since the CoW fork exists as a secondary data structure to the data
fork, we must always swap cow forks during swapext.  We also need to
swap the extent counts and reset the cowblocks tags.
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

52bfcdd7

exec: binfmt_misc: kill the onstack iname[BINPRM_BUF_SIZE] array · 50097f74