提交 · 040c52c0aa7c1736522676078ece0483c8596daf · openeuler / Kernel

26 9月, 2016 5 次提交

xfs: don't warn on buffers not being recovered due to LSN · 040c52c0

由 Brian Foster 提交于 9月 26, 2016

The log recovery buffer validation function is invoked in cases where a
buffer update may be skipped due to LSN ordering. If the validation
function happens to come across directory conversion situations (e.g., a
dir3 block to data conversion), it may warn about seeing a buffer log
format of one type and a buffer with a magic number of another.

This warning is not valid as the buffer update is ultimately skipped.
This is indicated by a current_lsn of NULLCOMMITLSN provided by the
caller. As such, update xlog_recover_validate_buf_type() to only warn in
such cases when a buffer update is expected.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

040c52c0

xfs: pass current lsn to log recovery buffer validation · 22db9af2

由 Brian Foster 提交于 9月 26, 2016

The current LSN must be available to the buffer validation function to
provide the ability to update the metadata LSN of the buffer. Pass the
current_lsn value down to xlog_recover_validate_buf_type() in
preparation.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

22db9af2

xfs: rework log recovery to submit buffers on LSN boundaries · 12818d24

由 Brian Foster 提交于 9月 26, 2016

The fix to log recovery to update the metadata LSN in recovered buffers
introduces the requirement that a buffer is submitted only once per
current LSN. Log recovery currently submits buffers on transaction
boundaries. This is not sufficient as the abstraction between log
records and transactions allows for various scenarios where multiple
transactions can share the same current LSN. If independent transactions
share an LSN and both modify the same buffer, log recovery can
incorrectly skip updates and leave the filesystem in an inconsisent
state.

In preparation for proper metadata LSN updates during log recovery,
update log recovery to submit buffers for write on LSN change boundaries
rather than transaction boundaries. Explicitly track the current LSN in
a new struct xlog field to handle the various corner cases of when the
current LSN may or may not change.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

12818d24

xfs: quiesce the filesystem after recovery on readonly mount · ddeb14f4

由 Dave Chinner 提交于 9月 26, 2016

Recently we've had a number of reports where log recovery on a v5
filesystem has reported corruptions that looked to be caused by
recovery being re-run over the top of an already-recovered
metadata. This has uncovered a bug in recovery (fixed elsewhere)
but the vector that caused this was largely unknown.

A kdump test started tripping over this problem - the system
would be crashed, the kdump kernel and environment would boot and
dump the kernel core image, and then the system would reboot. After
reboot, the root filesystem was triggering log recovery and
corruptions were being detected. The metadumps indicated the above
log recovery issue.

What is happening is that the kdump kernel and environment is
mounting the root device read-only to find the binaries needed to do
it's work. The result of this is that it is running log recovery.
However, because there were unlinked files and EFIs to be processed
by recovery, the completion of phase 1 of log recovery could not
mark the log clean. And because it's a read-only mount, the unmount
process does not write records to the log to mark it clean, either.
Hence on the next mount of the filesystem, log recovery was run
again across all the metadata that had already been recovered and
this is what triggered corruption warnings.

To avoid this problem, we need to ensure that a read-only mount
always updates the log when it completes the second phase of
recovery. We already handle this sort of issue with rw->ro remount
transitions, so the solution is as simple as quiescing the
filesystem at the appropriate time during the mount process. This
results in the log being marked clean so the mount behaviour
recorded in the logs on repeated RO mounts will change (i.e. log
recovery will no longer be run on every mount until a RW mount is
done). This is a user visible change in behaviour, but it is
harmless.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

ddeb14f4

xfs: remote attribute blocks aren't really userdata · 292378ed

由 Dave Chinner 提交于 9月 26, 2016

When adding a new remote attribute, we write the attribute to the
new extent before the allocation transaction is committed. This
means we cannot reuse busy extents as that violates crash
consistency semantics. Hence we currently treat remote attribute
extent allocation like userdata because it has the same overwrite
ordering constraints as userdata.

Unfortunately, this also allows the allocator to incorrectly apply
extent size hints to the remote attribute extent allocation. This
results in interesting failures, such as transaction block
reservation overruns and in-memory inode attribute fork corruption.

To fix this, we need to separate the busy extent reuse configuration
from the userdata configuration. This changes the definition of
XFS_BMAPI_METADATA slightly - it now means that allocation is
metadata and reuse of busy extents is acceptible due to the metadata
ordering semantics of the journal. If this flag is not set, it
means the allocation is that has unordered data writeback, and hence
busy extent reuse is not allowed. It no longer implies the
allocation is for user data, just that the data write will not be
strictly ordered. This matches the semantics for both user data
and remote attribute block allocation.

As such, This patch changes the "userdata" field to a "datatype"
field, and adds a "no busy reuse" flag to the field.
When we detect an unordered data extent allocation, we immediately set
the no reuse flag. We then set the "user data" flags based on the
inode fork we are allocating the extent to. Hence we only set
userdata flags on data fork allocations now and consider attribute
fork remote extents to be an unordered metadata extent.

The result is that remote attribute extents now have the expected
allocation semantics, and the data fork allocation behaviour is
completely unchanged.

It should be noted that there may be other ways to fix this (e.g.
use ordered metadata buffers for the remote attribute extent data
write) but they are more invasive and difficult to validate both
from a design and implementation POV. Hence this patch takes the
simple, obvious route to fixing the problem...
Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

292378ed

30 8月, 2016 1 次提交

xfs: track log done items directly in the deferred pending work item · ea78d808

由 Darrick J. Wong 提交于 8月 30, 2016

Christoph reports slab corruption when a deferred refcount update
aborts during _defer_finish().  The cause of this was broken log item
state tracking in xfs_defer_pending -- upon an abort,
_defer_trans_abort() will call abort_intent on all intent items,
including the ones that have already had a done item attached.

This is incorrect because each intent item has 2 refcount: the first
is released when the intent item is committed to the log; and the
second is released when the _done_ item is committed to the log, or
by the intent creator if there is no done item.  In other words, once
we log the done item, responsibility for releasing the intent item's
second refcount is transferred to the done item and /must not/ be
performed by anything else.

The dfp_committed flag should have been tracking whether or not we had
a done item so that _defer_trans_abort could decide if it needs to
abort the intent item, but due to a thinko this was not the case.  Rip
it out and track the done item directly so that we do the right thing
w.r.t. intent item freeing.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reported-by: NChristoph Hellwig <hch@infradead.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

ea78d808

29 8月, 2016 1 次提交

iomap: don't set FIEMAP_EXTENT_MERGED for extent based filesystems · 17de0a9f

由 Christoph Hellwig 提交于 8月 29, 2016

Filesystems like XFS that use extents should not set the
FIEMAP_EXTENT_MERGED flag in the fiemap extent structures.  To allow
for both behaviors for the upcoming gfs2 usage split the iomap
type field into type and flags, and only set FIEMAP_EXTENT_MERGED if
the IOMAP_F_MERGED flag is set.  The flags field will also come in
handy for future features such as shared extents on reflink-enabled
file systems.
Reported-by: NAndreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

17de0a9f

26 8月, 2016 7 次提交

xfs: prevent dropping ioend completions during buftarg wait · 800b2694

由 Brian Foster 提交于 8月 26, 2016

xfs_wait_buftarg() waits for all pending I/O, drains the ioend
completion workqueue and walks the LRU until all buffers in the cache
have been released. This is traditionally an unmount operation` but the
mechanism is also reused during filesystem freeze.

xfs_wait_buftarg() invokes drain_workqueue() as part of the quiesce,
which is intended more for a shutdown sequence in that it indicates to
the queue that new operations are not expected once the drain has begun.
New work jobs after this point result in a WARN_ON_ONCE() and are
otherwise dropped.

With filesystem freeze, however, read operations are allowed and can
proceed during or after the workqueue drain. If such a read occurs
during the drain sequence, the workqueue infrastructure complains about
the queued ioend completion work item and drops it on the floor. As a
result, the buffer remains on the LRU and the freeze never completes.

Despite the fact that the overall buffer cache cleanup is not necessary
during freeze, fix up this operation such that it is safe to invoke
during non-unmount quiesce operations. Replace the drain_workqueue()
call with flush_workqueue(), which runs a similar serialization on
pending workqueue jobs without causing new jobs to be dropped. This is
safe for unmount as unmount independently locks out new operations by
the time xfs_wait_buftarg() is invoked.

cc: <stable@vger.kernel.org>
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

800b2694

xfs: fix superblock inprogress check · f3d7ebde

由 Dave Chinner 提交于 8月 26, 2016

From inspection, the superblock sb_inprogress check is done in the
verifier and triggered only for the primary superblock via a
"bp->b_bn == XFS_SB_DADDR" check.

Unfortunately, the primary superblock is an uncached buffer, and
hence it is configured by xfs_buf_read_uncached() with:

	bp->b_bn = XFS_BUF_DADDR_NULL;  /* always null for uncached buffers */

And so this check never triggers. Fix it.

cc: <stable@vger.kernel.org>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

f3d7ebde

xfs: simple btree query range should look right if LE lookup fails · 5b5c2dbd

由 Darrick J. Wong 提交于 8月 26, 2016

If the initial LOOKUP_LE in the simple query range fails to find
anything, we should attempt to increment the btree cursor to see
if there actually /are/ records for what we're trying to find.
Without this patch, a bnobt range query of (0, $agsize) returns
no results because the leftmost record never has a startblock
of zero.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

5b5c2dbd

xfs: fix some key handling problems in _btree_simple_query_range · 72227899

由 Darrick J. Wong 提交于 8月 26, 2016

We only need the record's high key for the first record that we look
at; for all records, we /definitely/ need the regular record key.
Therefore, fix how the simple range query function gets its keys.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

72227899

xfs: don't log the entire end of the AGF · da1f039d

由 Darrick J. Wong 提交于 8月 26, 2016

When we're logging the last non-spare field in the AGF, we don't
need to log the spare fields, so plumb in a new AGF logging flag
to help us avoid that.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

da1f039d

xfs: disallow mounting of realtime + rmap filesystems · 738f57c1

由 Darrick J. Wong 提交于 8月 26, 2016

Since the kernel doesn't currently support the realtime rmapbt,
don't allow such filesystems to be mounted.  Support will appear
in a future release.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

738f57c1

xfs: don't perform lookups on zero-height btrees · ed150e1a

由 Darrick J. Wong 提交于 8月 26, 2016

If the caller passes in a cursor to a zero-height btree (which is
impossible), we never set block to anything but NULL, which causes the
later dereference of it to crash.  Instead, just return -EFSCORRUPTED.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

ed150e1a

17 8月, 2016 12 次提交

xfs: remove OWN_AG rmap when allocating a block from the AGFL · a03f1a66

由 Darrick J. Wong 提交于 8月 17, 2016

When we're really tight on space, xfs_alloc_ag_vextent_small() can
allocate a block from the AGFL and give it to the caller.  Since the
caller is never the AGFL-fixing method, we must remove the OWN_AG
reverse mapping because it will clash with whatever rmap the caller
wants to set up.  This bug was discovered by running generic/299
repeatedly.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

a03f1a66

xfs: (re-)implement FIEMAP_FLAG_XATTR · 1d4795e7

由 Christoph Hellwig 提交于 8月 17, 2016

Use a special read-only iomap_ops implementation to support fiemap on
the attr fork.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

1d4795e7

xfs: simplify xfs_file_iomap_begin · b95a2127

由 Christoph Hellwig 提交于 8月 17, 2016

We'll never get nimap == 0 for a successful return from xfs_bmapi_read,
so don't try to handle it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

b95a2127

iomap: mark ->iomap_end as optional · f20ac7ab

由 Christoph Hellwig 提交于 8月 17, 2016

No need to implement it for read-only mappings.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

f20ac7ab

iomap: prepare iomap_fiemap for attribute mappings · ac2dc058

由 Dave Chinner 提交于 8月 17, 2016

By bassing through an -ENOENT, similar to the old XFS implementation of
FIEMAP_FLAG_XATTR.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
[hch: split from a larger patch]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

ac2dc058

iomap: fiemap should honor the FIEMAP_FLAG_SYNC flag · 8896b8f6

由 Dave Chinner 提交于 8月 17, 2016

The flag is checked as supported, but then we do an unconditional
sync of the file, regardless of whether the flag is set or not. Make
the sync conditional on having the FIEMAP_FLAG_SYNC flag set.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

8896b8f6

iomap: remove superflous pagefault_disable from iomap_write_actor · 274c8874

由 Christoph Hellwig 提交于 8月 17, 2016

iov_iter_copy_from_user_atomic disables page faults internally, no need to
do it around the call.  This also brings the iomap code in line with
the original filemap version.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

274c8874

iomap: remove superflous mark_page_accessed from iomap_write_actor · 97dd8c9e

由 Christoph Hellwig 提交于 8月 17, 2016

This catches up with commit  2457ae ("mm: non-atomically mark page
accessed during page cache allocation where possible"), which
moved the initial access marking into the pagecache allocator.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

97dd8c9e

xfs: store rmapbt block count in the AGF · f32866fd

由 Darrick J. Wong 提交于 8月 17, 2016

Track the number of blocks used for the rmapbt in the AGF.  When we
get to the AG reservation code we need this counter to quickly
make our reservation during mount.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

f32866fd

xfs: don't invalidate whole file on DAX read/write · 8b2180b3

由 Dave Chinner 提交于 8月 17, 2016

When we do DAX IO, we try to invalidate the entire page cache held
on the file. This is incorrect as it will trash the entire mapping
tree that now tracks dirty state in exceptional entries in the radix
tree slots.

What we are trying to do is remove cached pages (e.g from reads
into holes) that sit in the radix tree over the range we are about
to write to. Hence we should just limit the invalidation to the
range we are about to overwrite.
Reported-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>

8b2180b3

xfs: fix bogus space reservation in xfs_iomap_write_allocate · 0af32fb4

由 Christoph Hellwig 提交于 8月 17, 2016

The space reservations was without an explaination in commit

    "Add error reporting calls in error paths that return EFSCORRUPTED"

back in 2003.  There is no reason to reserve disk blocks in the
transaction when allocating blocks for delalloc space as we already
reserved the space when creating the delalloc extent.

With this fix we stop running out of the reserved pool in
generic/229, which has happened for long time with small blocksize
file systems, and has increased in severity with the new buffered
write path.

[ dchinner: we still need to pass the block reservation into
  xfs_bmapi_write() to ensure we don't deadlock during AG selection.
  See commit dbd5c8c9 ("xfs: pass total block res. as total
  xfs_bmapi_write() parameter") for more details on why this is
  necessary. ]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

0af32fb4

xfs: don't assert fail on non-async buffers on ioacct decrement · 4dd3fd71

由 Brian Foster 提交于 8月 17, 2016

The buffer I/O accounting mechanism tracks async buffers under I/O. As
an optimization, the buffer I/O count is incremented only once on the
first async I/O for a given hold cycle of a buffer and decremented once
the buffer is released to the LRU (or freed).

xfs_buf_ioacct_dec() has an ASSERT() check for an XBF_ASYNC buffer, but
we have one or two corner cases where a buffer can be submitted for I/O
multiple times via different methods in a single hold cycle. If an async
I/O occurs first, the I/O count is incremented. If a sync I/O occurs
before the hold count drops, XBF_ASYNC is cleared by the time the I/O
count is decremented.

Remove the async assert check from xfs_buf_ioacct_dec() as this is a
perfectly valid scenario. For the purposes of I/O accounting, we really
only care about the buffer async state at I/O submission time.
Discovered-and-analyzed-by: NDave Chinner <david@fromorbit.com>
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

4dd3fd71

13 8月, 2016 1 次提交

nfsd: don't return an unhashed lock stateid after taking mutex · dd257933

由 Jeff Layton 提交于 8月 11, 2016

nfsd4_lock will take the st_mutex before working with the stateid it
gets, but between the time when we drop the cl_lock and take the mutex,
the stateid could become unhashed (a'la FREE_STATEID). If that happens
the lock stateid returned to the client will be forgotten.

Fix this by first moving the st_mutex acquisition into
lookup_or_create_lock_state. Then, have it check to see if the lock
stateid is still hashed after taking the mutex. If it's not, then put
the stateid and try the find/create again.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Tested-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
Cc: stable@vger.kernel.org # feb9dad5 nfsd: Always lock state exclusively.
Cc: stable@vger.kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

dd257933

12 8月, 2016 2 次提交

proc, meminfo: use correct helpers for calculating LRU sizes in meminfo · 2f95ff90

由 Mel Gorman 提交于 8月 11, 2016

meminfo_proc_show() and si_mem_available() are using the wrong helpers
for calculating the size of the LRUs. The user-visible impact is that
there appears to be an abnormally high number of unevictable pages.

Link: http://lkml.kernel.org/r/20160805105805.GR2799@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2f95ff90

nfsd: Fix race between FREE_STATEID and LOCK · 42691398

由 Chuck Lever 提交于 8月 11, 2016

When running LTP's nfslock01 test, the Linux client can send a LOCK
and a FREE_STATEID request at the same time. The outcome is:

Frame 324    R OPEN stateid [2,O]

Frame 115004 C LOCK lockowner_is_new stateid [2,O] offset 672000 len 64
Frame 115008 R LOCK stateid [1,L]
Frame 115012 C WRITE stateid [0,L] offset 672000 len 64
Frame 115016 R WRITE NFS4_OK
Frame 115019 C LOCKU stateid [1,L] offset 672000 len 64
Frame 115022 R LOCKU NFS4_OK
Frame 115025 C FREE_STATEID stateid [2,L]
Frame 115026 C LOCK lockowner_is_new stateid [2,O] offset 672128 len 64
Frame 115029 R FREE_STATEID NFS4_OK
Frame 115030 R LOCK stateid [3,L]
Frame 115034 C WRITE stateid [0,L] offset 672128 len 64
Frame 115038 R WRITE NFS4ERR_BAD_STATEID

In other words, the server returns stateid L in a successful LOCK
reply, but it has already released it. Subsequent uses of stateid L
fail.

To address this, protect the generation check in nfsd4_free_stateid
with the st_mutex. This should guarantee that only one of two
outcomes occurs: either LOCK returns a fresh valid stateid, or
FREE_STATEID returns NFS4ERR_LOCKS_HELD.
Reported-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
Fix-suggested-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

42691398

11 8月, 2016 1 次提交

nfsd: fix dentry refcounting on create · 502aa0a5

由 Josef Bacik 提交于 8月 10, 2016

b44061d0 introduced a dentry ref counting bug.  Previously we were
grabbing one ref to dchild in nfsd_create(), but with the creation of
nfsd_create_locked() we have a ref for dchild from the lookup in
nfsd_create(), and then another ref in nfsd_create_locked().  The ref
from the lookup in nfsd_create() is never dropped and results in
dentries still in use at unmount.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Fixes: b44061d0 "nfsd: reorganize nfsd_create"
Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

502aa0a5

10 8月, 2016 2 次提交

mm, writeback: flush plugged IO in wakeup_flusher_threads() · 51350ea0

由 Konstantin Khlebnikov 提交于 8月 04, 2016

I've found funny live-lock between raid10 barriers during resync and
memory controller hard limits. Inside mpage_readpages() task holds on to
its plug bio which blocks the barrier in raid10. Its memory cgroup have
no free memory thus the task goes into reclaimer but all reclaimable
pages are dirty and cannot be written because raid10 is rebuilding and
stuck on the barrier.

Common flush of such IO in schedule() never happens, because the caller
doesn't go to sleep.

Lock is 'live' because changing memory limit or killing tasks which
holds that stuck bio unblock whole progress.

That was what happened in 3.18.x but I see no difference in upstream
logic.  Theoretically this might happen even without memory cgroup.
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NJens Axboe <axboe@fb.com>

51350ea0

mm: memcontrol: only mark charged pages with PageKmemcg · c4159a75

由 Vladimir Davydov 提交于 8月 08, 2016

To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
which sets page->_mapcount to -512.  Currently, we set/clear PageKmemcg
in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
with __GFP_ACCOUNT, including those that aren't actually charged to any
cgroup, i.e. allocated from the root cgroup context.  To avoid overhead
in case cgroups are not used, we only do that if memcg_kmem_enabled() is
true.  The latter is set iff there are kmem-enabled memory cgroups
(online or offline).  The root cgroup is not considered kmem-enabled.

As a result, if a page is allocated with __GFP_ACCOUNT for the root
cgroup when there are kmem-enabled memory cgroups and is freed after all
kmem-enabled memory cgroups were removed, e.g.

  # no memory cgroups has been created yet, create one
  mkdir /sys/fs/cgroup/memory/test
  # run something allocating pages with __GFP_ACCOUNT, e.g.
  # a program using pipe
  dmesg | tail
  # remove the memory cgroup
  rmdir /sys/fs/cgroup/memory/test

we'll get bad page state bug complaining about page->_mapcount != -1:

  BUG: Bad page state in process swapper/0  pfn:1fd945c
  page:ffffea007f651700 count:0 mapcount:-511 mapping:          (null) index:0x0
  flags: 0x1000000000000000()

To avoid that, let's mark with PageKmemcg only those pages that are
actually charged to and hence pin a non-root memory cgroup.

Fixes: 4949148a ("mm: charge/uncharge kmemcg from generic page allocator paths")
Reported-and-tested-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c4159a75

09 8月, 2016 2 次提交
- I
  ceph: initialize pathbase in the !dentry case in encode_caps_cb() · 4eacd4cb
  由 Ilya Dryomov 提交于 8月 09, 2016
```
pathbase is the base inode; set it to 0 if we've got no path.

Coverity-id: 146348
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NAlex Elder <elder@linaro.org>
```
  4eacd4cb
- Y
  ceph: fix null pointer dereference in ceph_flush_snaps() · e4d2b16a
  由 Yan, Zheng 提交于 8月 04, 2016
```
Signed-off-by: NYan, Zheng <zyan@redhat.com>
```
  e4d2b16a
08 8月, 2016 2 次提交

block: rename bio bi_rw to bi_opf · 1eff9d32

由 Jens Axboe 提交于 8月 05, 2016

Since commit 63a4cc24, bio->bi_rw contains flags in the lower
portion and the op code in the higher portions. This means that
old code that relies on manually setting bi_rw is most likely
going to be broken. Instead of letting that brokeness linger,
rename the member, to force old and out-of-tree code to break
at compile time instead of at runtime.

No intended functional changes in this commit.
Signed-off-by: NJens Axboe <axboe@fb.com>

1eff9d32

block/mm: make bdev_ops->rw_page() take a bool for read/write · c11f0c0b

由 Jens Axboe 提交于 8月 05, 2016

Commit abf54548 changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.

Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.
Reviewed-by: NMike Christie <mchristi@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c11f0c0b

07 8月, 2016 1 次提交

fs: return EPERM on immutable inode · 337684a1

由 Eryu Guan 提交于 8月 02, 2016

In most cases, EPERM is returned on immutable inode, and there're only a
few places returning EACCES. I noticed this when running LTP on
overlayfs, setxattr03 failed due to unexpected EACCES on immutable
inode.

So converting all EACCES to EPERM on immutable inode.
Acked-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NEryu Guan <guaneryu@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

337684a1

06 8月, 2016 3 次提交

NFSv4: Cap the transport reconnection timer at 1/2 lease period · 8d480326

由 Trond Myklebust 提交于 8月 05, 2016

We don't want to miss a lease period renewal due to the TCP connection
failing to reconnect in a timely fashion. To ensure this doesn't happen,
cap the reconnection timer so that we retry the connection attempt
at least every 1/2 lease period.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

8d480326

NFSv4: Cleanup the setting of the nfs4 lease period · fb10fb67

由 Trond Myklebust 提交于 8月 05, 2016

Make a helper function nfs4_set_lease_period() and have
nfs41_setup_state_renewal() and nfs4_do_fsinfo() use it.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

fb10fb67

ramoops: use persistent_ram_free() instead of kfree() for freeing prz · e976e564

由 Hiraku Toyooka 提交于 7月 25, 2016

persistent_ram_zone(=prz) structures are allocated by persistent_ram_new(),
which includes vmap() or ioremap(). But they are currently freed by
kfree(). This uses persistent_ram_free() for correct this asymmetry usage.
Signed-off-by: NHiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Signed-off-by: NNobuhiro Iwamatsu <nobuhiro.iwamatsu.kw@hitachi.com>
Cc: Mark Salyzyn <salyzyn@android.com>
Cc: Seiji Aguchi <seiji.aguchi.tr@hitachi.com>
Signed-off-by: NKees Cook <keescook@chromium.org>

e976e564

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功