提交 · 5636ec4eb6b804cd7e67e3a896f1624609dfb427 · openanolis / cloud-kernel

09 8月, 2018 3 次提交

NFSv4: Detect nlink changes on cross-directory renames too · 5636ec4e

由 Trond Myklebust 提交于 7月 31, 2018

If the object being renamed from one directory to another is also
a directory, then 'nlink' will change for both directories.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

5636ec4e

NFSv4: bump/drop the nlink count on the parent dir when we mkdir/rmdir · 3c591175

由 Trond Myklebust 提交于 7月 31, 2018

Ensure that we always bump or drop the nlink count on the parent directory
when we do a mkdir or a rmdir(). This needs to be done by hand as we don't
have pre/post op attributes.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

3c591175

pnfs: Fix handling of NFS4ERR_OLD_STATEID replies to layoutreturn · c16467dc

由 Trond Myklebust 提交于 7月 29, 2018

If the server tells us that out layoutreturn raced with another layout
update, then we must ensure that the new layout segments are not in use
before we resend with an updated layout stateid.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

c16467dc

01 8月, 2018 4 次提交

NFSv4 client live hangs after live data migration recovery · 0f90be13

由 Bill Baker 提交于 6月 19, 2018

After a live data migration event at the NFS server, the client may send
I/O requests to the wrong server, causing a live hang due to repeated
recovery events.  On the wire, this will appear as an I/O request failing
with NFS4ERR_BADSESSION, followed by successful CREATE_SESSION, repeatedly.
NFS4ERR_BADSSESSION is returned because the session ID being used was
issued by the other server and is not valid at the old server.

The failure is caused by async worker threads having cached the transport
(xprt) in the rpc_task structure.  After the migration recovery completes,
the task is redispatched and the task resends the request to the wrong
server based on the old value still present in tk_xprt.

The solution is to recompute the tk_xprt field of the rpc_task structure
so that the request goes to the correct server.
Signed-off-by: NBill Baker <bill.baker@oracle.com>
Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NHelen Chao <helen.chao@oracle.com>
Fixes: fb43d172 ("SUNRPC: Use the multipath iterator to assign a ...")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

0f90be13

NFSv4.0 fix client reference leak in callback · 32cd3ee5

由 Olga Kornievskaia 提交于 7月 26, 2018

If there is an error during processing of a callback message, it leads
to refrence leak on the client structure and eventually an unclean
superblock.
Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

32cd3ee5

NFS: silence a harmless uninitialized variable warning · 379ebf07

由 Dan Carpenter 提交于 7月 12, 2018

kstrtoul() can return -ERANGE so Smatch complains that "num" can be
uninitialized.  We check that it's within bounds so it's not a huge
deal.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

379ebf07

sunrpc: Change rpc_print_iostats to rpc_clnt_show_stats and handle rpc_clnt clones · 016583d7

由 Dave Wysochanski 提交于 7月 31, 2018

The existing rpc_print_iostats has a few shortcomings. First, the naming
is not consistent with other functions in the kernel that display stats.
Second, it is really displaying stats for an rpc_clnt structure as it
displays both xprt stats and per-op stats. Third, it does not handle
rpc_clnt clones, which is important for the one in-kernel tree caller
of this function, the NFS client's nfs_show_stats function.

Fix all of the above by renaming the rpc_print_iostats to
rpc_clnt_show_stats and looping through any rpc_clnt clones via
cl_parent.

Once this interface is fixed, this addresses a problem with NFSv4.
Before this patch, the /proc/self/mountstats always showed incorrect
counts for NFSv4 lease and session related opcodes such as SEQUENCE,
RENEW, SETCLIENTID, CREATE_SESSION, etc. These counts were always 0
even though many ops would go over the wire. The reason for this is
there are multiple rpc_clnt structures allocated for any given NFSv4
mount, and inside nfs_show_stats() we callled into rpc_print_iostats()
which only handled one of them, nfs_server->client. Fix these counts
by calling sunrpc's new rpc_clnt_show_stats() function, which handles
cloned rpc_clnt structs and prints the stats together.

Note that one side-effect of the above is that multiple mounts from
the same NFS server will show identical counts in the above ops due
to the fact the one rpc_clnt (representing the NFSv4 client state)
is shared across mounts.
Signed-off-by: NDave Wysochanski <dwysocha@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

016583d7

31 7月, 2018 5 次提交

pnfs/blocklayout: off by one in bl_map_stripe() · 0914bb96

由 Dan Carpenter 提交于 7月 04, 2018

"dev->nr_children" is the number of children which were parsed
successfully in bl_parse_stripe().  It could be all of them and then, in
that case, it is equal to v->stripe.volumes_count.  Either way, the >
should be >= so that we don't go beyond the end of what we're supposed
to.

Fixes: 5c83746a ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org # 3.17+
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

0914bb96

nfs: Referrals not inheriting proto setting from parent · 23a88ade

由 Calum Mackay 提交于 7月 05, 2018

Commit 530ea421 ("nfs: Referrals should use the same proto setting
as their parent") encloses the fix with #ifdef CONFIG_SUNRPC_XPRT_RDMA.

CONFIG_SUNRPC_XPRT_RDMA is a tristate option, so it should be tested
with #if IS_ENABLED().

Fixes: 530ea421 ("nfs: Referrals should use the same proto setting as their parent")
Reported-by: NHelen Chao <helen.chao@oracle.com>
Tested-by: NHelen Chao <helen.chao@oracle.com>
Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NBill Baker <bill.baker@oracle.com>
Signed-off-by: NCalum Mackay <calum.mackay@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

23a88ade

nfs: initiate returning delegation when reclaiming one that's been recalled · 8b199e58

由 Jeff Layton 提交于 7月 05, 2018

When reclaiming a delegation via CLAIM_PREVIOUS open, the server can
indicate that the delegation has been recalled since it was issued by
setting the "recalled" flag in the delegation.

Ensure that we respect the flag by initiating a delegation return when
it is set.
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

8b199e58

fs: nfs: Adding new return type vm_fault_t · 01a36844

由 Souptick Joarder 提交于 7月 02, 2018

Use new return type vm_fault_t for fault handler
in struct vm_operations_struct. For now, this is
just documenting that the function returns a
VM_FAULT value rather than an errno.  Once all
instances are converted, vm_fault_t will become
a distinct type.

see commit 1c8f4220 ("mm: change return type to
vm_fault_t") for reference.
Signed-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

01a36844

nfs: add error check in nfs_idmap_prepare_message() · 12b289cf

由 Chengguang Xu 提交于 6月 28, 2018

Even though the caller of nfs_idmap_prepare_message() checks return
code in their side but it's better to add an error check for match_int()
so that we can avoid unnecessary operations when bad int arg is
detected.
Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

12b289cf

27 7月, 2018 12 次提交

Fix error code in nfs_lookup_verify_inode() · a61246c9

由 Lance Shelton 提交于 7月 16, 2018

Return -ESTALE to force a lookup when the file has no more links
Signed-off-by: NLance Shelton <lance.shelton@hammerspace.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

a61246c9

NFS: More excessive attribute revalidation in nfs_execute_ok() · 3825827e

由 Trond Myklebust 提交于 7月 24, 2018

execute_ok() will only check the mode bits if the object is not a
directory, so we don't need to revalidate the attributes in that case.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

3825827e

NFS: Fix excessive attribute revalidation in nfs_execute_ok() · cf834027

由 Trond Myklebust 提交于 7月 20, 2018

When nfs_update_inode() sets NFS_INO_INVALID_ACCESS it is a sign that
we want to revalidate the access cache, not the inode attributes.
In fact we only want to revalidate here if we see that the mode bits
are invalid, so check for NFS_INO_INVALID_OTHER instead.
Reported-by: NOlga Kornievskaia <aglo@umich.edu>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

cf834027

NFS: Ensure we immediately start writeback on rescheduled writes · 7be7b3ca

由 Trond Myklebust 提交于 7月 04, 2018

If the writes are being rescheduled due to a pNFS error, then we really
want to immediately start a new flush. The O_DIRECT code already does
this, so we only need to worry about buffered writes.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

7be7b3ca

NFSv4.1: Fix a potential layoutget/layoutrecall deadlock · bd3d16a8

由 Trond Myklebust 提交于 7月 12, 2018

If the client is sending a layoutget, but the server issues a callback
to recall what it thinks may be an outstanding layout, then we may find
an uninitialised layout attached to the inode due to the layoutget.
In that case, it is appropriate to return NFS4ERR_NOMATCHING_LAYOUT
rather than NFS4ERR_DELAY, as the latter can end up deadlocking.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

bd3d16a8

pNFS: Parse the results of layoutget on open even if permissions checks fail · af9b6d75

由 Trond Myklebust 提交于 6月 29, 2018

Even if the results of the permissions checks failed, we should parse
the results of the layout on open call so that we can return the
layout if required.
Note that we also want to ignore the sequence counter for whether or not
a layout recall occurred. If the recall pertained to our OPEN, then the
callback will know, and will attempt to wait for us to finih processing
anyway.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

af9b6d75

NFS: Allow optimisation of lseek(fd, SEEK_CUR, 0) on directories · b2b1ff3d

由 Trond Myklebust 提交于 6月 27, 2018

There should be no need to grab the inode lock if we're only reading
the file offset.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

b2b1ff3d

pNFS: Wait for stale layoutget calls to complete in pnfs_update_layout() · 411ae722

由 Trond Myklebust 提交于 6月 23, 2018

If the old layout was recalled, and we returned NFS4ERR_NOMATCHINGLAYOUT
then we need to wait for all outstanding layoutget calls to complete
before we can send a new one.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

411ae722

pNFS/flexfiles: Ensure we always return a layout if it has layoutstats · 056f9ad6

由 Trond Myklebust 提交于 6月 23, 2018

If a layout segment is carrying layoutstats or layout error information,
then we always want to return it rather than using a forgetful model.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

056f9ad6

pNFS: Ignore non-recalled layouts in pnfs_layout_need_return() · f0b42981

由 Trond Myklebust 提交于 6月 23, 2018

If a layout has been recalled, then we should fire off a layoutreturn as
soon as all the layout segments that match the recall have been retired.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

f0b42981

pNFS: Don't update the stateid when replying NFS4ERR_DELAY to a layout recall · 00bcbe11

由 Trond Myklebust 提交于 6月 23, 2018

RFC5661 doesn't state directly that the client should update the layout
stateid if it returns NFS4ERR_NOMATCHING_LAYOUT in response to a recall,
however it does state that this error will "cleanly indicate completion"
on par with returning the layout. For this reason, we assume that the
client should update the layout stateid. The Linux pNFS server definitely
does expect this behaviour.

However, if the client replies NFS4ERR_DELAY, then it is stating that
the recall was not processed, so it would be very wrong to update the
layout stateid.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

00bcbe11

pNFS: Don't discard layout segments that are marked for return · e0b7d420

由 Trond Myklebust 提交于 6月 23, 2018

If there are layout segments that are marked for return, then we need
to ensure that pnfs_mark_matching_lsegs_return() does not just
silently discard them, but it should tell the caller that there is a
layoutreturn scheduled.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

e0b7d420

25 7月, 2018 5 次提交

cachefiles: Wait rather than BUG'ing on "Unexpected object collision" · c2412ac4

由 Kiran Kumar Modukuri 提交于 6月 21, 2018

If we meet a conflicting object that is marked FSCACHE_OBJECT_IS_LIVE in
the active object tree, we have been emitting a BUG after logging
information about it and the new object.

Instead, we should wait for the CACHEFILES_OBJECT_ACTIVE flag to be cleared
on the old object (or return an error). The ACTIVE flag should be cleared
after it has been removed from the active object tree. A timeout of 60s is
used in the wait, so we shouldn't be able to get stuck there.

Fixes: 9ae326a6 ("CacheFiles: A cache that backs onto a mounted filesystem")
Signed-off-by: NKiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>

c2412ac4

cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE flag · 5ce83d4b

由 Kiran Kumar Modukuri 提交于 6月 21, 2018

In cachefiles_mark_object_active(), the new object is marked active and
then we try to add it to the active object tree.  If a conflicting object
is already present, we want to wait for that to go away.  After the wait,
we go round again and try to re-mark the object as being active - but it's
already marked active from the first time we went through and a BUG is
issued.

Fix this by clearing the CACHEFILES_OBJECT_ACTIVE flag before we try again.

Analysis from Kiran Kumar Modukuri:

[Impact]
Oops during heavy NFS + FSCache + Cachefiles

CacheFiles: Error: Overlong wait for old active object to go away.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000002

CacheFiles: Error: Object already active kernel BUG at
fs/cachefiles/namei.c:163!

[Cause]
In a heavily loaded system with big files being read and truncated, an
fscache object for a cookie is being dropped and a new object being
looked. The new object being looked for has to wait for the old object
to go away before the new object is moved to active state.

[Fix]
Clear the flag 'CACHEFILES_OBJECT_ACTIVE' for the new object when
retrying the object lookup.

[Testcase]
Have run ~100 hours of NFS stress tests and have not seen this bug recur.

[Regression Potential]
 - Limited to fscache/cachefiles.

Fixes: 9ae326a6 ("CacheFiles: A cache that backs onto a mounted filesystem")
Signed-off-by: NKiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>

5ce83d4b

fscache: Fix reference overput in fscache_attach_object() error handling · f29507ce

由 Kiran Kumar Modukuri 提交于 6月 21, 2018

When a cookie is allocated that causes fscache_object structs to be
allocated, those objects are initialised with the cookie pointer, but
aren't blessed with a ref on that cookie unless the attachment is
successfully completed in fscache_attach_object().

If attachment fails because the parent object was dying or there was a
collision, fscache_attach_object() returns without incrementing the cookie
counter - but upon failure of this function, the object is released which
then puts the cookie, whether or not a ref was taken on the cookie.

Fix this by taking a ref on the cookie when it is assigned in
fscache_object_init(), even when we're creating a root object.


Analysis from Kiran Kumar:

This bug has been seen in 4.4.0-124-generic #148-Ubuntu kernel

BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1776277

fscache cookie ref count updated incorrectly during fscache object
allocation resulting in following Oops.

kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/internal.h:321!
kernel BUG at /build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!

[Cause]
Two threads are trying to do operate on a cookie and two objects.

(1) One thread tries to unmount the filesystem and in process goes over a
    huge list of objects marking them dead and deleting the objects.
    cookie->usage is also decremented in following path:

      nfs_fscache_release_super_cookie
       -> __fscache_relinquish_cookie
        ->__fscache_cookie_put
        ->BUG_ON(atomic_read(&cookie->usage) <= 0);

(2) A second thread tries to lookup an object for reading data in following
    path:

    fscache_alloc_object
    1) cachefiles_alloc_object
        -> fscache_object_init
           -> assign cookie, but usage not bumped.
    2) fscache_attach_object -> fails in cant_attach_object because the
         cookie's backing object or cookie's->parent object are going away
    3) fscache_put_object
        -> cachefiles_put_object
          ->fscache_object_destroy
            ->fscache_cookie_put
               ->BUG_ON(atomic_read(&cookie->usage) <= 0);

[NOTE from dhowells] It's unclear as to the circumstances in which (2) can
take place, given that thread (1) is in nfs_kill_super(), however a
conflicting NFS mount with slightly different parameters that creates a
different superblock would do it.  A backtrace from Kiran seems to show
that this is a possibility:

    kernel BUG at/build/linux-Y09MKI/linux-4.4.0/fs/fscache/cookie.c:639!
    ...
    RIP: __fscache_cookie_put+0x3a/0x40 [fscache]
    Call Trace:
     __fscache_relinquish_cookie+0x87/0x120 [fscache]
     nfs_fscache_release_super_cookie+0x2d/0xb0 [nfs]
     nfs_kill_super+0x29/0x40 [nfs]
     deactivate_locked_super+0x48/0x80
     deactivate_super+0x5c/0x60
     cleanup_mnt+0x3f/0x90
     __cleanup_mnt+0x12/0x20
     task_work_run+0x86/0xb0
     exit_to_usermode_loop+0xc2/0xd0
     syscall_return_slowpath+0x4e/0x60
     int_ret_from_sys_call+0x25/0x9f

[Fix] Bump up the cookie usage in fscache_object_init, when it is first
being assigned a cookie atomically such that the cookie is added and bumped
up if its refcount is not zero.  Remove the assignment in
fscache_attach_object().

[Testcase]
I have run ~100 hours of NFS stress tests and not seen this bug recur.

[Regression Potential]
 - Limited to fscache/cachefiles.

Fixes: ccc4fc3d ("FS-Cache: Implement the cookie management part of the netfs API")
Signed-off-by: NKiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>

f29507ce

cachefiles: Fix refcounting bug in backing-file read monitoring · 934140ab

由 Kiran Kumar Modukuri 提交于 7月 18, 2017

cachefiles_read_waiter() has the right to access a 'monitor' object by
virtue of being called under the waitqueue lock for one of the pages in its
purview.  However, it has no ref on that monitor object or on the
associated operation.

What it is allowed to do is to move the monitor object to the operation's
to_do list, but once it drops the work_lock, it's actually no longer
permitted to access that object.  However, it is trying to enqueue the
retrieval operation for processing - but it can only do this via a pointer
in the monitor object, something it shouldn't be doing.

If it doesn't enqueue the operation, the operation may not get processed.
If the order is flipped so that the enqueue is first, then it's possible
for the work processor to look at the to_do list before the monitor is
enqueued upon it.

Fix this by getting a ref on the operation so that we can trust that it
will still be there once we've added the monitor to the to_do list and
dropped the work_lock.  The op can then be enqueued after the lock is
dropped.

The bug can manifest in one of a couple of ways.  The first manifestation
looks like:

 FS-Cache:
 FS-Cache: Assertion failed
 FS-Cache: 6 == 5 is false
 ------------[ cut here ]------------
 kernel BUG at fs/fscache/operation.c:494!
 RIP: 0010:fscache_put_operation+0x1e3/0x1f0
 ...
 fscache_op_work_func+0x26/0x50
 process_one_work+0x131/0x290
 worker_thread+0x45/0x360
 kthread+0xf8/0x130
 ? create_worker+0x190/0x190
 ? kthread_cancel_work_sync+0x10/0x10
 ret_from_fork+0x1f/0x30

This is due to the operation being in the DEAD state (6) rather than
INITIALISED, COMPLETE or CANCELLED (5) because it's already passed through
fscache_put_operation().

The bug can also manifest like the following:

 kernel BUG at fs/fscache/operation.c:69!
 ...
    [exception RIP: fscache_enqueue_operation+246]
 ...
 #7 [ffff883fff083c10] fscache_enqueue_operation at ffffffffa0b793c6
 #8 [ffff883fff083c28] cachefiles_read_waiter at ffffffffa0b15a48
 #9 [ffff883fff083c48] __wake_up_common at ffffffff810af028

I'm not entirely certain as to which is line 69 in Lei's kernel, so I'm not
entirely clear which assertion failed.

Fixes: 9ae326a6 ("CacheFiles: A cache that backs onto a mounted filesystem")
Reported-by: NLei Xue <carmark.dlut@gmail.com>
Reported-by: NVegard Nossum <vegard.nossum@gmail.com>
Reported-by: NAnthony DeRobertis <aderobertis@metrics.net>
Reported-by: NNeilBrown <neilb@suse.com>
Reported-by: NDaniel Axtens <dja@axtens.net>
Reported-by: NKiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Reviewed-by: NDaniel Axtens <dja@axtens.net>

934140ab

fscache: Allow cancelled operations to be enqueued · d0eb06af

由 Kiran Kumar Modukuri 提交于 7月 25, 2018

Alter the state-check assertion in fscache_enqueue_operation() to allow
cancelled operations to be given processing time so they can be cleaned up.

Also fix a debugging statement that was requiring such operations to have
an object assigned.

Fixes: 9ae326a6 ("CacheFiles: A cache that backs onto a mounted filesystem")
Reported-by: NKiran Kumar Modukuri <kiran.modukuri@gmail.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>

d0eb06af

22 7月, 2018 3 次提交

mm: make vm_area_alloc() initialize core fields · 490fc053

由 Linus Torvalds 提交于 7月 21, 2018

Like vm_area_dup(), it initializes the anon_vma_chain head, and the
basic mm pointer.

The rest of the fields end up being different for different users,
although the plan is to also initialize the 'vm_ops' field to a dummy
entry.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

490fc053

mm: use helper functions for allocating and freeing vm_area structs · 3928d4f5

由 Linus Torvalds 提交于 7月 21, 2018

The vm_area_struct is one of the most fundamental memory management
objects, but the management of it is entirely open-coded evertwhere,
ranging from allocation and freeing (using kmem_cache_[z]alloc and
kmem_cache_free) to initializing all the fields.

We want to unify this in order to end up having some unified
initialization of the vmas, and the first step to this is to at least
have basic allocation functions.

Right now those functions are literally just wrappers around the
kmem_cache_*() calls.  This is a purely mechanical conversion:

    # new vma:
    kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()

    # copy old vma
    kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)

    # free vma
    kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)

to the point where the old vma passed in to the vm_area_dup() function
isn't even used yet (because I've left all the old manual initialization
alone).
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3928d4f5

fat: fix memory allocation failure handling of match_strdup() · 35033ab9

由 OGAWA Hirofumi 提交于 7月 20, 2018

In parse_options(), if match_strdup() failed, parse_options() leaves
opts->iocharset in unexpected state (i.e. still pointing the freed
string). And this can be the cause of double free.

To fix, this initialize opts->iocharset always when freeing.

Link: http://lkml.kernel.org/r/8736wp9dzc.fsf@mail.parknet.co.jpSigned-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: syzbot+90b8e10515ae88228a92@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

35033ab9

19 7月, 2018 1 次提交

Btrfs: fix file data corruption after cloning a range and fsync · bd3599a0

由 Filipe Manana 提交于 7月 12, 2018

When we clone a range into a file we can end up dropping existing
extent maps (or trimming them) and replacing them with new ones if the
range to be cloned overlaps with a range in the destination inode.
When that happens we add the new extent maps to the list of modified
extents in the inode's extent map tree, so that a "fast" fsync (the flag
BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
and log corresponding extent items. However, at the end of range cloning
operation we do truncate all the pages in the affected range (in order to
ensure future reads will not get stale data). Sometimes this truncation
will release the corresponding extent maps besides the pages from the page
cache. If this happens, then a "fast" fsync operation will miss logging
some extent items, because it relies exclusively on the extent maps being
present in the inode's extent tree, leading to data loss/corruption if
the fsync ends up using the same transaction used by the clone operation
(that transaction was not committed in the meanwhile). An extent map is
released through the callback btrfs_invalidatepage(), which gets called by
truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
later ends up calling try_release_extent_mapping() which will release the
extent map if some conditions are met, like the file size being greater
than 16Mb, gfp flags allow blocking and the range not being locked (which
is the case during the clone operation) nor being the extent map flagged
as pinned (also the case for cloning).

The following example, turned into a test for fstests, reproduces the
issue:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
  $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar

  $ xfs_io -c "fsync" /mnt/bar
  # reflink destination offset corresponds to the size of file bar,
  # 2728Kb minus 4Kb.
  $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
  $ xfs_io -c "fsync" /mnt/bar

  $ md5sum /mnt/bar
  95a95813a8c2abc9aa75a6c2914a077e  /mnt/bar

  <power fail>

  $ mount /dev/sdb /mnt
  $ md5sum /mnt/bar
  207fd8d0b161be8a84b945f0df8d5f8d  /mnt/bar
  # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
  # power failure

In the above example, the destination offset of the clone operation
corresponds to the size of the "bar" file minus 4Kb. So during the clone
operation, the extent map covering the range from 2572Kb to 2728Kb gets
trimmed so that it ends at offset 2724Kb, and a new extent map covering
the range from 2724Kb to 11724Kb is created. So at the end of the clone
operation when we ask to truncate the pages in the range from 2724Kb to
2724Kb + 15908Kb, the page invalidation callback ends up removing the new
extent map (through try_release_extent_mapping()) when the page at offset
2724Kb is passed to that callback.

Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
map is removed at try_release_extent_mapping(), forcing the next fsync to
search for modified extents in the fs/subvolume tree instead of relying on
the presence of extent maps in memory. This way we can continue doing a
"fast" fsync if the destination range of a clone operation does not
overlap with an existing range or if any of the criteria necessary to
remove an extent map at try_release_extent_mapping() is not met (file
size not bigger then 16Mb or gfp flags do not allow blocking).

CC: stable@vger.kernel.org # 3.16+
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

bd3599a0

18 7月, 2018 1 次提交

aio: don't expose __aio_sigset in uapi · 9ba546c0

由 Christoph Hellwig 提交于 7月 11, 2018

glibc uses a different defintion of sigset_t than the kernel does,
and the current version would pull in both.  To fix this just do not
expose the type at all - this somewhat mirrors pselect() where we
do not even have a type for the magic sigmask argument, but just
use pointer arithmetics.

Fixes: 7a074e96 ("aio: implement io_pgetevents")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reported-by: NAdrian Reber <adrian@lisas.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9ba546c0

17 7月, 2018 1 次提交

btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block() · 665d4953

由 Qu Wenruo 提交于 7月 11, 2018

In commit ac0b4145 ("btrfs: scrub: Don't use inode pages for device
replace") we removed the branch of copy_nocow_pages() to avoid
corruption for compressed nodatasum extents.

However above commit only solves the problem in scrub_extent(), if
during scrub_pages() we failed to read some pages,
sctx->no_io_error_seen will be non-zero and we go to fixup function
scrub_handle_errored_block().

In scrub_handle_errored_block(), for sctx without csum (no matter if
we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine,
which does the similar thing with copy_nocow_pages(), but does it
without the extra check in copy_nocow_pages() routine.

So for test cases like btrfs/100, where we emulate read errors during
replace/scrub, we could corrupt compressed extent data again.

This patch will fix it just by avoiding any "optimization" for
nodatasum, just falls back to the normal fixup routine by try read from
any good copy.

This also solves WARN_ON() or dead lock caused by lame backref iteration
in scrub_fixup_nodatasum() routine.

The deadlock or WARN_ON() won't be triggered before commit ac0b4145
("btrfs: scrub: Don't use inode pages for device replace") since
copy_nocow_pages() have better locking and extra check for data extent,
and it's already doing the fixup work by try to read data from any good
copy, so it won't go scrub_fixup_nodatasum() anyway.

This patch disables the faulty code and will be removed completely in a
followup patch.

Fixes: ac0b4145 ("btrfs: scrub: Don't use inode pages for device replace")
Signed-off-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

665d4953

15 7月, 2018 4 次提交

reiserfs: fix buffer overflow with long warning messages · fe10e398

由 Eric Biggers 提交于 7月 13, 2018

ReiserFS prepares log messages into a 1024-byte buffer with no bounds
checks.  Long messages, such as the "unknown mount option" warning when
userspace passes a crafted mount options string, overflow this buffer.
This causes KASAN to report a global-out-of-bounds write.

Fix it by truncating messages to the buffer size.

Link: http://lkml.kernel.org/r/20180707203621.30922-1-ebiggers3@gmail.com
Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+b890b3335a4d8c608963@syzkaller.appspotmail.com
Signed-off-by: NEric Biggers <ebiggers@google.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fe10e398

fs, elf: make sure to page align bss in load_elf_library · 24962af7

由 Oscar Salvador 提交于 7月 13, 2018

The current code does not make sure to page align bss before calling
vm_brk(), and this can lead to a VM_BUG_ON() in __mm_populate() due to
the requested lenght not being correctly aligned.

Let us make sure to align it properly.

Kees: only applicable to CONFIG_USELIB kernels: 32-bit and configured
for libc5.

Link: http://lkml.kernel.org/r/20180705145539.9627-1-osalvador@techadventures.netSigned-off-by: NOscar Salvador <osalvador@suse.de>
Reported-by: syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com
Tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

24962af7

autofs: fix slab out of bounds read in getname_kernel() · 02f51d45

由 Tomas Bortoli 提交于 7月 13, 2018

The autofs subsystem does not check that the "path" parameter is present
for all cases where it is required when it is passed in via the "param"
struct.

In particular it isn't checked for the AUTOFS_DEV_IOCTL_OPENMOUNT_CMD
ioctl command.

To solve it, modify validate_dev_ioctl(function to check that a path has
been provided for ioctl commands that require it.

Link: http://lkml.kernel.org/r/153060031527.26631.18306637892746301555.stgit@pluto.themaw.netSigned-off-by: NTomas Bortoli <tomasbortoli@gmail.com>
Signed-off-by: NIan Kent <raven@themaw.net>
Reported-by: syzbot+60c837b428dc84e83a93@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

02f51d45

fs/proc/task_mmu.c: fix Locked field in /proc/pid/smaps* · e70cc2bd

由 Vlastimil Babka 提交于 7月 13, 2018

Thomas reports:
 "While looking around in /proc on my v4.14.52 system I noticed that all
  processes got a lot of "Locked" memory in /proc/*/smaps. A lot more
  memory than a regular user can usually lock with mlock().

  Commit 493b0e9d (in v4.14-rc1) seems to have changed the behavior
  of "Locked".

  Before that commit the code was like this. Notice the VM_LOCKED check.

           (vma->vm_flags & VM_LOCKED) ?
                (unsigned long)(mss.pss >> (10 + PSS_SHIFT)) : 0);

  After that commit Locked is now the same as Pss:

	  (unsigned long)(mss->pss >> (10 + PSS_SHIFT)));

  This looks like a mistake."

Indeed, the commit has added mss->pss_locked with the correct value that
depends on VM_LOCKED, but forgot to actually use it.  Fix it.

Link: http://lkml.kernel.org/r/ebf6c7fb-fec3-6a26-544f-710ed193c154@suse.cz
Fixes: 493b0e9d ("mm: add /proc/pid/smaps_rollup")
Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
Reported-by: NThomas Lindroth <thomas.lindroth@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e70cc2bd

13 7月, 2018 1 次提交

btrfs: fix use-after-free of cmp workspace pages · 97b19170

由 Naohiro Aota 提交于 7月 13, 2018

btrfs_cmp_data_free() puts cmp's src_pages and dst_pages, but leaves
their page address intact. Now, if you hit "goto again" in
btrfs_extent_same_range() and hit some error in
btrfs_cmp_data_prepare(), you'll try to unlock/put already put pages.

This is simple fix to reset the address to avoid use-after-free.

Fixes: 67b07bd4 ("Btrfs: reuse cmp workspace in EXTENT_SAME ioctl")
Signed-off-by: NNaohiro Aota <naota@elisp.net>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

97b19170

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功