- 25 2月, 2017 3 次提交
-
-
由 Jeff Layton 提交于
NFSv4 requires a transport "that is specified to avoid network congestion" (RFC 7530, section 3.1, paragraph 2). In practical terms, that means that you should not run NFSv4 over UDP. The server has never enforced that requirement, however. This patchset fixes this by adding a new flag to the svc_version that states that it has these transport requirements. With that, we can check that the transport has XPT_CONG_CTRL set before processing an RPC. If it doesn't we reject it with RPC_PROG_MISMATCH. Signed-off-by: NJeff Layton <jlayton@redhat.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Jeff Layton 提交于
It's just simpler to read this way, IMO. Also, no need to explicitly set vs_hidden to false in the nfsacl ones. Signed-off-by: NJeff Layton <jlayton@redhat.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Rasmus Villemoes 提交于
dprintk already provides a KERN_* prefix; this KERN_INFO just shows up as some odd characters in the output. Simplify the message a bit while we're there. Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
- 21 2月, 2017 2 次提交
-
-
由 Christoph Hellwig 提交于
Both the NFS protocols and the Linux VFS use a setattr operation with a bitmap of attributes to set to set various file attributes including the file size and the uid/gid. The Linux syscalls never mix size updates with unrelated updates like the uid/gid, and some file systems like XFS and GFS2 rely on the fact that truncates don't update random other attributes, and many other file systems handle the case but do not update the other attributes in the same transaction. NFSD on the other hand passes the attributes it gets on the wire more or less directly through to the VFS, leading to updates the file systems don't expect. XFS at least has an assert on the allowed attributes, which caught an unusual NFS client setting the size and group at the same time. To handle this issue properly this splits the notify_change call in nfsd_setattr into two separate ones. Signed-off-by: NChristoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Tested-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Christoph Hellwig 提交于
Simplify exit paths, size_change use. Signed-off-by: NChristoph Hellwig <hch@lst.de> Cc: stable@kernel.org Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
- 18 2月, 2017 6 次提交
-
-
由 Kinglong Mee 提交于
After tightening the OP_LOCKT reply size estimate, we can get warnings like: [11512.783519] RPC request reserved 124 but used 152 [11512.813624] RPC request reserved 108 but used 136 Signed-off-by: NKinglong Mee <kinglongmee@gmail.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Kinglong Mee 提交于
NFSD usess PAGE_SIZE as the reply size estimate for RPCs which don't support op_rsize_bop(), A PAGE_SIZE (4096) is larger than many real response sizes, eg, access (op_encode_hdr_size + 2), seek (op_encode_hdr_size + 3). This patch just adds op_rsize_bop() for all RPCs getting response size. An overestimate is generally safe but the tighter estimates are probably better. Signed-off-by: NKinglong Mee <kinglongmee@gmail.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Kinglong Mee 提交于
Signed-off-by: NKinglong Mee <kinglongmee@gmail.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Kinglong Mee 提交于
The callback tag is NULL, and hdr->nops is unused too right now, but. But if we were to ever test with a nonzero callback tag, nops will get a bad value. Signed-off-by: NKinglong Mee <kinglongmee@gmail.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Kinglong Mee 提交于
The rpccred gotten from rpc_lookup_machine_cred() should be put when state is shutdown. Signed-off-by: NKinglong Mee <kinglongmee@gmail.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Kinglong Mee 提交于
Tigran Mkrtchyan's new pynfs testcases for zero length principals fail: SATT16 st_setattr.testEmptyPrincipal : FAILURE Setting empty owner should return NFS4ERR_INVAL, instead got NFS4ERR_BADOWNER SATT17 st_setattr.testEmptyGroupPrincipal : FAILURE Setting empty owner_group should return NFS4ERR_INVAL, instead got NFS4ERR_BADOWNER This patch checks the principal and returns nfserr_inval directly. It could check after decoding in nfs4xdr.c, but it's simpler to do it in nfsd_map_xxxx. Signed-off-by: NKinglong Mee <kinglongmee@gmail.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
- 10 2月, 2017 1 次提交
-
-
由 J. Bruce Fields 提交于
This patch incorrectly attempted nested mnt_want_write, and incorrectly disabled nfsd's owner override for truncate. We'll fix those problems and make another attempt soon, for the moment I think the safest is to revert. Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
- 07 2月, 2017 1 次提交
-
-
由 NeilBrown 提交于
If you change the set of filesystems that are exported, then the contents of various directories in the NFSv4 pseudo-root is likely to change. However the change-id of those directories is currently tied to the underlying directory, so the client may not see the changes in a timely fashion. This patch changes the change-id number to be derived from the "flush_time" of the export cache. Whenever any changes are made to the set of exported filesystems, this flush_time is updated. The result is that clients see changes to the set of exported filesystems much more quickly, often immediately. Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
- 04 2月, 2017 1 次提交
-
-
由 Michal Hocko 提交于
Tetsuo has noticed that an OOM stress test which performs large write requests can cause the full memory reserves depletion. He has tracked this down to the following path __alloc_pages_nodemask+0x436/0x4d0 alloc_pages_current+0x97/0x1b0 __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 iomap_write_begin+0x50/0xd0 fs/iomap.c:118 iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 ? iomap_write_end+0x80/0x80 fs/iomap.c:150 iomap_apply+0xb3/0x130 fs/iomap.c:79 iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 ? iomap_write_end+0x80/0x80 xfs_file_buffered_aio_write+0x132/0x390 [xfs] ? remove_wait_queue+0x59/0x60 xfs_file_write_iter+0x90/0x130 [xfs] __vfs_write+0xe5/0x140 vfs_write+0xc7/0x1f0 ? syscall_trace_enter+0x1d0/0x380 SyS_write+0x58/0xc0 do_syscall_64+0x6c/0x200 entry_SYSCALL64_slow_path+0x25/0x25 the oom victim has access to all memory reserves to make a forward progress to exit easier. But iomap_file_buffered_write and other callers of iomap_apply loop to complete the full request. We need to check for fatal signals and back off with a short write instead. As the iomap_apply delegates all the work down to the actor we have to hook into those. All callers that work with the page cache are calling iomap_write_begin so we will check for signals there. dax_iomap_actor has to handle the situation explicitly because it copies data to the userspace directly. Other callers like iomap_page_mkwrite work on a single page or iomap_fiemap_actor do not allocate memory based on the given len. Fixes: 68a9f5e7 ("xfs: implement iomap based buffered write path") Link: http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 2月, 2017 13 次提交
-
-
由 David Howells 提交于
Under some circumstances, an fscache object can become queued such that it fscache_object_work_func() can be called once the object is in the OBJECT_DEAD state. This results in the kernel oopsing when it tries to invoke the handler for the state (which is hard coded to 0x2). The way this comes about is something like the following: (1) The object dispatcher is processing a work state for an object. This is done in workqueue context. (2) An out-of-band event comes in that isn't masked, causing the object to be queued, say EV_KILL. (3) The object dispatcher finishes processing the current work state on that object and then sees there's another event to process, so, without returning to the workqueue core, it processes that event too. It then follows the chain of events that initiates until we reach OBJECT_DEAD without going through a wait state (such as WAIT_FOR_CLEARANCE). At this point, object->events may be 0, object->event_mask will be 0 and oob_event_mask will be 0. (4) The object dispatcher returns to the workqueue processor, and in due course, this sees that the object's work item is still queued and invokes it again. (5) The current state is a work state (OBJECT_DEAD), so the dispatcher jumps to it - resulting in an OOPS. When I'm seeing this, the work state in (1) appears to have been either LOOK_UP_OBJECT or CREATE_OBJECT (object->oob_table is fscache_osm_lookup_oob). The window for (2) is very small: (A) object->event_mask is cleared whilst the event dispatch process is underway - though there's no memory barrier to force this to the top of the function. The window, therefore is from the time the object was selected by the workqueue processor and made requeueable to the time the mask was cleared. (B) fscache_raise_event() will only queue the object if it manages to set the event bit and the corresponding event_mask bit was set. The enqueuement is then deferred slightly whilst we get a ref on the object and get the per-CPU variable for workqueue congestion. This slight deferral slightly increases the probability by allowing extra time for the workqueue to make the item requeueable. Handle this by giving the dead state a processor function and checking the for the dead state address rather than seeing if the processor function is address 0x2. The dead state processor function can then set a flag to indicate that it's occurred and give a warning if it occurs more than once per object. If this race occurs, an oops similar to the following is seen (note the RIP value): BUG: unable to handle kernel NULL pointer dereference at 0000000000000002 IP: [<0000000000000002>] 0x1 PGD 0 Oops: 0010 [#1] SMP Modules linked in: ... CPU: 17 PID: 16077 Comm: kworker/u48:9 Not tainted 3.10.0-327.18.2.el7.x86_64 #1 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 12/27/2015 Workqueue: fscache_object fscache_object_work_func [fscache] task: ffff880302b63980 ti: ffff880717544000 task.ti: ffff880717544000 RIP: 0010:[<0000000000000002>] [<0000000000000002>] 0x1 RSP: 0018:ffff880717547df8 EFLAGS: 00010202 RAX: ffffffffa0368640 RBX: ffff880edf7a4480 RCX: dead000000200200 RDX: 0000000000000002 RSI: 00000000ffffffff RDI: ffff880edf7a4480 RBP: ffff880717547e18 R08: 0000000000000000 R09: dfc40a25cb3a4510 R10: dfc40a25cb3a4510 R11: 0000000000000400 R12: 0000000000000000 R13: ffff880edf7a4510 R14: ffff8817f6153400 R15: 0000000000000600 FS: 0000000000000000(0000) GS:ffff88181f420000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000002 CR3: 000000000194a000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffffffffa0363695 ffff880edf7a4510 ffff88093f16f900 ffff8817faa4ec00 ffff880717547e60 ffffffff8109d5db 00000000faa4ec18 0000000000000000 ffff8817faa4ec18 ffff88093f16f930 ffff880302b63980 ffff88093f16f900 Call Trace: [<ffffffffa0363695>] ? fscache_object_work_func+0xa5/0x200 [fscache] [<ffffffff8109d5db>] process_one_work+0x17b/0x470 [<ffffffff8109e4ac>] worker_thread+0x21c/0x400 [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400 [<ffffffff810a5acf>] kthread+0xcf/0xe0 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [<ffffffff816460d8>] ret_from_fork+0x58/0x90 [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NJeremy McNicoll <jeremymc@redhat.com> Tested-by: NFrank Sorenson <sorenson@redhat.com> Tested-by: NBenjamin Coddington <bcodding@redhat.com> Reviewed-by: NBenjamin Coddington <bcodding@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 David Howells 提交于
fscache_disable_cookie() needs to clear the outstanding writes on the cookie it's disabling because they cannot be completed after. Without this, fscache_nfs_open_file() gets stuck because it disables the cookie when the file is opened for writing but can't uncache the pages till afterwards - otherwise there's a race between the open routine and anyone who already has it open R/O and is still reading from it. Looking in /proc/pid/stack of the offending process shows: [<ffffffffa0142883>] __fscache_wait_on_page_write+0x82/0x9b [fscache] [<ffffffffa014336e>] __fscache_uncache_all_inode_pages+0x91/0xe1 [fscache] [<ffffffffa01740fa>] nfs_fscache_open_file+0x59/0x9e [nfs] [<ffffffffa01ccf41>] nfs4_file_open+0x17f/0x1b8 [nfsv4] [<ffffffff8117350e>] do_dentry_open+0x16d/0x2b7 [<ffffffff811743ac>] vfs_open+0x5c/0x65 [<ffffffff81184185>] path_openat+0x785/0x8fb [<ffffffff81184343>] do_filp_open+0x48/0x9e [<ffffffff81174710>] do_sys_open+0x13b/0x1cb [<ffffffff811747b9>] SyS_open+0x19/0x1b [<ffffffff81001c44>] do_syscall_64+0x80/0x17a [<ffffffff8165c2da>] return_from_SYSCALL_64+0x0/0x7a [<ffffffffffffffff>] 0xffffffffffffffff Reported-by: NJianhong Yin <jiyin@redhat.com> Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NJeff Layton <jlayton@redhat.com> Acked-by: NSteve Dickson <steved@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 David Howells 提交于
Initialise the stores_lock in fscache netfs cookies. Technically, it shouldn't be necessary, since the netfs cookie is an index and stores no data, but initialising it anyway adds insignificant overhead. Signed-off-by: NDavid Howells <dhowells@redhat.com> Reviewed-by: NJeff Layton <jlayton@redhat.com> Acked-by: NSteve Dickson <steved@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 J. Bruce Fields 提交于
Currently turning on NFSv4.2 results in 4.2 clients suddenly seeing the individual file labels as they're set on the server. This is not what they've previously seen, and not appropriate in may cases. (In particular, if clients have heterogenous security policies then one client's labels may not even make sense to another.) Labeled NFS should be opted in only in those cases when the administrator knows it makes sense. It's helpful to be able to turn 4.2 on by default, and otherwise the protocol upgrade seems free of regressions. So, default labeled NFS to off and provide an export flag to reenable it. Users wanting labeled NFS support on an export will henceforth need to: - make sure 4.2 support is enabled on client and server (as before), and - upgrade the server nfs-utils to a version supporting the new "security_label" export flag. - set that "security_label" flag on the export. This is commit may be seen as a regression to anyone currently depending on security labels. We believe those cases are currently rare. Reported-by: tibbs@math.uh.edu Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 J. Bruce Fields 提交于
To keep me from accidentally writing to this again.... Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Scott Mayhew 提交于
I noticed this was missing when I was testing with link local addresses. Signed-off-by: NScott Mayhew <smayhew@redhat.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Scott Mayhew 提交于
I noticed this was missing when I was testing with link local addresses. Signed-off-by: NScott Mayhew <smayhew@redhat.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Kinglong Mee 提交于
This is just cleanup, no change in functionality. Signed-off-by: NKinglong Mee <kinglongmee@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Kinglong Mee 提交于
This is just cleanup, no change in functionality. Signed-off-by: NKinglong Mee <kinglongmee@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Kinglong Mee 提交于
After fae5096a "nfsd: assume writeable exportabled filesystems have f_sync" we no longer modify this argument. This is just cleanup, no change in functionality. Signed-off-by: NKinglong Mee <kinglongmee@gmail.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 NeilBrown 提交于
Writing to /proc/fs/nfsd/versions allows individual major versions and NFSv4 minor versions to be enabled or disabled. However NFSv4.0 cannot currently be disabled, thought there is no good reason. Also the minor number is parsed as a 'long' but used as an 'int' so '4294967297' will be incorrectly treated as '1'. This patch removes the test on 'minor == 0' and switches to kstrtouint() to get correct range checking. When reading from /proc/fs/nfsd/versions, 4.0 is current not reported. To allow the disabling for v4.0 to be visible, while maintaining backward compatibility, change code to report "-4.0" if appropriate, but not "+4.0". Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Christoph Hellwig 提交于
Both the NFS protocols and the Linux VFS use a setattr operation with a bitmap of attributs to set to set various file attributes including the file size and the uid/gid. The Linux syscalls never mixes size updates with unrelated updates like the uid/gid, and some file systems like XFS and GFS2 rely on the fact that truncates might not update random other attributes, and many other file systems handle the case but do not update the different attributes in the same transaction. NFSD on the other hand passes the attributes it gets on the wire more or less directly through to the VFS, leading to updates the file systems don't expect. XFS at least has an assert on the allowed attributes, which caught an unusual NFS client setting the size and group at the same time. To handle this issue properly this switches nfsd to call vfs_truncate for size changes, and then handle all other attributes through notify_change. As a side effect this also means less boilerplace code around the size change as we can now reuse the VFS code. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
由 Kinglong Mee 提交于
nfsd assigns the nfs4_free_lock_stateid to .sc_free in init_lock_stateid(). If nfsd doesn't go through init_lock_stateid() and put stateid at end, there is a NULL reference to .sc_free when calling nfs4_put_stid(ns). This patch let the nfs4_stid.sc_free assignment to nfs4_alloc_stid(). Cc: stable@vger.kernel.org Fixes: 356a95ec "nfsd: clean up races in lock stateid searching..." Signed-off-by: NKinglong Mee <kinglongmee@gmail.com> Reviewed-by: NJeff Layton <jlayton@redhat.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
- 28 1月, 2017 1 次提交
-
-
由 Brian Foster 提交于
Quotacheck runs at mount time in situations where quota accounting must be recalculated. In doing so, it uses bulkstat to visit every inode in the filesystem. Historically, every inode processed during quotacheck was released and immediately tagged for reclaim because quotacheck runs before the superblock is marked active by the VFS. In other words, the final iput() lead to an immediate ->destroy_inode() call, which allowed the XFS background reclaim worker to start reclaiming inodes. Commit 17c12bcd ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped") marks the XFS superblock active sooner as part of the mount process to support caching inodes processed during log recovery. This occurs before quotacheck and thus means all inodes processed by quotacheck are inserted to the LRU on release. The s_umount lock is held until the mount has completed and thus prevents the shrinkers from operating on the sb. This means that quotacheck can excessively populate the inode LRU and lead to OOM conditions on systems without sufficient RAM. Update the quotacheck bulkstat handler to set XFS_IGET_DONTCACHE on inodes processed by quotacheck. This causes ->drop_inode() to return 1 and in turn causes iput_final() to evict the inode. This preserves the original quotacheck behavior and prevents it from overloading the LRU and running out of memory. CC: stable@vger.kernel.org # v4.9 Reported-by: NMartin Svec <martin.svec@zoner.cz> Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NEric Sandeen <sandeen@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 27 1月, 2017 6 次提交
-
-
由 Omar Sandoval 提交于
Subvolume directory inodes can't have ACLs. Cc: <stable@vger.kernel.org> # 4.9.x Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Omar Sandoval 提交于
When you snapshot a subvolume containing a subvolume, you get a placeholder directory where the subvolume would be. These directory inodes have ->i_ops set to btrfs_dir_ro_inode_operations. Previously, these i_ops didn't include the xattr operation callbacks. The conversion to xattr_handlers missed this case, leading to bogus attempts to set xattrs on these inodes. This manifested itself as failures when running delayed inodes. To fix this, clear IOP_XATTR in ->i_opflags on these inodes. Fixes: 6c6ef9f2 ("xattr: Stop calling {get,set,remove}xattr inode operations") Cc: Andreas Gruenbacher <agruenba@redhat.com> Reported-by: NChris Murphy <lists@colorremedies.com> Tested-by: NChris Murphy <lists@colorremedies.com> Cc: <stable@vger.kernel.org> # 4.9.x Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Omar Sandoval 提交于
As Jeff explained in c2951f32 ("btrfs: remove old tree_root dirent processing in btrfs_real_readdir()"), supporting this old format is no longer necessary since the Btrfs magic number has been updated since we changed to the current format. There are other places where we still handle this old format, but since this is part of a fix that is going to stable, I'm only removing this one for now. Cc: <stable@vger.kernel.org> # 4.9.x Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Trond Myklebust 提交于
IF NFS_LAYOUT_RETURN_REQUESTED is not set, then we currently exit without freeing the list of invalidated layout segments, leading to a reference leak. Reported-by: NOlga Kornievskaia <aglo@umich.edu> Fixes: 24408f52 ("pNFS: Fix bugs in _pnfs_return_layout") Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 Chuck Lever 提交于
Lock sequence IDs are bumped in decode_lock by calling nfs_increment_seqid(). nfs_increment_sequid() does not use the seqid_mutating_err() function fixed in commit 059aa734 ("Don't increment lock sequence ID after NFS4ERR_MOVED"). Fixes: 059aa734 ("Don't increment lock sequence ID after ...") Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NXuan Qi <xuan.qi@oracle.com> Cc: stable@vger.kernel.org # v3.7+ Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 Darrick J. Wong 提交于
In a bmapx call, bmv_count is the total size of the array, including the zeroth element that userspace uses to supply the search key. The output array starts at offset 1 so that we can set up the user for the next invocation. Since we now can split an extent into multiple bmap records due to shared/unshared status, we have to be careful that we don't overflow the output array. In the original patch f86f4037 ("xfs: teach get_bmapx about shared extents and the CoW fork") I used cur_ext (the output index) to check for overflows, albeit with an off-by-one error. Since nexleft no longer describes the number of unfilled slots in the output, we can rip all that out and use cur_ext for the overflow check directly. Failure to do this causes heap corruption in bmapx callers such as xfs_io and xfs_scrub. xfs/328 can reproduce this problem. Reviewed-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 26 1月, 2017 2 次提交
-
-
由 Darrick J. Wong 提交于
If we try to allocate memory pages to back an xfs_buf that we're trying to read, it's possible that we'll be so short on memory that the page allocation fails. For a blocking read we'll just wait, but for readahead we simply dump all the pages we've collected so far. Unfortunately, after dumping the pages we neglect to clear the _XBF_PAGES state, which means that the subsequent call to xfs_buf_free thinks that b_pages still points to pages we own. It then double-frees the b_pages pages. This results in screaming about negative page refcounts from the memory manager, which xfs oughtn't be triggering. To reproduce this case, mount a filesystem where the size of the inodes far outweighs the availalble memory (a ~500M inode filesystem on a VM with 300MB memory did the trick here) and run bulkstat in parallel with other memory eating processes to put a huge load on the system. The "check summary" phase of xfs_scrub also works for this purpose. Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com> Reviewed-by: NEric Sandeen <sandeen@redhat.com>
-
由 Christoph Hellwig 提交于
With COW files they are the hotpath, just like for files with the extent size hint attribute. We really shouldn't micro-manage anything but failure cases with unlikely. Additionally Arnd Bergmann recently reported that one of these two unlikely annotations causes link failures together with an upcoming kernel instrumentation patch, so let's get rid of it ASAP. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reported-by: NArnd Bergmann <arnd@arndb.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 25 1月, 2017 4 次提交
-
-
由 Brian Foster 提交于
xfs_attr_[get|remove]() have unlocked attribute fork checks to optimize away a lock cycle in cases where the fork does not exist or is otherwise empty. This check is not safe, however, because an attribute fork short form to extent format conversion includes a transient state that causes the xfs_inode_hasattr() check to fail. Specifically, xfs_attr_shortform_to_leaf() creates an empty extent format attribute fork and then adds the existing shortform attributes to it. This means that lookup of an existing xattr can spuriously return -ENOATTR when racing against a setxattr that causes the associated format conversion. This was originally reproduced by an untar on a particularly configured glusterfs volume, but can also be reproduced on demand with properly crafted xattr requests. The format conversion occurs under the exclusive ilock. xfs_attr_get() and xfs_attr_remove() already have the proper locking and checks further down in the functions to handle this situation correctly. Drop the unlocked checks to avoid the spurious failure and rely on the existing logic. Signed-off-by: NBrian Foster <bfoster@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
Currently we try to rely on the global reserved block pool for block allocations for the free inode btree, but I have customer reports (fairly complex workload, need to find an easier reproducer) where that is not enough as the AG where we free an inode that requires a new finobt block is entirely full. This causes us to cancel a dirty transaction and thus a file system shutdown. I think the right way to guard against this is to treat the finot the same way as the refcount btree and have a per-AG reservations for the possible worst case size of it, and the patch below implements that. Note that this could increase mount times with large finobt trees. In an ideal world we would have added a field for the number of finobt fields to the AGI, similar to what we did for the refcount blocks. We should do add it next time we rev the AGI or AGF format by adding new fields. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
Try to reserve the blocks first and only then update the fields in or hanging off the mount structure. This way we can call __xfs_ag_resv_init again after a previous failure. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Coly Li 提交于
Commit 8a59f5d2 ("fs/romfs: return f_fsid for statfs(2)") generates a 64bit id from sb->s_bdev->bd_dev. This is only correct when romfs is defined with CONFIG_ROMFS_ON_BLOCK. If romfs is only defined with CONFIG_ROMFS_ON_MTD, sb->s_bdev is NULL, referencing sb->s_bdev->bd_dev will triger an oops. Richard Weinberger points out that when CONFIG_ROMFS_BACKED_BY_BOTH=y, both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD are defined. Therefore when calling huge_encode_dev() to generate a 64bit id, I use the follow order to choose parameter, - CONFIG_ROMFS_ON_BLOCK defined use sb->s_bdev->bd_dev - CONFIG_ROMFS_ON_BLOCK undefined and CONFIG_ROMFS_ON_MTD defined use sb->s_dev when, - both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD undefined leave id as 0 When CONFIG_ROMFS_ON_MTD is defined and sb->s_mtd is not NULL, sb->s_dev is set to a device ID generated by MTD_BLOCK_MAJOR and mtd index, otherwise sb->s_dev is 0. This is a try-best effort to generate a uniq file system ID, if all the above conditions are not meet, f_fsid of this romfs instance will be 0. Generally only one romfs can be built on single MTD block device, this method is enough to identify multiple romfs instances in a computer. Link: http://lkml.kernel.org/r/1482928596-115155-1-git-send-email-colyli@suse.deSigned-off-by: NColy Li <colyli@suse.de> Reported-by: NNong Li <nongli1031@gmail.com> Tested-by: NNong Li <nongli1031@gmail.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-