- 12 4月, 2018 18 次提交
-
-
由 Amir Goldstein 提交于
With mount option "xino=on", mounter declares that there are enough free high bits in underlying fs to hold the layer fsid. If overlayfs does encounter underlying inodes using the high xino bits reserved for layer fsid, a warning will be emitted and the original inode number will be used. The mount option name "xino" goes after a similar meaning mount option of aufs, but in overlayfs case, the mapping is stateless. An example for a use case of "xino=on" is when upper/lower is on an xfs filesystem. xfs uses 64bit inode numbers, but it currently never uses the upper 8bit for inode numbers exposed via stat(2) and that is not likely to change in the future without user opting-in for a new xfs feature. The actual number of unused upper bit is much larger and determined by the xfs filesystem geometry (64 - agno_log - agblklog - inopblog). That means that for all practical purpose, there are enough unused bits in xfs inode numbers for more than OVL_MAX_STACK unique fsid's. Another use case of "xino=on" is when upper/lower is on tmpfs. tmpfs inode numbers are allocated sequentially since boot, so they will practially never use the high inode number bits. For compatibility with applications that expect 32bit inodes, the feature can be disabled with "xino=off". The option "xino=auto" automatically detects underlying filesystem that use 32bit inodes and enables the feature. The Kconfig option OVERLAY_FS_XINO_AUTO and module parameter of the same name, determine if the default mode for overlayfs mount is "xino=auto" or "xino=off". Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Amir Goldstein 提交于
When overlay layers are not all on the same fs, but all inode numbers of underlying fs do not use the high 'xino' bits, overlay st_ino values are constant and persistent. In that case, relax non-samefs constraint for consistent d_ino and always iterate non-merge dir using ovl_fill_real() actor so we can remap lower inode numbers to unique lower fs range. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Amir Goldstein 提交于
When overlay layers are not all on the same fs, but all inode numbers of underlying fs do not use the high 'xino' bits, overlay st_ino values are constant and persistent. In that case, set i_ino value to the same value as st_ino for nfsd readdirplus validator. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Amir Goldstein 提交于
On 64bit systems, when overlay layers are not all on the same fs, but all inode numbers of underlying fs are not using the high bits, use the high bits to partition the overlay st_ino address space. The high bits hold the fsid (upper fsid is 0). This way overlay inode numbers are unique and all inodes use overlay st_dev. Inode numbers are also persistent for a given layer configuration. Currently, our only indication for available high ino bits is from a filesystem that supports file handles and uses the default encode_fh() operation, which encodes a 32bit inode number. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Amir Goldstein 提交于
Instead of allocating an anonymous bdev per lower layer, allocate one anonymous bdev per every unique lower fs that is different than upper fs. Every unique lower fs is assigned an fsid > 0 and the number of unique lower fs are stored in ofs->numlowerfs. The assigned fsid is stored in the lower layer struct and will be used also for inode number multiplexing. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Amir Goldstein 提交于
A helper for ovl_getattr() to map the values of st_dev and st_ino according to constant st_ino rules. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Miklos Szeredi 提交于
No need to mess with an alias, the upperdentry can be retrieved directly from the overlay inode. Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Miklos Szeredi 提交于
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Vivek Goyal 提交于
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Vivek Goyal 提交于
Certain properties in ovl_lookup_data should be set only for the last element of the path. IOW, if we are calling ovl_lookup_single() for an absolute redirect, then d->is_dir and d->opaque do not make much sense for intermediate path elements. Instead set them only if dentry being lookup is last path element. As of now we do not seem to be making use of d->opaque if it is set for a path/dentry in lower. But just define the semantics so that future code can make use of this assumption. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Vivek Goyal 提交于
If we are looking in last layer, then there should not be any need to process redirect. redirect information is used only for lookup in next lower layer and there is no more lower layer to look into. So no need to process redirects. IOW, ignore redirects on lowest layer. Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Amir Goldstein 提交于
When decoding a lower file handle, we need to check if lower file was copied up and indexed and if it has a whiteout index, we need to check if this is an unlinked but open non-dir before returning -ESTALE. To find out if this is an unlinked but open non-dir we need to lookup an overlay inode in inode cache by lower inode and that requires decoding the lower file handle before looking in inode cache. Before this change, if the lower inode turned out to be a directory, we may have paid an expensive cost to reconnect that lower directory for nothing. After this change, we start by decoding a disconnected lower dentry and using the lower inode for looking up an overlay inode in inode cache. If we find overlay inode and dentry in cache, we avoid the index lookup overhead. If we don't find an overlay inode and dentry in cache, then we only need to decode a connected lower dentry in case the lower dentry is a non-indexed directory. The xfstests group overlay/exportfs tests decoding overlayfs file handles after drop_caches with different states of the file at encode and decode time. Overall the tests in the group call ovl_lower_fh_to_d() 89 times to decode a lower file handle. Before this change, the tests called ovl_get_index_fh() 75 times and reconnect_one() 61 times. After this change, the tests call ovl_get_index_fh() 70 times and reconnect_one() 59 times. The 2 cases where reconnect_one() was avoided are cases where a non-upper directory file handle was encoded, then the directory removed and then file handle was decoded. To demonstrate the affect on decoding file handles with hot inode/dentry cache, the drop_caches call in the tests was disabled. Without drop_caches, there are no reconnect_one() calls at all before or after the change. Before the change, there are 75 calls to ovl_get_index_fh(), exactly as the case with drop_caches. After the change, there are only 10 calls to ovl_get_index_fh(). Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Amir Goldstein 提交于
On lookup of non directory, we try to decode the origin file handle stored in upper inode. The origin file handle is supposed to be decoded to a disconnected non-dir dentry, which is fine, because we only need the lower inode of a copy up origin. However, if the origin file handle somehow turns out to be a directory we pay the expensive cost of reconnecting the directory dentry, only to get a mismatch file type and drop the dentry. Optimize this case by explicitly opting out of reconnecting the dentry. Opting-out of reconnect is done by passing a NULL acceptable callback to exportfs_decode_fh(). While the case described above is a strange corner case that does not really need to be optimized, the API added for this optimization will be used by a following patch to optimize a more common case of decoding an overlayfs file handle. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Amir Goldstein 提交于
Rename ovl_encode_fh() to ovl_encode_real_fh() to differentiate from the exportfs function ovl_encode_inode_fh() and change the latter to ovl_encode_fh() to match the exportfs method name. Rename ovl_decode_fh() to ovl_decode_real_fh() for consistency. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Amir Goldstein 提交于
For broken hardlinks, we do not return lower st_ino, so we should also not return lower pseudo st_dev. Fixes: a0c5ad30 ("ovl: relax same fs constraint for constant st_ino") Cc: <stable@vger.kernel.org> #v4.15 Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Amir Goldstein 提交于
As of now if we encounter an opaque dir while looking for a dentry, we set d->last=true. This means that there is no need to look further in any of the lower layers. This works fine as long as there are no redirets or relative redircts. But what if there is an absolute redirect on the children dentry of opaque directory. We still need to continue to look into next lower layer. This patch fixes it. Here is an example to demonstrate the issue. Say you have following setup. upper: /redirect (redirect=/a/b/c) lower1: /a/[b]/c ([b] is opaque) (c has absolute redirect=/a/b/d/) lower0: /a/b/d/foo Now "redirect" dir should merge with lower1:/a/b/c/ and lower0:/a/b/d. Note, despite the fact lower1:/a/[b] is opaque, we need to continue to look into lower0 because children c has an absolute redirect. Following is a reproducer. Watch me make foo disappear: $ mkdir lower middle upper work work2 merged $ mkdir lower/origin $ touch lower/origin/foo $ mount -t overlay none merged/ \ -olowerdir=lower,upperdir=middle,workdir=work2 $ mkdir merged/pure $ mv merged/origin merged/pure/redirect $ umount merged $ mount -t overlay none merged/ \ -olowerdir=middle:lower,upperdir=upper,workdir=work $ mv merged/pure/redirect merged/redirect Now you see foo inside a twice redirected merged dir: $ ls merged/redirect foo $ umount merged $ mount -t overlay none merged/ \ -olowerdir=middle:lower,upperdir=upper,workdir=work After mount cycle you don't see foo inside the same dir: $ ls merged/redirect During middle layer lookup, the opaqueness of middle/pure is left in the lookup state and then middle/pure/redirect is wrongly treated as opaque. Fixes: 02b69b28 ("ovl: lookup redirects") Cc: <stable@vger.kernel.org> #v4.10 Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
由 Vivek Goyal 提交于
d->last signifies that this is the last layer we are looking into and there is no more. And that means this allows for some optimzation opportunities during lookup. For example, in ovl_lookup_single() we don't have to check for opaque xattr of a directory is this is the last layer we are looking into (d->last = true). But knowing for sure whether we are looking into last layer can be very tricky. If redirects are not enabled, then we can look at poe->numlower and figure out if the lookup we are about to is last layer or not. But if redircts are enabled then it is possible poe->numlower suggests that we are looking in last layer, but there is an absolute redirect present in found element and that redirects us to a layer in root and that means lookup will continue in lower layers further. For example, consider following. /upperdir/pure (opaque=y) /upperdir/pure/foo (opaque=y,redirect=/bar) /lowerdir/bar In this case pure is "pure upper". When we look for "foo", that time poe->numlower=0. But that alone does not mean that we will not search for a merge candidate in /lowerdir. Absolute redirect changes that. IOW, d->last should not be set just based on poe->numlower if redirects are enabled. That can lead to setting d->last while it should not have and that means we will not check for opaque xattr while we should have. So do this. - If redirects are not enabled, then continue to rely on poe->numlower information to determine if it is last layer or not. - If redirects are enabled, then set d->last = true only if this is the last layer in root ovl_entry (roe). Suggested-by: NAmir Goldstein <amir73il@gmail.com> Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NVivek Goyal <vgoyal@redhat.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Fixes: 02b69b28 ("ovl: lookup redirects") Cc: <stable@vger.kernel.org> #v4.10
-
由 Amir Goldstein 提交于
Eddie Horng reported that readdir of an overlayfs directory that was exported via NFSv3 returns entries with d_type set to DT_UNKNOWN. The reason is that while preparing the response for readdirplus, nfsd checks inside encode_entryplus_baggage() that a child dentry's inode number matches the value of d_ino returns by overlayfs readdir iterator. Because the overlayfs inodes use arbitrary inode numbers that are not correlated with the values of st_ino/d_ino, NFSv3 falls back to not encoding d_type. Although this is an allowed behavior, we can fix it for the case of all overlayfs layers on the same underlying filesystem. When NFS export is enabled and d_ino is consistent with st_ino (samefs), set the same value also to i_ino in ovl_fill_inode() for all overlayfs inodes, nfsd readdirplus sanity checks will pass. ovl_fill_inode() may be called from ovl_new_inode(), before real inode was created with ino arg 0. In that case, i_ino will be updated to real upper inode i_ino on ovl_inode_init() or ovl_inode_update(). Reported-by: NEddie Horng <eddiehorng.tw@gmail.com> Tested-by: NEddie Horng <eddiehorng.tw@gmail.com> Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Fixes: 8383f174 ("ovl: wire up NFS export operations") Cc: <stable@vger.kernel.org> #v4.16 Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
- 23 3月, 2018 1 次提交
-
-
由 Mike Kravetz 提交于
A vma with vm_pgoff large enough to overflow a loff_t type when converted to a byte offset can be passed via the remap_file_pages system call. The hugetlbfs mmap routine uses the byte offset to calculate reservations and file size. A sequence such as: mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0); remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0); will result in the following when task exits/file closed, kernel BUG at mm/hugetlb.c:749! Call Trace: hugetlbfs_evict_inode+0x2f/0x40 evict+0xcb/0x190 __dentry_kill+0xcb/0x150 __fput+0x164/0x1e0 task_work_run+0x84/0xa0 exit_to_usermode_loop+0x7d/0x80 do_syscall_64+0x18b/0x190 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 The overflowed pgoff value causes hugetlbfs to try to set up a mapping with a negative range (end < start) that leaves invalid state which causes the BUG. The previous overflow fix to this code was incomplete and did not take the remap_file_pages system call into account. [mike.kravetz@oracle.com: v3] Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com [akpm@linux-foundation.org: include mmdebug.h] [akpm@linux-foundation.org: fix -ve left shift count on sh] Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com Fixes: 045c7a3f ("hugetlbfs: fix offset overflow in hugetlbfs mmap") Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com> Reported-by: NNic Losby <blurbdust@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 3月, 2018 2 次提交
-
-
由 Grygorii Strashko 提交于
The sysfs_create_link_nowarn() is going to be used in phylib framework in subsequent patch which can be built as module. Hence, export sysfs_create_link_nowarn() to avoid build errors. Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: Andrew Lunn <andrew@lunn.ch> Fixes: a3995460 ("net: phy: Relax error checking on sysfs_create_link()") Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com> Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jeff Layton 提交于
We had some reports of panics in nfsd4_lm_notify, and that showed a nfs4_lockowner that had outlived its so_client. Ensure that we walk any leftover lockowners after tearing down all of the stateids, and remove any blocked locks that they hold. With this change, we also don't need to walk the nbl_lru on nfsd_net shutdown, as that will happen naturally when we tear down the clients. Fixes: 76d348fa (nfsd: have nfsd4_lock use blocking locks for v4.1+ locks) Reported-by: NFrank Sorenson <fsorenso@redhat.com> Signed-off-by: NJeff Layton <jlayton@redhat.com> Cc: stable@vger.kernel.org # 4.9 Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
- 16 3月, 2018 2 次提交
-
-
由 David Sterba 提交于
This reverts commit 3c181c12. The offending patch was merged in 4.16-rc4 and was promptly applied to stable kernels 4.14.25 and 4.15.8. The patch causes a corruption in several superblock items on big-endian machines because of messed up endianity conversions. The damage is manually repairable. A filesystem cannot be mounted again after it has been unmounted once. We do a full revert and not a fixup so stable can pick that patch ASAP. Fixes: 3c181c12 ("btrfs: use proper endianness accessors for super_copy") Link: https://lkml.kernel.org/r/1521139304@msgid.manchmal.in-ulm.de CC: stable@vger.kernel.org # 4.14+ Reported-by: NChristoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Eric W. Biederman 提交于
On nfsv2 and nfsv3 the nfs server can export subsets of the same filesystem and report the same filesystem identifier, so that the nfs client can know they are the same filesystem. The subsets can be from disjoint directory trees. The nfsv2 and nfsv3 filesystems provides no way to find the common root of all directory trees exported form the server with the same filesystem identifier. The practical result is that in struct super s_root for nfs s_root is not necessarily the root of the filesystem. The nfs mount code sets s_root to the root of the first subset of the nfs filesystem that the kernel mounts. This effects the dcache invalidation code in generic_shutdown_super currently called shrunk_dcache_for_umount and that code for years has gone through an additional list of dentries that might be dentry trees that need to be freed to accomodate nfs. When I wrote path_connected I did not realize nfs was so special, and it's hueristic for avoiding calling is_subdir can fail. The practical case where this fails is when there is a move of a directory from the subtree exposed by one nfs mount to the subtree exposed by another nfs mount. This move can happen either locally or remotely. With the remote case requiring that the move directory be cached before the move and that after the move someone walks the path to where the move directory now exists and in so doing causes the already cached directory to be moved in the dcache through the magic of d_splice_alias. If someone whose working directory is in the move directory or a subdirectory and now starts calling .. from the initial mount of nfs (where s_root == mnt_root), then path_connected as a heuristic will not bother with the is_subdir check. As s_root really is not the root of the nfs filesystem this heuristic is wrong, and the path may actually not be connected and path_connected can fail. The is_subdir function might be cheap enough that we can call it unconditionally. Verifying that will take some benchmarking and the result may not be the same on all kernels this fix needs to be backported to. So I am avoiding that for now. Filesystems with snapshots such as nilfs and btrfs do something similar. But as the directory tree of the snapshots are disjoint from one another and from the main directory tree rename won't move things between them and this problem will not occur. Cc: stable@vger.kernel.org Reported-by: NAl Viro <viro@ZenIV.linux.org.uk> Fixes: 397d425d ("vfs: Test for and handle paths that are unreachable from their mnt_root") Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 15 3月, 2018 4 次提交
-
-
由 Edmund Nadolski 提交于
This patch addresses an issue that causes fiemap to falsely report a shared extent. The test case is as follows: xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5 sync xfs_io -c "fiemap -v" /media/scratch/file5 which gives the resulting output: wrote 65536/65536 bytes at offset 0 64 KiB, 4 ops; 0.0000 sec (121.359 MiB/sec and 7766.9903 ops/sec) /media/scratch/file5: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 24576..24703 128 0x2001 /media/scratch/file5: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 24576..24703 128 0x1 This is because btrfs_check_shared calls find_parent_nodes repeatedly in a loop, passing a share_check struct to report the count of shared extent. But btrfs_check_shared does not re-initialize the count value to zero for subsequent calls from the loop, resulting in a false share count value. This is a regressive behavior from 4.13. With proper re-initialization the test result is as follows: wrote 65536/65536 bytes at offset 0 64 KiB, 4 ops; 0.0000 sec (110.035 MiB/sec and 7042.2535 ops/sec) /media/scratch/file5: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 24576..24703 128 0x1 /media/scratch/file5: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..127]: 24576..24703 128 0x1 which corrects the regression. Fixes: 3ec4d323 ("btrfs: allow backref search checks for shared extents") Signed-off-by: NEdmund Nadolski <enadolski@suse.com> [ add text from cover letter to changelog ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Dmitriy Gorokh 提交于
On detaching of a disk which is a part of a RAID6 filesystem, the following kernel OOPS may happen: [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, flush 1, corrupt 0, gen 0 [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, flush 1, corrupt 0, gen 0 [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo [63122.935338] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080 [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs] [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0 [63122.971202] Oops: 0000 [#1] SMP [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 4.14.2-16-scst34x+ #8 [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs] [63123.007595] task: ffff880036ea4040 task.stack: ffffc90006384000 [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs] [63123.007968] RSP: 0018:ffffc90006387ad8 EFLAGS: 00010287 [63123.008140] RAX: 0000000000000002 RBX: ffff88004beaa0b8 RCX: ffff8800b2bd5690 [63123.008359] RDX: 0000000000000000 RSI: ffff88007bb43500 RDI: ffff88004beaa000 [63123.008621] RBP: ffffc90006387ae8 R08: 0000000099100000 R09: ffff8800b2bd5600 [63123.008840] R10: 0000000000000004 R11: 0000000000010000 R12: ffff88007bb43500 [63123.009059] R13: 00000000fffffffb R14: ffff880036fc5180 R15: 0000000000000004 [63123.009278] FS: 0000000000000000(0000) GS:ffff8800b7000000(0000) knlGS:0000000000000000 [63123.009564] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [63123.009748] CR2: 0000000000000080 CR3: 00000000b0866000 CR4: 00000000000406f0 [63123.009969] Call Trace: [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs] [63123.010251] bio_endio+0xa1/0x120 [63123.010378] generic_make_request+0x218/0x270 [63123.010921] submit_bio+0x66/0x130 [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs] [63123.011245] full_stripe_write+0x96/0xc0 [btrfs] [63123.011428] raid56_parity_write+0x117/0x170 [btrfs] [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs] [63123.011759] ? ___cache_free+0x1c5/0x300 [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs] [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs] [63123.012257] normal_work_helper+0x19e/0x300 [btrfs] [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs] [63123.012656] process_one_work+0x14d/0x350 [63123.012888] worker_thread+0x4d/0x3a0 [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20 [63123.013192] kthread+0x109/0x140 [63123.013315] ? process_scheduled_works+0x40/0x40 [63123.013472] ? kthread_stop+0x110/0x110 [63123.013610] ret_from_fork+0x25/0x30 [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: ffffc90006387ad8 [63123.014678] CR2: 0000000000000080 [63123.016590] ---[ end trace a295ea7259c17880 ]— This is reproducible in a cycle, where a series of writes is followed by SCSI device delete command. The test may take up to few minutes. Fixes: 74d46992 ("block: replace bi_bdev with a gendisk pointer and partitions index") [ no signed-off-by provided ] Author: Dmitriy Gorokh <Dmitriy.Gorokh@wdc.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Tejun Heo 提交于
While converting ioctx index from a list to a table, db446a08 ("aio: convert the ioctx list to table lookup v3") missed tagging kioctx_table->table[] as an array of RCU pointers and using the appropriate RCU accessors. This introduces a small window in the lookup path where init and access may race. Mark kioctx_table->table[] with __rcu and use the approriate RCU accessors when using the field. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NJann Horn <jannh@google.com> Fixes: db446a08 ("aio: convert the ioctx list to table lookup v3") Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org # v3.12+
-
由 Tejun Heo 提交于
While fixing refcounting, e34ecee2 ("aio: Fix a trinity splat") incorrectly removed explicit RCU grace period before freeing kioctx. The intention seems to be depending on the internal RCU grace periods of percpu_ref; however, percpu_ref uses a different flavor of RCU, sched-RCU. This can lead to kioctx being freed while RCU read protected dereferences are still in progress. Fix it by updating free_ioctx() to go through call_rcu() explicitly. v2: Comment added to explain double bouncing. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NJann Horn <jannh@google.com> Fixes: e34ecee2 ("aio: Fix a trinity splat") Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org # v3.13+
-
- 09 3月, 2018 3 次提交
-
-
由 Trond Myklebust 提交于
We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to ensure that all outstanding COMMIT requests to the inode in question are complete. Currently we may exit early from both nfs_commit_inode() and nfs_write_inode() even if there are COMMIT requests in flight, or unstable writes on the commit list. In order to get the right semantics w.r.t. sync_inode(), we don't need to have nfs_commit_inode() reset the inode dirty flags when called from nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that nfs_write_inode() leaves them in the right state if there are outstanding commits, or stable pages. Reported-by: NScott Mayhew <smayhew@redhat.com> Fixes: dc4fd9ab ("nfs: don't wait on commit in nfs_commit_inode()...") Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 Trond Myklebust 提交于
Ensure that we hold a reference to the layout header when processing the pNFS return-on-close so that the refcount value does not inadvertently go to zero. Reported-by: NTigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com> Cc: stable@vger.kernel.org # v4.10+ Tested-by: NTigran Mkrtchyan <tigran.mkrtchyan@desy.de>
-
由 Trond Myklebust 提交于
The start offset needs to be of type loff_t. Fixed: 5fadeb47 ("nfs: count DIO good bytes correctly with mirroring") Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
- 08 3月, 2018 1 次提交
-
-
由 Andreas Gruenbacher 提交于
It turns out that commit 3229c18c0d6b2 'Fixes to "Implement iomap for block_map"' introduced another bug in gfs2_iomap_begin that can cause gfs2_block_map to set bh->b_size of an actual buffer to 0. This can lead to arbitrary incorrect behavior including crashes or disk corruption. Revert the incorrect part of that commit. Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com> Signed-off-by: NBob Peterson <rpeterso@redhat.com>
-
- 07 3月, 2018 1 次提交
-
-
由 Miklos Szeredi 提交于
Add some hints about overlayfs kernel config options. Enabling NFS export by default is especially recommended against, as it incurs a performance penalty even if the filesystem is not actually exported. Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
- 02 3月, 2018 3 次提交
-
-
由 Christoph Hellwig 提交于
Fix xfs_file_iomap_begin to trylock the ilock if IOMAP_NOWAIT is passed, so that we don't block io_submit callers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
There is no reason to take the ilock exclusively at the start of xfs_file_iomap_begin for direct I/O, given that it will be demoted just before calling xfs_iomap_write_direct anyway. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
由 Christoph Hellwig 提交于
The iomap zeroing interface is smart enough to skip zeroing holes or unwritten extents. Don't subvert this logic for reflink files. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com> Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
-
- 01 3月, 2018 5 次提交
-
-
由 Chengguang Xu 提交于
There is lack of cache destroy operation for ceph_file_cachep when failing from fscache register. Signed-off-by: NChengguang Xu <cgxu519@icloud.com> Reviewed-by: NIlya Dryomov <idryomov@gmail.com> Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
-
由 Filipe Manana 提交于
If we have a file with 2 (or more) hard links in the same directory, remove one of the hard links, create a new file (or link an existing file) in the same directory with the name of the removed hard link, and then finally fsync the new file, we end up with a log that fails to replay, causing a mount failure. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/testdir $ touch /mnt/testdir/foo $ ln /mnt/testdir/foo /mnt/testdir/bar $ sync $ unlink /mnt/testdir/bar $ touch /mnt/testdir/bar $ xfs_io -c "fsync" /mnt/testdir/bar <power failure> $ mount /dev/sdb /mnt mount: mount(2) failed: /mnt: No such file or directory When replaying the log, for that example, we also see the following in dmesg/syslog: [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257 [71813.674204] ------------[ cut here ]------------ [71813.675694] BTRFS: Transaction aborted (error -2) [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs] [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs] [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G W 4.15.0-rc9-btrfs-next-56+ #1 [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs] [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286 [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001 [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001 [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8 [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101 [71813.679669] FS: 00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000 [71813.679669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0 [71813.679669] Call Trace: [71813.679669] btrfs_unlink_inode+0x17/0x41 [btrfs] [71813.679669] drop_one_dir_item+0xfa/0x131 [btrfs] [71813.679669] add_inode_ref+0x71e/0x851 [btrfs] [71813.679669] ? __lock_is_held+0x39/0x71 [71813.679669] ? replay_one_buffer+0x53/0x53a [btrfs] [71813.679669] replay_one_buffer+0x4a4/0x53a [btrfs] [71813.679669] ? rcu_read_unlock+0x3a/0x57 [71813.679669] ? __lock_is_held+0x39/0x71 [71813.679669] walk_up_log_tree+0x101/0x1d2 [btrfs] [71813.679669] walk_log_tree+0xad/0x188 [btrfs] [71813.679669] btrfs_recover_log_trees+0x1fa/0x31e [btrfs] [71813.679669] ? replay_one_extent+0x544/0x544 [btrfs] [71813.679669] open_ctree+0x1cf6/0x2209 [btrfs] [71813.679669] btrfs_mount_root+0x368/0x482 [btrfs] [71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6 [71813.679669] ? __lockdep_init_map+0x176/0x1c2 [71813.679669] ? mount_fs+0x64/0x10b [71813.679669] mount_fs+0x64/0x10b [71813.679669] vfs_kern_mount+0x68/0xce [71813.679669] btrfs_mount+0x13e/0x772 [btrfs] [71813.679669] ? trace_hardirqs_on_caller+0x14c/0x1a6 [71813.679669] ? __lockdep_init_map+0x176/0x1c2 [71813.679669] ? mount_fs+0x64/0x10b [71813.679669] mount_fs+0x64/0x10b [71813.679669] vfs_kern_mount+0x68/0xce [71813.679669] do_mount+0x6e5/0x973 [71813.679669] ? memdup_user+0x3e/0x5c [71813.679669] SyS_mount+0x72/0x98 [71813.679669] entry_SYSCALL_64_fastpath+0x1e/0x8b [71813.679669] RIP: 0033:0x7f7cedf150ba [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206 [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c [71813.679669] ---[ end trace 83bd473fc5b4663b ]--- [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree) [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30 [71814.128078] BTRFS error (device dm-0): open_ctree failed This happens because the log has inode reference items for both inode 258 (the first file we created) and inode 259 (the second file created), and when processing the reference item for inode 258, we replace the corresponding item in the subvolume tree (which has two names, "foo" and "bar") witht he one in the log (which only has one name, "foo") without removing the corresponding dir index keys from the parent directory. Later, when processing the inode reference item for inode 259, which has a name of "bar" associated to it, we notice that dir index entries exist for that name and for a different inode, so we attempt to unlink that name, which fails because the inode reference item for inode 258 no longer has the name "bar" associated to it, making a call to btrfs_unlink_inode() fail with a -ENOENT error. Fix this by unlinking all the names in an inode reference item from a subvolume tree that are not present in the inode reference item found in the log tree, before overwriting it with the item from the log tree. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
If in the same transaction we rename a special file (fifo, character/block device or symbolic link), create a hard link for it having its old name then sync the log, we will end up with a log that can not be replayed and at when attempting to replay it, an EEXIST error is returned and mounting the filesystem fails. Example scenario: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/testdir $ mkfifo /mnt/testdir/foo # Make sure everything done so far is durably persisted. $ sync # Create some unrelated file and fsync it, this is just to create a log # tree. The file must be in the same directory as our special file. $ touch /mnt/testdir/f1 $ xfs_io -c "fsync" /mnt/testdir/f1 # Rename our special file and then create a hard link with its old name. $ mv /mnt/testdir/foo /mnt/testdir/bar $ ln /mnt/testdir/bar /mnt/testdir/foo # Create some other unrelated file and fsync it, this is just to persist # the log tree which was modified by the previous rename and link # operations. Alternatively we could have modified file f1 and fsync it. $ touch /mnt/f2 $ xfs_io -c "fsync" /mnt/f2 <power failure> $ mount /dev/sdc /mnt mount: mount /dev/sdc on /mnt failed: File exists This happens because when both the log tree and the subvolume's tree have an entry in the directory "testdir" with the same name, that is, there is one key (258 INODE_REF 257) in the subvolume tree and another one in the log tree (where 258 is the inode number of our special file and 257 is the inode for directory "testdir"). Only the data of those two keys differs, in the subvolume tree the index field for inode reference has a value of 3 while the log tree it has a value of 5. Because the same key exists in both trees, but have different index, the log replay fails with an -EEXIST error when attempting to replay the inode reference from the log tree. Fix this by setting the last_unlink_trans field of the inode (our special file) to the current transaction id when a hard link is created, as this forces logging the parent directory inode, solving the conflict at log replay time. A new generic test case for fstests was also submitted. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
When doing an incremental send of a filesystem with the no-holes feature enabled, we end up issuing a write operation when using the no data mode send flag, instead of issuing an update extent operation. Fix this by issuing the update extent operation instead. Trivial reproducer: $ mkfs.btrfs -f -O no-holes /dev/sdc $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdc /mnt/sdc $ mount /dev/sdd /mnt/sdd $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1 $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2 $ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \ | btrfs receive -vv /mnt/sdd Before this change the output of the second receive command is: receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447... utimes write foobar, offset 8192, len 8192 utimes foobar BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-... After this change it is: receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64... utimes update_extent foobar: offset=8192, len=8192 utimes foobar BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64... Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
The fs_info::super_copy is a byte copy of the on-disk structure and all members must use the accessor macros/functions to obtain the right value. This was missing in update_super_roots and in sysfs readers. Moving between opposite endianness hosts will report bogus numbers in sysfs, and mount may fail as the root will not be restored correctly. If the filesystem is always used on a same endian host, this will not be a problem. Fix this by using the btrfs_set_super...() functions to set fs_info::super_copy values, and for the sysfs, use the cached fs_info::nodesize/sectorsize values. CC: stable@vger.kernel.org Fixes: df93589a ("btrfs: export more from FS_INFO to sysfs") Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-