- 16 8月, 2017 18 次提交
-
-
由 Aleksa Sarai 提交于
Several distributions mount the "proper root" as ro during initrd and then remount it as rw before pivot_root(2). Thus, if a rescan had been aborted by a previous shutdown, the rescan would never be resumed. This issue would manifest itself as several btrfs ioctl(2)s causing the entire machine to hang when btrfs_qgroup_wait_for_completion was hit (due to the fs_info->qgroup_rescan_running flag being set but the rescan itself not being resumed). Notably, Docker's btrfs storage driver makes regular use of BTRFS_QUOTA_CTL_DISABLE and BTRFS_IOC_QUOTA_RESCAN_WAIT (causing this problem to be manifested on boot for some machines). Cc: <stable@vger.kernel.org> # v3.11+ Cc: Jeff Mahoney <jeffm@suse.com> Fixes: b382a324 ("Btrfs: fix qgroup rescan resume on mount") Signed-off-by: NAleksa Sarai <asarai@suse.de> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Tested-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Edmund Nadolski 提交于
Repeating the same computation in multiple places is not necessary. Signed-off-by: NEdmund Nadolski <enadolski@suse.com> Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Edmund Nadolski 提交于
When called with a struct share_check, find_parent_nodes() will detect a shared extent and immediately return with BACKREF_SHARED_FOUND. Signed-off-by: NEdmund Nadolski <enadolski@suse.com> Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Edmund Nadolski 提交于
Since backref resolution is CPU-intensive, the cond_resched calls should help alleviate soft lockup occurences. Signed-off-by: NEdmund Nadolski <enadolski@suse.com> Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
This patch adds a tracepoint event for prelim_ref insertion and merging. For each, the ref being inserted or merged and the count of tree nodes is issued. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
This patch adds counters to each of the rbtrees so that we can tell how large they are growing for a given workload. These counters will be exported by tracepoints in the next patch. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Edmund Nadolski 提交于
It's been known for a while that the use of multiple lists that are periodically merged was an algorithmic problem within btrfs. There are several workloads that don't complete in any reasonable amount of time (e.g. btrfs/130) and others that cause soft lockups. The solution is to use a set of rbtrees that do insertion merging for both indirect and direct refs, with the former converting refs into the latter. The result is a btrfs/130 workload that used to take several hours now takes about half of that. This runtime still isn't acceptable and a future patch will address that by moving the rbtrees higher in the stack so the lookups can be shared across multiple calls to find_parent_nodes. Signed-off-by: NEdmund Nadolski <enadolski@suse.com> Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Edmund Nadolski 提交于
Commit afce772e ("btrfs: fix check_shared for fiemap ioctl") added the ref_tree code in backref.c to reduce backref searching for shared extents under the FIEMAP ioctl. This code will not be compatible with the upcoming rbtree changes for improved backref searching, so this patch removes the ref_tree code. The rbtree changes will provide the equivalent functionality for FIEMAP. The above commit also introduced transaction semantics around calls to btrfs_check_shared() in order to accurately account for delayed refs. This functionality needs to be retained, so a complete revert of the above commit is not desirable. This patch therefore removes the ref_tree portion of the commit as above, however it does not remove the transaction portion. Signed-off-by: NEdmund Nadolski <enadolski@suse.com> Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Edmund Nadolski 提交于
Commit afce772e ("btrfs: fix check_shared for fiemap ioctl") added transaction semantics around calls to btrfs_check_shared() in order to provide accurate accounting of delayed refs. The transaction management should be done inside btrfs_check_shared(), so that callers do not need to manage transactions individually. Signed-off-by: NEdmund Nadolski <enadolski@suse.com> Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
We typically use __ to indicate a helper routine that shouldn't be called directly without understanding the proper context required to do so. We use static functions to indicate that a function is private to a particular C file. The backref code uses static function and __ prefixes on nearly everything, which makes the code difficult to read and establishes a pattern for future code that shouldn't be followed. This patch drops all the unnecessary prefixes. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
Replacing the double cast and ternary conditional with a helper makes the code easier on the eyes. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
This constifies a few buffers used in the backref code. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
Tracepoint arguments are all read-only. If we mark the arguments as const, we're able to keep or convert those arguments to const where appropriate. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
We have reader helpers for most of the on-disk structures that use an extent_buffer and pointer as offset into the buffer that are read-only. We should mark them as const and, in turn, allow consumers of these interfaces to mark the buffers const as well. No impact on code, but serves as documentation that a buffer is intended not to be modified. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
The sectorsize member of btrfs_block_group_cache is unused. So remove it, this reduces the number of holes in the struct. With patch: /* size: 856, cachelines: 14, members: 40 */ /* sum members: 837, holes: 4, sum holes: 19 */ /* bit holes: 1, sum bit holes: 29 bits */ /* last cacheline: 24 bytes */ Without patch: /* size: 864, cachelines: 14, members: 41 */ /* sum members: 841, holes: 5, sum holes: 23 */ /* bit holes: 1, sum bit holes: 29 bits */ /* last cacheline: 32 bytes */ Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
__btrfs_alloc_chunk contains code which boils down to: ndevs = min(ndevs, devs_max) It's conditional upon devs_max not being 0. However, it cannot really be 0 since it's always set to either BTRFS_MAX_DEVS_SYS_CHUNK or BTRFS_MAX_DEVS(fs_info->chunk_root). So eliminate the condition check and use min explicitly. This has no functional changes. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
No functional changes. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
No functional changes, just make the loop a bit more readable Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 24 7月, 2017 3 次提交
-
-
由 Nikolay Borisov 提交于
Further testing showed that the fix introduced in 7dfb8be1 ("btrfs: Round down values which are written for total_bytes_size") was insufficient and it could still lead to discrepancies between the total_bytes in the super block and the device total bytes. So this patch also ensures that the difference between old/new sizes when shrinking/growing is also rounded down. This ensure that we won't be subtracting/adding a non-sectorsize multiples to the superblock/device total sizees. Fixes: 7dfb8be1 ("btrfs: Round down values which are written for total_bytes_size") Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Omar Sandoval 提交于
If a lot of metadata is reserved for outstanding delayed allocations, we rely on shrink_delalloc() to reclaim metadata space in order to fulfill reservation tickets. However, shrink_delalloc() has a shortcut where if it determines that space can be overcommitted, it will stop early. This made sense before the ticketed enospc system, but now it means that shrink_delalloc() will often not reclaim enough space to fulfill any tickets, leading to an early ENOSPC. (Reservation tickets don't care about being able to overcommit, they need every byte accounted for.) Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims all of the metadata it is supposed to. This fixes early ENOSPCs we were seeing when doing a btrfs receive to populate a new filesystem, as well as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs. Fixes: 957780eb ("Btrfs: introduce ticketed enospc infrastructure") Tested-by: NChristoph Anton Mitterer <mail@christoph.anton.mitterer.name> Cc: stable@vger.kernel.org Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
If we have a block group that is all of the following: 1) uncached in memory 2) is read-only 3) has a disk cache state that indicates we need to recreate the cache AND the file system has enough free space fragmentation such that the request for an extent of a given size can't be honored; AND have a single CPU core; AND it's the block group with the highest starting offset such that there are no opportunities (like reading from disk) for the loop to yield the CPU; We can end up with a lockup. The root cause is simple. Once we're in the position that we've read in all of the other block groups directly and none of those block groups can honor the request, there are no more opportunities to sleep. We end up trying to start a caching thread which never gets run if we only have one core. This *should* present as a hung task waiting on the caching thread to make some progress, but it doesn't. Instead, it degrades into a busy loop because of the placement of the read-only check. During the first pass through the loop, block_group->cached will be set to BTRFS_CACHE_STARTED and have_caching_bg will be set. Then we hit the read-only check and short circuit the loop. We're not yet in LOOP_CACHING_WAIT, so we skip that loop back before going through the loop again for other raid groups. Then we move to LOOP_CACHING_WAIT state. During the this pass through the loop, ->cached will still be BTRFS_CACHE_STARTED, which means it's not cached, so we'll enter cache_block_group, do a lot of nothing, and return, and also set have_caching_bg again. Then we hit the read-only check and short circuit the loop. The same thing happens as before except now we DO trigger the LOOP_CACHING_WAIT && have_caching_bg check and loop back up to the top. We do this forever. There are two fixes in this patch since they address the same underlying bug. The first is to add a cond_resched to the end of the loop to ensure that the caching thread always has an opportunity to run. This will fix the soft lockup issue, but find_free_extent will still loop doing nothing until the thread has completed. The second is to move the read-only check to the top of the loop. We're never going to return an allocation within a read-only block group so we may as well skip it early. The check for ->cached == BTRFS_CACHE_ERROR would cause the same problem except that BTRFS_CACHE_ERROR is considered a "done" state and we won't re-set have_caching_bg again. Many thanks to Stephan Kulow <coolo@suse.de> for his excellent help in the testing process. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 20 7月, 2017 1 次提交
-
-
由 Filipe Manana 提交于
We were passing an incorrect slot number to the function that validates directory items when we are replaying xattr deletes from a log tree. The correct slot is stored at variable 'i' and not at 'path->slots[0]', so the call to the validation function was only correct for the first iteration of the loop, when 'i == path->slots[0]'. After this fix, the fstest generic/066 passes again. Fixes: 8ee8c2d6 ("btrfs: Verify dir_item in replay_xattr_deletes") Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 15 7月, 2017 3 次提交
-
-
由 Liu Bo 提交于
With blk_status_t conversion (that are now present in master), bio_readpage_error() may return 1 as now ->submit_bio_hook() may not set %ret if it runs without problems. This fixes that unexpected return value by changing btrfs_check_repairable() to return a bool instead of updating %ret, and patch is applicable to both codebases with and without blk_status_t. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
As the function uses the non-failing bio allocation, we can remove error handling from the callers as well. Signed-off-by: NDavid Sterba <dsterba@suse.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
We've started using cloned bios more in 4.13, there are some specifics regarding the iteration. Filipe found [1] that the raid56 iterated a cloned bio using bio_for_each_segment_all, which is incorrect. The cloned bios have wrong bi_vcnt and this could lead to silent corruptions. This patch adds assertions to all remaining bio_for_each_segment_all cases. [1] https://patchwork.kernel.org/patch/9838535/Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 14 7月, 2017 1 次提交
-
-
由 Filipe Manana 提交于
The recent changes to make bio cloning faster (added in the 4.13 merge window) by using the bio_clone_fast() API introduced a regression on raid5/6 modes, because cloned bios have an invalid bi_vcnt field (therefore it can not be used) and the raid5/6 code uses the bio_for_each_segment_all() API to iterate the segments of a bio, and this API uses a bio's bi_vcnt field. The issue is very simple to trigger by doing for example a direct IO write against a raid5 or raid6 filesystem and then attempting to read what we wrote before: $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf $ mount /dev/sdc /mnt $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar $ od -t x1 /mnt/foobar od: /mnt/foobar: read error: Input/output error For that example, the following is also reported in dmesg/syslog: [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2 [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3 [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1 [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1 A scrub will also fail correcting bad copies, mentioning the following in dmesg/syslog: [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed [18276.129617] BTRFS warning (device sdf): checksum error at logical 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, length 4096, links $ [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [18276.206059] BTRFS warning (device sdf): checksum error at logical 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, length 4096, links$ [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [18276.306552] BTRFS warning (device sdf): checksum error at logical 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, length 4096, links$ [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [18276.394316] BTRFS warning (device sdf): checksum error at logical 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, length 4096, links$ [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [18276.434127] BTRFS warning (device sdf): checksum error at logical 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, length 4096, links$ [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186477568 on dev /dev/sdd [18276.538400] BTRFS warning (device sdf): checksum error at logical 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, length 4096, links$ [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186481664 on dev /dev/sdd [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186346496 on dev /dev/sde [18276.598306] BTRFS warning (device sdf): checksum error at logical 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, length 4096, links$ [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 [18276.598582] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186350592 on dev /dev/sde [18276.603455] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 [18276.638362] BTRFS warning (device sdf): checksum error at logical 2186354688 on dev /dev/sde, sector 2116624, root 5, inode 257, offset 73728, length 4096, links $ [18276.640445] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 5, gen 0 [18276.645942] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186354688 on dev /dev/sde [18276.657204] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186412032 on dev /dev/sde [18276.660563] BTRFS warning (device sdf): checksum error at logical 2186416128 on dev /dev/sde, sector 2116744, root 5, inode 257, offset 135168, length 4096, links$ [18276.664609] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 6, gen 0 [18276.664609] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186358784 on dev /dev/sde So fix this by using the bio_for_each_segment() API and setting before the bio's bi_iter field to the value of the corresponding btrfs bio container's saved iterator if we are processing a cloned bio in the raid5/6 code (the same code processes both cloned and non-cloned bios). This incorrect iteration of cloned bios was also causing some occasional BUG_ONs when running fstest btrfs/064, which have a trace like the following: [ 6674.416156] ------------[ cut here ]------------ [ 6674.416157] kernel BUG at fs/btrfs/raid56.c:1897! [ 6674.416159] invalid opcode: 0000 [#1] PREEMPT SMP [ 6674.416160] Modules linked in: dm_flakey dm_mod dax ppdev tpm_tis parport_pc tpm_tis_core evdev tpm psmouse sg i2c_piix4 pcspkr parport i2c_core serio_raw button s [ 6674.416184] CPU: 3 PID: 19236 Comm: kworker/u32:10 Not tainted 4.12.0-rc6-btrfs-next-44+ #1 [ 6674.416185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 6674.416210] Workqueue: btrfs-endio btrfs_endio_helper [btrfs] [ 6674.416211] task: ffff880147f6c740 task.stack: ffffc90001fb8000 [ 6674.416229] RIP: 0010:__raid_recover_end_io+0x1ac/0x370 [btrfs] [ 6674.416230] RSP: 0018:ffffc90001fbbb90 EFLAGS: 00010217 [ 6674.416231] RAX: ffff8801ff4b4f00 RBX: 0000000000000002 RCX: 0000000000000001 [ 6674.416232] RDX: ffff880099b045d8 RSI: ffffffff81a5f6e0 RDI: 0000000000000004 [ 6674.416232] RBP: ffffc90001fbbbc8 R08: 0000000000000001 R09: 0000000000000001 [ 6674.416233] R10: ffffc90001fbbac8 R11: 0000000000001000 R12: 0000000000000002 [ 6674.416234] R13: ffff880099b045c0 R14: 0000000000000004 R15: ffff88012bff2000 [ 6674.416235] FS: 0000000000000000(0000) GS:ffff88023f2c0000(0000) knlGS:0000000000000000 [ 6674.416235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6674.416236] CR2: 00007f28cf282000 CR3: 00000001000c6000 CR4: 00000000000006e0 [ 6674.416239] Call Trace: [ 6674.416259] __raid56_parity_recover+0xfc/0x16e [btrfs] [ 6674.416276] raid56_parity_recover+0x157/0x16b [btrfs] [ 6674.416293] btrfs_map_bio+0xe0/0x259 [btrfs] [ 6674.416310] btrfs_submit_bio_hook+0xbf/0x147 [btrfs] [ 6674.416327] end_bio_extent_readpage+0x27b/0x4a0 [btrfs] [ 6674.416331] bio_endio+0x17d/0x1b3 [ 6674.416346] end_workqueue_fn+0x3c/0x3f [btrfs] [ 6674.416362] btrfs_scrubparity_helper+0x1aa/0x3b8 [btrfs] [ 6674.416379] btrfs_endio_helper+0xe/0x10 [btrfs] [ 6674.416381] process_one_work+0x276/0x4b6 [ 6674.416384] worker_thread+0x1ac/0x266 [ 6674.416386] ? rescuer_thread+0x278/0x278 [ 6674.416387] kthread+0x106/0x10e [ 6674.416389] ? __list_del_entry+0x22/0x22 [ 6674.416391] ret_from_fork+0x27/0x40 [ 6674.416395] Code: 44 89 e2 be 00 10 00 00 ff 15 b0 ab ef ff eb 72 4d 89 e8 89 d9 44 89 e2 be 00 10 00 00 ff 15 a3 ab ef ff eb 5d 41 83 fc ff 74 02 <0f> 0b 49 63 97 [ 6674.416432] RIP: __raid_recover_end_io+0x1ac/0x370 [btrfs] RSP: ffffc90001fbbb90 [ 6674.416434] ---[ end trace 74d56ebe7489dd6a ]--- Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
-
- 10 7月, 2017 1 次提交
-
-
由 Goldwyn Rodrigues 提交于
Assigning pos for usage early messes up in append mode, where the pos is re-assigned in generic_write_checks(). Assign pos later to get the correct position to write from iocb->ki_pos. Since check_can_nocow also uses the value of pos, we shift generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT are present in generic_write_checks(), so checking for IOCB_NOWAIT is enough. Also, put locking sequence in the fast path. This fixes a user visible bug, as reported: "apparently breaks several shell related features on my system. In zsh history stopped working, because no new entries are added anymore. I fist noticed the issue when I tried to build mplayer. It uses a shell script to generate a help_mp.h file: [...] Here is a simple testcase: % echo "foo" >> test % echo "foo" >> test % cat test foo % " Fixes: edf064e7 ("btrfs: nowait aio support") CC: Jens Axboe <axboe@kernel.dk> Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de> Link: https://lkml.kernel.org/r/20170704042306.GA274@x4Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 07 7月, 2017 2 次提交
-
-
由 Filipe Manana 提交于
When doing an incremental send, while processing an extent that changed between the parent and send snapshots and that extent was an inline extent in the parent snapshot, it's possible to access a memory region beyond the end of leaf if the inline extent is very small and it is the first item in a leaf. An example scenario is described below. The send snapshot has the following leaf: leaf 33865728 items 33 free space 773 generation 46 owner 5 fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7 chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f (...) item 14 key (335 EXTENT_DATA 0) itemoff 3052 itemsize 53 generation 36 type 1 (regular) extent data disk byte 12791808 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) item 15 key (335 EXTENT_DATA 8192) itemoff 2999 itemsize 53 generation 36 type 1 (regular) extent data disk byte 138170368 nr 225280 extent data offset 0 nr 225280 ram 225280 extent compression 0 (none) (...) And the parent snapshot has the following leaf: leaf 31272960 items 17 free space 17 generation 31 owner 5 fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7 chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f item 0 key (335 EXTENT_DATA 0) itemoff 3951 itemsize 44 generation 31 type 0 (inline) inline extent data size 23 ram_bytes 613 compression 1 (zlib) (...) When computing the send stream, it is detected that the extent of inode 335, at file offset 0, and at fs/btrfs/send.c:is_extent_unchanged() we grab the leaf from the parent snapshot and access the inline extent item. However, before jumping to the 'out' label, we access the 'offset' and 'disk_bytenr' fields of the extent item, which should not be done for inline extents since the inlined data starts at the offset of the 'disk_bytenr' field and can be very small. For example accessing the 'offset' field of the file extent item results in the following trace: [ 599.705368] general protection fault: 0000 [#1] PREEMPT SMP [ 599.706296] Modules linked in: btrfs psmouse i2c_piix4 ppdev acpi_cpufreq serio_raw parport_pc i2c_core evdev tpm_tis tpm_tis_core sg pcspkr parport tpm button su$ [ 599.709340] CPU: 7 PID: 5283 Comm: btrfs Not tainted 4.10.0-rc8-btrfs-next-46+ #1 [ 599.709340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 599.709340] task: ffff88023eedd040 task.stack: ffffc90006658000 [ 599.709340] RIP: 0010:read_extent_buffer+0xdb/0xf4 [btrfs] [ 599.709340] RSP: 0018:ffffc9000665ba00 EFLAGS: 00010286 [ 599.709340] RAX: db73880000000000 RBX: 0000000000000000 RCX: 0000000000000001 [ 599.709340] RDX: ffffc9000665ba60 RSI: db73880000000000 RDI: ffffc9000665ba5f [ 599.709340] RBP: ffffc9000665ba30 R08: 0000000000000001 R09: ffff88020dc5e098 [ 599.709340] R10: 0000000000001000 R11: 0000160000000000 R12: 6db6db6db6db6db7 [ 599.709340] R13: ffff880000000000 R14: 0000000000000000 R15: ffff88020dc5e088 [ 599.709340] FS: 00007f519555a8c0(0000) GS:ffff88023f3c0000(0000) knlGS:0000000000000000 [ 599.709340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 599.709340] CR2: 00007f1411afd000 CR3: 0000000235f8e000 CR4: 00000000000006e0 [ 599.709340] Call Trace: [ 599.709340] btrfs_get_token_64+0x93/0xce [btrfs] [ 599.709340] ? printk+0x48/0x50 [ 599.709340] btrfs_get_64+0xb/0xd [btrfs] [ 599.709340] process_extent+0x3a1/0x1106 [btrfs] [ 599.709340] ? btree_read_extent_buffer_pages+0x5/0xef [btrfs] [ 599.709340] changed_cb+0xb03/0xb3d [btrfs] [ 599.709340] ? btrfs_get_token_32+0x7a/0xcc [btrfs] [ 599.709340] btrfs_compare_trees+0x432/0x53d [btrfs] [ 599.709340] ? process_extent+0x1106/0x1106 [btrfs] [ 599.709340] btrfs_ioctl_send+0x960/0xe26 [btrfs] [ 599.709340] btrfs_ioctl+0x181b/0x1fed [btrfs] [ 599.709340] ? trace_hardirqs_on_caller+0x150/0x1ac [ 599.709340] vfs_ioctl+0x21/0x38 [ 599.709340] ? vfs_ioctl+0x21/0x38 [ 599.709340] do_vfs_ioctl+0x611/0x645 [ 599.709340] ? rcu_read_unlock+0x5b/0x5d [ 599.709340] ? __fget+0x6d/0x79 [ 599.709340] SyS_ioctl+0x57/0x7b [ 599.709340] entry_SYSCALL_64_fastpath+0x18/0xad [ 599.709340] RIP: 0033:0x7f51945eec47 [ 599.709340] RSP: 002b:00007ffc21c13e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 599.709340] RAX: ffffffffffffffda RBX: ffffffff81096459 RCX: 00007f51945eec47 [ 599.709340] RDX: 00007ffc21c13f20 RSI: 0000000040489426 RDI: 0000000000000004 [ 599.709340] RBP: ffffc9000665bf98 R08: 00007f519450d700 R09: 00007f519450d700 [ 599.709340] R10: 00007f519450d9d0 R11: 0000000000000202 R12: 0000000000000046 [ 599.709340] R13: ffffc9000665bf78 R14: 0000000000000000 R15: 00007f5195574040 [ 599.709340] ? trace_hardirqs_off_caller+0x43/0xb1 [ 599.709340] Code: 29 f0 49 39 d8 4c 0f 47 c3 49 03 81 58 01 00 00 44 89 c1 4c 01 c2 4c 29 c3 48 c1 f8 03 49 0f af c4 48 c1 e0 0c 4c 01 e8 48 01 c6 <f3> a4 31 f6 4$ [ 599.709340] RIP: read_extent_buffer+0xdb/0xf4 [btrfs] RSP: ffffc9000665ba00 [ 599.762057] ---[ end trace fe00d7af61b9f49e ]--- This is because the 'offset' field starts at an offset of 37 bytes (offsetof(struct btrfs_file_extent_item, offset)), has a length of 8 bytes and therefore attemping to read it causes a 1 byte access beyond the end of the leaf, as the first item's content in a leaf is located at the tail of the leaf, the item size is 44 bytes and the offset of that field plus its length (37 + 8 = 45) goes beyond the item's size by 1 byte. So fix this by accessing the 'offset' and 'disk_bytenr' fields after jumping to the 'out' label if we are processing an inline extent. We move the reading operation of the 'disk_bytenr' field too because we have the same problem as for the 'offset' field explained above when the inline data is less then 8 bytes. The access to the 'generation' field is also moved but just for the sake of grouping access to all the fields. Fixes: e1cbfd7b ("Btrfs: send, fix file hole not being preserved due to inline extent") Cc: <stable@vger.kernel.org> # v4.12+ Signed-off-by: NFilipe Manana <fdmanana@suse.com>
-
由 Filipe Manana 提交于
In some scenarios an incremental send stream can contain link commands with an invalid target path. Such scenarios happen after moving some directory inode A, renaming a regular file inode B into the old name of inode A and finally creating a new hard link for inode B at directory inode A. Consider the following example scenario where this issue happens. Parent snapshot: . (ino 256) | |--- dir1/ (ino 257) | |--- dir2/ (ino 258) | |--- dir3/ (ino 259) | |--- file1 (ino 261) | |--- dir4/ (ino 262) | |--- dir5/ (ino 260) Send snapshot: . (ino 256) | |--- dir1/ (ino 257) |--- dir2/ (ino 258) | |--- dir3/ (ino 259) | |--- dir4 (ino 261) | |--- dir6/ (ino 263) |--- dir44/ (ino 262) |--- file11 (ino 261) |--- dir55/ (ino 260) When attempting to apply the corresponding incremental send stream, a link command contains an invalid target path which makes the receiver fail. The following is the verbose output of the btrfs receive command: receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...) utimes utimes dir1 utimes dir1/dir2/dir3 utimes rename dir1/dir2/dir3/dir4 -> o262-7-0 link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1 link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory The following steps happen during the computation of the incremental send stream the lead to this issue: 1) When processing inode 261, we orphanize inode 262 due to a name/location collision with one of the new hard links for inode 261 (created in the second step below). 2) We create one of the 2 new hard links for inode 261, the one whose location is at "dir1/dir2/dir3/dir4". 3) We then attempt to create the other new hard link for inode 261, which has inode 262 as its parent directory. Because the path for this new hard link was computed before we started processing the new references (hard links), it reflects the old name/location of inode 262, that is, it does not account for the orphanization step that happened when we started processing the new references for inode 261, whence it is no longer valid, causing the receiver to fail. So fix this issue by recomputing the full path of new references if we ended up orphanizing other inodes which are directories. A test case for fstests follows soon. Signed-off-by: NFilipe Manana <fdmanana@suse.com>
-
- 06 7月, 2017 2 次提交
-
-
由 Jeff Layton 提交于
Just check and advance the errseq_t in the file before returning, and use an errseq_t based check for writeback errors. Other internal callers of filemap_* functions are left as-is. Signed-off-by: NJeff Layton <jlayton@redhat.com>
-
由 David Howells 提交于
btrfs, debugfs, reiserfs and tracefs call save_mount_options() and reiserfs calls replace_mount_options(), but they then implement their own ->show_options() methods and don't touch s_options, rendering the saved options unnecessary. I'm trying to eliminate s_options to make it easier to implement a context-based mount where the mount options can be passed individually over a file descriptor. Remove the calls to save/replace_mount_options() call in these cases. Signed-off-by: NDavid Howells <dhowells@redhat.com> cc: Chris Mason <clm@fb.com> cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> cc: Steven Rostedt <rostedt@goodmis.org> cc: linux-btrfs@vger.kernel.org cc: reiserfs-devel@vger.kernel.org Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 30 6月, 2017 9 次提交
-
-
由 Qu Wenruo 提交于
Commit 4751832d ("btrfs: fiemap: Cache and merge fiemap extent before submit it to user") introduced a warning to catch unemitted cached fiemap extent. However such warning doesn't take the following case into consideration: 0 4K 8K |<---- fiemap range --->| |<----------- On-disk extent ------------------>| In this case, the whole 0~8K is cached, and since it's larger than fiemap range, it break the fiemap extent emit loop. This leaves the fiemap extent cached but not emitted, and caught by the final fiemap extent sanity check, causing kernel warning. This patch removes the kernel warning and renames the sanity check to emit_last_fiemap_cache() since it's possible and valid to have cached fiemap extent. Reported-by: NDavid Sterba <dsterba@suse.cz> Reported-by: NAdam Borowski <kilobyte@angband.pl> Fixes: 4751832d ("btrfs: fiemap: Cache and merge fiemap extent ...") Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jan Kara 提交于
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by moving posix_acl_update_mode() out of __btrfs_set_acl() into btrfs_set_acl(). That way the function will not be called when inheriting ACLs which is what we want as it prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Fixes: 07393101 CC: stable@vger.kernel.org CC: linux-btrfs@vger.kernel.org CC: David Sterba <dsterba@suse.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Chris Mason 提交于
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with v4.12-rc6. This was because commit 70e7af24 made it possible for calc_reclaim_items_nr() to return a negative number. It's not really a bug in that commit, it just didn't go far enough down the stack to find all the possible 64->32 bit overflows. This switches calc_reclaim_items_nr() to return a u64 and changes everyone that uses the results of that math to u64 as well. Reported-by: NDave Jones <davej@codemonkey.org.uk> Fixes: 70e7af24 ("Btrfs: fix delalloc accounting leak caused by u32 overflow") Signed-off-by: NChris Mason <clm@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a helper but wrongly sets up the target device. Incidentally there's a local variable with the same name as a parameter in the previous function, so this got caught during runtime as crash in test btrfs/027. Reported-by: NChris Mason <clm@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] For the following case, btrfs can underflow qgroup reserved space at an error path: (Page size 4K, function name without "btrfs_" prefix) Task A | Task B ---------------------------------------------------------------------- Buffered_write [0, 2K) | |- check_data_free_space() | | |- qgroup_reserve_data() | | Range aligned to page | | range [0, 4K) <<< | | 4K bytes reserved <<< | |- copy pages to page cache | | Buffered_write [2K, 4K) | |- check_data_free_space() | | |- qgroup_reserved_data() | | Range alinged to page | | range [0, 4K) | | Already reserved by A <<< | | 0 bytes reserved <<< | |- delalloc_reserve_metadata() | | And it *FAILED* (Maybe EQUOTA) | |- free_reserved_data_space() |- qgroup_free_data() Range aligned to page range [0, 4K) Freeing 4K (Special thanks to Chandan for the detailed report and analyse) [CAUSE] Above Task B is freeing reserved data range [0, 4K) which is actually reserved by Task A. And at writeback time, page dirty by Task A will go through writeback routine, which will free 4K reserved data space at file extent insert time, causing the qgroup underflow. [FIX] For btrfs_qgroup_free_data(), add @reserved parameter to only free data ranges reserved by previous btrfs_qgroup_reserve_data(). So in above case, Task B will try to free 0 byte, so no underflow. Reported-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Tested-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Introduce a new parameter, struct extent_changeset for btrfs_qgroup_reserved_data() and its callers. Such extent_changeset was used in btrfs_qgroup_reserve_data() to record which range it reserved in current reserve, so it can free it in error paths. The reason we need to export it to callers is, at buffered write error path, without knowing what exactly which range we reserved in current allocation, we can free space which is not reserved by us. This will lead to qgroup reserved space underflow. Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled [BUG] Under the following case, we can underflow qgroup reserved space. Task A | Task B --------------------------------------------------------------- Quota disabled | Buffered write | |- btrfs_check_data_free_space() | | *NO* qgroup space is reserved | | since quota is *DISABLED* | |- All pages are copied to page | cache | | Enable quota | Quota scan finished | | Sync_fs | |- run_delalloc_range | |- Write pages | |- btrfs_finish_ordered_io | |- insert_reserved_file_extent | |- btrfs_qgroup_release_data() | Since no qgroup space is reserved in Task A, we underflow qgroup reserved space This can be detected by fstest btrfs/104. [CAUSE] In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes size of qgroup reserved_space in all cases. And btrfs_qgroup_release_data() will check if quotas are enabled. However in the above case, the buffered write happens before quota is enabled, so we don't have the reserved space for that range. [FIX] In insert_reserved_file_extent(), we tell qgroup to release the acctual byte number it released. In the above case, since we don't have the reserved space, we tell qgroups to release 0 byte, so the problem can be fixed. And thanks to the @reserved parameter introduced by the qgroup rework, and previous patch to return released bytes, the fix can be as small as 10 lines. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> [ changelog updates ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
btrfs_qgroup_release/free_data() only returns 0 or a negative error number (ENOMEM is the only possible error). This is normally good enough, but sometimes we need the exact byte count it freed/released. Change it to return actually released/freed bytenr number instead of 0 for success. And slightly modify related extent_changeset structure, since in btrfs one no-hole data extent won't be larger than 128M, so "unsigned int" is large enough for the use case. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Quite a lot of qgroup corruption happens due to wrong time of calling btrfs_qgroup_prepare_account_extents(). Since the safest time is to call it just before btrfs_qgroup_account_extents(), there is no need to separate these 2 functions. Merging them will make code cleaner and less bug prone. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> [ changelog and comment adjustments ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-