- 30 11月, 2016 2 次提交
-
-
由 David Sterba 提交于
'start' is not used since "btrfs: reada: Pass reada_extent into __readahead_hook directly" (6e39dbe8). Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
We can't touch the eb directly in case the function is called with a non-zero error, so we can read the eb level when needed. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 29 11月, 2016 1 次提交
-
-
由 Christoph Hellwig 提交于
btrfs_map_block supports different types of mappings, which to a large extent resemble block layer operations. But they don't always do, and currently btrfs dangerously overlays it's own flag over the block layer flags. This is just asking for a conflict, so introduce a different map flags enum inside of btrfs instead. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 27 9月, 2016 3 次提交
-
-
由 Jeff Mahoney 提交于
For many printks, we want to know which file system issued the message. This patch converts most pr_* calls to use the btrfs_* versions instead. In some cases, this means adding plumbing to allow call sites access to an fs_info pointer. fs/btrfs/check-integrity.c is left alone for another day. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
This patch converts printk(KERN_* style messages to use the pr_* versions. One side effect is that anything that was KERN_DEBUG is now automatically a dynamic debug message. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
CodingStyle chapter 2: "[...] never break user-visible strings such as printk messages, because that breaks the ability to grep for them." This patch unsplits user-visible strings. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 30 5月, 2016 1 次提交
-
-
由 Filipe Manana 提交于
The list of devices is protected by the device_list_mutex and the device replace code, in its finishing phase correctly takes that mutex before removing the source device from that list. However the readahead code was iterating that list without acquiring the respective mutex leading to crashes later on due to invalid memory accesses: [125671.831036] general protection fault: 0000 [#1] PREEMPT SMP [125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport processor ser [125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: G W 4.6.0-rc7-btrfs-next-29+ #1 [125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs] [125671.834973] task: ffff8801ac520540 ti: ffff8801ac918000 task.ti: ffff8801ac918000 [125671.834973] RIP: 0010:[<ffffffff81270479>] [<ffffffff81270479>] __radix_tree_lookup+0x6a/0x105 [125671.834973] RSP: 0018:ffff8801ac91bc28 EFLAGS: 00010206 [125671.834973] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6a RCX: 0000000000000000 [125671.834973] RDX: 0000000000000000 RSI: 00000000000c1bff RDI: ffff88002ebd62a8 [125671.834973] RBP: ffff8801ac91bc70 R08: 0000000000000001 R09: 0000000000000000 [125671.834973] R10: ffff8801ac91bc70 R11: 0000000000000000 R12: ffff88002ebd62a8 [125671.834973] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000c1bff [125671.834973] FS: 0000000000000000(0000) GS:ffff88023fd40000(0000) knlGS:0000000000000000 [125671.834973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [125671.834973] CR2: 000000000073cae4 CR3: 00000000b7723000 CR4: 00000000000006e0 [125671.834973] Stack: [125671.834973] 0000000000000000 ffff8801422d5600 ffff8802286bbc00 0000000000000000 [125671.834973] 0000000000000001 ffff8802286bbc00 00000000000c1bff 0000000000000000 [125671.834973] ffff88002e639eb8 ffff8801ac91bc80 ffffffff81270541 ffff8801ac91bcb0 [125671.834973] Call Trace: [125671.834973] [<ffffffff81270541>] radix_tree_lookup+0xd/0xf [125671.834973] [<ffffffffa04ae6a6>] reada_peer_zones_set_lock+0x3e/0x60 [btrfs] [125671.834973] [<ffffffffa04ae8b9>] reada_pick_zone+0x29/0x103 [btrfs] [125671.834973] [<ffffffffa04af42f>] reada_start_machine_worker+0x129/0x2d3 [btrfs] [125671.834973] [<ffffffffa04880be>] btrfs_scrubparity_helper+0x185/0x3aa [btrfs] [125671.834973] [<ffffffffa0488341>] btrfs_readahead_helper+0xe/0x10 [btrfs] [125671.834973] [<ffffffff81069691>] process_one_work+0x271/0x4e9 [125671.834973] [<ffffffff81069dda>] worker_thread+0x1eb/0x2c9 [125671.834973] [<ffffffff81069bef>] ? rescuer_thread+0x2b3/0x2b3 [125671.834973] [<ffffffff8106f403>] kthread+0xd4/0xdc [125671.834973] [<ffffffff8149e242>] ret_from_fork+0x22/0x40 [125671.834973] [<ffffffff8106f32f>] ? kthread_stop+0x286/0x286 So fix this by taking the device_list_mutex in the readahead code. We can't use here the lighter approach of using a rcu_read_lock() and rcu_read_unlock() pair together with a list_for_each_entry_rcu() call because we end up doing calls to sleeping functions (kzalloc()) in the respective code path. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com>
-
- 05 4月, 2016 1 次提交
-
-
由 Kirill A. Shutemov 提交于
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 2月, 2016 1 次提交
-
-
由 Liu Bo 提交于
Xfstests btrfs/011 complains about a deadlock warning, [ 1226.649039] ========================================================= [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ] [ 1226.649039] 4.1.0+ #270 Not tainted [ 1226.649039] --------------------------------------------------------- [ 1226.652955] kswapd0/46 just changed the state of lock: [ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.} and interrupts could create inverse lock ordering between them. [ 1226.652955] other info that might help us debug this: [ 1226.652955] Chain exists of: &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock [ 1226.652955] Possible interrupt unsafe locking scenario: [ 1226.652955] CPU0 CPU1 [ 1226.652955] ---- ---- [ 1226.652955] lock(&fs_info->dev_replace.lock); [ 1226.652955] local_irq_disable(); [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] lock(&found->groups_sem); [ 1226.652955] <Interrupt> [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] *** DEADLOCK *** Commit 084b6e7c ("btrfs: Fix a lockdep warning when running xfstest.") tried to fix a similar one that has the exactly same warning, but with that, we still run to this. The above lock chain comes from btrfs_commit_transaction ->btrfs_run_delayed_items ... ->__btrfs_update_delayed_inode ... ->__btrfs_cow_block ... ->find_free_extent ->cache_block_group ->load_free_space_cache ->btrfs_readpages ->submit_one_bio ... ->__btrfs_map_block ->btrfs_dev_replace_lock However, with high memory pressure, tasks which hold dev_replace.lock can be interrupted by kswapd and then kswapd is intended to release memory occupied by superblock, inodes and dentries, where we may call evict_inode, and it comes to [ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30 [ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700 delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads to a ABBA deadlock. To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but things are simpler here since we only needs read's spinlock to blocking lock. With this, btrfs/011 no more produces warnings in dmesg. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 18 2月, 2016 13 次提交
-
-
由 Zhao Lei 提交于
For a non-existent device, old code bypasses adding it in dev's reada queue. And to solve problem of unfinished waitting in raid5/6, commit 5fbc7c59 ("Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting") adding an exception for the first stripe, in short, the first stripe will always be processed whether the device exists or not. Actually we have a better way for the above request: just bypass creation of the reada_extent for non-existent device, it will make code simple and effective. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
Reada background works is not designed to finish all jobs completely, it will break in following case: 1: When a device reaches workload limit (MAX_IN_FLIGHT) 2: Total reads reach max limit (10000) 3: All devices don't have queued more jobs, often happened in DUP case And if all background works exit with remaining jobs, btrfs_reada_wait() will wait indefinetelly. Above problem is rarely happened in old code, because: 1: Every work queues 2x new works So many works reduced chances of undone jobs. 2: One work will continue 10000 times loop in case of no-jobs It reduced no-thread window time. But after we fixed above case, the "undone reada extents" frequently happened. Fix: Check to ensure we have at least one thread if there are undone jobs in btrfs_reada_wait(). Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
Reada creates 2 works for each level of tree recursively. In case of a tree having many levels, the number of created works is 2^level_of_tree. Actually we don't need so many works in parallel, this patch limits max works to BTRFS_MAX_MIRRORS * 2. The per-fs works_counter will be also used for btrfs_reada_wait() to check is there are background workers. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
No need to decrease dev->reada_in_flight in __readahead_hook()'s internal and reada_extent_put(). reada_extent_put() have no chance to decrease dev->reada_in_flight in free operation, because reada_extent have additional refcnt when scheduled to a dev. We can put inc and dec operation for dev->reada_in_flight to one place instead to make logic simple and safe, and move useless reada_extent->scheduled_for to a bool flag instead. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
Remove one copy of loop to fix the typo of iterate zones. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
Current code set nritems to 0 to make for_loop useless to bypass it, and set generation's value which is not necessary. Jump into cleanup directly is better choise. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
What __readahead_hook() need exactly is fs_info, no need to convert fs_info to root in caller and convert back in __readahead_hook() Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
reada_start_machine_dev() already have reada_extent pointer, pass it into __readahead_hook() directly instead of search radix_tree will make code run faster. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
We can't release reada_extent earlier than __readahead_hook(), because __readahead_hook() still need to use it, it is necessary to hode a refcnt to avoid it be freed. Actually it is not a problem after my patch named: Avoid many times of empty loop It make reada_extent in above line include at least one reada_extctl, which keeps additional one refcnt for reada_extent. But we still need this patch to make the code in pretty logic. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
level is not used in severial functions, remove them from arguments, and remove relative code for get its value. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
When failed adding all dev_zones for a reada_extent, the extent will have no chance to be selected to run, and keep in memory for ever. We should bypass this extent to avoid above case. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
If some device is not reachable, we should bypass and continus addingb next, instead of break on bad device. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
Move is_need_to_readahead contition earlier to avoid useless loop to get relative data for readahead. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 16 2月, 2016 4 次提交
-
-
由 Zhao Lei 提交于
We can see following loop(10000 times) in trace_log: [ 75.416137] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 75.417413] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 [ 75.418611] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 75.419793] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 [ 75.421016] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 75.422324] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 [ 75.423661] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 75.424882] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 ...(10000 times) [ 124.101672] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 124.102850] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 [ 124.104008] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 124.105121] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 Reason: If more than one user trigger reada in same extent, the first task finished setting of reada data struct and call reada_start_machine() to start, and the second task only add a ref_count but have not add reada_extctl struct completely, the reada_extent can not finished all jobs, and will be selected in __reada_start_machine() for 10000 times(total times in __reada_start_machine()). Fix: For a reada_extent without job, we don't need to run it, just return 0 to let caller break. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
In rechecking zone-in-tree, we still need to check zone include our logical address. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
We can avoid additional locking-acquirment and one pair of kref_get/put by combine two condition. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
reada_zone->end is end pos of segment: end = start + cache->key.offset - 1; So we need to use "<=" in condition to judge is a pos in the segment. The problem happened rearly, because logical pos rarely pointed to last 4k of a blockgroup, but we need to fix it to make code right in logic. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 11 2月, 2016 1 次提交
-
-
由 David Sterba 提交于
The readahead framework is not on the critical writeback path we don't need to use GFP_NOFS for allocations. All error paths are handled and the readahead failures are not fatal. The actual users (scrub, dev-replace) will trigger reads if the blocks are not found in cache. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 22 10月, 2015 1 次提交
-
-
由 Luis de Bethencourt 提交于
reada is using -1 instead of the -ENOMEM defined macro to specify that a buffer allocation failed. Since the error number is propagated, the caller will get a -EPERM which is the wrong error condition. Also, updating the caller to return the exact value from reada_add_block. Smatch tool warning: reada_add_block() warn: returning -1 instead of -ENOMEM is sloppy Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NLuis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 09 8月, 2015 1 次提交
-
-
由 Omar Sandoval 提交于
Commit 5fbc7c59 ("Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting") fixed a problem where we would skip a missing device when we shouldn't have because there are no other mirrors to read from in RAID 5/6. After commit 2c8cdd6e ("Btrfs, replace: write dirty pages into the replace target device"), the fix doesn't work when we're doing a missing device replace on RAID 5/6 because the replace device is counted as a mirror so we're tricked into thinking we can safely skip the missing device. The fix is to count only the real stripes and decide based on that. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 22 1月, 2015 1 次提交
-
-
由 Zhao Lei 提交于
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag to keep btrfs_bio's memory in raid56 recovery implement. 2: free function for bbio will make code clean and flexible, plus forced data type checking in compile. Changelog v1->v2: Rename following by David Sterba's suggestion: put_btrfs_bio() -> btrfs_put_bio() get_btrfs_bio() -> btrfs_get_bio() bbio->ref_count -> bbio->refs Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 13 12月, 2014 2 次提交
-
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
Replace with global nodesize instead. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
- 18 9月, 2014 1 次提交
-
-
由 David Sterba 提交于
The nodesize and leafsize were never of different values. Unify the usage and make nodesize the one. Cleanup the redundant checks and helpers. Shaves a few bytes from .text: text data bss dec hex filename 852418 24560 23112 900090 dbbfa btrfs.ko.before 851074 24584 23112 898770 db6d2 btrfs.ko.after Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
- 24 8月, 2014 1 次提交
-
-
由 Liu Bo 提交于
This has been reported and discussed for a long time, and this hang occurs in both 3.15 and 3.16. Btrfs now migrates to use kernel workqueue, but it introduces this hang problem. Btrfs has a kind of work queued as an ordered way, which means that its ordered_func() must be processed in the way of FIFO, so it usually looks like -- normal_work_helper(arg) work = container_of(arg, struct btrfs_work, normal_work); work->func() <---- (we name it work X) for ordered_work in wq->ordered_list ordered_work->ordered_func() ordered_work->ordered_free() The hang is a rare case, first when we find free space, we get an uncached block group, then we go to read its free space cache inode for free space information, so it will file a readahead request btrfs_readpages() for page that is not in page cache __do_readpage() submit_extent_page() btrfs_submit_bio_hook() btrfs_bio_wq_end_io() submit_bio() end_workqueue_bio() <--(ret by the 1st endio) queue a work(named work Y) for the 2nd also the real endio() So the hang occurs when work Y's work_struct and work X's work_struct happens to share the same address. A bit more explanation, A,B,C -- struct btrfs_work arg -- struct work_struct kthread: worker_thread() pick up a work_struct from @worklist process_one_work(arg) worker->current_work = arg; <-- arg is A->normal_work worker->current_func(arg) normal_work_helper(arg) A = container_of(arg, struct btrfs_work, normal_work); A->func() A->ordered_func() A->ordered_free() <-- A gets freed B->ordered_func() submit_compressed_extents() find_free_extent() load_free_space_inode() ... <-- (the above readhead stack) end_workqueue_bio() btrfs_queue_work(work C) B->ordered_free() As if work A has a high priority in wq->ordered_list and there are more ordered works queued after it, such as B->ordered_func(), its memory could have been freed before normal_work_helper() returns, which means that kernel workqueue code worker_thread() still has worker->current_work pointer to be work A->normal_work's, ie. arg's address. Meanwhile, work C is allocated after work A is freed, work C->normal_work and work A->normal_work are likely to share the same address(I confirmed this with ftrace output, so I'm not just guessing, it's rare though). When another kthread picks up work C->normal_work to process, and finds our kthread is processing it(see find_worker_executing_work()), it'll think work C as a collision and skip then, which ends up nobody processing work C. So the situation is that our kthread is waiting forever on work C. Besides, there're other cases that can lead to deadlock, but the real problem is that all btrfs workqueue shares one work->func, -- normal_work_helper, so this makes each workqueue to have its own helper function, but only a wraper pf normal_work_helper. With this patch, I no long hit the above hang. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 14 6月, 2014 1 次提交
-
-
由 Wang Shilong 提交于
Steps to reproduce: # mkfs.btrfs -f /dev/sd[b-f] -m raid5 -d raid5 # mkfs.ext4 /dev/sdc --->corrupt one of btrfs device # mount /dev/sdb /mnt -o degraded # btrfs scrub start -BRd /mnt This is because readahead would skip missing device, this is not true for RAID5/6, because REQ_GET_READ_MIRRORS return 1 for RAID5/6 block mapping. If expected data locates in missing device, readahead thread would not call __readahead_hook() which makes event @rc->elems=0 wait forever. Fix this problem by checking return value of btrfs_map_block(),we can only skip missing device safely if there are several mirrors. Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 11 3月, 2014 2 次提交
-
-
由 Qu Wenruo 提交于
Since the "_struct" suffix is mainly used for distinguish the differnt btrfs_work between the original and the newly created one, there is no need using the suffix since all btrfs_workers are changed into btrfs_workqueue. Also this patch fixed some codes whose code style is changed due to the too long "_struct" suffix. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NJosef Bacik <jbacik@fb.com>
-
由 Qu Wenruo 提交于
Replace the fs_info->readahead_workers with the newly created btrfs_workqueue. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NJosef Bacik <jbacik@fb.com>
-
- 29 1月, 2014 1 次提交
-
-
由 Frank Holton 提交于
Convert all applicable cases of printk and pr_* to the btrfs_* macros. Fix all uses of the BTRFS prefix. Signed-off-by: NFrank Holton <fholton@gmail.com> Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 07 5月, 2013 1 次提交
-
-
由 Vincent 提交于
This fixes the following errors: fs/btrfs/reada.c: In function ‘btrfs_reada_wait’: fs/btrfs/reada.c:958:42: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’) fs/btrfs/reada.c:961:41: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’) Signed-off-by: NVincent Stehlé <vincent.stehle@laposte.net> Cc: Chris Mason <chris.mason@fusionio.com> Cc: linux-btrfs@vger.kernel.org Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
-
- 13 12月, 2012 1 次提交
-
-
由 Stefan Behrens 提交于
Before this commit, btrfs_map_block() was called with REQ_WRITE in order to retrieve the list of mirrors for a disk block. This needs to be changed for the device replace procedure since it makes a difference whether you are asking for read mirrors or for locations to write to. GET_READ_MIRRORS is introduced as a new interface to call btrfs_map_block(). In the current commit, the functionality is not yet changed, only the interface for GET_READ_MIRRORS is introduced and all the places that should use this new interface are adapted. The reason that REQ_WRITE cannot be abused anymore to retrieve a list of read mirrors is that during a running dev replace operation all write requests to the live filesystem are duplicated to also write to the target drive. Keep in mind that the target disk is only partially a valid copy of the source disk while the operation is ongoing. All writes go to the target disk, but not all reads would return valid data on the target disk. Therefore it is not possible anymore to abuse a REQ_WRITE interface to find valid mirrors for a REQ_READ. Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: NChris Mason <chris.mason@fusionio.com>
-