- 23 2月, 2016 2 次提交
-
-
由 Liu Bo 提交于
Xfstests btrfs/011 complains about a deadlock warning, [ 1226.649039] ========================================================= [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ] [ 1226.649039] 4.1.0+ #270 Not tainted [ 1226.649039] --------------------------------------------------------- [ 1226.652955] kswapd0/46 just changed the state of lock: [ 1226.652955] (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 1226.652955] (&fs_info->dev_replace.lock){+.+.+.} and interrupts could create inverse lock ordering between them. [ 1226.652955] other info that might help us debug this: [ 1226.652955] Chain exists of: &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock [ 1226.652955] Possible interrupt unsafe locking scenario: [ 1226.652955] CPU0 CPU1 [ 1226.652955] ---- ---- [ 1226.652955] lock(&fs_info->dev_replace.lock); [ 1226.652955] local_irq_disable(); [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] lock(&found->groups_sem); [ 1226.652955] <Interrupt> [ 1226.652955] lock(&delayed_node->mutex); [ 1226.652955] *** DEADLOCK *** Commit 084b6e7c ("btrfs: Fix a lockdep warning when running xfstest.") tried to fix a similar one that has the exactly same warning, but with that, we still run to this. The above lock chain comes from btrfs_commit_transaction ->btrfs_run_delayed_items ... ->__btrfs_update_delayed_inode ... ->__btrfs_cow_block ... ->find_free_extent ->cache_block_group ->load_free_space_cache ->btrfs_readpages ->submit_one_bio ... ->__btrfs_map_block ->btrfs_dev_replace_lock However, with high memory pressure, tasks which hold dev_replace.lock can be interrupted by kswapd and then kswapd is intended to release memory occupied by superblock, inodes and dentries, where we may call evict_inode, and it comes to [ 1226.652955] [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0 [ 1226.652955] [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30 [ 1226.652955] [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700 delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads to a ABBA deadlock. To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but things are simpler here since we only needs read's spinlock to blocking lock. With this, btrfs/011 no more produces warnings in dmesg. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Arnd Bergmann 提交于
With CONFIG_SMP and CONFIG_PREEMPT both disabled, gcc decides to partially inline the get_state_failrec() function but cannot figure out that means the failrec pointer is always valid if the function returns success, which causes a harmless warning: fs/btrfs/extent_io.c: In function 'clean_io_failure': fs/btrfs/extent_io.c:2131:4: error: 'failrec' may be used uninitialized in this function [-Werror=maybe-uninitialized] This marks get_state_failrec() and set_state_failrec() both as 'noinline', which avoids the warning in all cases for me, and seems less ugly than adding a fake initialization. Signed-off-by: NArnd Bergmann <arnd@arndb.de> Fixes: 47dc196a ("btrfs: use proper type for failrec in extent_state") Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 18 2月, 2016 24 次提交
-
-
由 Kinglong Mee 提交于
Cleanup. kmem_cache_destroy has support NULL argument checking, so drop the double null testing before calling it. Signed-off-by: NKinglong Mee <kinglongmee@gmail.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Sudip Mukherjee 提交于
We were getting build warning about: fs/btrfs/extent-tree.c:7021:34: warning: ‘used_bg’ may be used uninitialized in this function It is not a valid warning as used_bg is never used uninitilized since locked is initially false so we can never be in the section where 'used_bg' is used. But gcc is not able to understand that and we can initialize it while declaring to silence the warning. Signed-off-by: NSudip Mukherjee <sudip@vectorindia.org> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
We use the private member of extent_state to store the failrec and play pointless pointer games. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Deepa Dinamani 提交于
CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_fs_time() instead. Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: linux-btrfs@vger.kernel.org Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Dave Jones 提交于
The kernel provides a swap() that does the same thing as this code. Signed-off-by: NDave Jones <dsj@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Byongho Lee 提交于
While running btrfs_mksubvol(), d_really_is_positive() is called twice. First in btrfs_mksubvol() and second inside btrfs_may_create(). So I remove the first one. Signed-off-by: NByongho Lee <bhlee.kernel@gmail.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Byongho Lee 提交于
Simplify expression in btrfs_calc_trans_metadata_size(). Signed-off-by: NByongho Lee <bhlee.kernel@gmail.com> Reviewed-by: NStefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We will sometimes start background flushing the various enospc related things (delayed nodes, delalloc, etc) if we are getting close to reserving all of our available space. We don't want to do this however when we are actually using this space as it causes unneeded thrashing. We currently try to do this by checking bytes_used >= thresh, but bytes_used is only part of the equation, we need to use bytes_reserved as well as this represents space that is very likely to become bytes_used in the future. My tracing tool will keep count of the number of times we kick off the async flusher, the following are counts for the entire run of generic/027 No Patch Patch avg: 5385 5009 median: 5500 4916 We skewed lower than the average with my patch and higher than the average with the patch, overall it cuts the flushing from anywhere from 5-10%, which in the case of actual ENOSPC is quite helpful. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
There are a few places where we add to trans->bytes_reserved but don't have the corresponding trace point. With these added my tool no longer sees transaction leaks. Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
truncate_space_check is using btrfs_csum_bytes_to_leaves() but forgetting to multiply by nodesize so we get an actual byte count. We need a tracepoint here so that we have the matching reserve for the release that will come later. Also add a comment to make clear what the intent of truncate_space_check is. Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
I'm writing a tool to visualize the enospc system in order to help debug enospc bugs and I found weird data and ran it down to when we update the global block rsv. We add all of the remaining free space to the block rsv, do a trace event, then remove the extra and do another trace event. This makes my visualization look silly and is unintuitive code as well. Fix this stuff to only add the amount we are missing, or free the amount we are missing. This is less clean to read but more explicit in what it is doing, as well as only emitting events for values that make sense. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
For a non-existent device, old code bypasses adding it in dev's reada queue. And to solve problem of unfinished waitting in raid5/6, commit 5fbc7c59 ("Btrfs: fix unfinished readahead thread for raid5/6 degraded mounting") adding an exception for the first stripe, in short, the first stripe will always be processed whether the device exists or not. Actually we have a better way for the above request: just bypass creation of the reada_extent for non-existent device, it will make code simple and effective. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
Reada background works is not designed to finish all jobs completely, it will break in following case: 1: When a device reaches workload limit (MAX_IN_FLIGHT) 2: Total reads reach max limit (10000) 3: All devices don't have queued more jobs, often happened in DUP case And if all background works exit with remaining jobs, btrfs_reada_wait() will wait indefinetelly. Above problem is rarely happened in old code, because: 1: Every work queues 2x new works So many works reduced chances of undone jobs. 2: One work will continue 10000 times loop in case of no-jobs It reduced no-thread window time. But after we fixed above case, the "undone reada extents" frequently happened. Fix: Check to ensure we have at least one thread if there are undone jobs in btrfs_reada_wait(). Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
Reada creates 2 works for each level of tree recursively. In case of a tree having many levels, the number of created works is 2^level_of_tree. Actually we don't need so many works in parallel, this patch limits max works to BTRFS_MAX_MIRRORS * 2. The per-fs works_counter will be also used for btrfs_reada_wait() to check is there are background workers. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
No need to decrease dev->reada_in_flight in __readahead_hook()'s internal and reada_extent_put(). reada_extent_put() have no chance to decrease dev->reada_in_flight in free operation, because reada_extent have additional refcnt when scheduled to a dev. We can put inc and dec operation for dev->reada_in_flight to one place instead to make logic simple and safe, and move useless reada_extent->scheduled_for to a bool flag instead. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
Remove one copy of loop to fix the typo of iterate zones. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
Current code set nritems to 0 to make for_loop useless to bypass it, and set generation's value which is not necessary. Jump into cleanup directly is better choise. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
What __readahead_hook() need exactly is fs_info, no need to convert fs_info to root in caller and convert back in __readahead_hook() Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
reada_start_machine_dev() already have reada_extent pointer, pass it into __readahead_hook() directly instead of search radix_tree will make code run faster. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
We can't release reada_extent earlier than __readahead_hook(), because __readahead_hook() still need to use it, it is necessary to hode a refcnt to avoid it be freed. Actually it is not a problem after my patch named: Avoid many times of empty loop It make reada_extent in above line include at least one reada_extctl, which keeps additional one refcnt for reada_extent. But we still need this patch to make the code in pretty logic. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
level is not used in severial functions, remove them from arguments, and remove relative code for get its value. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
When failed adding all dev_zones for a reada_extent, the extent will have no chance to be selected to run, and keep in memory for ever. We should bypass this extent to avoid above case. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
If some device is not reachable, we should bypass and continus addingb next, instead of break on bad device. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
Move is_need_to_readahead contition earlier to avoid useless loop to get relative data for readahead. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 16 2月, 2016 4 次提交
-
-
由 Zhao Lei 提交于
We can see following loop(10000 times) in trace_log: [ 75.416137] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 75.417413] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 [ 75.418611] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 75.419793] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 [ 75.421016] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 75.422324] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 [ 75.423661] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 75.424882] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 ...(10000 times) [ 124.101672] ZL_DEBUG: reada_start_machine_dev:730: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 124.102850] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 [ 124.104008] ZL_DEBUG: __readahead_hook:129: pid=771 comm=kworker/u2:3 re->ref_cnt ffff88003741e0c0 1 -> 2 [ 124.105121] ZL_DEBUG: reada_extent_put:524: pid=771 comm=kworker/u2:3 re = ffff88003741e0c0, refcnt = 2 -> 1 Reason: If more than one user trigger reada in same extent, the first task finished setting of reada data struct and call reada_start_machine() to start, and the second task only add a ref_count but have not add reada_extctl struct completely, the reada_extent can not finished all jobs, and will be selected in __reada_start_machine() for 10000 times(total times in __reada_start_machine()). Fix: For a reada_extent without job, we don't need to run it, just return 0 to let caller break. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
In rechecking zone-in-tree, we still need to check zone include our logical address. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
We can avoid additional locking-acquirment and one pair of kref_get/put by combine two condition. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Zhao Lei 提交于
reada_zone->end is end pos of segment: end = start + cache->key.offset - 1; So we need to use "<=" in condition to judge is a pos in the segment. The problem happened rearly, because logical pos rarely pointed to last 4k of a blockgroup, but we need to fix it to make code right in logic. Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 12 2月, 2016 3 次提交
-
-
由 Qu Wenruo 提交于
Introduce new mount option alias "norecovery" for nologreplay, to keep "norecovery" behavior the same with other filesystems. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Introduce a new mount option "nologreplay" to co-operate with "ro" mount option to get real readonly mount, like "norecovery" in ext* and xfs. Since the new parse_options() need to check new flags at remount time, so add a new parameter for parse_options(). Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Tested-by: NAustin S. Hemmelgarn <ahferroin7@gmail.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Current "recovery" mount option will only try to use backup root. However the word "recovery" is too generic and may be confusing for some users. Here introduce a new and more specific mount option, "usebackuproot" to replace "recovery" mount option. "Recovery" will be kept for compatibility reason, but will be deprecated. Also, since "usebackuproot" will only affect mount behavior and after open_ctree() it has nothing to do with the filesystem, so clear the flag after mount succeeded. This provides the basis for later unified "norecovery" mount option. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> [ dropped usebackuproot from show_mount, added note about 'recovery' to docs ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 11 2月, 2016 7 次提交
-
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The number of distinct key types is not that big that we could waste one for something new we want to store in the tree. Similar to the temporary items, we'll introduce a new name for an existing key value and use the objectid for further extension. The victim is the BTRFS_DEV_STATS_KEY (248). The device stats are an example of a permanent item. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
No visible change. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The number of distinct key types is not that big that we could waste one for something new we want to store in the tree. We'll introduce a new name for an existing key value and use the objectid for further extension. The victim is the BTRFS_BALANCE_ITEM_KEY (248). The nature of the balance status item is a good example of the temporary item. It exists from beginning of the balance, keeps the status until it finishes. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The value of ctx->pos in the last readdir call is supposed to be set to INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a larger value, then it's LLONG_MAX. There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++" overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a 64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before the increment. We can get to that situation like that: * emit all regular readdir entries * still in the same call to readdir, bump the last pos to INT_MAX * next call to readdir will not emit any entries, but will reach the bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX Normally this is not a problem, but if we call readdir again, we'll find 'pos' set to LLONG_MAX and the unconditional increment will overflow. The report from Victor at (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging print shows that pattern: Overflow: e Overflow: 7fffffff Overflow: 7fffffffffffffff PAX: size overflow detected in function btrfs_real_readdir fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0; context: dir_context; CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1 Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015 ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48 ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78 ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8 Call Trace: [<ffffffff81742f0f>] dump_stack+0x4c/0x7f [<ffffffff811cb706>] report_size_overflow+0x36/0x40 [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0 [<ffffffff811dafc8>] iterate_dir+0xa8/0x150 [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70 [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0 Overflow: 1a [<ffffffff811db070>] ? iterate_dir+0x150/0x150 [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83 The jump from 7fffffff to 7fffffffffffffff happens when new dir entries are not yet synced and are processed from the delayed list. Then the code could go to the bump section again even though it might not emit any new dir entries from the delayed list. The fix avoids entering the "bump" section again once we've finished emitting the entries, both for synced and delayed entries. References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284Reported-by: NVictor <services@swwu.com> CC: stable@vger.kernel.org Signed-off-by: NDavid Sterba <dsterba@suse.com> Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com> Signed-off-by: NChris Mason <clm@fb.com>
-