- 03 12月, 2014 1 次提交
-
-
由 Filipe Manana 提交于
Our fs trim operation, which is completely transactionless (doesn't start or joins an existing transaction) consists of visiting all block groups and then for each one to iterate its free space entries and perform a discard operation against the space range represented by the free space entries. However before performing a discard, the corresponding free space entry is removed from the free space rbtree, and when the discard completes it is added back to the free space rbtree. If a block group remove operation happens while the discard is ongoing (or before it starts and after a free space entry is hidden), we end up not waiting for the discard to complete, remove the extent map that maps logical address to physical addresses and the corresponding chunk metadata from the the chunk and device trees. After that and before the discard completes, the current running transaction can finish and a new one start, allowing for new block groups that map to the same physical addresses to be allocated and written to. So fix this by keeping the extent map in memory until the discard completes so that the same physical addresses aren't reused before it completes. If the physical locations that are under a discard operation end up being used for a new metadata block group for example, and dirty metadata extents are written before the discard finishes (the VM might call writepages() of our btree inode's i_mapping for example, or an fsync log commit happens) we end up overwriting metadata with zeroes, which leads to errors from fsck like the following: checking extents Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block owner ref check failed [833912832 16384] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 Check tree block failed, want=833912832, have=0 read block failed check_tree_block root 5 root dir 256 error root 5 inode 260 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref root 5 inode 262 errors 2001, no inode item, link count wrong unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref root 5 inode 263 errors 2001, no inode item, link count wrong (...) Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 25 11月, 2014 1 次提交
-
-
由 Qu Wenruo 提交于
The following lockdep warning is triggered during xfstests: [ 1702.980872] ========================================================= [ 1702.981181] [ INFO: possible irq lock inversion dependency detected ] [ 1702.981482] 3.18.0-rc1 #27 Not tainted [ 1702.981781] --------------------------------------------------------- [ 1702.982095] kswapd0/77 just changed the state of lock: [ 1702.982415] (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs] [ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 1702.983160] (&fs_info->dev_replace.lock){+.+.+.} and interrupts could create inverse lock ordering between them. [ 1702.984675] other info that might help us debug this: [ 1702.985524] Chain exists of: &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock [ 1702.986799] Possible interrupt unsafe locking scenario: [ 1702.987681] CPU0 CPU1 [ 1702.988137] ---- ---- [ 1702.988598] lock(&fs_info->dev_replace.lock); [ 1702.989069] local_irq_disable(); [ 1702.989534] lock(&delayed_node->mutex); [ 1702.990038] lock(&found->groups_sem); [ 1702.990494] <Interrupt> [ 1702.990938] lock(&delayed_node->mutex); [ 1702.991407] *** DEADLOCK *** It is because the btrfs_kobj_{add/rm}_device() will call memory allocation with GFP_KERNEL, which may flush fs page cache to free space, waiting for it self to do the commit, causing the deadlock. To solve the problem, move btrfs_kobj_{add/rm}_device() out of the dev_replace lock range, also involing split the btrfs_rm_dev_replace_srcdev() function into remove and free parts. Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are called out of the lock range. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 04 10月, 2014 1 次提交
-
-
由 Fabian Frederick 提交于
bi_sector and bi_size moved to bi_iter since commit 4f024f37 ("block: Abstract out bvec iterator") Signed-off-by: NFabian Frederick <fabf@skynet.be> Signed-off-by: NChris Mason <clm@fb.com>
-
- 23 9月, 2014 1 次提交
-
-
由 Josef Bacik 提交于
One problem that has plagued us is that a user will use up all of his space with data, remove a bunch of that data, and then try to create a bunch of small files and run out of space. This happens because all the chunks were allocated for data since the metadata requirements were so low. But now there's a bunch of empty data block groups and not enough metadata space to do anything. This patch solves this problem by automatically deleting empty block groups. If we notice the used count go down to 0 when deleting or on mount notice that a block group has a used count of 0 then we will queue it to be deleted. When the cleaner thread runs we will double check to make sure the block group is still empty and then we will delete it. This patch has the side effect of no longer having a bunch of BUG_ON()'s in the chunk delete code, which will be helpful for both this and relocate. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 18 9月, 2014 30 次提交
-
-
由 Chris Mason 提交于
This reverts commit b96de000. This commit is triggering failures to mount by subvolume id in some configurations. The main problem is how many different ways this scanning function is used, both for scanning while mounted and unmounted. A proper cleanup is too big for late rcs. For now, just revert the commit and we'll put a better fix into a later merge window. Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
We need real mirror number for RAID0/5/6 when reading data, or if read error happens, we would pass 0 as the number of the mirror on which the io error happens. It is wrong and would cause the filesystem read the data from the corrupted mirror again. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
rw_devices counter is often used to tune the profile when doing chunk allocation, so we should modify it under the chunk_mutex context to avoid getting wrong chunk profile. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
For a missing device, we don't know it belong to which fs before we read its fsid from the chunk tree. So we add them into the current fs device list at first. When we get its fsid, we should move them to their own fs device list. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
When we open a seed filesystem, if the degraded mount option is set, we continue to mount the fs if we don't find some devices in the seed filesystem. But we should stop mounting if other errors happen. Fix it Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
The problem is: Task0(device scan task) Task1(device replace task) scan_one_device() mutex_lock(&uuid_mutex) device = find_device() mutex_lock(&device_list_mutex) lock_chunk() rm_and_free_source_device unlock_chunk() mutex_unlock(&device_list_mutex) check device Destroying the target device if device replace fails also has the same problem. We fix this problem by locking uuid_mutex during destroying source device or target device, just like the device remove operation. It is a temporary solution, we can fix this problem and make the code more clear by atomic counter in the future. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
We can build a new filesystem based a seed filesystem, and we need clone the fs devices when we open the new filesystem. But someone might clear the seed flag of the seed filesystem, then mount that filesystem and remove some device. If we mount the new filesystem, we might access a device list which was being changed when we clone the fs devices. Fix it. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
There were several problems about chunk mutex usage: - Lock chunk mutex when updating metadata. It would cause the nested deadlock because updating metadata might need allocate new chunks that need acquire chunk mutex. We remove chunk mutex at this case, because b-tree lock and other lock mechanism can help us. - ABBA deadlock occured between device_list_mutex and chunk_mutex. When we update device status, we must acquire device_list_mutex at the beginning, and then we might get chunk_mutex during the device status update because we need allocate new chunks for metadata COW. But at most place, we acquire chunk_mutex at first and then acquire device list mutex. We need change the lock order. - Some place we needn't acquire chunk_mutex. For example we needn't get chunk_mutex when we free a empty seed fs_devices structure. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
We didn't protect the system chunk array when we added a new system chunk into it, it would cause the array be corrupted if someone remove/add some system chunk into array at the same time. Fix it by chunk lock. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
->total_bytes,->disk_total_bytes,->bytes_used is protected by chunk lock when we change them, but sometimes we read them without any lock, and we might get unexpected value. We fix this problem like inode's i_size. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
We should update free_chunk_space in time when we allocate a new chunk, not when we deal with the pending device update and block group insertion, because we need the real free_chunk_space data to calculate the reserved space, if we don't update it in time, we would consider the disk space which has be allocated as free space, and would use it to do overcommit reservation. Fix it. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
We should update device->bytes_used in the lock context of chunk_mutex, or we would get wrong data. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
During removing a device, we have modified free_chunk_space when we shrink the device, so we needn't assign a new value to it after the device shrink. Fix it. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
device->bytes_used will be changed when allocating a new chunk, and disk_total_size will be changed if resizing is successful. Meanwhile, the on-disk super blocks of the previous transaction might not be updated. Considering the consistency of the metadata in the previous transaction, We should use the size in the previous transaction to check if the super block is beyond the boundary of the device. Though it is not big problem because we don't use it now, but anyway it is better that we make it be consistent with the common metadata, maybe we will use it in the future. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
total_size will be changed when resizing a device, and disk_total_size will be changed if resizing is successful. Meanwhile, the on-disk super blocks of the previous transaction might not be updated. Considering the consistency of the metadata in the previous transaction, We should use the size in the previous transaction to check if the super block is beyond the boundary of the device. Fix it. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
We didn't protect the assignment of the target device, it might cause the problem that the super block update was skipped because we might find wrong size of the target device during the assignment. Fix it by moving the assignment sentences into the initialization function of the target device. And there is another merit that we can check if the target device is suitable more early. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
The member variants - num_can_discard - of fs_devices structure are set, but no one use them to do anything. so remove them. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Anand Jain 提交于
we are assigning number_devices to the total_bytes, that's very confusing for a moment Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Anand Jain 提交于
seed fs devices don't participate as rw_device, so don't increment rw_devices when the device being handled belongs to a seed fs. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Anand Jain 提交于
When we replace all the seed device in the system there is no point in just keeping the btrfs_fs_devices with out any device Signed-off-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Anand Jain 提交于
We are not updating sprout fs seed pointer when all seed device is replaced. This patch will check if all seed device has been replaced and then update the sprout pointer accordingly. Same reproducer as in the previous patch would apply here. And notice that btrfs_close_device will check if seed fs is present and spits out the error with out this patch. int btrfs_close_devices(struct btrfs_fs_devices *fs_devices) { :: seed_devices = fs_devices->seed; :: while (seed_devices) { fs_devices = seed_devices; seed_devices = fs_devices->seed; __btrfs_close_devices(fs_devices); free_fs_devices(fs_devices); } Signed-off-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Anand Jain 提交于
reproducer: mount /dev/sdb /btrfs btrfs dev add /dev/sdc /btrfs btrfs rep start -B /dev/sdb /dev/sdd /btrfs umount /btrfs WARNING: CPU: 0 PID: 12661 at fs/btrfs/volumes.c:891 __btrfs_close_devices+0x1b0/0x200 [btrfs]() :: __btrfs_close_devices() :: WARN_ON(fs_devices->open_devices); After the seed device has been replaced the new target device is no more a seed device. So we need to update the device numbers in the fs_devices as pointed by the fs_info. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Anand Jain 提交于
There is no logical change in this patch, just a preparatory patch, so that changes can be easily reasoned. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
None of the uses of btrfs_search_forward() need to have the path nodes (level >= 1) read locked, only the leaf needs to be locked while the caller processes it. Therefore make it return a path with all nodes unlocked, except for the leaf. This change is motivated by the observation that during a file fsync we repeatdly call btrfs_search_forward() and process the returned leaf while upper nodes of the returned path (level >= 1) are read locked, which unnecessarily blocks other tasks that want to write to the same fs/subvol btree. Therefore instead of modifying the fsync code to unlock all nodes with level >= 1 immediately after calling btrfs_search_forward(), change btrfs_search_forward() to do it, so that it benefits all callers. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
The member variants - latest_devid and latest_trans - of fs_devices structure are set, but no one use them to do anything. so remove them. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
The io error might happen during writing out the device stats, and the device stats information and dirty flag would be update at that time, but the current code didn't consider this case, just clear the dirty flag, it would cause that we forgot to write out the new device stats information. Fix it. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 HIMANGI SARAOGI 提交于
Use BUG_ON(x) rather than if(x) BUG(); The semantic patch that fixes this problem is as follows: // <smpl> @@ identifier x; @@ -if (x) BUG(); +BUG_ON(x); // </smpl> Signed-off-by: NHimangi Saraogi <himangi774@gmail.com> Acked-by: NJulia Lawall <julia.lawall@lip6.fr> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
If we mounted a seed filesystem with degraded option, and then added a new device into the seed filesystem, then we found adding device failed because of the IO failure. Steps to reproduce: # mkfs.btrfs -d raid1 -m raid1 <dev0> <dev1> # btrfstune -S 1 <dev0> # mount <dev0> -o degraded <mnt> # btrfs device add -f <dev2> <mnt> It is because the original didn't set the chunk on the seed device to be read-only if the degraded flag was set. It was introduced by patch f48b9075, which fixed the problem the raid1 filesystem became read-only after one device of it was missing. But this fix method was not right, we should set the read-only flag according to the number of the missing devices, not the degraded mount option, if the number of the missing devices is less than the max error number that the profile of the chunk tolerates, we don't set it to be read-only. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
btrfs_set_key_type and btrfs_key_type are used inconsistently along with open coded variants. Other members of btrfs_key are accessed directly without any helpers anyway. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
- 24 8月, 2014 1 次提交
-
-
由 Liu Bo 提交于
This has been reported and discussed for a long time, and this hang occurs in both 3.15 and 3.16. Btrfs now migrates to use kernel workqueue, but it introduces this hang problem. Btrfs has a kind of work queued as an ordered way, which means that its ordered_func() must be processed in the way of FIFO, so it usually looks like -- normal_work_helper(arg) work = container_of(arg, struct btrfs_work, normal_work); work->func() <---- (we name it work X) for ordered_work in wq->ordered_list ordered_work->ordered_func() ordered_work->ordered_free() The hang is a rare case, first when we find free space, we get an uncached block group, then we go to read its free space cache inode for free space information, so it will file a readahead request btrfs_readpages() for page that is not in page cache __do_readpage() submit_extent_page() btrfs_submit_bio_hook() btrfs_bio_wq_end_io() submit_bio() end_workqueue_bio() <--(ret by the 1st endio) queue a work(named work Y) for the 2nd also the real endio() So the hang occurs when work Y's work_struct and work X's work_struct happens to share the same address. A bit more explanation, A,B,C -- struct btrfs_work arg -- struct work_struct kthread: worker_thread() pick up a work_struct from @worklist process_one_work(arg) worker->current_work = arg; <-- arg is A->normal_work worker->current_func(arg) normal_work_helper(arg) A = container_of(arg, struct btrfs_work, normal_work); A->func() A->ordered_func() A->ordered_free() <-- A gets freed B->ordered_func() submit_compressed_extents() find_free_extent() load_free_space_inode() ... <-- (the above readhead stack) end_workqueue_bio() btrfs_queue_work(work C) B->ordered_free() As if work A has a high priority in wq->ordered_list and there are more ordered works queued after it, such as B->ordered_func(), its memory could have been freed before normal_work_helper() returns, which means that kernel workqueue code worker_thread() still has worker->current_work pointer to be work A->normal_work's, ie. arg's address. Meanwhile, work C is allocated after work A is freed, work C->normal_work and work A->normal_work are likely to share the same address(I confirmed this with ftrace output, so I'm not just guessing, it's rare though). When another kthread picks up work C->normal_work to process, and finds our kthread is processing it(see find_worker_executing_work()), it'll think work C as a collision and skip then, which ends up nobody processing work C. So the situation is that our kthread is waiting forever on work C. Besides, there're other cases that can lead to deadlock, but the real problem is that all btrfs workqueue shares one work->func, -- normal_work_helper, so this makes each workqueue to have its own helper function, but only a wraper pf normal_work_helper. With this patch, I no long hit the above hang. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 19 8月, 2014 5 次提交
-
-
由 Miao Xie 提交于
total_bytes of device is just a in-memory variant which is used to record the size of the device, and it might be changed before we resize a device, if the resize operation fails, it will be fallbacked. But some code used it to update on-disk metadata of the device, it would cause the problem that on-disk metadata of the devices was not consistent. We should use the other variant named disk_total_bytes to update the on-disk metadata of device, because that variant is updated only when the resize operation is successful. Fix it. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
The seed filesystem was destroyed by the device replace, the reproduce method is: # mkfs.btrfs -f <dev0> # btrfstune -S 1 <dev0> # mount <dev0> <mnt> # btrfs device add <dev1> <mnt> # umount <mnt> # mount <dev1> <mnt> # btrfs replace start -f <dev0> <dev2> <mnt> # umount <mnt> # mount <dev0> <mnt> It is because we erase the super block on the seed device. It is wrong, we should not change anything on the seed device. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
The missing devices are accounted by its own fs device, for example the missing devices in seed filesystem will be accounted by the fs device of the seed filesystem, not by the new filesystem which is based on the seed filesystem, so when we remove the missing device in the seed filesystem, we should decrease the counter of its own fs device. Fix it. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
We forgot to zero some members in fs_devices when we create new fs_devices from the one of the seed fs. It would cause the problem that we got wrong chunk profile when allocating chunks. Fix it. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Anand Jain 提交于
When FS in unmounted we need to check generation number as well since devid+uuid combination could match with the missing replaced disk when it reappears, and without this patch it might pair with the replaced disk again. device_list_add() function is called in the following threads, mount device option mount argument ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan) ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready <dev>) they have been unit tested to work fine with this patch. If the user knows what he is doing and really want to pair with replaced disk (which is not a standard operation), then he should first clear the kernel btrfs device list in the memory by doing the module unload/load and followed with the mount -o device option. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-