- 04 10月, 2014 2 次提交
-
-
由 David Sterba 提交于
Populate btrfs_check_super_valid() with checks that try to verify consistency of superblock by additional conditions that may arise from corrupted devices or bitflips. Some of tests are only hints and issue warnings instead of failing the mount, basically when the checks are derived from the data found in the superblock. Tested on a broken image provided by Qu. Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
While we have a transaction ongoing, the VM might decide at any time to call btree_inode->i_mapping->a_ops->writepages(), which will start writeback of dirty pages belonging to btree nodes/leafs. This call might return an error or the writeback might finish with an error before we attempt to commit the running transaction. If this happens, we might have no way of knowing that such error happened when we are committing the transaction - because the pages might no longer be marked dirty nor tagged for writeback (if a subsequent modification to the extent buffer didn't happen before the transaction commit) which makes filemap_fdata[write|wait]_range unable to find such pages (even if they're marked with SetPageError). So if this happens we must abort the transaction, otherwise we commit a super block with btree roots that point to btree nodes/leafs whose content on disk is invalid - either garbage or the content of some node/leaf from a past generation that got cowed or deleted and is no longer valid (for this later case we end up getting error messages like "parent transid verify failed on 10826481664 wanted 25748 found 29562" when reading btree nodes/leafs from disk). Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's i_mapping would not be enough because we need to distinguish between log tree extents (not fatal) vs non-log tree extents (fatal) and because the next call to filemap_fdatawait_range() will catch and clear such errors in the mapping - and that call might be from a log sync and not from a transaction commit, which means we would not know about the error at transaction commit time. Also, checking for the eb flag EXTENT_BUFFER_IOERR at transaction commit time isn't done and would not be completely reliable, as the eb might be removed from memory and read back when trying to get it, which clears that flag right before reading the eb's pages from disk, making us not know about the previous write error. Using the new 3 flags for the btree inode also makes us achieve the goal of AS_EIO/AS_ENOSPC when writepages() returns success, started writeback for all dirty pages and before filemap_fdatawait_range() is called, the writeback for all dirty pages had already finished with errors - because we were not using AS_EIO/AS_ENOSPC, filemap_fdatawait_range() would return success, as it could not know that writeback errors happened (the pages were no longer tagged for writeback). Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 02 10月, 2014 10 次提交
-
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
The structure is frequently reused. Rename it according to the slab name. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
The enum exists but is not consistently used. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
8MiB is way too large and likely set by mistake. This is not a significant issue as in practice the max amount of data added to an inline extent is also limited by the page cache and btree leaf sizes. Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
Rename to btrfs_alloc_tree_block as it fits to the alloc/find/free + _tree_block family. The parameter blocksize was set to the metadata block size, directly or indirectly. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
We know the tree block size, no need to pass it around. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
Errors in readahead are not fatal and ignored elsewhere in the code. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
The parent_transid parameter has been unused since its introduction in ca7a79ad ("Pass down the expected generation number when reading tree blocks"). In reada_tree_block, it was even wrongly set to leafsize. Transid check is done in the proper read and readahead ignores errors. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
- 23 9月, 2014 1 次提交
-
-
由 Josef Bacik 提交于
One problem that has plagued us is that a user will use up all of his space with data, remove a bunch of that data, and then try to create a bunch of small files and run out of space. This happens because all the chunks were allocated for data since the metadata requirements were so low. But now there's a bunch of empty data block groups and not enough metadata space to do anything. This patch solves this problem by automatically deleting empty block groups. If we notice the used count go down to 0 when deleting or on mount notice that a block group has a used count of 0 then we will queue it to be deleted. When the cleaner thread runs we will double check to make sure the block group is still empty and then we will delete it. This patch has the side effect of no longer having a bunch of BUG_ON()'s in the chunk delete code, which will be helpful for both this and relocate. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 18 9月, 2014 9 次提交
-
-
由 Miao Xie 提交于
This patch implement data repair function when direct read fails. The detail of the implementation is: - When we find the data is not right, we try to read the data from the other mirror. - When the io on the mirror ends, we will insert the endio work into the dedicated btrfs workqueue, not common read endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. - After we get right data, we write it back to the corrupted mirror. - And if the data on the new mirror is still corrupted, we will try next mirror until we read right data or all the mirrors are traversed. - After the above work, we set the uptodate flag according to the result. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
device->bytes_used will be changed when allocating a new chunk, and disk_total_size will be changed if resizing is successful. Meanwhile, the on-disk super blocks of the previous transaction might not be updated. Considering the consistency of the metadata in the previous transaction, We should use the size in the previous transaction to check if the super block is beyond the boundary of the device. Though it is not big problem because we don't use it now, but anyway it is better that we make it be consistent with the common metadata, maybe we will use it in the future. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
total_size will be changed when resizing a device, and disk_total_size will be changed if resizing is successful. Meanwhile, the on-disk super blocks of the previous transaction might not be updated. Considering the consistency of the metadata in the previous transaction, We should use the size in the previous transaction to check if the super block is beyond the boundary of the device. Fix it. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Li RongQing 提交于
This comments became wrong after c3c532[bdi: add helper function for doing init and register of a bdi for a file system], so remove them. Signed-off-by: NLi RongQing <roy.qing.li@gmail.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Andrey Utkin 提交于
The issue was introduced in a79b7d4b, adding allocation of extent_workers, so this stray check is surely not meant to be a check of something else. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=82021Reported-by: NMaks Naumov <maksqwe1@ukr.net> Signed-off-by: NAndrey Utkin <andrey.krieger.utkin@gmail.com> Reviewed-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Wang Shilong 提交于
Marc argued that if there are several btrfs filesystems mounted, while users even don't know which filesystem hit the corrupted errors something like generation verification failure. Since @extent_buffer structure has a member @fs_info, let's output btrfs device info. Reported-by: NMarc MERLIN <marc@merlins.org> Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
The nodesize and leafsize were never of different values. Unify the usage and make nodesize the one. Cleanup the redundant checks and helpers. Shaves a few bytes from .text: text data bss dec hex filename 852418 24560 23112 900090 dbbfa btrfs.ko.before 851074 24584 23112 898770 db6d2 btrfs.ko.after Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
There's no user of the return value and we can get rid of the comment in put_super. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
The naming is confusing, generic yet used for a specific cache. Add a prefix 'ino_' or rename appropriately. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
- 24 8月, 2014 1 次提交
-
-
由 Liu Bo 提交于
This has been reported and discussed for a long time, and this hang occurs in both 3.15 and 3.16. Btrfs now migrates to use kernel workqueue, but it introduces this hang problem. Btrfs has a kind of work queued as an ordered way, which means that its ordered_func() must be processed in the way of FIFO, so it usually looks like -- normal_work_helper(arg) work = container_of(arg, struct btrfs_work, normal_work); work->func() <---- (we name it work X) for ordered_work in wq->ordered_list ordered_work->ordered_func() ordered_work->ordered_free() The hang is a rare case, first when we find free space, we get an uncached block group, then we go to read its free space cache inode for free space information, so it will file a readahead request btrfs_readpages() for page that is not in page cache __do_readpage() submit_extent_page() btrfs_submit_bio_hook() btrfs_bio_wq_end_io() submit_bio() end_workqueue_bio() <--(ret by the 1st endio) queue a work(named work Y) for the 2nd also the real endio() So the hang occurs when work Y's work_struct and work X's work_struct happens to share the same address. A bit more explanation, A,B,C -- struct btrfs_work arg -- struct work_struct kthread: worker_thread() pick up a work_struct from @worklist process_one_work(arg) worker->current_work = arg; <-- arg is A->normal_work worker->current_func(arg) normal_work_helper(arg) A = container_of(arg, struct btrfs_work, normal_work); A->func() A->ordered_func() A->ordered_free() <-- A gets freed B->ordered_func() submit_compressed_extents() find_free_extent() load_free_space_inode() ... <-- (the above readhead stack) end_workqueue_bio() btrfs_queue_work(work C) B->ordered_free() As if work A has a high priority in wq->ordered_list and there are more ordered works queued after it, such as B->ordered_func(), its memory could have been freed before normal_work_helper() returns, which means that kernel workqueue code worker_thread() still has worker->current_work pointer to be work A->normal_work's, ie. arg's address. Meanwhile, work C is allocated after work A is freed, work C->normal_work and work A->normal_work are likely to share the same address(I confirmed this with ftrace output, so I'm not just guessing, it's rare though). When another kthread picks up work C->normal_work to process, and finds our kthread is processing it(see find_worker_executing_work()), it'll think work C as a collision and skip then, which ends up nobody processing work C. So the situation is that our kthread is waiting forever on work C. Besides, there're other cases that can lead to deadlock, but the real problem is that all btrfs workqueue shares one work->func, -- normal_work_helper, so this makes each workqueue to have its own helper function, but only a wraper pf normal_work_helper. With this patch, I no long hit the above hang. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 19 8月, 2014 1 次提交
-
-
由 Miao Xie 提交于
total_bytes of device is just a in-memory variant which is used to record the size of the device, and it might be changed before we resize a device, if the resize operation fails, it will be fallbacked. But some code used it to update on-disk metadata of the device, it would cause the problem that on-disk metadata of the devices was not consistent. We should use the other variant named disk_total_bytes to update the on-disk metadata of device, because that variant is updated only when the resize operation is successful. Fix it. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 15 8月, 2014 1 次提交
-
-
由 Chris Mason 提交于
Truncates and renames are often used to replace old versions of a file with new versions. Applications often expect this to be an atomic replacement, even if they haven't done anything to make sure the new version is fully on disk. Btrfs has strict flushing in place to make sure that renaming over an old file with a new file will fully flush out the new file before allowing the transaction commit with the rename to complete. This ordering means the commit code needs to be able to lock file pages, and there are a few paths in the filesystem where we will try to end a transaction with the page lock held. It's rare, but these things can deadlock. This patch removes the ordered flushes and switches to a best effort filemap_flush like ext4 uses. It's not perfect, but it should fix the deadlocks. Signed-off-by: NChris Mason <clm@fb.com>
-
- 03 7月, 2014 1 次提交
-
-
由 Wang Shilong 提交于
Balance recovery is called when RW mounting or remounting from RO to RW, it is called to finish roots merging. When doing balance recovery, relocation root's corresponding fs root(whose root refs is 0) might be destroyed by cleaner thread, this will make btrfs fail to mount. Fix this problem by holding @cleaner_mutex when doing balance recovery. Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 29 6月, 2014 1 次提交
-
-
由 Josef Bacik 提交于
This is a regression from my patch a26e8c9f, we need to only unlock the block if we were the one who locked it. Otherwise this will trip BUG_ON()'s in locking.c Thanks, cc: stable@vger.kernel.org Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 10 6月, 2014 9 次提交
-
-
由 Chris Mason 提交于
Delayed extent operations are triggered during transaction commits. The goal is to queue up a healthly batch of changes to the extent allocation tree and run through them in bulk. This farms them off to async helper threads. The goal is to have the bulk of the delayed operations being done in the background, but this is also important to limit our stack footprint. Signed-off-by: NChris Mason <clm@fb.com>
-
由 Josef Bacik 提交于
This exercises the various parts of the new qgroup accounting code. We do some basic stuff and do some things with the shared refs to make sure all that code works. I had to add a bunch of infrastructure because I needed to be able to insert items into a fake tree without having to do all the hard work myself, hopefully this will be usefull in the future. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Josef Bacik 提交于
Currently qgroups account for space by intercepting delayed ref updates to fs trees. It does this by adding sequence numbers to delayed ref updates so that it can figure out how the tree looked before the update so we can adjust the counters properly. The problem with this is that it does not allow delayed refs to be merged, so if you say are defragging an extent with 5k snapshots pointing to it we will thrash the delayed ref lock because we need to go back and manually merge these things together. Instead we want to process quota changes when we know they are going to happen, like when we first allocate an extent, we free a reference for an extent, we add new references etc. This patch accomplishes this by only adding qgroup operations for real ref changes. We only modify the sequence number when we need to lookup roots for bytenrs, this reduces the amount of churn on the sequence number and allows us to merge delayed refs as we add them most of the time. This patch encompasses a bunch of architectural changes 1) qgroup ref operations: instead of tracking qgroup operations through the delayed refs we simply add new ref operations whenever we notice that we need to when we've modified the refs themselves. 2) tree mod seq: we no longer have this separation of major/minor counters. this makes the sequence number stuff much more sane and we can remove some locking that was needed to protect the counter. 3) delayed ref seq: we now read the tree mod seq number and use that as our sequence. This means each new delayed ref doesn't have it's own unique sequence number, rather whenever we go to lookup backrefs we inc the sequence number so we can make sure to keep any new operations from screwing up our world view at that given point. This allows us to merge delayed refs during runtime. With all of these changes the delayed ref stuff is a little saner and the qgroup accounting stuff no longer goes negative in some cases like it was before. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
To ease finding bugs during development related to modifying btree leaves in such a way that it makes its items not sorted by key anymore. Since this is an expensive check, it's only enabled if CONFIG_BTRFS_FS_CHECK_INTEGRITY is set, which isn't meant to be enabled for regular users. Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Wang Shilong 提交于
In close_ctree(), after we have stopped all workers,there maybe still some read requests(for example readahead) to submit and this *maybe* trigger an oops that user reported before: kernel BUG at fs/btrfs/async-thread.c:619! By hacking codes, i can reproduce this problem with one cpu available. We fix this potential problem by invalidating all btree inode pages before stopping all workers. Thanks to Miao for pointing out this problem. Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Tsutomu Itoh 提交于
In btrfs_create_tree(), if btrfs_insert_root() fails, we should free root->commit_root. Reported-by: NAlex Lyakas <alex@zadarastorage.com> Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Qu Wenruo 提交于
Current btrfs_orphan_cleanup will also cleanup roots which is already in fs_info->dead_roots without protection. This will have conditional race with fs_info->cleaner_kthread. This patch will use refs in root->root_item to detect roots in dead_roots and avoid conflicts. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
Before applying this patch, the task had to reclaim the metadata space by itself if the metadata space was not enough. And When the task started the space reclamation, all the other tasks which wanted to reserve the metadata space were blocked. At some cases, they would be blocked for a long time, it made the performance fluctuate wildly. So we introduce the background metadata space reclamation, when the space is about to be exhausted, we insert a reclaim work into the workqueue, the worker of the workqueue helps us to reclaim the reserved space at the background. By this way, the tasks needn't reclaim the space by themselves at most cases, and even if the tasks have to reclaim the space or are blocked for the space reclamation, they will get enough space more quickly. Here is my test result(Tested by compilebench): Memory: 2GB CPU: 2Cores * 1CPU Partition: 40GB(SSD) Test command: # compilebench -D <mnt> -m Without this patch: intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s) compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s) read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s) delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s) With this patch: intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s) compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s) read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s) delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s) Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 25 4月, 2014 1 次提交
-
-
由 Wang Shilong 提交于
Fix possible memory leaks in the following error handling paths: read_tree_block() btrfs_recover_log_trees btrfs_commit_super() btrfs_find_orphan_roots() btrfs_cleanup_fs_roots() Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 07 4月, 2014 2 次提交
-
-
由 Josef Bacik 提交于
Lets try this again. We can deadlock the box if we send on a box and try to write onto the same fs with the app that is trying to listen to the send pipe. This is because the writer could get stuck waiting for a transaction commit which is being blocked by the send. So fix this by making sure looking at the commit roots is always going to be consistent. We do this by keeping track of which roots need to have their commit roots swapped during commit, and then taking the commit_root_sem and swapping them all at once. Then make sure we take a read lock on the commit_root_sem in cases where we search the commit root to make sure we're always looking at a consistent view of the commit roots. Previously we had problems with this because we would swap a fs tree commit root and then swap the extent tree commit root independently which would cause the backref walking code to screw up sometimes. With this patch we no longer deadlock and pass all the weird send/receive corner cases. Thanks, Reportedy-by: NHugo Mills <hugo@carfax.org.uk> Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Josef Bacik 提交于
So I have an awful exercise script that will run snapshot, balance and send/receive in parallel. This sometimes would crash spectacularly and when it came back up the fs would be completely hosed. Turns out this is because of a bad interaction of balance and send/receive. Send will hold onto its entire path for the whole send, but its blocks could get relocated out from underneath it, and because it doesn't old tree locks theres nothing to keep this from happening. So it will go to read in a slot with an old transid, and we could have re-allocated this block for something else and it could have a completely different transid. But because we think it is invalid we clear uptodate and re-read in the block. If we do this before we actually write out the new block we could write back stale data to the fs, and boom we're screwed. Now we definitely need to fix this disconnect between send and balance, but we really really need to not allow ourselves to accidently read in stale data over new data. So make sure we check if the extent buffer is not under io before clearing uptodate, this will kick back EIO to the caller instead of reading in stale data and keep us from corrupting the fs. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 11 3月, 2014 1 次提交
-
-
由 Miao Xie 提交于
We didn't have a lock to protect the access to the delalloc inodes list, that is we might access a empty delalloc inodes list if someone start flushing delalloc inodes because the delalloc inodes were moved into a other list temporarily. Fix it by wrapping the access with a lock. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NJosef Bacik <jbacik@fb.com>
-