- 06 1月, 2009 2 次提交
-
-
由 Chris Mason 提交于
There were many, most are fixed now. struct-funcs.c generates some warnings but these are bogus. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Liu Hui 提交于
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD. We can improve FTL's performance by telling it which sectors are freed by file system. But if we don't tell FTL the information of free sectors in proper time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not roll back the previous transaction under the power loss condition. There are some problems in the old implementation: 1, In __free_extent(), the pinned down extents should not be discarded. 2, In free_extents(), the free extents are all pinned, so they need to be discarded in transaction committing time instead of free_extents(). 3, The reserved extent used by log tree should be discard too. This patch change discard behavior as follows: 1, For the extents which need to be free at once, we discard them in update_block_group(). 2, Delay discarding the pinned extent in btrfs_finish_extent_commit() when committing transaction. 3, Remove discarding from free_extents() and __free_extent() 4, Add discard interface into btrfs_free_reserved_extent() 5, Discard sectors before updating the free space cache, otherwise, FTL will destroy file system data.
-
- 19 12月, 2008 2 次提交
-
-
由 Yan Zheng 提交于
There is a race in relocate_inode_pages, it happens when find_delalloc_range finds the delalloc extent before the boundary bit is set. Thank you, Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
This adds the missing block accounting code to finish_current_insert and makes block accounting for root item properly protected by the delalloc spin lock. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 17 12月, 2008 1 次提交
-
-
由 Chris Mason 提交于
Btrfs maintains a cache of blocks available for allocation in ram. The code that frees extents was marking the extents free and then deleting the checksum items. This meant it was possible the extent would be reallocated before the checksum item was actually deleted, leading to races and other problems as the checksums were updated for the newly allocated extent. The fix is to delete the checksum before marking the extent free. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 16 12月, 2008 1 次提交
-
-
由 Chris Mason 提交于
The delalloc lock doesn't need to have irqs disabled, nobody that changes the number of delalloc bytes in the FS is running with irqs off. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 12 12月, 2008 3 次提交
-
-
由 Yan Zheng 提交于
Checksums on data can be disabled by mount option, so it's possible some data extents don't have checksums or have invalid checksums. This causes trouble for data relocation. This patch contains following things to make data relocation work. 1) make nodatasum/nodatacow mount option only affects new files. Checksums and COW on data are only controlled by the inode flags. 2) check the existence of checksum in the nodatacow checker. If checksums exist, force COW the data extent. This ensure that checksum for a given block is either valid or does not exist. 3) update data relocation code to properly handle the case of checksum missing. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
This patch makes seed device possible to be shared by multiple mounted file systems. The sharing is achieved by cloning seed device's btrfs_fs_devices structure. Thanks you, Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
The block group structs are referenced in many different places, and it's not safe to free while balancing. So, those block group structs were simply leaked instead. This patch replaces the block group pointer in the inode with the starting byte offset of the block group and adds reference counting to the block group struct. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 11 12月, 2008 1 次提交
-
-
由 Yan Zheng 提交于
This updates the space balancing code for the new checksum format. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 10 12月, 2008 1 次提交
-
-
由 Chris Mason 提交于
This finishes off the new checksumming code by removing csum items for extents that are no longer in use. The trick is doing it without racing because a single csum item may hold csums for more than one extent. Extra checks are added to btrfs_csum_file_blocks to make sure that we are using the correct csum item after dropping locks. A new btrfs_split_item is added to split a single csum item so it can be split without dropping the leaf lock. This is used to remove csum bytes from the middle of an item. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 09 12月, 2008 1 次提交
-
-
由 Yan Zheng 提交于
This patch implements superblock duplication. Superblocks are stored at offset 16K, 64M and 256G on every devices. Spaces used by superblocks are preserved by the allocator, which uses a reverse mapping function to find the logical addresses that correspond to superblocks. Thank you, Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 02 12月, 2008 1 次提交
-
-
由 Christoph Hellwig 提交于
Shut up various sparse warnings about symbols that should be either static or have their declarations in scope. Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 21 11月, 2008 1 次提交
-
-
由 Josef Bacik 提交于
This the lockdep complaint by having a different mutex to gaurd caching the block group, so you don't end up with this backwards dependancy. Thank you, Signed-off-by: NJosef Bacik <jbacik@redhat.com>
-
- 20 11月, 2008 3 次提交
-
-
由 Chris Mason 提交于
The btrfs git kernel trees is used to build a standalone tree for compiling against older kernels. This commit makes the standalone tree work with 2.6.27 Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
* open/close_bdev_excl -> open/close_bdev_exclusive * blkdev_issue_discard takes a GFP mask now * Fix blkdev_issue_discard usage now that it is enabled Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Josef Bacik 提交于
This patch fixes what I hope is the last early ENOSPC bug left. I did not know that pinned extents would merge into one big extent when inserted on to the pinned extent tree, so I was adding free space to a block group that could possibly span multiple block groups. This is a big issue because first that space doesn't exist in that block group, and second we won't actually use that space because there are a bunch of other checks to make sure we're allocating within the constraints of the block group. This patch fixes the problem by adding the btrfs_add_free_space to btrfs_update_pinned_extents which makes sure we are adding the appropriate amount of free space to the appropriate block group. Thanks much to Lee Trager for running my myriad of debug patches to help me track this problem down. Thank you, Signed-off-by: NJosef Bacik <jbacik@redhat.com>
-
- 19 11月, 2008 1 次提交
-
-
由 Liu Hui 提交于
In insert_extents(), when ret==1 and last is not zero, it should check if the current inserted item is the last item in this batching inserts. If so, it should just break from loop. If not, 'cur = insert_list->next' will make no sense because the list is empty now, and 'op' will point to an unexpectable place. There are also some trivial fixs in this patch including one comment typo error and deleting two redundant lines. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 18 11月, 2008 3 次提交
-
-
由 Josef Bacik 提交于
Some people are still reporting problems with early enospc. This will help narrown down the cause. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Josef Bacik 提交于
In my batch delete/update/insert patch I introduced a free space leak. The extent that we do the original search on in free_extents is never pinned, so we always update the block saying that it has free space, but the free space never actually gets added to the free space tree, since op->del will always be 0 and it's never actually added to the pinned extents tree. This patch fixes this problem by making sure we call pin_down_bytes on the pending extent op and set op->del to the return value of pin_down_bytes so update_block_group is called with the right value. This seems to fix the case where we were getting ENOSPC when there was plenty of space available. Signed-off-by: NJosef Bacik <jbacik@redhat.com>
-
由 Yan Zheng 提交于
Seed device is a special btrfs with SEEDING super flag set and can only be mounted in read-only mode. Seed devices allow people to create new btrfs on top of it. The new FS contains the same contents as the seed device, but it can be mounted in read-write mode. This patch does the following: 1) split code in btrfs_alloc_chunk into two parts. The first part does makes the newly allocated chunk usable, but does not do any operation that modifies the chunk tree. The second part does the the chunk tree modifications. This division is for the bootstrap step of adding storage to the seed device. 2) Update device management code to handle seed device. The basic idea is: For an FS grown from seed devices, its seed devices are put into a list. Seed devices are opened on demand at mounting time. If any seed device is missing or has been changed, btrfs kernel module will refuse to mount the FS. 3) make btrfs_find_block_group not return NULL when all block groups are read-only. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 13 11月, 2008 3 次提交
-
-
由 Yan Zheng 提交于
This patch adds mount ro and remount support. The main changes in patch are: adding btrfs_remount and related helper function; splitting the transaction related code out of close_ctree into btrfs_commit_super; updating allocator to properly handle read only block group. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Josef Bacik 提交于
While profiling the allocator I noticed a good amount of time was being spent in finish_current_insert and del_pending_extents, and as the filesystem filled up more and more time was being spent in those functions. This patch aims to try and reduce that problem. This happens two ways 1) track if we tried to delete an extent that we are going to update or insert. Once we get into finish_current_insert we discard any of the extents that were marked for deletion. This saves us from doing unnecessary work almost every time finish_current_insert runs. 2) Batch insertion/updates/deletions. Instead of doing a btrfs_search_slot for each individual extent and doing the needed operation, we instead keep the leaf around and see if there is anything else we can do on that leaf. On the insert case I introduced a btrfs_insert_some_items, which will take an array of keys with an array of data_sizes and try and squeeze in as many of those keys as possible, and then return how many keys it was able to insert. In the update case we search for an extent ref, update the ref and then loop through the leaf to see if any of the other refs we are looking to update are on that leaf, and then once we are done we release the path and search for the next ref we need to update. And finally for the deletion we try and delete the extent+ref in pairs, so we will try to find extent+ref pairs next to the extent we are trying to free and free them in bulk if possible. This along with the other cluster fix that Chris pushed out a bit ago helps make the allocator preform more uniformly as it fills up the disk. There is still a slight drop as we fill up the disk since we start having to stick new blocks in odd places which results in more COW's than on a empty fs, but the drop is not nearly as severe as it was before. Signed-off-by: NJosef Bacik <jbacik@redhat.com>
-
由 Chris Mason 提交于
When we fail to allocate a new block group, we should still do the checks to make sure allocations try again with the minimum requested allocation size. This also fixes a deadlock that come from a missed down_read in the chunk allocation failure handling. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 11 11月, 2008 2 次提交
-
-
由 Chris Mason 提交于
The allocator wasn't catching all of the cases where it needed to do extra loops because the check to enforce them wasn't happening early enough. When the allocator decided to increase the size of the allocation for metadata clustering, it wasn't always setting the empty_size to include the extra (optional) bytes. This also fixes the empty_size field to be correct. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
The loop searching for free space would exit out too soon when metadata clustering was trying to allocate a large extent. This makes sure a full scan of the free space is done searching for only the minimum extent size requested by the higher layers. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 10 11月, 2008 1 次提交
-
-
由 Chris Mason 提交于
When metadata allocation clustering has to fall back to unclustered allocs because large free areas could not be found, it was sometimes substracting too much from the total bytes to allocate. This would make it wrap below zero. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 08 11月, 2008 1 次提交
-
-
由 Chris Mason 提交于
In comes cases the empty cluster was added twice to the total number of bytes the allocator was trying to find. With empty clustering on, the hint byte was sometimes outside of the block group. Add an extra goto to find the correct block group. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 07 11月, 2008 3 次提交
-
-
由 Chris Mason 提交于
This lowers the empty cluster target for metadata allocations. The lower target makes it easier to do allocations and still seems to perform well. It also fixes the allocator loop to drop the empty cluster when things start getting difficult, avoiding false enospc warnings. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
The allocator uses the last allocation as a starting point for metadata allocations, and tries to allocate in clusters of at least 256k. If the search for a free block fails to find the expected block, this patch forces a new cluster to be found in the free list. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
When reading compressed extents, try to put pages into the page cache for any pages covered by the compressed extent that readpages didn't already preload. Add an async work queue to handle transformations at delayed allocation processing time. Right now this is just compression. The workflow is: 1) Find offsets in the file marked for delayed allocation 2) Lock the pages 3) Lock the state bits 4) Call the async delalloc code The async delalloc code clears the state lock bits and delalloc bits. It is important this happens before the range goes into the work queue because otherwise it might deadlock with other work queue items that try to lock those extent bits. The file pages are compressed, and if the compression doesn't work the pages are written back directly. An ordered work queue is used to make sure the inodes are written in the same order that pdflush or writepages sent them down. This changes extent_write_cache_pages to let the writepage function update the wbc nr_written count. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 31 10月, 2008 3 次提交
-
-
由 Yan Zheng 提交于
This patch updates btrfs-progs for fallocate support. fallocate is a little different in Btrfs because we need to tell the COW system that a given preallocated extent doesn't need to be cow'd as long as there are no snapshots of it. This leverages the -o nodatacow checks. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
This patch simplifies the nodatacow checker. If all references were created after the latest snapshot, then we can avoid COW safely. This patch also updates run_delalloc_nocow to do more fine-grained checking. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Yan Zheng 提交于
When dropping middle part of an extent, btrfs_drop_extents truncates the extent at first, then inserts a bookend extent. Since truncation and insertion can't be done atomically, there is a small period that the bookend extent isn't in the tree. This causes problem for functions that search the tree for file extent item. The way to fix this is lock the range of the bookend extent before truncation. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
- 30 10月, 2008 6 次提交
-
-
由 Chris Mason 提交于
finish_current_insert and del_pending_extents process extent tree modifications that build up while we are changing the extent tree. It is a confusing bit of code that prevents recursion. Both functions run through a list of pending operations and both funcs add to the list of pending operations. If you have two procs in either one of them, they can end up looping forever making more work for each other. This patch makes them walk forward through the list of pending changes instead of always trying to process the entire list. At transaction commit time, we catch any changes that were left over. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Yan Zheng 提交于
This patch adds transaction IDs to root tree pointers. Transaction IDs in tree pointers are compared with the generation numbers in block headers when reading root blocks of trees. This can detect some types of IO errors. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Josef Bacik 提交于
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch of little locks. There is now a pinned_mutex, which is used when messing with the pinned_extents extent io tree, and the extent_ins_mutex which is used with the pending_del and extent_ins extent io trees. The locking for the extent tree stuff was inspired by a patch that Yan Zheng wrote to fix a race condition, I cleaned it up some and changed the locking around a little bit, but the idea remains the same. Basically instead of holding the extent_ins_mutex throughout the processing of an extent on the extent_ins or pending_del trees, we just hold it while we're searching and when we clear the bits on those trees, and lock the extent for the duration of the operations on the extent. Also to keep from getting hung up waiting to lock an extent, I've added a try_lock_extent so if we cannot lock the extent, move on to the next one in the tree and we'll come back to that one. I have tested this heavily and it does not appear to break anything. This has to be applied on top of my find_free_extent redo patch. I tested this patch on top of Yan's space reblancing code and it worked fine. The only thing that has changed since the last version is I pulled out all my debugging stuff, apparently I forgot to run guilt refresh before I sent the last patch out. Thank you, Signed-off-by: NJosef Bacik <jbacik@redhat.com>
-
由 Josef Bacik 提交于
So there is an odd case where we can possibly return -ENOSPC when there is in fact space to be had. It only happens with Metadata writes, and happens _very_ infrequently. What has to happen is we have to allocate have allocated out of the first logical byte on the disk, which would set last_alloc to first_logical_byte(root, 0), so search_start == orig_search_start. We then need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start, block_group_bits() won't match and we'll go to choose another block group. However because search_start matches orig_search_start we go to see if we can allocate a chunk. If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC. This is kind of a big flaw of the way find_free_extent works, as it along with find_free_space loop through _all_ of the block groups, not just the ones that we want to allocate out of. This patch completely kills find_free_space and rolls it into find_free_extent. I've introduced a sort of state machine into this, which will make it easier to get cache miss information out of the allocator, and will work well with my locking changes. The basic flow is this: We have the variable loop which is 0, meaning we are in the hint phase. We lookup the block group for the hint, and lookup the space_info for what we want to allocate out of. If the block group we were pointed at by the hint either isn't of the correct type, or just doesn't have the space we need, we set head to space_info->block_groups, so we start at the beginning of the block groups for this particular space info, and loop through. This is also where we add the empty_cluster to total_needed. At this point loop is set to 1 and we just loop through all of the block groups for this particular space_info looking for the space we need, just as find_free_space would have done, except we only hit the block groups we want and not _all_ of the block groups. If we come full circle we see if we can allocate a chunk. If we cannot of course we exit with -ENOSPC and we are good. If not we start over at space_info->block_groups and loop through again, with loop == 2. If we come full circle and haven't found what we need then we exit with -ENOSPC. I've been running this for a couple of days now and it seems stable, and I haven't yet hit a -ENOSPC when there was plenty of space left. Also I've made a groups_sem to handle the group list for the space_info. This is part of my locking changes, but is relatively safe and seems better than holding the space_info spinlock over that entire search time. Thanks, Signed-off-by: NJosef Bacik <jbacik@redhat.com>
-
由 Yan Zheng 提交于
This patch improves the space balancing code to keep more sharing of tree blocks. The only case that breaks sharing of tree blocks is data extents get fragmented during balancing. The main changes in this patch are: Add a 'drop sub-tree' function. This solves the problem in old code that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block. Remove relocation mapping tree. Relocation mappings are stored in struct btrfs_ref_path and updated dynamically during walking up/down the reference path. This reduces CPU usage and simplifies code. This patch also fixes a bug. Root items for reloc trees should be updated in btrfs_free_reloc_root. Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
-
由 Chris Mason 提交于
This is a large change for adding compression on reading and writing, both for inline and regular extents. It does some fairly large surgery to the writeback paths. Compression is off by default and enabled by mount -o compress. Even when the -o compress mount option is not used, it is possible to read compressed extents off the disk. If compression for a given set of pages fails to make them smaller, the file is flagged to avoid future compression attempts later. * While finding delalloc extents, the pages are locked before being sent down to the delalloc handler. This allows the delalloc handler to do complex things such as cleaning the pages, marking them writeback and starting IO on their behalf. * Inline extents are inserted at delalloc time now. This allows us to compress the data before inserting the inline extent, and it allows us to insert an inline extent that spans multiple pages. * All of the in-memory extent representations (extent_map.c, ordered-data.c etc) are changed to record both an in-memory size and an on disk size, as well as a flag for compression. From a disk format point of view, the extent pointers in the file are changed to record the on disk size of a given extent and some encoding flags. Space in the disk format is allocated for compression encoding, as well as encryption and a generic 'other' field. Neither the encryption or the 'other' field are currently used. In order to limit the amount of data read for a single random read in the file, the size of a compressed extent is limited to 128k. This is a software only limit, the disk format supports u64 sized compressed extents. In order to limit the ram consumed while processing extents, the uncompressed size of a compressed extent is limited to 256k. This is a software only limit and will be subject to tuning later. Checksumming is still done on compressed extents, and it is done on the uncompressed version of the data. This way additional encodings can be layered on without having to figure out which encoding to checksum. Compression happens at delalloc time, which is basically singled threaded because it is usually done by a single pdflush thread. This makes it tricky to spread the compression load across all the cpus on the box. We'll have to look at parallel pdflush walks of dirty inodes at a later time. Decompression is hooked into readpages and it does spread across CPUs nicely. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-