- 29 9月, 2009 1 次提交
-
-
由 Josef Bacik 提交于
At the start of a transaction we do a btrfs_reserve_metadata_space() and specify how many items we plan on modifying. Then once we've done our modifications and such, just call btrfs_unreserve_metadata_space() for the same number of items we reserved. For keeping track of metadata needed for data I've had to add an extent_io op for when we merge extents. This lets us track space properly when we are doing sequential writes, so we don't end up reserving way more metadata space than what we need. The only place where the metadata space accounting is not done is in the relocation code. This is because Yan is going to be reworking that code in the near future, so running btrfs-vol -b could still possibly result in a ENOSPC related panic. This patch also turns off the metadata_ratio stuff in order to allow users to more efficiently use their disk space. This patch makes it so we track how much metadata we need for an inode's delayed allocation extents by tracking how many extents are currently waiting for allocation. It introduces two new callbacks for the extent_io tree's, merge_extent_hook and split_extent_hook. These help us keep track of when we merge delalloc extents together and split them up. Reservations are handled prior to any actually dirty'ing occurs, and then we unreserve after we dirty. btrfs_unreserve_metadata_for_delalloc() will make the appropriate unreservations as needed based on the number of reservations we currently have and the number of extents we currently have. Doing the reservation outside of doing any of the actual dirty'ing lets us do things like filemap_flush() the inode to try and force delalloc to happen, or as a last resort actually start allocation on all delalloc inodes in the fs. This has survived dbench, fs_mark and an fsx torture test. Signed-off-by: NJosef Bacik <jbacik@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 23 9月, 2009 2 次提交
-
-
由 Chris Mason 提交于
We now do extra checks before a balance to make sure there is room for the balance to take place. One of the checks was testing to see if we were trying to balance away the last block group of a given type. If there is no space available for new chunks, we should not try and balance away the last block group of a give type. But, the code wasn't checking for available chunk space, and so it was exiting too soon. The fix here is to combine some of the checks and make sure we try to allocate new chunks when we're balancing the last block group. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
After a balance it is briefly possible for the space info field in the inode to be NULL. This adds some checks to make sure things properly deal with the NULL value. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 22 9月, 2009 6 次提交
-
-
由 Josef Bacik 提交于
As we get closer to proper -ENOSPC handling in btrfs, we need more accurate space accounting for the space info's. Currently we exclude the free space for the super mirrors, but the space they take up isn't accounted for in any of the counters. This patch introduces bytes_super, which keeps track of the amount of bytes used for a super mirror in the block group cache and space info. This makes sure that our free space caclucations will be completely accurate. Signed-off-by: NJosef Bacik <jbacik@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Josef Bacik 提交于
This patch removes a bunch of dead code from the snapshot removal stuff. It was confusing me when doing the metadata ENOSPC stuff so I killed it. Signed-off-by: NJosef Bacik <jbacik@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Josef Bacik 提交于
The box can get locked up in the allocator if we happen upon a block group under these conditions: 1) During a commit, so caching threads cannot make progress 2) Our block group currently is in the middle of being cached 3) Our block group currently has plenty of free space in it 4) Our block group is so fragmented that it ends up having no free space chunks larger than min_bytes calculated by btrfs_find_space_cluster. What happens is we try and do btrfs_find_space_cluster, which fails because it is unable to find enough free space chunks that are large than min_bytes and are close enough together. Since the block group is not cached we do a wait_block_group_cache_progress, which waits for the number of bytes we need, except the block group already has _plenty_ of free space, its just severely fragmented, so we loop and try again, ad infinitum. This patch keeps us from waiting on the block group to finish caching if we failed to find a free space cluster before. It also makes sure that we don't even try to find a free space cluster if we are on our last loop in the allocator, since we will have tried everything at this point at it is futile. Signed-off-by: NJosef Bacik <jbacik@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Josef Bacik 提交于
Currently, we can panic the box if the first block group we go to move is of a type where there is no space left to move those extents. For example, if we fill the disk up with data, and then we try to balance and we have no room to move the data nor room to allocate new chunks, we will panic. Change this by checking to see if we have room to move this chunk around, and if not, return -ENOSPC and move on to the next chunk. This will make sure we remove block groups that are moveable, like if we have alot of empty metadata block groups, and then that way we make room to be able to balance our data chunks as well. Tested this with an fs that would panic on btrfs-vol -b normally, but no longer panics with this patch. V1->V2: -actually search for a free extent on the device to make sure we can allocate a chunk if need be. -fix btrfs_shrink_device to make sure we actually try to relocate all the chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r we don't remove the device with data still on it. -check to make sure the block group we are going to relocate isn't the last one in that particular space -fix a bug in btrfs_shrink_device where we would change the device's size and not fix it if we fail to do our relocate Signed-off-by: NJosef Bacik <jbacik@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Yan, Zheng 提交于
This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being used and doesn't contains links to other subvolumes can be destroyed. Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Yan, Zheng 提交于
This patch contains two changes to avoid unnecessary tree block reads during snapshot dropping. First, check tree block's reference count and flags before reading the tree block. if reference count > 1 and there is no need to update backrefs, we can avoid reading the tree block. Second, save when snapshot was created in root_key.offset. we can compare block pointer's generation with snapshot's creation generation during updating backrefs. If a given block was created before snapshot was created, the snapshot can't be the tree block's owner. So we can avoid reading the block. Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 18 9月, 2009 1 次提交
-
-
由 Yan Zheng 提交于
This patch gets rid of two limitations of async block group caching. The old code delays handling pinned extents when block group is in caching. To allocate logged file extents, the old code need wait until block group is fully cached. To get rid of the limitations, This patch introduces a data structure to track the progress of caching. Base on the caching progress, we know which extents should be added to the free space cache when handling the pinned extents. The logged file extents are also handled in a similar way. This patch also changes how pinned extents are tracked. The old code uses one tree to track pinned extents, and copy the pinned extents tree at transaction commit time. This patch makes it use two trees to track pinned extents. One tree for extents that are pinned in the running transaction, one tree for extents that can be unpinned. At transaction commit time, we swap the two trees. Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 12 9月, 2009 1 次提交
-
-
由 Chris Mason 提交于
There are two main users of the extent_map tree. The first is regular file inodes, where it is evenly spread between readers and writers. The second is the chunk allocation tree, which maps blocks from logical addresses to phyiscal ones, and it is 99.99% reads. The mapping tree is a point of lock contention during heavy IO workloads, so this commit switches things to a rw lock. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 01 8月, 2009 1 次提交
-
-
由 Chris Mason 提交于
The async caching thread can end up looping forever if a given search puts it at the last key in a leaf. It will end up calling btrfs_next_leaf and then checking if it needs to politely drop the read semaphore. Most of the time this looping isn't noticed because it is able to make progress the next time around. But, during log replay, we wait on the async caching thread to finish, and the async thread is waiting on the commit, and no progress is really made. The fix used here is to copy the key out of the next leaf, that way our search lands there properly. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 30 7月, 2009 2 次提交
-
-
由 Chris Mason 提交于
The semaphore used by the async caching threads can prevent a transaction commit, which can make the FS appear to stall. This releases the semaphore more often when a transaction commit is in progress. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Yan Zheng 提交于
The async block group caching code uses the commit_root pointer to get a stable version of the extent allocation tree for scanning. This copy of the tree root isn't going to change and it significantly reduces the complexity of the scanning code. During a commit, we have a loop where we update the extent allocation tree root. We need to loop because updating the root pointer in the tree of tree roots may allocate blocks which may change the extent allocation tree. Right now the commit_root pointer is changed inside this loop. It is more correct to change the commit_root pointer only after all the looping is done. Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 28 7月, 2009 2 次提交
-
-
由 Yan Zheng 提交于
- don't stop the caching thread until btrfs_commit_super return. - if caching is interrupted by umount, set last to (u64)-1. otherwise the un-scanned range of block group will be considered as free extent. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Josef Bacik 提交于
We are racy with async block caching and unpinning extents. This patch makes things much less complicated by only unpinning the extent if the block group is cached. We check the block_group->cached var under the block_group->lock spin lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters, and unpin the extent and add the free space back. If it is not set to this, we start the caching of the block group so the next time we unpin extents we can unpin the extent. This keeps us from racing with the async caching threads, lets us kill the fs wide async thread counter, and keeps us from having to set DELALLOC bits for every extent we hit if there are caching kthreads going. One thing that needed to be changed was btrfs_free_super_mirror_extents. Now instead of just looking for LOCKED extents, we also look for DIRTY extents, since we could have left some extents pinned in the previous transaction that will never get freed now that we are unmounting, which would cause us to leak memory. So btrfs_free_super_mirror_extents has been changed to btrfs_free_pinned_extents, and it will clear the extents locked for the super mirror, and any remaining pinned extents that may be present. Thank you, Signed-off-by: NJosef Bacik <jbacik@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 25 7月, 2009 1 次提交
-
-
由 Chris Mason 提交于
Btrfs allocates individual extents from block groups, and each block group has a specific type. It may hold metadata, data mirrored or striped etc. When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r) we free block groups. Once a block group is freed, the space it was using on the device may be available for use by new block groups. btrfs_remove_block_group was clearing the flag that said 'our devices are full, don't even try to allocate new block groups', but it was only clearing that flag for a specific type of block group. This commit clears the full flag for all of the types of block groups, making it much more likely that we'll be able to balance space when the drive is close to full. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 24 7月, 2009 2 次提交
-
-
由 Josef Bacik 提交于
This patch moves the caching of the block group off to a kthread in order to allow people to allocate sooner. Instead of blocking up behind the caching mutex, we instead kick of the caching kthread, and then attempt to make an allocation. If we cannot, we wait on the block groups caching waitqueue, which the caching kthread will wake the waiting threads up everytime it finds 2 meg worth of space, and then again when its finished caching. This is how I tested the speedup from this mkfs the disk mount the disk fill the disk up with fs_mark unmount the disk mount the disk time touch /mnt/foo Without my changes this took 11 seconds on my box, with these changes it now takes 1 second. Another change thats been put in place is we lock the super mirror's in the pinned extent map in order to keep us from adding that stuff as free space when caching the block group. This doesn't really change anything else as far as the pinned extent map is concerned, since for actual pinned extents we use EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock those extents to keep from leaking memory. I've also added a check where when we are reading block groups from disk, if the amount of space used == the size of the block group, we go ahead and mark the block group as cached. This drastically reduces the amount of time it takes to cache the block groups. Using the same test as above, except doing a dd to a file and then unmounting, it used to take 33 seconds to umount, now it takes 3 seconds. This version uses the commit_root in the caching kthread, and then keeps track of how many async caching threads are running at any given time so if one of the async threads is still running as we cross transactions we can wait until its finished before handling the pinned extents. Thank you, Signed-off-by: NJosef Bacik <jbacik@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Josef Bacik 提交于
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply tracking free space. As free space gets fragmented, we end up with thousands of entries on an rb-tree per block group, which usually spans 1 gig of area. Since we currently don't ever flush free space cache back to disk this gets to be a bit unweildly on large fs's with lots of fragmentation. This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free space cache. Initially we calculate a threshold of extent entries we can handle, which is however many extent entries we can cram into 16k of ram. The maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace will be 32k of RAM, which scales much better than we did before. Once we pass the extent threshold, we start adding bitmaps and using those instead for tracking the free space. This patch also makes it so that any free space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is nice since we try and allocate out of the front of a block group, so if the front of a block group is heavily fragmented and then has a huge chunk of free space at the end, we go ahead and add the fragmented areas to bitmaps and use a normal extent entry to track the big chunk at the back of the block group. I've also taken the opportunity to revamp how we search for free space. Previously we indexed free space via an offset indexed rb tree and a bytes indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset indexed rb tree. This cuts the number of tree operations we were doing previously down by half, and gives us a little bit of a better allocation pattern since we will always start from a specific offset and search forward from there, instead of searching for the size we need and try and get it as close as possible to the offset we want. I've given this a healthy amount of testing pre-new format stuff, as well as post-new format stuff. I've booted up my fedora box which is installed on btrfs with this patch and ran with it for a few days without issues. I've not seen any performance regressions in any of my tests. Since the last patch Yan Zheng fixed a problem where we could have overlapping entries, so updating their offset inline would cause problems. Thanks, Signed-off-by: NJosef Bacik <jbacik@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 22 7月, 2009 1 次提交
-
-
由 Yan Zheng 提交于
Write dirty block groups may allocate new block, and so may add new delayed back ref. btrfs_run_delayed_refs may make some block groups dirty. commit_cowonly_roots does not handle the recursion properly, and some dirty blocks can be left unwritten at commit time. This patch moves btrfs_run_delayed_refs into the loop that writes dirty block groups, and makes the code not break out of the loop until there are no dirty block groups or delayed back refs. Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 03 7月, 2009 2 次提交
-
-
由 Hu Tao 提交于
Make an error msg look nicer by inserting a space between number and word. Signed-off-by: NHu Tao <hu.taoo@gmail.com> Signed-off-by: NJiri Kosina <jkosina@suse.cz> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Yan Zheng 提交于
The new backref format has restriction on type of backref item. If a tree block isn't referenced by its owner tree, full backrefs must be used for the pointers in it. When a tree block loses its owner tree's reference, backrefs for the pointers in it should be updated to full backrefs. Current btrfs_drop_snapshot misses the code that updates backrefs, so it's unsafe for general use. This patch adds backrefs update code to btrfs_drop_snapshot. It isn't a problem in the restricted form btrfs_drop_snapshot is used today, but for general snapshot deletion this update is required. Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 11 6月, 2009 1 次提交
-
-
由 Yan Zheng 提交于
lookup_inline_extent_backref only checks for duplicate backref for data extents. It assumes backrefs for tree block never conflict. This patch makes lookup_inline_extent_backref check for duplicate backrefs for both data and tree block, so that we can detect potential bug earlier. This is a safety check, strictly speaking it is not required. Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 10 6月, 2009 3 次提交
-
-
由 David Woodhouse 提交于
There's no need to preserve this abstraction; it used to let us use hardware crc32c support directly, but libcrc32c is already doing that for us through the crypto API -- so we're already using the Intel crc32c acceleration where appropriate. Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
Some SSDs perform best when reusing block numbers often, while others perform much better when clustering strictly allocates big chunks of unused space. The default mount -o ssd will find rough groupings of blocks where there are a bunch of free blocks that might have some allocated blocks mixed in. mount -o ssd_spread will make sure there are no allocated blocks mixed in. It should perform better on lower end SSDs. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Yan Zheng 提交于
This commit introduces a new kind of back reference for btrfs metadata. Once a filesystem has been mounted with this commit, IT WILL NO LONGER BE MOUNTABLE BY OLDER KERNELS. When a tree block in subvolume tree is cow'd, the reference counts of all extents it points to are increased by one. At transaction commit time, the old root of the subvolume is recorded in a "dead root" data structure, and the btree it points to is later walked, dropping reference counts and freeing any blocks where the reference count goes to 0. The increments done during cow and decrements done after commit cancel out, and the walk is a very expensive way to go about freeing the blocks that are no longer referenced by the new btree root. This commit reduces the transaction overhead by avoiding the need for dead root records. When a non-shared tree block is cow'd, we free the old block at once, and the new block inherits old block's references. When a tree block with reference count > 1 is cow'd, we increase the reference counts of all extents the new block points to by one, and decrease the old block's reference count by one. This dead tree avoidance code removes the need to modify the reference counts of lower level extents when a non-shared tree block is cow'd. But we still need to update back ref for all pointers in the block. This is because the location of the block is recorded in the back ref item. We can solve this by introducing a new type of back ref. The new back ref provides information about pointer's key, level and in which tree the pointer lives. This information allow us to find the pointer by searching the tree. The shortcoming of the new back ref is that it only works for pointers in tree blocks referenced by their owner trees. This is mostly a problem for snapshots, where resolving one of these fuzzy back references would be O(number_of_snapshots) and quite slow. The solution used here is to use the fuzzy back references in the common case where a given tree block is only referenced by one root, and use the full back references when multiple roots have a reference on a given block. This commit adds per subvolume red-black tree to keep trace of cached inodes. The red-black tree helps the balancing code to find cached inodes whose inode numbers within a given range. This commit improves the balancing code by introducing several data structures to keep the state of balancing. The most important one is the back ref cache. It caches how the upper level tree blocks are referenced. This greatly reduce the overhead of checking back ref. The improved balancing code scales significantly better with a large number of snapshots. This is a very large commit and was written in a number of pieces. But, they depend heavily on the disk format change and were squashed together to make sure git bisect didn't end up in a bad state wrt space balancing or the format change. Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 05 6月, 2009 1 次提交
-
-
由 Chris Mason 提交于
The btrfs allocator uses list_for_each to walk the available block groups when searching for free blocks. It starts off with a hint to help find the best block group for a given allocation. The hint is resolved into a block group, but we don't properly check to make sure the block group we find isn't in the middle of being freed due to filesystem shrinking or balancing. If it is being freed, the list pointers in it are bogus and can't be trusted. But, the code happily goes along and uses them in the list_for_each loop, leading to all kinds of fun. The fix used here is to check to make sure the block group we find really is on the list before we use it. list_del_init is used when removing it from the list, so we can do a proper check. The allocation clustering code has a similar bug where it will trust the block group in the current free space cluster. If our allocation flags have changed (going from single spindle dup to raid1 for example) because the drives in the FS have changed, we're not allowed to use the old block group any more. The fix used here is to check the current cluster against the current allocation flags. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 15 5月, 2009 1 次提交
-
-
由 Sankar P 提交于
Signed-off-by: NSankar P <sankar.curiosity@gmail.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 27 4月, 2009 1 次提交
-
-
由 Joel Becker 提交于
Just happened to notice a bunch of %llu vs u64 warnings. Here's a patch to cast them all. Signed-off-by: NJoel Becker <joel.becker@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 25 4月, 2009 1 次提交
-
-
由 Josef Bacik 提交于
This patch makes the chunk allocator keep a good ratio of metadata vs data block groups. By default for every 8 data block groups, we'll allocate 1 metadata chunk, or about 12% of the disk will be allocated for metadata. This can be changed by specifying the metadata_ratio mount option. This is simply the number of data block groups that have to be allocated to force a metadata chunk allocation. By making sure we allocate metadata chunks more often, we are less likely to get into situations where the whole disk has been allocated as data block groups. Signed-off-by: NJosef Bacik <jbacik@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 03 4月, 2009 5 次提交
-
-
由 Chris Mason 提交于
Because btrfs is copy-on-write, we end up picking new locations for blocks very often. This makes it fairly difficult to maintain perfect read patterns over time, but we can at least do some optimizations for writes. This is done today by remembering the last place we allocated and trying to find a free space hole big enough to hold more than just one allocation. The end result is that we tend to write sequentially to the drive. This happens all the time for metadata and it happens for data when mounted -o ssd. But, the way we record it is fairly racey and it tends to fragment the free space over time because we are trying to allocate fairly large areas at once. This commit gets rid of the races by adding a free space cluster object with dedicated locking to make sure that only one process at a time is out replacing the cluster. The free space fragmentation is somewhat solved by allowing a cluster to be comprised of smaller free space extents. This part definitely adds some CPU time to the cluster allocations, but it allows the allocator to consume the small holes left behind by cow. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Josef Bacik 提交于
This patch removes the pinned_mutex. The extent io map has an internal tree lock that protects the tree itself, and since we only copy the extent io map when we are committing the transaction we don't need it there. We also don't need it when caching the block group since searching through the tree is also protected by the internal map spin lock. Signed-off-by: NJosef Bacik <jbacik@redhat.com>
-
由 Josef Bacik 提交于
This patch removes the block group alloc mutex used to protect the free space tree for allocations and replaces it with a spin lock which is used only to protect the free space rb tree. This means we only take the lock when we are directly manipulating the tree, which makes us a touch faster with multi-threaded workloads. This patch also gets rid of btrfs_find_free_space and replaces it with btrfs_find_space_for_alloc, which takes the number of bytes you want to allocate, and empty_size, which is used to indicate how much free space should be at the end of the allocation. It will return an offset for the allocator to use. If we don't end up using it we _must_ call btrfs_add_free_space to put it back. This is the tradeoff to kill the alloc_mutex, since we need to make sure nobody else comes along and takes our space. Signed-off-by: NJosef Bacik <jbacik@redhat.com>
-
由 Josef Bacik 提交于
I've replaced the strange looping constructs with a list_for_each_entry on space_info->block_groups. If we have a hint we just jump into the loop with the block group and start looking for space. If we don't find anything we start at the beginning and start looking. We never come out of the loop with a ref on the block_group _unless_ we found space to use, then we drop it after we set the trans block_group. Signed-off-by: NJosef Bacik <jbacik@redhat.com>
-
由 Josef Bacik 提交于
This patch cleans up the free space cache code a bit. It better documents the idiosyncrasies of tree_search_offset and makes the code make a bit more sense. I took out the info allocation at the start of __btrfs_add_free_space and put it where it makes more sense. This was left over cruft from when alloc_mutex existed. Also all of the re-searches we do to make sure we inserted properly. Signed-off-by: NJosef Bacik <jbacik@redhat.com>
-
- 01 4月, 2009 1 次提交
-
-
由 Chris Mason 提交于
COW means we cycle though blocks fairly quickly, and once we free an extent on disk, it doesn't make much sense to keep the pages around. This commit tries to immediately free the page when we free the extent, which lowers our memory footprint significantly. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 25 3月, 2009 4 次提交
-
-
由 Chris Mason 提交于
The tree logging code allows individual files or directories to be logged without including operations on other files and directories in the FS. It tries to commit the minimal set of changes to disk in order to fsync the single file or directory that was sent to fsync or O_SYNC. The tree logging code was allowing files and directories to be unlinked if they were part of a rename operation where only one directory in the rename was in the fsync log. This patch adds a few new rules to the tree logging. 1) on rename or unlink, if the inode being unlinked isn't in the fsync log, we must force a full commit before doing an fsync of the directory where the unlink was done. The commit isn't done during the unlink, but it is forced the next time we try to log the parent directory. Solution: record transid of last unlink/rename per directory when the directory wasn't already logged. For renames this is only done when renaming to a different directory. mkdir foo/some_dir normal commit rename foo/some_dir foo2/some_dir mkdir foo/some_dir fsync foo/some_dir/some_file The fsync above will unlink the original some_dir without recording it in its new location (foo2). After a crash, some_dir will be gone unless the fsync of some_file forces a full commit 2) we must log any new names for any file or dir that is in the fsync log. This way we make sure not to lose files that are unlinked during the same transaction. 2a) we must log any new names for any file or dir during rename when the directory they are being removed from was logged. 2a is actually the more important variant. Without the extra logging a crash might unlink the old name without recreating the new one 3) after a crash, we must go through any directories with a link count of zero and redo the rm -rf mkdir f1/foo normal commit rm -rf f1/foo fsync(f1) The directory f1 was fully removed from the FS, but fsync was never called on f1, only its parent dir. After a crash the rm -rf must be replayed. This must be able to recurse down the entire directory tree. The inode link count fixup code takes care of the ugly details. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
btrfs_mark_buffer dirty would set dirty bits in the extent_io tree for the buffers it was dirtying. This may require a kmalloc and it was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to set any btree locks they were holding to blocking first. This commit changes dirty tracking for extent buffers to just use a flag in the extent buffer. Now that we have one and only one extent buffer per page, this can be safely done without losing dirty bits along the way. This also introduces a path->leave_spinning flag that callers of btrfs_search_slot can use to indicate they will properly deal with a path returned where all the locks are spinning instead of blocking. Many of the btree search callers now expect spinning paths, resulting in better btree concurrency overall. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
To avoid deadlocks and reduce latencies during some critical operations, some transaction writers are allowed to jump into the running transaction and make it run a little longer, while others sit around and wait for the commit to finish. This is a bit unfair, especially when the callers that jump in do a bunch of IO that makes all the others procs on the box wait. This commit reduces the stalls this produces by pre-reading file extent pointers during btrfs_finish_ordered_io before the transaction is joined. It also tunes the drop_snapshot code to politely wait for transactions that have started writing out their delayed refs to finish. This avoids new delayed refs being flooded into the queue while we're trying to close off the transaction. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Chris Mason 提交于
The delayed reference queue maintains pending operations that need to be done to the extent allocation tree. These are processed by finding records in the tree that are not currently being processed one at a time. This is slow because it uses lots of time searching through the rbtree and because it creates lock contention on the extent allocation tree when lots of different procs are running delayed refs at the same time. This commit changes things to grab a cluster of refs for processing, using a cursor into the rbtree as the starting point of the next search. This way we walk smoothly through the rbtree. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-