- 15 8月, 2014 1 次提交
-
-
由 Chris Mason 提交于
Truncates and renames are often used to replace old versions of a file with new versions. Applications often expect this to be an atomic replacement, even if they haven't done anything to make sure the new version is fully on disk. Btrfs has strict flushing in place to make sure that renaming over an old file with a new file will fully flush out the new file before allowing the transaction commit with the rename to complete. This ordering means the commit code needs to be able to lock file pages, and there are a few paths in the filesystem where we will try to end a transaction with the page lock held. It's rare, but these things can deadlock. This patch removes the ordered flushes and switches to a best effort filemap_flush like ext4 uses. It's not perfect, but it should fix the deadlocks. Signed-off-by: NChris Mason <clm@fb.com>
-
- 10 6月, 2014 1 次提交
-
-
由 Alex Gartrell 提交于
In these instances, we are trying to determine if a page has been accessed since we began the operation for the sake of retry. This is easily accomplished by doing a gang lookup in the page mapping radix tree, and it saves us the dependency on the flag (so that we might eventually delete it). btrfs_page_exists_in_range borrows heavily from find_get_page, replacing the radix tree look up with a gang lookup of 1, so that we can find the next highest page >= index and see if it falls into our lock range. Signed-off-by: NChris Mason <clm@fb.com> Signed-off-by: NAlex Gartrell <agartrell@fb.com>
-
- 18 4月, 2014 1 次提交
-
-
由 Peter Zijlstra 提交于
Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 11 3月, 2014 1 次提交
-
-
由 Miao Xie 提交于
The log trans id is initialized to be 0 every time we create a log tree, and the log tree need be re-created after a new transaction is started, it means the log trans id is unlikely to be a huge number, so we can use signed integer instead of unsigned long integer to save a bit space. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NJosef Bacik <jbacik@fb.com>
-
- 29 1月, 2014 2 次提交
-
-
This change adds infrastructure to allow for generic properties for inodes. Properties are name/value pairs that can be associated with inodes for different purposes. They are stored as xattrs with the prefix "btrfs." Properties can be inherited - this means when a directory inode has inheritable properties set, these are added to new inodes created under that directory. Further, subvolumes can also have properties associated with them, and they can be inherited from their parent subvolume. Naturally, directory properties have priority over subvolume properties (in practice a subvolume property is just a regular property associated with the root inode, objectid 256, of the subvolume's fs tree). This change also adds one specific property implementation, named "compression", whose values can be "lzo" or "zlib" and it's an inheritable property. The corresponding changes to btrfs-progs were also implemented. A patch with xfstests for this feature will follow once there's agreement on this change/feature. Further, the script at the bottom of this commit message was used to do some benchmarks to measure any performance penalties of this feature. Basically the tests correspond to: Test 1 - create a filesystem and mount it with compress-force=lzo, then sequentially create N files of 64Kb each, measure how long it took to create the files, unmount the filesystem, mount the filesystem and perform an 'ls -lha' against the test directory holding the N files, and report the time the command took. Test 2 - create a filesystem and don't use any compression option when mounting it - instead set the compression property of the subvolume's root to 'lzo'. Then create N files of 64Kb, and report the time it took. The unmount the filesystem, mount it again and perform an 'ls -lha' like in the former test. This means every single file ends up with a property (xattr) associated to it. Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the compression property, have no real effect other than adding more work when inheriting properties and taking more btree leaf space. Test 4 - same as test 3 but with 10 properties per file. Results (in seconds, and averages of 5 runs each), for different N numbers of files follow. * Without properties (test 1) file creation time ls -lha time 10 000 files 3.49 0.76 100 000 files 47.19 8.37 1 000 000 files 518.51 107.06 * With 1 property (compression property set to lzo - test 2) file creation time ls -lha time 10 000 files 3.63 0.93 100 000 files 48.56 9.74 1 000 000 files 537.72 125.11 * With 4 properties (test 3) file creation time ls -lha time 10 000 files 3.94 1.20 100 000 files 52.14 11.48 1 000 000 files 572.70 142.13 * With 10 properties (test 4) file creation time ls -lha time 10 000 files 4.61 1.35 100 000 files 58.86 13.83 1 000 000 files 656.01 177.61 The increased latencies with properties are essencialy because of: *) When creating an inode, we now synchronously write 1 more item (an xattr item) for each property inherited from the parent dir (or subvolume). This could be done in an asynchronous way such as we do for dir intex items (delayed-inode.c), which could help reduce the file creation latency; *) With properties, we now have larger fs trees. For this particular test each xattr item uses 75 bytes of leaf space in the fs tree. This could be less by using a new item for xattr items, instead of the current btrfs_dir_item, since we could cut the 'location' and 'type' fields (saving 18 bytes) and maybe 'transid' too (saving a total of 26 bytes per xattr item) from the btrfs_dir_item type. Also tried batching the xattr insertions (ignoring proper hash collision handling, since it didn't exist) when creating files that inherit properties from their parent inode/subvolume, but the end results were (surprisingly) essentially the same. Test script: $ cat test.pl #!/usr/bin/perl -w use strict; use Time::HiRes qw(time); use constant NUM_FILES => 10_000; use constant FILE_SIZES => (64 * 1024); use constant DEV => '/dev/sdb4'; use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev'; use constant TEST_DIR => (MNT_POINT . '/testdir'); system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!"; # following line for testing without properties #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!"; # following 2 lines for testing with properties system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!"; system("mkdir", TEST_DIR) == 0 or die "mkdir failed!"; my ($t1, $t2); $t1 = time(); for (my $i = 1; $i <= NUM_FILES; $i++) { my $p = TEST_DIR . '/file_' . $i; open(my $f, '>', $p) or die "Error opening file!"; $f->autoflush(1); for (my $j = 0; $j < FILE_SIZES; $j += 4096) { print $f ('A' x 4096) or die "Error writing to file!"; } close($f); } $t2 = time(); print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; $t1 = time(); system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!"; $t2 = time(); print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
The inode reference item is close to inode item, so we insert it simultaneously with the inode item insertion when we create a file/directory.. In fact, we also can handle the inode reference deletion by the same way. So we made this patch to introduce the delayed inode reference deletion for the single link inode(At most case, the file doesn't has hard link, so we don't take the hard link into account). This function is based on the delayed inode mechanism. After applying this patch, we can reduce the time of the file/directory deletion by ~10%. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 12 11月, 2013 1 次提交
-
-
Currently the hash value used for adding an inode to the VFS's inode hash table consists of the plain inode number, which is a 64 bits integer. This results in hash table buckets (hlist_head lists) with too many elements for at least 2 important scenarios: 1) When we have many subvolumes. Each subvolume has its own btree where its files and directories are added to, and each has its own objectid (inode number) namespace. This means that if we have N subvolumes, and all have inode number X associated to a file or directory, the corresponding inodes all map to the same hash table entry, resulting in a bucket (hlist_head list) with N elements; 2) On 32 bits machines. Th VFS hash values are unsigned longs, which are 32 bits wide on 32 bits machines, and the inode (objectid) numbers are 64 bits unsigned integers. We simply cast the inode numbers to hash values, which means that for all inodes with the same 32 bits lower half, the same hash bucket is used for all of them. For example, all inodes with a number (objectid) between 0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in the same hash table bucket. This change ensures the inode's hash value depends both on the objectid (inode number) and its subvolume's (btree root) objectid. For 32 bits machines, this change gives better entropy by making the hash value depend on both the upper and lower 32 bits of the 64 bits hash previously computed. Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: NJosef Bacik <jbacik@fusionio.com> Signed-off-by: NChris Mason <chris.mason@fusionio.com>
-
- 21 9月, 2013 1 次提交
-
-
由 Josef Bacik 提交于
Liu introduced a local copy of the last log commit for an inode to make sure we actually log an inode even if a log commit has already taken place. In order to make sure we didn't relog the same inode multiple times he set this local copy to the current trans when we log the inode, because usually we log the inode and then sync the log. The exception to this is during rename, we will relog an inode if the name changed and it is already in the log. The problem with this is then we go to sync the inode, and our check to see if the inode has already been logged is tripped and we don't sync the log. To fix this we need to _also_ check against the roots last log commit, because it could be less than what is in our local copy of the log commit. This fixes a bug where we rename a file into a directory and then fsync the directory and then on remount the directory is no longer there. Thanks, Signed-off-by: NJosef Bacik <jbacik@fusionio.com> Signed-off-by: NChris Mason <chris.mason@fusionio.com>
-
- 01 9月, 2013 1 次提交
-
-
由 Miao Xie 提交于
Before applying this patch, we cached the csum value into the extent state tree when reading some data from the disk, this operation increased the lock contention of the state tree. Now, we just store the csum value into the bio structure or other unshared structure, so we can reduce the lock contention. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NJosef Bacik <jbacik@fusionio.com> Signed-off-by: NChris Mason <chris.mason@fusionio.com>
-
- 07 5月, 2013 1 次提交
-
-
由 Nathaniel Yazdani 提交于
In the comment describing the sync_writers field of the btrfs_inode struct, "fsyncing" was misspelled "fsycing." Signed-off-by: NNathaniel Yazdani <n1ght.4nd.d4y@gmail.com> Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
-
- 21 2月, 2013 2 次提交
-
-
由 Miao Xie 提交于
Currently, we can do unlocked dio reads, but the following race is possible: dio_read_task truncate_task ->btrfs_setattr() ->btrfs_direct_IO ->__blockdev_direct_IO ->btrfs_get_block ->btrfs_truncate() #alloc truncated blocks #to other inode ->submit_io() #INFORMATION LEAK In order to avoid this problem, we must serialize unlocked dio reads with truncate. There are two approaches: - use extent lock to protect the extent that we truncate - use inode_dio_wait() to make sure the truncating task will wait for the read DIO. If we use the 1st one, we will meet the endless truncation problem due to the nonlocked read DIO after we implement the nonlocked write DIO. It is because we still need invoke inode_dio_wait() avoid the race between write DIO and truncation. By that time, we have to introduce btrfs_inode_{block, resume}_nolock_dio() again. That is we have to implement this patch again, so I choose the 2nd way to fix the problem. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
-
由 Miao Xie 提交于
We need not use a global lock to protect the delalloc_bytes of the inode, just use its own lock. In this way, we can reduce the lock contention and ->delalloc_lock will just protect delalloc inode list. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
-
- 17 12月, 2012 2 次提交
-
-
由 Josef Bacik 提交于
The tree logging stuff needs the csums to be on the ordered extents in order to log them properly, so mark that we're sync and inline the csum creation so we don't have to wait on the csumming to be done when logging extents that are still in flight. Thanks, Signed-off-by: NJosef Bacik <jbacik@fusionio.com> Signed-off-by: NChris Mason <chris.mason@fusionio.com>
-
由 Josef Bacik 提交于
Currently we copy all the file information into the log, inode item, the refs, xattrs etc. Except most of this doesn't change from fsync to fsync, just the inode item changes. So set a flag if an xattr changes or a link is added, and otherwise only log the inode item. Thanks, Signed-off-by: NJosef Bacik <jbacik@fusionio.com> Signed-off-by: NChris Mason <chris.mason@fusionio.com>
-
- 02 10月, 2012 2 次提交
-
-
由 Liu Bo 提交于
This is based on Josef's "Btrfs: turbo charge fsync". The current btrfs checks if an inode is in log by comparing root's last_log_commit to inode's last_sub_trans[2]. But the problem is that this root->last_log_commit is shared among inodes. Say we have N inodes to be logged, after the first inode, root's last_log_commit is updated and the N-1 remained files will be skipped. This fixes the bug by keeping a local copy of root's last_log_commit inside each inode and this local copy will be maintained itself. [1]: we regard each log transaction as a subset of btrfs's transaction, i.e. sub_trans Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
-
由 Josef Bacik 提交于
At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
-
- 24 7月, 2012 3 次提交
-
-
由 Li Zefan 提交于
Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type, which means every inode has the same BTRFS_I(inode)->free_space pointer. This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits). Signed-off-by: NLi Zefan <lizefan@huawei.com>
-
由 Liu Bo 提交于
Since root can be fetched via BTRFS_I macro directly, we can save an args for btrfs_is_free_space_inode(). Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
-
由 Liu Bo 提交于
For btree inode, its root is also 'tree root', so btree inode can be misunderstood as a free space inode. We should add one more check for btree inode. Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
-
- 15 6月, 2012 1 次提交
-
-
由 Josef Bacik 提交于
I removed this in an earlier commit and I was wrong. Because compression can return from filemap_fdatawrite() without having actually set any of it's pages as writeback() it can make filemap_fdatawait() do essentially nothing, and then we won't find any ordered extents because they may not have been created yet. So not only does this make fsync() completely useless, but it will also screw up if you truncate on a non-page aligned offset since we zero out the end and then wait on ordered extents and then call drop caches. We can drop the cache before the io completes and then we try to unpin the extent we just wrote we won't find it and everything goes sideways. So fix this by putting it back and put a giant comment there to keep me from trying to remove it in the future. Thanks, Signed-off-by: NJosef Bacik <josef@redhat.com>
-
- 30 5月, 2012 4 次提交
-
-
由 Josef Bacik 提交于
We have this check down in the actual logging code, but this is after we start a transaction and all that good stuff. So move the helper inode_in_log() out so we can call it in fsync() and avoid starting a transaction altogether and just exit if we've already fsync()'ed this file recently. You would notice this issue if you fsync()'ed a file over and over again until the transaction committed. Thanks, Signed-off-by: NJosef Bacik <josef@redhat.com>
-
由 Josef Bacik 提交于
Ceph was hitting this race where we would remove an inode from the per-root orphan list before we would release the space we had reserved for the inode. We actually don't need a list or anything, we just need to make sure the root doesn't try to free up the orphan reserve until after the inodes have released their reservations. So use an atomic counter instead of a list on the root and only decrement the counter after we've released our reservation. I've tested this as well as several others and we no longer see the warnings that you would see while running ceph. Thanks, Btrfs: fix how we deal with the orphan block rsv Ceph was hitting this race where we would remove an inode from the per-root orphan list before we would release the space we had reserved for the inode. We actually don't need a list or anything, we just need to make sure the root doesn't try to free up the orphan reserve until after the inodes have released their reservations. So use an atomic counter instead of a list on the root and only decrement the counter after we've released our reservation. I've tested this as well as several others and we no longer see the warnings that you would see while running ceph. Thanks, Signed-off-by: NJosef Bacik <josef@redhat.com>
-
由 Josef Bacik 提交于
Miao pointed this out while I was working on an orphan problem that messing with a bitfield where different ranges are protected by different locks doesn't work out right. Turns out we've been doing this forever where we have different parts of the bit field protected by either no lock at all or different locks which could cause all sorts of weird problems including the issue I was hitting. So instead make a runtime_flags thing that we use the normal bit operations on that are all atomic so we can keep having our no/different locking for the different flags and then make force_compress it's own thing so it can be treated normally. Thanks, Signed-off-by: NJosef Bacik <josef@redhat.com>
-
由 Josef Bacik 提交于
We've been keeping around the inode sequence number in hopes that somebody would use it, but nobody uses it and people actually use i_version which serves the same purpose, so use i_version where we used the incore inode's sequence number and that way the sequence is updated properly across the board, and not just in file write. Thanks, Signed-off-by: NJosef Bacik <josef@redhat.com>
-
- 17 1月, 2012 1 次提交
-
-
由 Josef Bacik 提交于
I was using i_mutex for this, but we're getting bogus lockdep warnings by doing that and theres no real way to get rid of those, so just stop using i_mutex to protect delalloc metadata reservations and use a delalloc mutex instead. This shouldn't be contended often at all, only if you are writing and mmap writing to the file at the same time. Thanks, Signed-off-by: NJosef Bacik <josef@redhat.com>
-
- 09 11月, 2011 1 次提交
-
-
由 Josef Bacik 提交于
People have been reporting ENOSPC crashes in finish_ordered_io. This is because we try to steal from the delalloc block rsv to satisfy a reservation to update the inode. The problem with this is we don't explicitly save space for updating the inode when doing delalloc. This is kind of a problem and we've gotten away with this because way back when we just stole from the delalloc reserve without any questions, and this worked out fine because generally speaking the leaf had been modified either by the mtime update when we did the original write or because we just updated the leaf when we inserted the file extent item, only on rare occasions had the leaf not actually been modified, and that was still ok because we'd just use a block or two out of the over-reservation that is delalloc. Then came the delayed inode stuff. This is amazing, except it wants a full reservation for updating the inode since it may do it at some point down the road after we've written the blocks and we have to recow everything again. This worked out because the delayed inode stuff just stole from the global reserve, that is until recently when I changed that because it caused other problems. So here we are, we're doing everything right and being screwed for it. So take an extra reservation for the inode at delalloc reservation time and carry it through the life of the delalloc reservation. If we need it we can steal it in the delayed inode stuff. If we have already stolen it try and do a normal metadata reservation. If that fails try to steal from the delalloc reservation. If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to solve this and in the meantime we'll steal from the global reserve. With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see any problems. Signed-off-by: NJosef Bacik <josef@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 20 10月, 2011 3 次提交
-
-
由 Josef Bacik 提交于
We have not been reserving enough space for checksums. We were just reserving bytes for the checksum items themselves, we were not taking into account having to cow the tree and such. This patch adds a csum_bytes counter to the inode for keeping track of the number of bytes outstanding we have for checksums. Then we calculate how many leaves would be required for the checksums we are given and use that to reserve space. This adds a significant amount of bytes to our reservations, but we will handle this later. Thanks, Signed-off-by: NJosef Bacik <josef@redhat.com>
-
由 Josef Bacik 提交于
reserved_bytes is not used for anything in the inode, remove it. Signed-off-by: NJosef Bacik <josef@redhat.com>
-
由 Josef Bacik 提交于
Moving things around to give us better packing in the btrfs_inode. This reduces the size of our inode by 8 bytes. Thanks, Signed-off-by: NJosef Bacik <josef@redhat.com>
-
- 11 9月, 2011 1 次提交
-
-
由 Liu Bo 提交于
We can reproduce this oops via the following steps: $ mkfs.btrfs /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs $ for ((i=0; i<3; i++)); do btrfs sub snap /mnt/btrfs /mnt/btrfs/s_$i; done $ rm -fr /mnt/btrfs/* $ rm -fr /mnt/btrfs/* then we'll get ------------[ cut here ]------------ kernel BUG at fs/btrfs/inode.c:2264! [...] Call Trace: [<ffffffffa05578c7>] btrfs_rmdir+0xf7/0x1b0 [btrfs] [<ffffffff81150b95>] vfs_rmdir+0xa5/0xf0 [<ffffffff81153cc3>] do_rmdir+0x123/0x140 [<ffffffff81145ac7>] ? fput+0x197/0x260 [<ffffffff810aecff>] ? audit_syscall_entry+0x1bf/0x1f0 [<ffffffff81153d0d>] sys_unlinkat+0x2d/0x40 [<ffffffff8147896b>] system_call_fastpath+0x16/0x1b RIP [<ffffffffa054f7b9>] btrfs_orphan_add+0x179/0x1a0 [btrfs] When it comes to btrfs_lookup_dentry, we may set a snapshot's inode->i_ino to BTRFS_EMPTY_SUBVOL_DIR_OBJECTID instead of BTRFS_FIRST_FREE_OBJECTID, while the snapshot's location.objectid remains unchanged. However, btrfs_ino() does not take this into account, and returns a wrong ino, and causes the oops. Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 28 7月, 2011 2 次提交
-
-
由 Chris Mason 提交于
Now that we are using regular file crcs for the free space cache, we can deadlock if we try to read the free_space_inode while we are updating the crc tree. This commit fixes things by using the commit_root to read the crcs. This is safe because we the free space cache file would already be loaded if that block group had been changed in the current transaction. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Josef Bacik 提交于
So I had this brilliant idea to use atomic counters for outstanding and reserved extents, but this turned out to be a bad idea. Consider this where we have 1 outstanding extent and 1 reserved extent Reserver Releaser atomic_dec(outstanding) now 0 atomic_read(outstanding)+1 get 1 atomic_read(reserved) get 1 don't actually reserve anything because they are the same atomic_cmpxchg(reserved, 1, 0) atomic_inc(outstanding) atomic_add(0, reserved) free reserved space for 1 extent Then the reserver now has no actual space reserved for it, and when it goes to finish the ordered IO it won't have enough space to do it's allocation and you get those lovely warnings. Signed-off-by: NJosef Bacik <josef@redhat.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 27 5月, 2011 1 次提交
-
-
由 Chris Mason 提交于
This will detect small random writes into files and queue the up for an auto defrag process. It isn't well suited to database workloads yet, but works for smaller files such as rpm, sqlite or bdb databases. Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 24 5月, 2011 1 次提交
-
-
由 Josef Bacik 提交于
Originally this was going to be used as a way to give hints to the allocator, but frankly we can get much better hints elsewhere and it's not even used at all for anything usefull. In addition to be completely useless, when we initialize an inode we try and find a freeish block group to set as the inodes block group, and with a completely full 40gb fs this takes _forever_, so I imagine with say 1tb fs this is just unbearable. So just axe the thing altoghether, we don't need it and it saves us 8 bytes in the inode and saves us 500 microseconds per inode lookup in my testcase. Thanks, Signed-off-by: NJosef Bacik <josef@redhat.com>
-
- 21 5月, 2011 1 次提交
-
-
由 Miao Xie 提交于
Changelog V5 -> V6: - Fix oom when the memory load is high, by storing the delayed nodes into the root's radix tree, and letting btrfs inodes go. Changelog V4 -> V5: - Fix the race on adding the delayed node to the inode, which is spotted by Chris Mason. - Merge Chris Mason's incremental patch into this patch. - Fix deadlock between readdir() and memory fault, which is reported by Itaru Kitayama. Changelog V3 -> V4: - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache inode in time. Changelog V2 -> V3: - Fix the race between the delayed worker and the task which does delayed items balance, which is reported by Tsutomu Itoh. - Modify the patch address David Sterba's comment. - Fix the bug of the cpu recursion spinlock, reported by Chris Mason Changelog V1 -> V2: - break up the global rb-tree, use a list to manage the delayed nodes, which is created for every directory and file, and used to manage the delayed directory name index items and the delayed inode item. - introduce a worker to deal with the delayed nodes. Compare with Ext3/4, the performance of file creation and deletion on btrfs is very poor. the reason is that btrfs must do a lot of b+ tree insertions, such as inode item, directory name item, directory name index and so on. If we can do some delayed b+ tree insertion or deletion, we can improve the performance, so we made this patch which implemented delayed directory name index insertion/deletion and delayed inode update. Implementation: - introduce a delayed root object into the filesystem, that use two lists to manage the delayed nodes which are created for every file/directory. One is used to manage all the delayed nodes that have delayed items. And the other is used to manage the delayed nodes which is waiting to be dealt with by the work thread. - Every delayed node has two rb-tree, one is used to manage the directory name index which is going to be inserted into b+ tree, and the other is used to manage the directory name index which is going to be deleted from b+ tree. - introduce a worker to deal with the delayed operation. This worker is used to deal with the works of the delayed directory name index items insertion and deletion and the delayed inode update. When the delayed items is beyond the lower limit, we create works for some delayed nodes and insert them into the work queue of the worker, and then go back. When the delayed items is beyond the upper bound, we create works for all the delayed nodes that haven't been dealt with, and insert them into the work queue of the worker, and then wait for that the untreated items is below some threshold value. - When we want to insert a directory name index into b+ tree, we just add the information into the delayed inserting rb-tree. And then we check the number of the delayed items and do delayed items balance. (The balance policy is above.) - When we want to delete a directory name index from the b+ tree, we search it in the inserting rb-tree at first. If we look it up, just drop it. If not, add the key of it into the delayed deleting rb-tree. Similar to the delayed inserting rb-tree, we also check the number of the delayed items and do delayed items balance. (The same to inserting manipulation) - When we want to update the metadata of some inode, we cached the data of the inode into the delayed node. the worker will flush it into the b+ tree after dealing with the delayed insertion and deletion. - We will move the delayed node to the tail of the list after we access the delayed node, By this way, we can cache more delayed items and merge more inode updates. - If we want to commit transaction, we will deal with all the delayed node. - the delayed node will be freed when we free the btrfs inode. - Before we log the inode items, we commit all the directory name index items and the delayed inode update. I did a quick test by the benchmark tool[1] and found we can improve the performance of file creation by ~15%, and file deletion by ~20%. Before applying this patch: Create files: Total files: 50000 Total time: 1.096108 Average time: 0.000022 Delete files: Total files: 50000 Total time: 1.510403 Average time: 0.000030 After applying this patch: Create files: Total files: 50000 Total time: 0.932899 Average time: 0.000019 Delete files: Total files: 50000 Total time: 1.215732 Average time: 0.000024 [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3 Many thanks for Kitayama-san's help! Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dave@jikos.cz> Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com> Tested-by: NItaru Kitayama <kitayama@cl.bb4u.ne.jp> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
- 25 4月, 2011 1 次提交
-
-
由 Li Zefan 提交于
There's a potential problem in 32bit system when we exhaust 32bit inode numbers and start to allocate big inode numbers, because btrfs uses inode->i_ino in many places. So here we always use BTRFS_I(inode)->location.objectid, which is an u64 variable. There are 2 exceptions that BTRFS_I(inode)->location.objectid != inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2), and inode->i_ino will be used in those cases. Another reason to make this change is I'm going to use a special inode to save free ino cache, and the inode number must be > (u64)-256. Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
-
- 18 3月, 2011 1 次提交
-
-
由 Josef Bacik 提交于
We track delayed allocation per inodes via 2 counters, one is outstanding_extents and reserved_extents. Outstanding_extents is already an atomic_t, but reserved_extents is not and is protected by a spinlock. So convert this to an atomic_t and instead of using a spinlock, use atomic_cmpxchg when releasing delalloc bytes. This makes our inode 72 bytes smaller, and reduces locking overhead (albiet it was minimal to begin with). Thanks, Signed-off-by: NJosef Bacik <josef@redhat.com>
-
- 22 12月, 2010 1 次提交
-
-
由 Li Zefan 提交于
Make the code aware of compression type, instead of always assuming zlib compression. Also make the zlib workspace function as common code for all compression types. Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
-
- 25 5月, 2010 2 次提交
-
-
由 Yan, Zheng 提交于
reserve metadata space for handling orphan inodes Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-
由 Yan, Zheng 提交于
Introduce metadata reservation context for delayed allocation and update various related functions. This patch also introduces EXTENT_FIRST_DELALLOC control bit for set/clear_extent_bit. It tells set/clear_bit_hook whether they are processing the first extent_state with EXTENT_DELALLOC bit set. This change is important if set/clear_extent_bit involves multiple extent_state. Signed-off-by: NYan Zheng <zheng.yan@oracle.com> Signed-off-by: NChris Mason <chris.mason@oracle.com>
-