- 15 2月, 2015 1 次提交
-
-
由 Josef Bacik 提交于
We have this weird dance where we always inc outstanding_extents when we do a O_DIRECT write, even if we allocate the entire range. To get around this we also drop the metadata space if we successfully write. This is an unnecessary dance, we only need to jack up outstanding_extents if we don't satisfy the entire range request in get_blocks_direct, otherwise we are good using our original reservation. So drop the unconditional inc and the drop of the metadata space that we have for the unconditional inc. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 03 2月, 2015 2 次提交
-
-
由 chandan r 提交于
This patch adds a new member to the 'struct btrfs_inode' structure to hold the file creation time. Signed-off-by: Nchandan <chandanrmail@gmail.com> [refreshed, removed btrfs_inode_otime] Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
They just opencode taking address of the timespec member. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
- 22 1月, 2015 4 次提交
-
-
由 Zhao Lei 提交于
So we can check raid56 with: (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) instead of long: (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
Currently there's a 4B hole in the structure between refs and state and there are only 16 bits used so we can make it unsigned. This will get a better packing and may save some stack space for local variables. The size of extent_state gets reduced by 8B and there are usually a lot of slab objects. struct extent_state { u64 start; /* 0 8 */ u64 end; /* 8 8 */ struct rb_node rb_node; /* 16 24 */ wait_queue_head_t wq; /* 40 24 */ /* --- cacheline 1 boundary (64 bytes) --- */ atomic_t refs; /* 64 4 */ /* XXX 4 bytes hole, try to pack */ long unsigned int state; /* 72 8 */ u64 private; /* 80 8 */ /* size: 88, cachelines: 2, members: 7 */ /* sum members: 84, holes: 1, sum holes: 4 */ /* last cacheline: 24 bytes */ }; Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
The errors are worth noting and might get missed with INFO level. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
Several messages that point to some internal problem, level INFO is wrong here. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
- 15 1月, 2015 1 次提交
-
-
由 David Sterba 提交于
If the found_key is NULL, then btrfs_find_item becomes a verbose wrapper for simple btrfs_search_slot. After we've removed all such callers, passing a NULL key is not valid anymore. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
- 03 1月, 2015 1 次提交
-
-
由 Wang Shilong 提交于
In btrfs_mkdir(), if it fails to create dir, we should clean up existed items, setting inode's link properly to make sure it could be cleaned up properly. Signed-off-by: NWang Shilong <wangshilong1991@gmail.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 25 11月, 2014 1 次提交
-
-
由 Filipe Manana 提交于
If right after starting the snapshot creation ioctl we perform a write against a file followed by a truncate, with both operations increasing the file's size, we can get a snapshot tree that reflects a state of the source subvolume's tree where the file truncation happened but the write operation didn't. This leaves a gap between 2 file extent items of the inode, which makes btrfs' fsck complain about it. For example, if we perform the following file operations: $ mkfs.btrfs -f /dev/vdd $ mount /dev/vdd /mnt $ xfs_io -f \ -c "pwrite -S 0xaa -b 32K 0 32K" \ -c "fsync" \ -c "pwrite -S 0xbb -b 32770 16K 32770" \ -c "truncate 90123" \ /mnt/foobar and the snapshot creation ioctl was just called before the second write, we often can get the following inode items in the snapshot's btree: item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160 inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0 item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20 inode ref index 282 namelen 10 name: foobar item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53 extent data disk byte 1104855040 nr 32768 extent data offset 0 nr 32768 ram 32768 extent compression 0 item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53 extent data disk byte 0 nr 0 extent data offset 0 nr 40960 ram 40960 extent compression 0 There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[ for which there's no file extent item covering it. This is because the file write and file truncate operations happened both right after the snapshot creation ioctl called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the ordered extent that matches the write and, in btrfs_setsize(), we were able to call btrfs_cont_expand() before being able to commit the current transaction in the snapshot creation ioctl. So this made it possibe to insert the hole file extent item in the source subvolume (which represents the region added by the truncate) right before the transaction commit from the snapshot creation ioctl. Btrfs' fsck tool complains about such cases with a message like the following: "root 331 inode 257 errors 100, file extent discount" >From a user perspective, the expectation when a snapshot is created while those file operations are being performed is that the snapshot will have a file that either: 1) is empty 2) only the first write was captured 3) only the 2 writes were captured 4) both writes and the truncation were captured But never capture a state where only the first write and the truncation were captured (since the second write was performed before the truncation). A test case for xfstests follows. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 22 11月, 2014 1 次提交
-
-
由 Filipe Manana 提交于
When doing a fsync with a fast path we have a time window where we can miss the fact that writeback of some file data failed, and therefore we endup returning success (0) from fsync when we should return an error. The steps that lead to this are the following: 1) We start all ordered extents by calling filemap_fdatawrite_range(); 2) We do some other work like locking the inode's i_mutex, start a transaction, start a log transaction, etc; 3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the ordered extents from inode's ordered tree into a list; 4) But by the time we do ordered extent collection, some ordered extents we started at step 1) might have already completed with an error, and therefore we didn't found them in the ordered tree and had no idea they finished with an error. This makes our fsync return success (0) to userspace, but has no bad effects on the log like for example insertion of file extent items into the log that point to unwritten extents, because the invalid extent maps were removed before the ordered extent completed (in inode.c:btrfs_finish_ordered_io). So after collecting the ordered extents just check if the inode's i_mapping has any error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever writeback fails for a page of an ordered extent, we call mapping_set_error (done in extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage) that sets one of those error flags in the inode's i_mapping flags. This change also has the side effect of fixing the issue where for fast fsyncs we never checked/cleared the error flags from the inode's i_mapping flags, which means that a full fsync performed after a fast fsync could get such errors that belonged to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which calls filemap_fdatawait_range(), and the later checks for and clears those flags, while for fast fsyncs we never call filemap_fdatawait_range() or anything else that checks for and clears the error flags from the inode's i_mapping. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 21 11月, 2014 10 次提交
-
-
由 Filipe Manana 提交于
If cow_file_range_inline() failed, when called from compress_file_range(), we were tagging the locked page for writeback, end its writeback and unlock it, but not marking it with an error nor setting AS_EIO in inode's mapping flags. This made it impossible for a caller of filemap_fdatawrite_range (writepages) or filemap_fdatawait_range() to know that an error happened. And the return value of compress_file_range() is useless because it's returned to a workqueue task and not to the task calling filemap_fdatawrite_range (writepages). This change applies on top of the previous patchset starting at the patch titled: "[1/5] Btrfs: set page and mapping error on compressed write failure" Which changed extent_clear_unlock_delalloc() to use SetPageError and mapping_set_error(). Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
To avoid duplicating this double filemap_fdatawrite_range() call for inodes with async extents (compressed writes) so often. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
For compressed writes, after doing the first filemap_fdatawrite_range() we don't get the pages tagged for writeback immediately. Instead we create a workqueue task, which is run by other kthread, and keep the pages locked. That other kthread compresses data, creates the respective ordered extent/s, tags the pages for writeback and unlocks them. Therefore we need a second call to filemap_fdatawrite_range() if we have compressed writes, as this second call will wait for the pages to become unlocked, then see they became tagged for writeback and finally wait for the writeback to finish. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
Its return value is useless, its single caller ignores it and can't do anything with it anyway, since it's a workqueue task and not the task calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range(). Failure is communicated to such functions via start and end of writeback with the respective pages tagged with an error and AS_EIO flag set in the inode's imapping. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Shilong Wang 提交于
Steps to reproduce: # mkfs.btrfs -f /dev/sdb # mount -t btrfs /dev/sdb /mnt -o compress=lzo # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1 after previous steps, inode will be detected as bad compression ratio, and NOCOMPRESS flag will be set for that inode. Reason is that compress have a max limit pages every time(128K), if a 132k write in, it will be splitted into two write(128k+4k), this bug is a leftover for commit 68bb462d(Btrfs: don't compress for a small write) Fix this problem by checking every time before compression, if it is a small write(<=blocksize), we bail out and fall into nocompression directly. Signed-off-by: NWang Shilong <wangshilong1991@gmail.com> Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
Its return value is completely ignored by its single caller and it's useless anyway, since errors are indicated through SetPageError and the bit AS_EIO set in the flags of the inode's mapping. The caller can't do anything with the value, as it's invoked from a workqueue task and not by the task calling filemap_fdatawrite_range (which calls the writepages address space callback, which in turn calls the inode's fill_delalloc callback). Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
If we had an error when processing one of the async extents from our list, we were not processing the remaining async extents, meaning we would leak those async_extent structs, never release the pages with the compressed data and never unlock and clear the dirty flag from the inode's pages (those that correspond to the uncompressed content). Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
In inode.c:submit_compressed_extents(), if we fail before calling btrfs_submit_compressed_write(), or when that function fails, we were freeing the async_extent structure without releasing its pages and freeing the pages array. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write() we start writeback for all pages, clear their dirty flag, unlock them, etc, but if btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM), we never end the writeback on the pages, so any filemap_fdatawait_range() call will hang forever. We were also not calling the writepage end io hook, which means the corresponding ordered extent will never complete and all its waiters will block forever, such as a full fsync (via btrfs_wait_ordered_range()). Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(), we start and end the writeback for the pages (clear their dirty flag, unlock them, etc) but we don't tag the pages, nor the inode's mapping, with an error. This makes it impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit for e.g.) know that there was an error. Note that the return value of submit_compressed_extents() is useless, as that function is executed by a workqueue task and not directly by the fill_delalloc callback. This means the writepage/s callbacks of the inode's address space operations don't get that return value. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 20 11月, 2014 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 17 10月, 2014 1 次提交
-
-
由 Chris Mason 提交于
This reverts commit 9c3b306e. Switching only one commit root during a transaction is wrong because it leads the fs into an inconsistent state. All commit roots should be switched at once, at transaction commit time, otherwise backref walking can often miss important references that were only accessible through the old commit root. Plus, the root item for the snapshot's root wasn't getting updated and preventing the next transaction commit to do it. This made several users get into random corruption issues after creation of readonly snapshots. A regression test for xfstests will follow soon. Cc: stable@vger.kernel.org # 3.17 Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 04 10月, 2014 1 次提交
-
-
由 Qu Wenruo 提交于
Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map is using wrong condition to judgement whether the range is a subset of a existing extent map. This may cause bug in btrfs no-holes mode. This patch will correct the judgment and fix the bug. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 02 10月, 2014 3 次提交
-
-
由 David Sterba 提交于
The enum exists but is not consistently used. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
There are the branch hints that obviously depend on the data being processed, the CPU predictor will do better job according to the actual load. It also does not make sense to use the hints in slow paths that do a lot of other operations like locking, waiting or IO. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
Unlikely is implicit for NULL checks of pointers. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
- 23 9月, 2014 1 次提交
-
-
由 Josef Bacik 提交于
When doing log replay we may have to update inodes, which traditionally goes through our delayed inode stuff. This will try to move space over from the trans handle, but we don't reserve space in our trans handle on replay since we don't know how much we will need, so instead we try to flush. But because we have a trans handle open we won't flush anything, so if we are out of reserve space we will simply return ENOSPC. Since we know that if an operation made it into the log then we definitely had space before the box bought the farm then we don't need to worry about doing this space reservation. Use the fs_info->log_root_recovering flag to skip the delayed inode stuff and update the item directly. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 18 9月, 2014 12 次提交
-
-
由 Qu Wenruo 提交于
The following commit enhanced the merge_extent_mapping() to reduce fragment in extent map tree, but it can't handle case which existing lies before map_start: 51f39 btrfs: Use right extent length when inserting overlap extent map. [BUG] When existing extent map's start is before map_start, the em->len will be minus, which will corrupt the extent map and fail to insert the new extent map. This will happen when someone get a large extent map, but when it is going to insert it into extent map tree, some one has already commit some write and split the huge extent into small parts. [REPRODUCER] It is very easy to tiger using filebench with randomrw personality. It is about 100% to reproduce when using 8G preallocated file in 60s randonrw test. [FIX] This patch can now handle any existing extent position. Since it does not directly use existing->start, now it will find the previous and next extent around map_start. So the old existing->start < map_start bug will never happen again. [ENHANCE] This patch will insert the best fitted extent map into extent map tree, other than the oldest [map_start, map_start + sectorsize) or the relatively newer but not perfect [map_start, existing->start). The patch will first search existing extent that does not intersects with the desired map range [map_start, map_start + len). The existing extent will be either before or behind map_start, and based on the existing extent, we can find out the previous and next extent around map_start. So the best fitted extent would be [prev->end, next->start). For prev or next is not found, em->start would be prev->end and em->end wold be next->start. With this patch, the fragment in extent map tree should be reduced much more than the 51f39 commit and reduce an unneeded extent map tree search. Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
After the data is written successfully, we should cleanup the read failure record in that range because - If we set data COW for the file, the range that the failure record pointed to is mapped to a new place, so it is invalid. - If we set no data COW for the file, and if there is no error during writting, the corrupted data is corrected, so the failure record can be removed. And if some errors happen on the mirrors, we also needn't worry about it because the failure record will be recreated if we read the same place again. Sometimes, we may fail to correct the data, so the failure records will be left in the tree, we need free them when we free the inode or the memory leak happens. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
This patch implement data repair function when direct read fails. The detail of the implementation is: - When we find the data is not right, we try to read the data from the other mirror. - When the io on the mirror ends, we will insert the endio work into the dedicated btrfs workqueue, not common read endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. - After we get right data, we write it back to the corrupted mirror. - And if the data on the new mirror is still corrupted, we will try next mirror until we read right data or all the mirrors are traversed. - After the above work, we set the uptodate flag according to the result. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
Direct IO splits the original bio to several sub-bios because of the limit of raid stripe, and the filesystem will wait for all sub-bios and then run final end io process. But it was very hard to implement the data repair when dio read failure happens, because at the final end io function, we didn't know which mirror the data was read from. So in order to implement the data repair, we have to move the file data check in the final end io function to the sub-bio end io function, in which we can get the mirror number of the device we access. This patch did this work as the first step of the direct io data repair implementation. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Miao Xie 提交于
The current code would load checksum data for several times when we split a whole direct read io because of the limit of the raid stripe, it would make us search the csum tree for several times. In fact, it just wasted time, and made the contention of the csum tree root be more serious. This patch improves this problem by loading the data at once. Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Wang Shilong 提交于
Btrfs could still inline file data if its size is same as page size, so don't skip max value here. Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Wang Shilong 提交于
If flag NOCOMPRESS is set which means bad compression ratio, we could avoid call cow_file_range_async() for this case earlier. Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Wang Shilong 提交于
If a file's compression ratios is bad, we will set NOCOMPRESS flag for it, and it will skip compression for that inode next time. However, if we remount fs to COMPRESS_FORCE, it still should try if we could compress pages for that inode, this patch fix wrong check for this problem. Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
We were returning with 0 (success) because we weren't extracting the error code from em (PTR_ERR(em)). Fix it. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Wang Shilong 提交于
Btrfs defragment will utilize COW feature, which means this did not work for nodatacow option, this problem was detected by xfstests generic/018 with nodatacow mount option. Fix this problem by forcing cow for a extent with state @EXTETN_DEFRAG setting. Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
btrfs_set_key_type and btrfs_key_type are used inconsistently along with open coded variants. Other members of btrfs_key are accessed directly without any helpers anyway. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-