- 21 10月, 2015 1 次提交
-
-
由 Qu Wenruo 提交于
Current code will always truncate tailing page if its alloc_start is smaller than inode size. For example, the file extent layout is like: 0 4K 8K 16K 32K |<-----Extent A---------------->| |<--Inode size: 18K---------->| But if calling fallocate even for range [0,4K), it will cause btrfs to re-truncate the range [16,32K), causing COW and a new extent. 0 4K 8K 16K 32K |///////| <- Fallocate call range |<-----Extent A-------->|<--B-->| The cause is quite easy, just a careless btrfs_truncate_inode() in a else branch without extra judgment. Fix it by add judgment on whether the fallocate range is beyond isize. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 24 6月, 2015 1 次提交
-
-
由 Jan Kara 提交于
file_remove_suid() is a misnomer since it removes also file capabilities stored in xattrs and sets S_NOSEC flag. Also should_remove_suid() tells something else than whether file_remove_suid() call is necessary which leads to bugs. Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 10 6月, 2015 1 次提交
-
-
由 Filipe Manana 提交于
Commit 3a8b36f3 ("Btrfs: fix data loss in the fast fsync path") added a performance regression for that causes an unnecessary sync of the log trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done against a file, without no writes or any metadata updates to the inode in between them and if a transaction is committed before the second fsync is called. Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99) after a test sysbench test that measured a -62% decrease of file io requests per second for that tests' workload. The test is: echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor mkfs -t btrfs /dev/sda2 mount -t btrfs /dev/sda2 /fs/sda2 cd /fs/sda2 for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \ --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \ --file-num=1024 run A test on kvm guest, running a debug kernel gave me the following results: Without 3a8b36f3: 16.01 reqs/sec With 3a8b36f3: 3.39 reqs/sec With 3a8b36f3 and this patch: 16.04 reqs/sec Reported-by: NHuang Ying <ying.huang@intel.com> Tested-by: NHuang, Ying <ying.huang@intel.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 16 4月, 2015 1 次提交
-
-
由 David Howells 提交于
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 13 4月, 2015 2 次提交
-
-
由 Dongsheng Yang 提交于
There are two problems in qgroup: a). The PAGE_CACHE is 4K, even when we are writing a data of 1K, qgroup will reserve a 4K size. It will cause the last 3K in a qgroup is not available to user. b). When user is writing a inline data, qgroup will not reserve it, it means this is a window we can exceed the limit of a qgroup. The main idea of this patch is reserving the data size of write_bytes rather than the reserve_bytes. It means qgroup will not care about the data size btrfs will reserve for user, but only care about the data size user is going to write. Then reserve it when user want to write and release it in transaction committed. In this way, qgroup can be released from the complex procedure in btrfs and only do the reserve when user want to write and account when the data is written in commit_transaction(). Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Dongsheng Yang 提交于
Currenly, in data writing, ->reserved is accounted in fill_delalloc(), but ->may_use is released in clear_bit_hook() which is called by btrfs_finish_ordered_io(). That's too late, that said, between fill_delalloc() and btrfs_finish_ordered_io(), the data is doublely accounted by qgroup. It will cause some unexpected -EDQUOT. Example: # btrfs quota enable /root/btrfs-auto-test/ # btrfs subvolume create /root/btrfs-auto-test//sub Create subvolume '/root/btrfs-auto-test/sub' # btrfs qgroup limit 1G /root/btrfs-auto-test//sub dd if=/dev/zero of=/root/btrfs-auto-test//sub/file bs=1024 count=1500000 dd: error writing '/root/btrfs-auto-test//sub/file': Disk quota exceeded 681353+0 records in 681352+0 records out 697704448 bytes (698 MB) copied, 8.15563 s, 85.5 MB/s It's (698 MB) when we got an -EDQUOT, but we limit it by 1G. This patch move the btrfs_qgroup_reserve/free() for data from btrfs_delalloc_reserve/release_metadata() to btrfs_check_data_free_space() and btrfs_free_reserved_data_space(). Then the accounter in qgroup will be updated at the same time with the accounter in space_info updated. In this way, the unexpected -EDQUOT will be killed. Reported-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 12 4月, 2015 4 次提交
-
-
由 Al Viro 提交于
... avoiding write_iter/fcntl races. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
... returning -E... upon error and amount of data left in iter after (possible) truncation upon success. Note, that normal case gives a non-zero (positive) return value, so any tests for != 0 _must_ be updated. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Conflicts: fs/ext4/file.c
-
由 Al Viro 提交于
all remaining callers are passing 0; some just obscure that fact. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
All places outside of core VFS that checked ->read and ->write for being NULL or called the methods directly are gone now, so NULL {read,write} with non-NULL {read,write}_iter will do the right thing in all cases. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 27 3月, 2015 2 次提交
-
-
由 Filipe Manana 提交于
We can get into inconsistency between inodes and directory entries after fsyncing a directory. The issue is that while a directory gets the new dentries persisted in the fsync log and replayed at mount time, the link count of the inode that directory entries point to doesn't get updated, staying with an incorrect link count (smaller then the correct value). This later leads to stale file handle errors when accessing (including attempt to delete) some of the links if all the other ones are removed, which also implies impossibility to delete the parent directories, since the dentries can not be removed. Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2), when fsyncing a directory, new files aren't logged (their metadata and dentries) nor any child directories. So this patch fixes this issue too, since it has the same resolution as the incorrect inode link count issue mentioned before. This is very easy to reproduce, and the following excerpt from my test case for xfstests shows how: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our main test file and directory. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io mkdir $SCRATCH_MNT/mydir # Make sure all metadata and data are durably persisted. sync # Add a hard link to 'foo' inside our test directory and fsync only the # directory. The btrfs fsync implementation had a bug that caused the new # directory entry to be visible after the fsync log replay but, the inode # of our file remained with a link count of 1. ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2 # Add a few more links and new files. # This is just to verify nothing breaks or gives incorrect results after the # fsync log is replayed. ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3 $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2 # Add some subdirectories and new files and links to them. This is to verify # that after fsyncing our top level directory 'mydir', all the subdirectories # and their files/links are registered in the fsync log and exist after the # fsync log is replayed. mkdir -p $SCRATCH_MNT/mydir/x/y/z ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link touch $SCRATCH_MNT/mydir/x/y/z/qwerty # Now fsync only our top directory. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir # And fsync now our new file named 'hello', just to verify later that it has # the expected content and that the previous fsync on the directory 'mydir' had # no bad influence on this fsync. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Verify the content of our file 'foo' remains the same as before, 8192 bytes, # all with the value 0xaa. echo "File 'foo' content after log replay:" od -t x1 $SCRATCH_MNT/foo # Remove the first name of our inode. Because of the directory fsync bug, the # inode's link count was 1 instead of 5, so removing the 'foo' name ended up # deleting the inode and the other names became stale directory entries (still # visible to applications). Attempting to remove or access the remaining # dentries pointing to that inode resulted in stale file handle errors and # made it impossible to remove the parent directories since it was impossible # for them to become empty. echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)" rm -f $SCRATCH_MNT/foo # Now verify that all files, links and directories created before fsyncing our # directory exist after the fsync log was replayed. [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing" [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing" [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing" [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing" [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \ echo "Link mydir/x/y/foo_y_link is missing" [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \ echo "Link mydir/x/y/z/foo_z_link is missing" [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \ echo "File mydir/x/y/z/qwerty is missing" # We expect our file here to have a size of 64Kb and all the bytes having the # value 0xff. echo "file 'hello' content after log replay:" od -t x1 $SCRATCH_MNT/hello # Now remove all files/links, under our test directory 'mydir', and verify we # can remove all the directories. rm -f $SCRATCH_MNT/mydir/x/y/z/* rmdir $SCRATCH_MNT/mydir/x/y/z rm -f $SCRATCH_MNT/mydir/x/y/* rmdir $SCRATCH_MNT/mydir/x/y rmdir $SCRATCH_MNT/mydir/x rm -f $SCRATCH_MNT/mydir/* rmdir $SCRATCH_MNT/mydir # An fsck, run by the fstests framework everytime a test finishes, also detected # the inconsistency and printed the following error message: # # root 5 inode 257 errors 2001, no inode item, link count wrong # unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref # unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref status=0 exit The expected golden output for the test is: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 65536/65536 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File 'foo' content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 file 'foo' link count after log replay: 5 file 'hello' content after log replay: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0200000 Which is the output after this patch and when running the test against ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's output is: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 65536/65536 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File 'foo' content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 file 'foo' link count after log replay: 1 Link mydir/foo_2 is missing Link mydir/foo_3 is missing Link mydir/x/y/foo_y_link is missing Link mydir/x/y/z/foo_z_link is missing File mydir/x/y/z/qwerty is missing file 'hello' content after log replay: 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0200000 rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty Fsck, without this fix, also complains about the wrong link count: root 5 inode 257 errors 2001, no inode item, link count wrong unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref So fix this by logging the inodes that the dentries point to when fsyncing a directory. A test case for xfstests follows. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
If we fallocate(), without the keep size flag, into an area already covered by an extent previously fallocated, we were updating the inode's i_size but we weren't updating the inode item in the fs/subvol tree. A following umount + mount would result in a loss of the inode's size (and an fsync would miss too the fact that the inode changed). Reproducer: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ fallocate -n -l 1M /mnt/foobar $ fallocate -l 512K /mnt/foobar $ umount /mnt $ mount /dev/sdd /mnt $ od -t x1 /mnt/foobar 0000000 The expected result is: $ od -t x1 /mnt/foobar 0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 2000000 A test case for fstests follows soon. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 26 3月, 2015 1 次提交
-
-
由 Christoph Hellwig 提交于
struct kiocb now is a generic I/O container, so move it to fs.h. Also do a #include diet for aio.h while we're at it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 06 3月, 2015 1 次提交
-
-
由 Filipe Manana 提交于
When using the fast file fsync code path we can miss the fact that new writes happened since the last file fsync and therefore return without waiting for the IO to finish and write the new extents to the fsync log. Here's an example scenario where the fsync will miss the fact that new file data exists that wasn't yet durably persisted: 1. fs_info->last_trans_committed == N - 1 and current transaction is transaction N (fs_info->generation == N); 2. do a buffered write; 3. fsync our inode, this clears our inode's full sync flag, starts an ordered extent and waits for it to complete - when it completes at btrfs_finish_ordered_io(), the inode's last_trans is set to the value N (via btrfs_update_inode_fallback -> btrfs_update_inode -> btrfs_set_inode_last_trans); 4. transaction N is committed, so fs_info->last_trans_committed is now set to the value N and fs_info->generation remains with the value N; 5. do another buffered write, when this happens btrfs_file_write_iter sets our inode's last_trans to the value N + 1 (that is fs_info->generation + 1 == N + 1); 6. transaction N + 1 is started and fs_info->generation now has the value N + 1; 7. transaction N + 1 is committed, so fs_info->last_trans_committed is set to the value N + 1; 8. fsync our inode - because it doesn't have the full sync flag set, we only start the ordered extent, we don't wait for it to complete (only in a later phase) therefore its last_trans field has the value N + 1 set previously by btrfs_file_write_iter(), and so we have: inode->last_trans <= fs_info->last_trans_committed (N + 1) (N + 1) Which made us not log the last buffered write and exit the fsync handler immediately, returning success (0) to user space and resulting in data loss after a crash. This can actually be triggered deterministically and the following excerpt from a testcase I made for xfstests triggers the issue. It moves a dummy file across directories and then fsyncs the old parent directory - this is just to trigger a transaction commit, so moving files around isn't directly related to the issue but it was chosen because running 'sync' for example does more than just committing the current transaction, as it flushes/waits for all file data to be persisted. The issue can also happen at random periods, since the transaction kthread periodicaly commits the current transaction (about every 30 seconds by default). The body of the test is: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our main test file 'foo', the one we check for data loss. # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync' # bit from its flags (btrfs inode specific flags). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \ -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io # Now create one other file and 2 directories. We will move this second file # from one directory to the other later because it forces btrfs to commit its # currently open transaction if we fsync the old parent directory. This is # necessary to trigger the data loss bug that affected btrfs. mkdir $SCRATCH_MNT/testdir_1 touch $SCRATCH_MNT/testdir_1/bar mkdir $SCRATCH_MNT/testdir_2 # Make sure everything is durably persisted. sync # Write more 8Kb of data to our file. $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io # Move our 'bar' file into a new directory. mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar # Fsync our first directory. Because it had a file moved into some other # directory, this made btrfs commit the currently open transaction. This is # a condition necessary to trigger the data loss bug. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1 # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of # data we wrote previously to be persisted and available if a crash happens. # This did not happen with btrfs, because of the transaction commit that # happened when we fsynced the parent directory. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Now check that all data we wrote before are available. echo "File content after log replay:" od -t x1 $SCRATCH_MNT/foo status=0 exit The expected golden output for the test, which is what we get with this fix applied (or when running against ext3/4 and xfs), is: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 8192/8192 bytes at offset 8192 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb * 0040000 Without this fix applied, the output shows the test file does not have the second 8Kb extent that we successfully fsynced: wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 8192/8192 bytes at offset 8192 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content after log replay: 0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 0020000 So fix this by skipping the fsync only if we're doing a full sync and if the inode's last_trans is <= fs_info->last_trans_committed, or if the inode is already in the log. Also remove setting the inode's last_trans in btrfs_file_write_iter since it's useless/unreliable. Also because btrfs_file_write_iter no longer sets inode->last_trans to fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't bail out if last_trans is 0, otherwise something as simple as the following example wouldn't log the second write on the last fsync: 1. write to file 2. fsync file 3. fsync file |--> btrfs_inode_in_log() returns true and it set last_trans to 0 4. write to file |--> btrfs_file_write_iter() no longers sets last_trans, so it remained with a value of 0 5. fsync |--> inode->last_trans == 0, so it bails out without logging the second write A test case for xfstests will be sent soon. CC: <stable@vger.kernel.org> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 04 3月, 2015 3 次提交
-
-
由 David Sterba 提交于
There are lockstart and lockend defined in the function and not used after their duplicate definition scope ends, it's safe to reuse them. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
Convert kmalloc(nr * size, ..) to kmalloc_array that does additional overflow checks, the zeroing variant is kcalloc. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
由 David Sterba 提交于
Clean the opencoded variant, cond_resched_lock also checks the lock for contention so it might help in some cases that were not covered by simple need_resched(). Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
- 03 3月, 2015 1 次提交
-
-
由 Filipe Manana 提交于
When punching a file hole if we endup only zeroing parts of a page, because the start offset isn't a multiple of the sector size or the start offset and length fall within the same page, we were not updating the inode item. This prevented an fsync from doing anything, if no other file changes happened in the current transaction, because the fields in btrfs_inode used to check if the inode needs to be fsync'ed weren't updated. This issue is easy to reproduce and the following excerpt from the xfstest case I made shows how to trigger it: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey # Create our test file. $XFS_IO_PROG -f -c "pwrite -S 0x22 -b 16K 0 16K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Fsync the file, this makes btrfs update some btrfs inode specific fields # that are used to track if the inode needs to be written/updated to the fsync # log or not. After this fsync, the new values for those fields indicate that # a subsequent fsync does not need to touch the fsync log. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo # Force a commit of the current transaction. After this point, any operation # that modifies the data or metadata of our file, should update those fields in # the btrfs inode with values that make the next fsync operation write to the # fsync log. sync # Punch a hole in our file. This small range affects only 1 page. # This made the btrfs hole punching implementation write only some zeroes in # one page, but it did not update the btrfs inode fields used to determine if # the next fsync needs to write to the fsync log. $XFS_IO_PROG -c "fpunch 8000 4K" $SCRATCH_MNT/foo # Another variation of the previously mentioned case. $XFS_IO_PROG -c "fpunch 15000 100" $SCRATCH_MNT/foo # Now fsync the file. This was a no-operation because the previous hole punch # operation didn't update the inode's fields mentioned before, so they remained # with the values they had after the first fsync - that is, they indicate that # it is not needed to write to fsync log. $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo echo "File content before:" od -t x1 $SCRATCH_MNT/foo # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Enable writes and mount the fs. This makes the fsync log replay code run. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey # Because the last fsync didn't do anything, here the file content matched what # it was after the first fsync, before the holes were punched, and not what it # was after the holes were punched. echo "File content after:" od -t x1 $SCRATCH_MNT/foo This issue has been around since 2012, when the punch hole implementation was added, commit 2aaa6655 ("Btrfs: add hole punching"). A test case for xfstests follows soon. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 17 2月, 2015 1 次提交
-
-
由 Daniel Dressler 提交于
This patch is part of a larger project to cleanup btrfs's internal usage of struct btrfs_root. Many functions take btrfs_root only to grab a pointer to fs_info. This causes programmers to ponder which root can be passed. Since only the fs_info is read affected functions can accept any root, except this is only obvious upon inspection. This patch reduces the specificty of such functions to accept the fs_info directly. This patch does not address the two functions in ctree.c (insert_ptr, and split_item) which only use root for BUG_ONs in ctree.c This patch affects the following functions: 1) fixup_low_keys 2) btrfs_set_item_key_safe Signed-off-by: NDaniel Dressler <danieru.dressler@gmail.com> Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
- 11 2月, 2015 1 次提交
-
-
由 Kirill A. Shutemov 提交于
Nobody uses it anymore. [akpm@linux-foundation.org: fix filemap_xip.c] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 1月, 2015 1 次提交
-
-
由 Christoph Hellwig 提交于
Now that we got rid of the bdi abuse on character devices we can always use sb->s_bdi to get at the backing_dev_info for a file, except for the block device special case. Export inode_to_bdi and replace uses of mapping->backing_dev_info with it to prepare for the removal of mapping->backing_dev_info. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NTejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 25 11月, 2014 1 次提交
-
-
由 Filipe Manana 提交于
If right after starting the snapshot creation ioctl we perform a write against a file followed by a truncate, with both operations increasing the file's size, we can get a snapshot tree that reflects a state of the source subvolume's tree where the file truncation happened but the write operation didn't. This leaves a gap between 2 file extent items of the inode, which makes btrfs' fsck complain about it. For example, if we perform the following file operations: $ mkfs.btrfs -f /dev/vdd $ mount /dev/vdd /mnt $ xfs_io -f \ -c "pwrite -S 0xaa -b 32K 0 32K" \ -c "fsync" \ -c "pwrite -S 0xbb -b 32770 16K 32770" \ -c "truncate 90123" \ /mnt/foobar and the snapshot creation ioctl was just called before the second write, we often can get the following inode items in the snapshot's btree: item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160 inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0 item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20 inode ref index 282 namelen 10 name: foobar item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53 extent data disk byte 1104855040 nr 32768 extent data offset 0 nr 32768 ram 32768 extent compression 0 item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53 extent data disk byte 0 nr 0 extent data offset 0 nr 40960 ram 40960 extent compression 0 There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[ for which there's no file extent item covering it. This is because the file write and file truncate operations happened both right after the snapshot creation ioctl called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the ordered extent that matches the write and, in btrfs_setsize(), we were able to call btrfs_cont_expand() before being able to commit the current transaction in the snapshot creation ioctl. So this made it possibe to insert the hole file extent item in the source subvolume (which represents the region added by the truncate) right before the transaction commit from the snapshot creation ioctl. Btrfs' fsck tool complains about such cases with a message like the following: "root 331 inode 257 errors 100, file extent discount" >From a user perspective, the expectation when a snapshot is created while those file operations are being performed is that the snapshot will have a file that either: 1) is empty 2) only the first write was captured 3) only the 2 writes were captured 4) both writes and the truncation were captured But never capture a state where only the first write and the truncation were captured (since the second write was performed before the truncation). A test case for xfstests follows. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 21 11月, 2014 2 次提交
-
-
由 Filipe Manana 提交于
To avoid duplicating this double filemap_fdatawrite_range() call for inodes with async extents (compressed writes) so often. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
For compressed writes, after doing the first filemap_fdatawrite_range() we don't get the pages tagged for writeback immediately. Instead we create a workqueue task, which is run by other kthread, and keep the pages locked. That other kthread compresses data, creates the respective ordered extent/s, tags the pages for writeback and unlocks them. Therefore we need a second call to filemap_fdatawrite_range() if we have compressed writes, as this second call will wait for the pages to become unlocked, then see they became tagged for writeback and finally wait for the writeback to finish. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 02 10月, 2014 1 次提交
-
-
由 David Sterba 提交于
There are the branch hints that obviously depend on the data being processed, the CPU predictor will do better job according to the actual load. It also does not make sense to use the hints in slow paths that do a lot of other operations like locking, waiting or IO. Signed-off-by: NDavid Sterba <dsterba@suse.cz>
-
- 19 9月, 2014 2 次提交
-
-
由 Filipe Manana 提交于
When we do a fast fsync, we start all ordered operations and then while they're running in parallel we visit the list of modified extent maps and construct their matching file extent items and write them to the log btree. After that, in btrfs_sync_log() we wait for all the ordered operations to finish (via btrfs_wait_logged_extents). The problem with this is that we were completely ignoring errors that can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM or -EIO errors for example. When such error happens, it means we have parts of the on disk extent that weren't written to, and so we end up logging file extent items that point to these extents that contain garbage/random data - so after a crash/reboot plus log replay, we get our inode's metadata pointing to those extents. This worked in contrast with the full (non-fast) fsync path, where we start all ordered operations, wait for them to finish and then write to the log btree. In this path, after each ordered operation completes we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return -EIO if so (via btrfs_wait_ordered_range). So if an error happens with any ordered operation, just return a -EIO error to userspace, so that it knows that not all of its previous writes were durably persisted and the application can take proper action (like redo the writes for e.g.) - and definitely not leave any file extent items in the log refer to non fully written extents. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Filipe Manana 提交于
When the fsync callback (btrfs_sync_file) starts, it first waits for the writeback of any dirty pages to start and finish without holding the inode's mutex (to reduce contention). After this it acquires the inode's mutex and repeats that process via btrfs_wait_ordered_range only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag is set on the inode). This is not safe for a non full sync - we need to start and wait for writeback to finish for any pages that might have been made dirty before acquiring the inode's mutex and after that first step mentioned before. Why this is needed is explained by the following comment added to btrfs_sync_file: "Right before acquiring the inode's mutex, we might have new writes dirtying pages, which won't immediately start the respective ordered operations - that is done through the fill_delalloc callbacks invoked from the writepage and writepages address space operations. So make sure we start all ordered operations before starting to log our inode. Not doing this means that while logging the inode, writeback could start and invoke writepage/writepages, which would call the fill_delalloc callbacks (cow_file_range, submit_compressed_extents). These callbacks add first an extent map to the modified list of extents and then create the respective ordered operation, which means in tree-log.c:btrfs_log_inode() we might capture all existing ordered operations (with btrfs_get_logged_extents()) before the fill_delalloc callback adds its ordered operation, and by the time we visit the modified list of extent maps (with btrfs_log_changed_extents()), we see and process the extent map they created. We then use the extent map to construct a file extent item for logging without waiting for the respective ordered operation to finish - this file extent item points to a disk location that might not have yet been written to, containing random data - so after a crash a log replay will make our inode have file extent items that point to disk locations containing invalid data, as we returned success to userspace without waiting for the respective ordered operation to finish, because it wasn't captured by btrfs_get_logged_extents()." Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 18 9月, 2014 4 次提交
-
-
由 Liu Bo 提交于
An user reported this, it is because that lseek's SEEK_SET/SEEK_CUR/SEEK_END allow a negative value for @offset, but btrfs's SEEK_DATA/SEEK_HOLE don't prepare for that and convert the negative @offset into unsigned type, so we get (end < start) warning. [ 1269.835374] ------------[ cut here ]------------ [ 1269.836809] WARNING: CPU: 0 PID: 1241 at fs/btrfs/extent_io.c:430 insert_state+0x11d/0x140() [ 1269.838816] BTRFS: end < start 4094 18446744073709551615 [ 1269.840334] CPU: 0 PID: 1241 Comm: a.out Tainted: G W 3.16.0+ #306 [ 1269.858229] Call Trace: [ 1269.858612] [<ffffffff81801a69>] dump_stack+0x4e/0x68 [ 1269.858952] [<ffffffff8107894c>] warn_slowpath_common+0x8c/0xc0 [ 1269.859416] [<ffffffff81078a36>] warn_slowpath_fmt+0x46/0x50 [ 1269.859929] [<ffffffff813b0fbd>] insert_state+0x11d/0x140 [ 1269.860409] [<ffffffff813b1396>] __set_extent_bit+0x3b6/0x4e0 [ 1269.860805] [<ffffffff813b21c7>] lock_extent_bits+0x87/0x200 [ 1269.861697] [<ffffffff813a5b28>] btrfs_file_llseek+0x148/0x2a0 [ 1269.862168] [<ffffffff811f201e>] SyS_lseek+0xae/0xc0 [ 1269.862620] [<ffffffff8180b212>] system_call_fastpath+0x16/0x1b [ 1269.862970] ---[ end trace 4d33ea885832054b ]--- This assumes that btrfs starts finding DATA/HOLE from the beginning of file if the assigned @offset is negative. Also we add alignment for lock_extent_bits 's range. Reported-by: NToralf Förster <toralf.foerster@gmx.de> Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
The form (value + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT is equivalent to (value + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE The rest is a simple subsitution, no difference in the generated assembly code. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
The nodesize and leafsize were never of different values. Unify the usage and make nodesize the one. Cleanup the redundant checks and helpers. Shaves a few bytes from .text: text data bss dec hex filename 852418 24560 23112 900090 dbbfa btrfs.ko.before 851074 24584 23112 898770 db6d2 btrfs.ko.after Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
由 David Sterba 提交于
btrfs_set_key_type and btrfs_key_type are used inconsistently along with open coded variants. Other members of btrfs_key are accessed directly without any helpers anyway. Signed-off-by: NDavid Sterba <dsterba@suse.cz> Signed-off-by: NChris Mason <clm@fb.com>
-
- 09 9月, 2014 1 次提交
-
-
由 Filipe Manana 提交于
While we're doing a full fsync (when the inode has the flag BTRFS_INODE_NEEDS_FULL_SYNC set) that is ranged too (covers only a portion of the file), we might have ordered operations that are started before or while we're logging the inode and that fall outside the fsync range. Therefore when a full ranged fsync finishes don't remove every extent map from the list of modified extent maps - as for some of them, that fall outside our fsync range, their respective ordered operation hasn't finished yet, meaning the corresponding file extent item wasn't inserted into the fs/subvol tree yet and therefore we didn't log it, and we must let the next fast fsync (one that checks only the modified list) see this extent map and log a matching file extent item to the log btree and wait for its ordered operation to finish (if it's still ongoing). A test case for xfstests follows. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 21 8月, 2014 2 次提交
-
-
由 Chris Mason 提交于
We should only be flushing on close if the file was flagged as needing it during truncate. I broke this with my ordered data vs transaction commit deadlock fix. Thanks to Miao Xie for catching this. Signed-off-by: NChris Mason <clm@fb.com> Reported-by: NMiao Xie <miaox@cn.fujitsu.com> Reported-by: NFengguang Wu <fengguang.wu@intel.com>
-
由 Qu Wenruo 提交于
When current btrfs finds that a new extent map is going to be insereted but failed with -EEXIST, it will try again to insert the extent map but with the length of sectorsize. This is OK if we don't enable 'no-holes' feature since all extent space is continuous, we will not go into the not found->insert routine. But if we enable 'no-holes' feature, it will make things out of control. e.g. in 4K sectorsize, we pass the following args to btrfs_get_extent(): btrfs_get_extent() args: start: 27874 len 4100 28672 27874 28672 27874+4100 32768 |-----------------------| |---------hole--------------------|---------data----------| 1) not found and insert Since no extent map containing the range, btrfs_get_extent() will go into the not_found and insert routine, which will try to insert the extent map (27874, 27847 + 4100). 2) first overlap But it overlaps with (28672, 32768) extent, so -EEXIST will be returned by add_extent_mapping(). 3) retry but still overlap After catching the -EEXIST, then btrfs_get_extent() will try insert it again but with 4K length, which still overlaps, so -EEXIST will be returned. This makes the following patch fail to punch hole. d7781546 btrfs: Avoid trucating page or punching hole in a already existed hole. This patch will use the right length, which is the (exsisting->start - em->start) to insert, making the above patch works in 'no-holes' mode. Also, some small code style problems in above patch is fixed too. Reported-by: NFilipe David Manana <fdmanana@gmail.com> Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: NFilipe David Manana <fdmanana@suse.com> Tested-by: NFilipe David Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 19 8月, 2014 1 次提交
-
-
由 chandan 提交于
For a non-existent key, btrfs_search_slot() sets path->slots[0] to the slot where the key could have been present, which in this case would be the slot containing the extent item which would be the next neighbor of the file range being punched. The current code passes an incremented path->slots[0] and we skip to the wrong file extent item. This would mean that we would fail to merge the "yet to be created" hole with the next neighboring hole (if one exists). Fix this. Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
- 15 8月, 2014 1 次提交
-
-
由 Chris Mason 提交于
Truncates and renames are often used to replace old versions of a file with new versions. Applications often expect this to be an atomic replacement, even if they haven't done anything to make sure the new version is fully on disk. Btrfs has strict flushing in place to make sure that renaming over an old file with a new file will fully flush out the new file before allowing the transaction commit with the rename to complete. This ordering means the commit code needs to be able to lock file pages, and there are a few paths in the filesystem where we will try to end a transaction with the page lock held. It's rare, but these things can deadlock. This patch removes the ordered flushes and switches to a best effort filemap_flush like ext4 uses. It's not perfect, but it should fix the deadlocks. Signed-off-by: NChris Mason <clm@fb.com>
-
- 10 6月, 2014 4 次提交
-
-
由 Filipe Manana 提交于
If btrfs_log_dentry_safe() returns an error, we set ret to 1 and fall through with the goal of committing the transaction. However, in the case where the inode doesn't need a full sync, we would call btrfs_wait_ordered_range() against the target range for our inode, and if it returned an error, we would return without commiting or ending the transaction. Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Qu Wenruo 提交于
btrfs_punch_hole() will truncate unaligned pages or punch hole on a already existed hole. This will cause unneeded zero page or holes splitting the original huge hole. This patch will skip already existed holes before any page truncating or hole punching. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NChris Mason <clm@fb.com>
-
由 Alex Gartrell 提交于
In these instances, we are trying to determine if a page has been accessed since we began the operation for the sake of retry. This is easily accomplished by doing a gang lookup in the page mapping radix tree, and it saves us the dependency on the flag (so that we might eventually delete it). btrfs_page_exists_in_range borrows heavily from find_get_page, replacing the radix tree look up with a gang lookup of 1, so that we can find the next highest page >= index and see if it falls into our lock range. Signed-off-by: NChris Mason <clm@fb.com> Signed-off-by: NAlex Gartrell <agartrell@fb.com>
-
由 Josef Bacik 提交于
Currently qgroups account for space by intercepting delayed ref updates to fs trees. It does this by adding sequence numbers to delayed ref updates so that it can figure out how the tree looked before the update so we can adjust the counters properly. The problem with this is that it does not allow delayed refs to be merged, so if you say are defragging an extent with 5k snapshots pointing to it we will thrash the delayed ref lock because we need to go back and manually merge these things together. Instead we want to process quota changes when we know they are going to happen, like when we first allocate an extent, we free a reference for an extent, we add new references etc. This patch accomplishes this by only adding qgroup operations for real ref changes. We only modify the sequence number when we need to lookup roots for bytenrs, this reduces the amount of churn on the sequence number and allows us to merge delayed refs as we add them most of the time. This patch encompasses a bunch of architectural changes 1) qgroup ref operations: instead of tracking qgroup operations through the delayed refs we simply add new ref operations whenever we notice that we need to when we've modified the refs themselves. 2) tree mod seq: we no longer have this separation of major/minor counters. this makes the sequence number stuff much more sane and we can remove some locking that was needed to protect the counter. 3) delayed ref seq: we now read the tree mod seq number and use that as our sequence. This means each new delayed ref doesn't have it's own unique sequence number, rather whenever we go to lookup backrefs we inc the sequence number so we can make sure to keep any new operations from screwing up our world view at that given point. This allows us to merge delayed refs during runtime. With all of these changes the delayed ref stuff is a little saner and the qgroup accounting stuff no longer goes negative in some cases like it was before. Thanks, Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NChris Mason <clm@fb.com>
-