- 21 8月, 2017 2 次提交
-
-
由 Filipe Manana 提交于
When doing an incremental send it's possible that the computed send stream contains clone operations that will fail on the receiver if the receiver has compression enabled and the clone operations target a sector sized extent that starts at a zero file offset, is not compressed on the source filesystem but ends up being compressed and inlined at the destination filesystem. Example scenario: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt # By doing a direct IO write, the data is not compressed. $ xfs_io -f -d -c "pwrite -S 0xab 0 4K" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1 $ xfs_io -c "reflink /mnt/foobar 0 8K 4K" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2 $ btrfs send -f /tmp/1.snap /mnt/mysnap1 $ btrfs send -f /tmp/2.snap -p /mnt/mysnap1 /mnt/mysnap2 $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount -o compress /dev/sdc /mnt $ btrfs receive -f /tmp/1.snap /mnt $ btrfs receive -f /tmp/2.snap /mnt ERROR: failed to clone extents to foobar Operation not supported The same could be achieved by mounting the source filesystem without compression and doing a buffered IO write instead of a direct IO one, and mounting the destination filesystem with compression enabled. So fix this by issuing regular write operations in the send stream instead of clone operations when the source offset is zero and the range has a length matching the sector size. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
There is a corner case that slips through the checkers in functions reading extent buffer, ie. if (start < eb->len) and (start + len > eb->len), then a) map_private_extent_buffer() returns immediately because it's thinking the range spans across two pages, b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len) and WARN_ON(start + len > eb->start + eb->len), both are OK in this corner case, but it'd actually try to access the eb->pages out of bounds because of (start + len > eb->len). The case is found by switching extent inline ref type from shared data ref to non-shared data ref, which is a kind of metadata corruption. It'd use the wrong helper to access the eb, eg. btrfs_extent_data_ref_root(eb, ref) is used but the %ref passing here is "struct btrfs_shared_data_ref". And if the extent item happens to be the first item in the eb, then offset/length will get over eb->len which ends up an invalid memory access. This is adding proper checks in order to avoid invalid memory access, ie. 'general protection fault', before it's too late. Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NChris Mason <clm@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 18 8月, 2017 12 次提交
-
-
由 Nikolay Borisov 提交于
The buffer passed to btrfs_ioctl_tree_search* functions have to be at least sizeof(struct btrfs_ioctl_search_header). If this is not the case then the ioctl should return -EOVERFLOW and set the uarg->buf_size to the minimum required size. Currently btrfs_ioctl_tree_search_v2 would return an -EOVERFLOW error with ->buf_size being set to the value passed by user space. Fix this by removing the size check and relying on search_ioctl, which already includes it and correctly sets buf_size. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NChris Mason <clm@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Currently the code checks whether we should do data checksumming in btrfs_submit_direct and the boolean result of this check is passed to btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which actually consumes it. The last function actually has all the necessary context to figure out whether to skip the check or not, so let's move the check closer to where it's being consumed. No functional changes. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NChris Mason <clm@fb.com> Signed-off-by: NChris Mason <clm@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
When logging an inode in full mode that has an inline compressed extent that represents a range with a size matching the sector size (currently the same as the page size), has a trailing hole and the no-holes feature is enabled, we end up failing an assertion leading to a trace like the following: [141812.031528] assertion failed: len == i_size, file: fs/btrfs/tree-log.c, line: 4453 [141812.033069] ------------[ cut here ]------------ [141812.034330] kernel BUG at fs/btrfs/ctree.h:3452! [141812.035137] invalid opcode: 0000 [#1] PREEMPT SMP [141812.035932] Modules linked in: btrfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio dm_flakey dm_mod dax ppdev evdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 tpm_tis psmouse crypto_simd parport_pc sg pcspkr tpm_tis_core cryptd parport serio_raw glue_helper tpm i2c_piix4 i2c_core button sunrpc loop autofs4 ext4 crc16 jbd2 mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod ata_generic virtio_scsi ata_piix floppy crc32c_intel libata scsi_mod virtio_pci virtio_ring e1000 virtio [last unloaded: btrfs] [141812.036790] CPU: 3 PID: 845 Comm: fdm-stress Tainted: G B W 4.12.3-btrfs-next-52+ #1 [141812.036790] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014 [141812.036790] task: ffff8801e6694180 task.stack: ffffc90009004000 [141812.036790] RIP: 0010:assfail.constprop.18+0x1c/0x1e [btrfs] [141812.036790] RSP: 0018:ffffc90009007bc0 EFLAGS: 00010282 [141812.036790] RAX: 0000000000000046 RBX: ffff88017512c008 RCX: 0000000000000001 [141812.036790] RDX: ffff88023fd95201 RSI: ffffffff8182264c RDI: 00000000ffffffff [141812.036790] RBP: ffffc90009007bc0 R08: 0000000000000001 R09: 0000000000000001 [141812.036790] R10: 0000000000001000 R11: ffffffff82f5a0c9 R12: ffff88014e5947e8 [141812.036790] R13: 00000000000b4000 R14: ffff8801b234d008 R15: 0000000000000000 [141812.036790] FS: 00007fdba6ffd700(0000) GS:ffff88023fd80000(0000) knlGS:0000000000000000 [141812.036790] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [141812.036790] CR2: 00007fdb9c000010 CR3: 000000016efa2000 CR4: 00000000001406e0 [141812.036790] Call Trace: [141812.036790] btrfs_log_inode+0x9f0/0xd3d [btrfs] [141812.036790] ? __mutex_lock+0x120/0x3ce [141812.036790] btrfs_log_inode_parent+0x224/0x685 [btrfs] [141812.036790] ? lock_acquire+0x16b/0x1af [141812.036790] btrfs_log_dentry_safe+0x60/0x7b [btrfs] [141812.036790] btrfs_sync_file+0x32e/0x3f8 [btrfs] [141812.036790] vfs_fsync_range+0x8a/0x9d [141812.036790] vfs_fsync+0x1c/0x1e [141812.036790] do_fsync+0x31/0x4a [141812.036790] SyS_fdatasync+0x13/0x17 [141812.036790] entry_SYSCALL_64_fastpath+0x18/0xad [141812.036790] RIP: 0033:0x7fdbac41a47d [141812.036790] RSP: 002b:00007fdba6ffce30 EFLAGS: 00000293 ORIG_RAX: 000000000000004b [141812.036790] RAX: ffffffffffffffda RBX: ffffffff81092c9f RCX: 00007fdbac41a47d [141812.036790] RDX: 0000004cf0160a40 RSI: 0000000000000000 RDI: 0000000000000006 [141812.036790] RBP: ffffc90009007f98 R08: 0000000000000000 R09: 0000000000000010 [141812.036790] R10: 00000000000002e8 R11: 0000000000000293 R12: ffffffff8110cd90 [141812.036790] R13: ffffc90009007f78 R14: 0000000000000000 R15: 0000000000000000 [141812.036790] ? time_hardirqs_off+0x9/0x14 [141812.036790] ? trace_hardirqs_off_caller+0x1f/0xa3 [141812.036790] Code: c7 d6 61 6b a0 48 89 e5 e8 ba ef a8 e0 0f 0b 55 89 f1 48 c7 c2 6d 65 6b a0 48 89 fe 48 c7 c7 81 65 6b a0 48 89 e5 e8 9c ef a8 e0 <0f> 0b 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 [141812.036790] RIP: assfail.constprop.18+0x1c/0x1e [btrfs] RSP: ffffc90009007bc0 [141812.084448] ---[ end trace 44e472684c7a32cc ]--- Which happens because the code that logs a trailing hole when the no-holes feature is enabled, did not consider that a compressed inline extent can represent a range with a size matching the sector size, in which case expanding the inode's i_size, through a truncate operation, won't lead to padding with zeroes the page that represents the inline extent, and therefore the inline extent remains after the truncation. Fix this by adapting the assertion to accept inline extents representing data with a sector size length if, and only if, the inline extents are compressed. A sample and trivial reproducer (for systems with a 4K page size) for this issue: mkfs.btrfs -O no-holes -f /dev/sdc mount -o compress /dev/sdc /mnt xfs_io -f -c "pwrite -S 0xab 0 4K" /mnt/foobar sync xfs_io -c "truncate 32K" /mnt/foobar xfs_io -c "fsync" /mnt/foobar Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NChris Mason <clm@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
If the range being cleared was not marked for defrag and we are not about to clear the range from the defrag status, we don't need to lock and unlock the inode. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NChris Mason <clm@fb.com> Reviewed-by: NWang Shilong <wangshilong1991@gmail.com> Signed-off-by: NChris Mason <clm@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Colin Ian King 提交于
The error return variable ret is initialized to zero and then is checked to see if it is non-zero in the if-block that follows it. It is therefore impossible for ret to be non-zero after the if-block hence the check is redundant and can be removed. Detected by CoverityScan, CID#1021040 ("Logically dead code") Signed-off-by: NColin Ian King <colin.king@canonical.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
The internal free space tree management routines are always exposed for testing purposes. Make them dependent on SANITY_TESTS being on so that they are exposed only when they really have to. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
This variable was added in 1abe9b8a ("Btrfs: add initial tracepointi support for btrfs"), yet it never really got used, only assigned to. So let's remove it. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
We have a WARN_ON(!var) inside an if branch which is executed (among others) only when var is true. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
We aren't using this define, so removing it. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
We have define for FSID size so use it. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE are of the same size, we should use the matching constant for the fsid buffer. Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
Our dir_context->pos is supposed to hold the next position we're supposed to look. If we successfully insert a delayed dir index we could end up with a duplicate entry because we don't increase ctx->pos after doing the dir_emit. Signed-off-by: NJosef Bacik <jbacik@fb.com> Reviewed-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 16 8月, 2017 26 次提交
-
-
由 Josef Bacik 提交于
Readdir does dir_emit while under the btree lock. dir_emit can trigger the page fault which means we can deadlock. Fix this by allocating a buffer on opening a directory and copying the readdir into this buffer and doing dir_emit from outside of the tree lock. Thread A readdir <holding tree lock> dir_emit <page fault> down_read(mmap_sem) Thread B mmap write down_write(mmap_sem) page_mkwrite wait_ordered_extents Process C finish_ordered_extent insert_reserved_file_extent try to lock leaf <hang> Signed-off-by: NJosef Bacik <jbacik@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ copy the deadlock scenario to changelog ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Currently should_alloc_chunk uses ->total_bytes - ->bytes_readonly to signify the total amount of bytes in this space info. However, given Jeff's patch which adds bytes_pinned and bytes_may_use to the calculation of num_allocated it becomes a lot more clear to just eliminate num_bytes altogether and add the bytes_readonly to the amount of used space. That way we don't change the results of the following statements. In the process also start using btrfs_space_info_used. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
In a heavy write scenario, we can end up with a large number of pinned bytes. This can translate into (very) premature ENOSPC because pinned bytes must be accounted for when allowing a reservation but aren't accounted for when deciding whether to create a new chunk. This patch adds the accounting to should_alloc_chunk so that we can create the chunk. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
This is a minimal patch intended to be backported to older kernels. We're going to extend the string specifying the compression method and this would fail on kernels before that change (the string is compared exactly). Relax the string matching only to the prefix, ie. ignoring anything that goes after "zlib" or "lzo", regardless of th format extension we decide to use. This applies to the mount options and properties. That way, patched old kernels could be booted on systems already utilizing the new compression spec. Applicable since commit 63541927, v3.14. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Currently, the BTRFS_INODE_NOCOMPRESS will prevent any compression on a given file, except when the mount is force-compress. As users have reported on IRC, this will also prevent compression when requested by defrag (btrfs fi defrag -c file). The nocompress flag is set automatically by filesystem when the ratios are bad and the user would have to manually drop the bit in order to make defrag -c work. This is not good from the usability perspective. This patch will raise priority for the defrag -c over nocompress, ie. any file with NOCOMPRESS bit set will get defragmented. The bit will remain untouched. Alternate option was to also drop the nocompress bit and keep the decision logic as is, but I think this is not the right solution. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Add new value for compression to distinguish between defrag and property. Previously, a single variable was used and this caused clashes when the per-file 'compression' was set and a defrag -c was called. The property-compression is loaded when the file is open, defrag will overwrite the same variable and reset to 0 (ie. NONE) at when the file defragmentaion is finished. That's considered a usability bug. Now we won't touch the property value, use the defrag-compression. The precedence of defrag is higher than for property (and whole-filesystem). Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
This is preparatory for separating inode compression requested by defrag and set via properties. This will fix a usability bug when defrag will reset compression type to NONE. If the file has compression set via property, it will not apply anymore (until next mount or reset through command line). We're going to fix that by adding another variable just for the defrag call and won't touch the property. The defrag will have higher priority when deciding whether to compress the data. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Timofey Titovets 提交于
Add skeleton code for compresison heuristics. Now it iterates over all the pages, but in the end always says "yes, compress please", ie it does not change the current behaviour. In the future we're going to add various heuristics to analyze the data. This patch can be used as a baseline for measuring if the effectivness and performance. Signed-off-by: NTimofey Titovets <nefelim4ag@gmail.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ enhanced changelog, modified comments ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Correctly account for IO when waiting for a submitted bio in scrub. This only for the accounting purposes and should not change other behaviour. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Correctly account for IO when waiting for a submitted DIO read, the case when we're retrying. This only for the accounting purposes and should not change other behaviour. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The pinned chunks might be left over so we clean them but at this point of close_ctree, there's noone to race with, the locking can be removed. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
It's a simple call page_cache_sync_readahead, same arguments in the same order. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
There's no PageFsMisc. Added by patch 4881ee5a in 2008, the flag is not present in current kernels. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
The return value of flush_space was used to have significance in the early days when the code was first introduced and before the ticketed enospc rework. Since the latter got introduced the return value lost any significance whatsoever to its callers. So let's remove it. While at it also remove the unused ticket variable in btrfs_async_reclaim_metadata_space. It was used in the initial version of the ticketed ENOSPC work, however Wang Xiaoguang detected a problem with this and fixed it in ce129655 ("btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress"). Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Userspace transactions were introduced in commit 6bf13c0c ("Btrfs: transaction ioctls") to provide semantics that Ceph's object store required. However, things have changed significantly since then, to the point where btrfs is no longer suitable as a backend for ceph and in fact it's actively advised against such usages. Considering this, there doesn't seem to be a widespread, legit use case of userspace transaction. They also clutter the file->private pointer. So to end the agony let's nuke the userspace transaction ioctls. As a first step let's give time for people to voice their objection by just WARN()ining when the userspace transaction is used. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ move the warning past perm checks, keep the has-been-printed state; we're ok with just one warning over all filesystems ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Superblock is read and written using buffer heads, we need to set the bdev blocksize. The magic constant has been hardcoded in several places, so replace it with a named constant. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
There are two independent parts, one that writes the superblocks and another that waits for completion. No functional changes, but cleanups, reformatting and comment updates. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Polish the helper: * drop underscores, no special meaning here * pass fs_devices, as this is what the API implements * drop noinline, no apparent reason for such simple helper * constify uuid * add comment Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
There are two helpers called in chain from one location, we can merge the functionaliy. Originally, alloc_fs_devices could fill the device uuid randomly if we we didn't give the uuid buffer. This happens for seed devices but the fsid is generated in btrfs_prepare_sprout, so we can remove it. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The function submit_extent_page has 15(!) parameters right now, op and op_flags are effectively one value stored to bio::bi_opf, no need to pass them separately. So it's 14 parameters now. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Unify types of local variables and parameters that store various REQ_* values to unsigned int. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
This function prints an informative message and then continues dev-replace. The message contains a progress percentage which is read from the status. The status is allocated dynamically, about 2600 bytes, just to read the single value. That's an overkill. We'll use the new helper and drop the allocation. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-