1. 03 1月, 2022 3 次提交
  2. 27 10月, 2021 2 次提交
  3. 18 9月, 2021 1 次提交
  4. 23 8月, 2021 1 次提交
  5. 22 6月, 2021 1 次提交
  6. 28 5月, 2021 2 次提交
    • F
      btrfs: fix fsync failure and transaction abort after writes to prealloc extents · ea7036de
      Filipe Manana 提交于
      When doing a series of partial writes to different ranges of preallocated
      extents with transaction commits and fsyncs in between, we can end up with
      a checksum items in a log tree. This causes an fsync to fail with -EIO and
      abort the transaction, turning the filesystem to RO mode, when syncing the
      log.
      
      For this to happen, we need to have a full fsync of a file following one
      or more fast fsyncs.
      
      The following example reproduces the problem and explains how it happens:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        # Create our test file with 2 preallocated extents. Leave a 1M hole
        # between them to ensure that we get two file extent items that will
        # never be merged into a single one. The extents are contiguous on disk,
        # which will later result in the checksums for their data to be merged
        # into a single checksum item in the csums btree.
        #
        $ xfs_io -f \
                 -c "falloc 0 1M" \
                 -c "falloc 3M 3M" \
                 /mnt/foobar
      
        # Now write to the second extent and leave only 1M of it as unwritten,
        # which corresponds to the file range [4M, 5M[.
        #
        # Then fsync the file to flush delalloc and to clear full sync flag from
        # the inode, so that a future fsync will use the fast code path.
        #
        # After the writeback triggered by the fsync we have 3 file extent items
        # that point to the second extent we previously allocated:
        #
        # 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
        #    file range [3M, 4M[
        #
        # 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers
        #    the file range [4M, 5M[
        #
        # 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
        #    file range [5M, 6M[
        #
        # All these file extent items have a generation of 6, which is the ID of
        # the transaction where they were created. The split of the original file
        # extent item is done at btrfs_mark_extent_written() when ordered extents
        # complete for the file ranges [3M, 4M[ and [5M, 6M[.
        #
        $ xfs_io -c "pwrite -S 0xab 3M 1M" \
                 -c "pwrite -S 0xef 5M 1M" \
                 -c "fsync" \
                 /mnt/foobar
      
        # Commit the current transaction. This wipes out the log tree created by
        # the previous fsync.
        sync
      
        # Now write to the unwritten range of the second extent we allocated,
        # corresponding to the file range [4M, 5M[, and fsync the file, which
        # triggers the fast fsync code path.
        #
        # The fast fsync code path sees that there is a new extent map covering
        # the file range [4M, 5M[ and therefore it will log a checksum item
        # covering the range [1M, 2M[ of the second extent we allocated.
        #
        # Also, after the fsync finishes we no longer have the 3 file extent
        # items that pointed to 3 sections of the second extent we allocated.
        # Instead we end up with a single file extent item pointing to the whole
        # extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the
        # current transaction ID). This is due to the file extent item merging we
        # do when completing ordered extents into ranges that point to unwritten
        # (preallocated) extents. This merging is done at
        # btrfs_mark_extent_written().
        #
        $ xfs_io -c "pwrite -S 0xcd 4M 1M" \
                 -c "fsync" \
                 /mnt/foobar
      
        # Now do some write to our file outside the range of the second extent
        # that we allocated with fallocate() and truncate the file size from 6M
        # down to 5M.
        #
        # The truncate operation sets the full sync runtime flag on the inode,
        # forcing the next fsync to use the slow code path. It also changes the
        # length of the second file extent item so that it represents the file
        # range [3M, 5M[ and not the range [3M, 6M[ anymore.
        #
        # Finally fsync the file. Since this is a fsync that triggers the slow
        # code path, it will remove all items associated to the inode from the
        # log tree and then it will scan for file extent items in the
        # fs/subvolume tree that have a generation matching the current
        # transaction ID, which is 7. This means it will log 2 file extent
        # items:
        #
        # 1) One for the first extent we allocated, covering the file range
        #    [0, 1M[
        #
        # 2) Another for the first 2M of the second extent we allocated,
        #    covering the file range [3M, 5M[
        #
        # When logging the first file extent item we log a single checksum item
        # that has all the checksums for the entire extent.
        #
        # When logging the second file extent item, we also lookup for the
        # checksums that are associated with the range [0, 2M[ of the second
        # extent we allocated (file range [3M, 5M[), and then we log them with
        # btrfs_csum_file_blocks(). However that results in ending up with a log
        # that has two checksum items with ranges that overlap:
        #
        # 1) One for the range [1M, 2M[ of the second extent we allocated,
        #    corresponding to the file range [4M, 5M[, which we logged in the
        #    previous fsync that used the fast code path;
        #
        # 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second
        #    extents, respectively, corresponding to the files ranges [0, 1M[
        #    and [3M, 5M[. This one was added during this last fsync that uses
        #    the slow code path and overlaps with the previous one logged by
        #    the previous fast fsync.
        #
        # This happens because when logging the checksums for the second
        # extent, we notice they start at an offset that matches the end of the
        # checksums item that we logged for the first extent, and because both
        # extents are contiguous on disk, btrfs_csum_file_blocks() decides to
        # extend that existing checksums item and append the checksums for the
        # second extent to this item. The end result is we end up with two
        # checksum items in the log tree that have overlapping ranges, as
        # listed before, resulting in the fsync to fail with -EIO and aborting
        # the transaction, turning the filesystem into RO mode.
        #
        $ xfs_io -c "pwrite -S 0xff 0 1M" \
                 -c "truncate 5M" \
                 -c "fsync" \
                 /mnt/foobar
        fsync: Input/output error
      
      After running the example, dmesg/syslog shows the tree checker complained
      about the checksum items with overlapping ranges and we aborted the
      transaction:
      
        $ dmesg
        (...)
        [756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item
        [756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610
        [756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929
        [756289.563654] 	item 0 key (257 1 0) itemoff 16123 itemsize 160
        [756289.564649] 		inode generation 6 size 5242880 mode 100600
        [756289.565636] 	item 1 key (257 12 256) itemoff 16107 itemsize 16
        [756289.566694] 	item 2 key (257 108 0) itemoff 16054 itemsize 53
        [756289.567725] 		extent data disk bytenr 13631488 nr 1048576
        [756289.568697] 		extent data offset 0 nr 1048576 ram 1048576
        [756289.569689] 	item 3 key (257 108 1048576) itemoff 16001 itemsize 53
        [756289.570682] 		extent data disk bytenr 0 nr 0
        [756289.571363] 		extent data offset 0 nr 2097152 ram 2097152
        [756289.572213] 	item 4 key (257 108 3145728) itemoff 15948 itemsize 53
        [756289.573246] 		extent data disk bytenr 14680064 nr 3145728
        [756289.574121] 		extent data offset 0 nr 2097152 ram 3145728
        [756289.574993] 	item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072
        [756289.576113] 	item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024
        [756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected
        [756289.578644] ------------[ cut here ]------------
        [756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
        [756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...)
        [756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G        W         5.12.0-rc8-btrfs-next-87 #1
        [756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
        [756289.595122] Code: 5d c3 e8 76 60 (...)
        [756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282
        [756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000
        [756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff
        [756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000
        [756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
        [756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000
        [756289.602278] FS:  00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000
        [756289.603217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0
        [756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [756289.606400] Call Trace:
        [756289.606704]  btree_csum_one_bio+0x244/0x2b0 [btrfs]
        [756289.607313]  btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
        [756289.608040]  submit_one_bio+0x61/0x70 [btrfs]
        [756289.608587]  btree_write_cache_pages+0x587/0x610 [btrfs]
        [756289.609258]  ? free_debug_processing+0x1d5/0x240
        [756289.609812]  ? __module_address+0x28/0xf0
        [756289.610298]  ? lock_acquire+0x1a0/0x3e0
        [756289.610754]  ? lock_acquired+0x19f/0x430
        [756289.611220]  ? lock_acquire+0x1a0/0x3e0
        [756289.611675]  do_writepages+0x43/0xf0
        [756289.612101]  ? __filemap_fdatawrite_range+0xa4/0x100
        [756289.612800]  __filemap_fdatawrite_range+0xc5/0x100
        [756289.613393]  btrfs_write_marked_extents+0x68/0x160 [btrfs]
        [756289.614085]  btrfs_sync_log+0x21c/0xf20 [btrfs]
        [756289.614661]  ? finish_wait+0x90/0x90
        [756289.615096]  ? __mutex_unlock_slowpath+0x45/0x2a0
        [756289.615661]  ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs]
        [756289.616338]  ? lock_acquire+0x1a0/0x3e0
        [756289.616801]  ? lock_acquired+0x19f/0x430
        [756289.617284]  ? lock_acquire+0x1a0/0x3e0
        [756289.617750]  ? lock_release+0x214/0x470
        [756289.618221]  ? lock_acquired+0x19f/0x430
        [756289.618704]  ? dput+0x20/0x4a0
        [756289.619079]  ? dput+0x20/0x4a0
        [756289.619452]  ? lockref_put_or_lock+0x9/0x30
        [756289.619969]  ? lock_release+0x214/0x470
        [756289.620445]  ? lock_release+0x214/0x470
        [756289.620924]  ? lock_release+0x214/0x470
        [756289.621415]  btrfs_sync_file+0x46a/0x5b0 [btrfs]
        [756289.621982]  do_fsync+0x38/0x70
        [756289.622395]  __x64_sys_fsync+0x10/0x20
        [756289.622907]  do_syscall_64+0x33/0x80
        [756289.623438]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [756289.624063] RIP: 0033:0x7f08b27fbb7b
        [756289.624588] Code: 0f 05 48 3d 00 (...)
        [756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
        [756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b
        [756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003
        [756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0
        [756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001
        [756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480
        [756289.631819] irq event stamp: 0
        [756289.632188] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
        [756289.633893] softirqs last  enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
        [756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [756289.635606] ---[ end trace 0a039fdc16ff3fef ]---
        [756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure
        [756289.637082] BTRFS info (device sdc): forced readonly
      
      Having checksum items covering ranges that overlap is dangerous as in some
      cases it can lead to having extent ranges for which we miss checksums
      after log replay or getting the wrong checksum item. There were some fixes
      in the past for bugs that resulted in this problem, and were explained and
      fixed by the following commits:
      
        27b9a812 ("Btrfs: fix csum tree corruption, duplicate and outdated checksums")
        b84b8390 ("Btrfs: fix file read corruption after extent cloning and fsync")
        40e046ac ("Btrfs: fix missing data checksums after replaying a log tree")
        e289f03e ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")
      
      Fix the issue by making btrfs_csum_file_blocks() taking into account the
      start offset of the next checksum item when it decides to extend an
      existing checksum item, so that it never extends the checksum to end at a
      range that goes beyond the start range of the next checksum item.
      
      When we can not access the next checksum item without releasing the path,
      simply drop the optimization of extending the previous checksum item and
      fallback to inserting a new checksum item - this happens rarely and the
      optimization is not significant enough for a log tree in order to justify
      the extra complexity, as it would only save a few bytes (the size of a
      struct btrfs_item) of leaf space.
      
      This behaviour is only needed when inserting into a log tree because
      for the regular checksums tree we never have a case where we try to
      insert a range of checksums that overlap with a range that was previously
      inserted.
      
      A test case for fstests will follow soon.
      Reported-by: NPhilipp Fent <fent@in.tum.de>
      Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
      CC: stable@vger.kernel.org # 5.4+
      Tested-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ea7036de
    • J
      btrfs: fix error handling in btrfs_del_csums · b86652be
      Josef Bacik 提交于
      Error injection stress would sometimes fail with checksums on disk that
      did not have a corresponding extent.  This occurred because the pattern
      in btrfs_del_csums was
      
      	while (1) {
      		ret = btrfs_search_slot();
      		if (ret < 0)
      			break;
      	}
      	ret = 0;
      out:
      	btrfs_free_path(path);
      	return ret;
      
      If we got an error from btrfs_search_slot we'd clear the error because
      we were breaking instead of goto out.  Instead of using goto out, simply
      handle the cases where we may leave a random value in ret, and get rid
      of the
      
      	ret = 0;
      out:
      
      pattern and simply allow break to have the proper error reporting.  With
      this fix we properly abort the transaction and do not commit thinking we
      successfully deleted the csum.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b86652be
  7. 19 4月, 2021 1 次提交
  8. 09 2月, 2021 1 次提交
  9. 18 12月, 2020 1 次提交
    • E
      btrfs: correctly calculate item size used when item key collision happens · 9a664971
      ethanwu 提交于
      Item key collision is allowed for some item types, like dir item and
      inode refs, but the overall item size is limited by the nodesize.
      
      item size(ins_len) passed from btrfs_insert_empty_items to
      btrfs_search_slot already contains size of btrfs_item.
      
      When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
      The check incorrectly reports that split leaf is required, because
      it treats the space required by the newly inserted item as
      btrfs_item + item data. But in item key collision case, only item data
      is actually needed, the newly inserted item could merge into the existing
      one. No new btrfs_item will be inserted.
      
      And split_leaf return EOVERFLOW from following code:
      
        if (extend && data_size + btrfs_item_size_nr(l, slot) +
            sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
            return -EOVERFLOW;
      
      In most cases, when callers receive EOVERFLOW, they either return
      this error or handle in different ways. For example, in normal dir item
      creation the userspace will get errno EOVERFLOW; in inode ref case
      INODE_EXTREF is used instead.
      
      However, this is not the case for rename. To avoid the unrecoverable
      situation in rename, btrfs_check_dir_item_collision is called in
      early phase of rename. In this function, when item key collision is
      detected leaf space is checked:
      
        data_size = sizeof(*di) + name_len;
        if (data_size + btrfs_item_size_nr(leaf, slot) +
            sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
      
      the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
      refers to existing item size, the condition here correctly calculates
      the needed size for collision case rather than the wrong case above.
      
      The consequence of inconsistent condition check between
      btrfs_check_dir_item_collision and btrfs_search_slot when item key
      collision happens is that we might pass check here but fail
      later at btrfs_search_slot. Rename fails and volume is forced readonly
      
        [436149.586170] ------------[ cut here ]------------
        [436149.586173] BTRFS: Transaction aborted (error -75)
        [436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
        [436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G      D           4.18.0-rc5+ #1
        [436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
        [436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
        [436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
        [436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
        [436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
        [436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
        [436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
        [436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
        [436149.586260] FS:  00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
        [436149.586261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
        [436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [436149.586296] Call Trace:
        [436149.586311]  vfs_rename+0x383/0x920
        [436149.586313]  ? vfs_rename+0x383/0x920
        [436149.586315]  do_renameat2+0x4ca/0x590
        [436149.586317]  __x64_sys_rename+0x20/0x30
        [436149.586324]  do_syscall_64+0x5a/0x120
        [436149.586330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [436149.586332] RIP: 0033:0x7fa3133b1d37
        [436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
        [436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
        [436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
        [436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
        [436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
      
      Thanks to Hans van Kranenburg for information about crc32 hash collision
      tools, I was able to reproduce the dir item collision with following
      python script.
      https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
      it under a btrfs volume will trigger the abort transaction.  It simply
      creates files and rename them to forged names that leads to
      hash collision.
      
      There are two ways to fix this. One is to simply revert the patch
      878f2d2c ("Btrfs: fix max dir item size calculation") to make the
      condition consistent although that patch is correct about the size.
      
      The other way is to handle the leaf space check correctly when
      collision happens. I prefer the second one since it correct leaf
      space check in collision case. This fix will not account
      sizeof(struct btrfs_item) when the item already exists.
      There are two places where ins_len doesn't contain
      sizeof(struct btrfs_item), however.
      
        1. extent-tree.c: lookup_inline_extent_backref
        2. file-item.c: btrfs_csum_file_blocks
      
      to make the logic of btrfs_search_slot more clear, we add a flag
      search_for_extension in btrfs_path.
      
      This flag indicates that ins_len passed to btrfs_search_slot doesn't
      contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
      will use the actual size needed to calculate the required leaf space.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: Nethanwu <ethanwu@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a664971
  10. 10 12月, 2020 3 次提交
    • Q
      btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs · 6275193e
      Qu Wenruo 提交于
      Refactor btrfs_lookup_bio_sums() by:
      
      - Remove the @file_offset parameter
        There are two factors making the @file_offset parameter useless:
      
        * For csum lookup in csum tree, file offset makes no sense
          We only need disk_bytenr, which is unrelated to file_offset
      
        * page_offset (file offset) of each bvec is not contiguous.
          Pages can be added to the same bio as long as their on-disk bytenr
          is contiguous, meaning we could have pages at different file offsets
          in the same bio.
      
        Thus passing file_offset makes no sense any more.
        The only user of file_offset is for data reloc inode, we will use
        a new function, search_file_offset_in_bio(), to handle it.
      
      - Extract the csum tree lookup into search_csum_tree()
        The new function will handle the csum search in csum tree.
        The return value is the same as btrfs_find_ordered_sum(), returning
        the number of found sectors which have checksum.
      
      - Change how we do the main loop
        The only needed info from bio is:
        * the on-disk bytenr
        * the length
      
        After extracting the above info, we can do the search without bio
        at all, which makes the main loop much simpler:
      
      	for (cur_disk_bytenr = orig_disk_bytenr;
      	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
      	     cur_disk_bytenr += count * sectorsize) {
      
      		/* Lookup csum tree */
      		count = search_csum_tree(fs_info, path, cur_disk_bytenr,
      					 search_len, csum_dst);
      		if (!count) {
      			/* Csum hole handling */
      		}
      	}
      
      - Use single variable as the source to calculate all other offsets
        Instead of all different type of variables, we use only one main
        variable, cur_disk_bytenr, which represents the current disk bytenr.
      
        All involved values can be calculated from that variable, and
        all those variable will only be visible in the inner loop.
      
      The above refactoring makes btrfs_lookup_bio_sums() way more robust than
      it used to be, especially related to the file offset lookup.  Now
      file_offset lookup is only related to data reloc inode, otherwise we
      don't need to bother file_offset at all.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6275193e
    • Q
      btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums · 9e46458a
      Qu Wenruo 提交于
      The function btrfs_lookup_bio_sums() is only called for read bios.
      While btrfs_find_ordered_sum() is to search ordered extent sums, which
      is only for write path.
      
      This means to read a page we either:
      
      - Submit read bio if it's not uptodate
        This means we only need to search csum tree for checksums.
      
      - The page is already uptodate
        It can be marked uptodate for previous read, or being marked dirty.
        As we always mark page uptodate for dirty page.
        In that case, we don't need to submit read bio at all, thus no need
        to search any checksums.
      
      Remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums().
      And since btrfs_lookup_bio_sums() is the only caller for
      btrfs_find_ordered_sum(), also remove the implementation.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9e46458a
    • D
      btrfs: drop casts of bio bi_sector · 1201b58b
      David Sterba 提交于
      Since commit 72deb455 ("block: remove CONFIG_LBDAF") (5.2) the
      sector_t type is u64 on all arches and configs so we don't need to
      typecast it.  It used to be unsigned long and the result of sector size
      shifts were not guaranteed to fit in the type.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1201b58b
  11. 08 12月, 2020 10 次提交
  12. 07 10月, 2020 1 次提交
  13. 27 7月, 2020 2 次提交
  14. 25 5月, 2020 5 次提交
    • F
      btrfs: remove useless 'fail_unlock' label from btrfs_csum_file_blocks() · 918cdf44
      Filipe Manana 提交于
      The label 'fail_unlock' is pointless, all it does is to jump to the label
      'out', so just remove it.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      918cdf44
    • F
      btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums · 7e4a3f7e
      Filipe Manana 提交于
      We are currently treating any non-zero return value from btrfs_next_leaf()
      the same way, by going to the code that inserts a new checksum item in the
      tree. However if btrfs_next_leaf() returns an error (a value < 0), we
      should just stop and return the error, and not behave as if nothing has
      happened, since in that case we do not have a way to know if there is a
      next leaf or we are currently at the last leaf already.
      
      So fix that by returning the error from btrfs_next_leaf().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7e4a3f7e
    • F
      btrfs: make checksum item extension more efficient · cc14600c
      Filipe Manana 提交于
      When we want to add checksums into the checksums tree, or a log tree, we
      try whenever possible to extend existing checksum items, as this helps
      reduce amount of metadata space used, since adding a new item uses extra
      metadata space for a btrfs_item structure (25 bytes).
      
      However we have two inefficiencies in the current approach:
      
      1) After finding a checksum item that covers a range with an end offset
         that matches the start offset of the checksum range we want to insert,
         we release the search path populated by btrfs_lookup_csum() and then
         do another COW search on tree with the goal of getting additional
         space for at least one checksum. Doing this path release and then
         searching again is a waste of time because very often the leaf already
         has enough free space for at least one more checksum;
      
      2) After the COW search that guarantees we get free space in the leaf for
         at least one more checksum, we end up not doing the extension of the
         previous checksum item, and fallback to insertion of a new checksum
         item, if the leaf doesn't have an amount of free space larger then the
         space required for 2 checksums plus one btrfs_item structure - this is
         pointless for two reasons:
      
         a) We want to extend an existing item, so we don't need to account for
            a btrfs_item structure (25 bytes);
      
         b) We made the COW search with an insertion size for 1 single checksum,
            so if the leaf ends up with a free space amount smaller then 2
            checksums plus the size of a btrfs_item structure, we give up on the
            extension of the existing item and jump to the 'insert' label, where
            we end up releasing the path and then doing yet another search to
            insert a new checksum item for a single checksum.
      
      Fix these inefficiencies by doing the following:
      
      - For case 1), before releasing the path just check if the leaf already
        has enough space for at least 1 more checksum, and if it does, jump
        directly to the item extension code, with releasing our current path,
        which was already COWed by btrfs_lookup_csum();
      
      - For case 2), fix the logic so that for item extension we require only
        that the leaf has enough free space for 1 checksum, and not a minimum
        of 2 checksums plus space for a btrfs_item structure.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc14600c
    • E
      btrfs: use crypto_shash_digest() instead of open coding · fd08001f
      Eric Biggers 提交于
      Use crypto_shash_digest() instead of crypto_shash_init() +
      crypto_shash_update() + crypto_shash_final().  This is more efficient.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd08001f
    • O
      btrfs: clarify btrfs_lookup_bio_sums documentation · fb30f470
      Omar Sandoval 提交于
      Fix a couple of issues in the btrfs_lookup_bio_sums documentation:
      
      * The bio doesn't need to be a btrfs_io_bio if dst was provided. Move
        the declaration in the code to make that clear, too.
      * dst must be large enough to hold nblocks * csum_size, not just
        csum_size.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb30f470
  15. 24 3月, 2020 2 次提交
  16. 20 1月, 2020 4 次提交