1. 07 11月, 2022 4 次提交
    • Q
      Revert "btrfs: scrub: use larger block size for data extent scrub" · b75b51f8
      Qu Wenruo 提交于
      This reverts commit 786672e9.
      
      [BUG]
      Since commit 786672e9 ("btrfs: scrub: use larger block size for data
      extent scrub"), btrfs scrub no longer reports errors if the corruption
      is not in the first sector of a STRIPE_LEN.
      
      The following script can expose the problem:
      
        mkfs.btrfs -f $dev
        mount $dev $mnt
        xfs_io -f -c "pwrite -S 0xff 0 8k" $mnt/foobar
        umount $mnt
      
        # 13631488 is the logical bytenr of above 8K extent
        btrfs-map-logical -l 13631488 -b 4096 $dev
        mirror 1 logical 13631488 physical 13631488 device /dev/test/scratch1
      
        # Corrupt the 2nd sector of that extent
        xfs_io -f -c "pwrite -S 0x00 13635584 4k" $dev
      
        mount $dev $mnt
        btrfs scrub start -B $mnt
        scrub done for 54e63f9f-0c30-4c84-a33b-5c56014629b7
        Scrub started:    Mon Nov  7 07:18:27 2022
        Status:           finished
        Duration:         0:00:00
        Total to scrub:   536.00MiB
        Rate:             0.00B/s
        Error summary:    no errors found <<<
      
      [CAUSE]
      That offending commit enlarges the data extent scrub size from sector
      size to BTRFS_STRIPE_LEN, to avoid extra scrub_block to be allocated.
      
      But unfortunately the data extent scrub is still heavily relying on the
      fact that there is only one scrub_sector per scrub_block.
      
      Thus it will only check the first sector, and ignoring the remaining
      sectors.
      
      Furthermore the error reporting is not able to handle multiple sectors
      either.
      
      [FIX]
      For now just revert the offending commit.
      
      The consequence is just extra memory usage during scrub.
      We will need a proper change to make the remaining data scrub path to
      handle multiple sectors before we enlarging the data scrub size.
      Reported-by: NLi Zhang <zhanglikernel@gmail.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b75b51f8
    • D
      btrfs: don't print stack trace when transaction is aborted due to ENOMEM · 8bb808c6
      David Sterba 提交于
      Add ENOMEM among the error codes that don't print stack trace on
      transaction abort. We've got several reports from syzbot that detects
      stacks as errors but caused by limiting memory. As this is an artificial
      condition we don't need to know where exactly the error happens, the
      abort and error cleanup will continue like e.g. for EIO.
      
      As the transaction aborts code needs to be inline in a lot of code, the
      implementation cases about minimal bloat. The error codes are in a
      separate function and the WARN uses the condition directly. This
      increases the code size by 571 bytes on release build.
      
      Alternatives considered: add -ENOMEM among the errors, this increases
      size by 2340 bytes, various attempts to combine the WARN and helper
      calls, increase by 700 or more bytes.
      
      Example syzbot reports (error -12):
      
      - https://syzkaller.appspot.com/bug?extid=5244d35be7f589cf093e
      - https://syzkaller.appspot.com/bug?extid=9c37714c07194d816417Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8bb808c6
    • Z
      btrfs: selftests: fix wrong error check in btrfs_free_dummy_root() · 9b2f2034
      Zhang Xiaoxu 提交于
      The btrfs_alloc_dummy_root() uses ERR_PTR as the error return value
      rather than NULL, if error happened, there will be a NULL pointer
      dereference:
      
        BUG: KASAN: null-ptr-deref in btrfs_free_dummy_root+0x21/0x50 [btrfs]
        Read of size 8 at addr 000000000000002c by task insmod/258926
      
        CPU: 2 PID: 258926 Comm: insmod Tainted: G        W          6.1.0-rc2+ #5
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
        Call Trace:
         <TASK>
         dump_stack_lvl+0x34/0x44
         kasan_report+0xb7/0x140
         kasan_check_range+0x145/0x1a0
         btrfs_free_dummy_root+0x21/0x50 [btrfs]
         btrfs_test_free_space_cache+0x1a8c/0x1add [btrfs]
         btrfs_run_sanity_tests+0x65/0x80 [btrfs]
         init_btrfs_fs+0xec/0x154 [btrfs]
         do_one_initcall+0x87/0x2a0
         do_init_module+0xdf/0x320
         load_module+0x3006/0x3390
         __do_sys_finit_module+0x113/0x1b0
         do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fixes: aaedb55b ("Btrfs: add tests for btrfs_get_extent")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9b2f2034
    • L
      btrfs: fix match incorrectly in dev_args_match_device · 0fca385d
      Liu Shixin 提交于
      syzkaller found a failed assertion:
      
        assertion failed: (args->devid != (u64)-1) || args->missing, in fs/btrfs/volumes.c:6921
      
      This can be triggered when we set devid to (u64)-1 by ioctl. In this
      case, the match of devid will be skipped and the match of device may
      succeed incorrectly.
      
      Patch 562d7b15 introduced this function which is used to match device.
      This function contains two matching scenarios, we can distinguish them by
      checking the value of args->missing rather than check whether args->devid
      and args->uuid is default value.
      
      Reported-by: syzbot+031687116258450f9853@syzkaller.appspotmail.com
      Fixes: 562d7b15 ("btrfs: handle device lookup with btrfs_dev_lookup_args")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0fca385d
  2. 03 11月, 2022 6 次提交
    • F
      btrfs: fix inode reserve space leak due to nowait buffered write · eb81b682
      Filipe Manana 提交于
      During a nowait buffered write, if we fail to balance dirty pages we exit
      btrfs_buffered_write() without releasing the delalloc space reserved for
      an extent, resulting in leaking space from the inode's block reserve.
      
      So fix that by releasing the delalloc space for the extent when balancing
      dirty pages fails.
      Reported-by: Nkernel test robot <yujie.liu@intel.com>
      Link: https://lore.kernel.org/all/202210111304.d369bc32-yujie.liu@intel.com
      Fixes: 965f47ae ("btrfs: make btrfs_buffered_write nowait compatible")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb81b682
    • F
      btrfs: fix nowait buffered write returning -ENOSPC · a348c8d4
      Filipe Manana 提交于
      If we are doing a buffered write in NOWAIT context and we can't reserve
      metadata space due to -ENOSPC, then we should return -EAGAIN so that we
      retry the write in a context allowed to block and do metadata reservation
      with flushing, which might succeed this time due to the allowed flushing.
      
      Returning -ENOSPC while in NOWAIT context simply makes some writes fail
      with -ENOSPC when they would likely succeed after switching from NOWAIT
      context to blocking context. That is unexpected behaviour and even fio
      complains about it with a warning like this:
      
        fio: io_u error on file /mnt/sdi/task_0.0.0: No space left on device: write offset=1535705088, buflen=65536
        fio: pid=592630, err=28/file:io_u.c:1846, func=io_u error, error=No space left on device
      
      The fio's job config is this:
      
         [global]
         bs=64K
         ioengine=io_uring
         iodepth=1
         size=2236962133
         nr_files=1
         filesize=2236962133
         direct=0
         runtime=10
         fallocate=posix
         io_size=2236962133
         group_reporting
         time_based
      
         [task_0]
         rw=randwrite
         directory=/mnt/sdi
         numjobs=4
      
      So fix this by returning -EAGAIN if we are in NOWAIT context and the
      metadata reservation failed with -ENOSPC.
      
      Fixes: 304e45ac ("btrfs: plumb NOWAIT through the write path")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a348c8d4
    • F
      btrfs: remove pointless and double ulist frees in error paths of qgroup tests · d0ea17ae
      Filipe Manana 提交于
      Several places in the qgroup self tests follow the pattern of freeing the
      ulist pointer they passed to btrfs_find_all_roots() if the call to that
      function returned an error. That is pointless because that function always
      frees the ulist in case it returns an error.
      
      Also In some places like at test_multiple_refs(), after a call to
      btrfs_qgroup_account_extent() we also leave "old_roots" and "new_roots"
      pointing to ulists that were freed, because btrfs_qgroup_account_extent()
      has freed those ulists, and if after that the next call to
      btrfs_find_all_roots() fails, we call ulist_free() on the "old_roots"
      ulist again, resulting in a double free.
      
      So remove those calls to reduce the code size and avoid double ulist
      free in case of an error.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d0ea17ae
    • F
      btrfs: fix ulist leaks in error paths of qgroup self tests · d37de92b
      Filipe Manana 提交于
      In the test_no_shared_qgroup() and test_multiple_refs() qgroup self tests,
      if we fail to add the tree ref, remove the extent item or remove the
      extent ref, we are returning from the test function without freeing the
      "old_roots" ulist that was allocated by the previous calls to
      btrfs_find_all_roots(). Fix that by calling ulist_free() before returning.
      
      Fixes: 442244c9 ("btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d37de92b
    • F
      btrfs: fix inode list leak during backref walking at find_parent_nodes() · 92876eec
      Filipe Manana 提交于
      During backref walking, at find_parent_nodes(), if we are dealing with a
      data extent and we get an error while resolving the indirect backrefs, at
      resolve_indirect_refs(), or in the while loop that iterates over the refs
      in the direct refs rbtree, we end up leaking the inode lists attached to
      the direct refs we have in the direct refs rbtree that were not yet added
      to the refs ulist passed as argument to find_parent_nodes(). Since they
      were not yet added to the refs ulist and prelim_release() does not free
      the lists, on error the caller can only free the lists attached to the
      refs that were added to the refs ulist, all the remaining refs get their
      inode lists never freed, therefore leaking their memory.
      
      Fix this by having prelim_release() always free any attached inode list
      to each ref found in the rbtree, and have find_parent_nodes() set the
      ref's inode list to NULL once it transfers ownership of the inode list
      to a ref added to the refs ulist passed to find_parent_nodes().
      
      Fixes: 86d5f994 ("btrfs: convert prelimary reference tracking to use rbtrees")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92876eec
    • F
      btrfs: fix inode list leak during backref walking at resolve_indirect_refs() · 5614dc3a
      Filipe Manana 提交于
      During backref walking, at resolve_indirect_refs(), if we get an error
      we jump to the 'out' label and call ulist_free() on the 'parents' ulist,
      which frees all the elements in the ulist - however that does not free
      any inode lists that may be attached to elements, through the 'aux' field
      of a ulist node, so we end up leaking lists if we have any attached to
      the unodes.
      
      Fix this by calling free_leaf_list() instead of ulist_free() when we exit
      from resolve_indirect_refs(). The static function free_leaf_list() is
      moved up for this to be possible and it's slightly simplified by removing
      unnecessary code.
      
      Fixes: 3301958b ("Btrfs: add inodes before dropping the extent lock in find_all_leafs")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5614dc3a
  3. 31 10月, 2022 2 次提交
  4. 25 10月, 2022 1 次提交
    • Q
      btrfs: don't use btrfs_chunk::sub_stripes from disk · 76a66ba1
      Qu Wenruo 提交于
      [BUG]
      There are two reports (the earliest one from LKP, a more recent one from
      kernel bugzilla) that we can have some chunks with 0 as sub_stripes.
      
      This will cause divide-by-zero errors at btrfs_rmap_block, which is
      introduced by a recent kernel patch ac067734 ("btrfs: merge
      calculations for simple striped profiles in btrfs_rmap_block"):
      
      		if (map->type & (BTRFS_BLOCK_GROUP_RAID0 |
      				 BTRFS_BLOCK_GROUP_RAID10)) {
      			stripe_nr = stripe_nr * map->num_stripes + i;
      			stripe_nr = div_u64(stripe_nr, map->sub_stripes); <<<
      		}
      
      [CAUSE]
      From the more recent report, it has been proven that we have some chunks
      with 0 as sub_stripes, mostly caused by older mkfs.
      
      It turns out that the mkfs.btrfs fix is only introduced in 6718ab4d33aa
      ("btrfs-progs: Initialize sub_stripes to 1 in btrfs_alloc_data_chunk")
      which is included in v5.4 btrfs-progs release.
      
      So there would be quite some old filesystems with such 0 sub_stripes.
      
      [FIX]
      Just don't trust the sub_stripes values from disk.
      
      We have a trusted btrfs_raid_array[] to fetch the correct sub_stripes
      numbers for each profile and that are fixed.
      
      By this, we can keep the compatibility with older filesystems while
      still avoid divide-by-zero bugs.
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Reported-by: NViktor Kuzmin <kvaster@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216559
      Fixes: ac067734 ("btrfs: merge calculations for simple striped profiles in btrfs_rmap_block")
      CC: stable@vger.kernel.org # 6.0
      Reviewed-by: NSu Yue <glass@fydeos.io>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      76a66ba1
  5. 24 10月, 2022 7 次提交
    • D
      btrfs: fix type of parameter generation in btrfs_get_dentry · 2398091f
      David Sterba 提交于
      The type of parameter generation has been u32 since the beginning,
      however all callers pass a u64 generation, so unify the types to prevent
      potential loss.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2398091f
    • B
      btrfs: send: fix send failure of a subcase of orphan inodes · 9b8be45f
      BingJing Chang 提交于
      Commit 9ed0a72e ("btrfs: send: fix failures when processing inodes with
      no links") tries to fix all incremental send cases of orphan inodes the
      send operation will meet. However, there's still a bug causing the corner
      subcase fails with a ENOENT error.
      
      Here's shortened steps of that subcase:
      
        $ btrfs subvolume create vol
        $ touch vol/foo
      
        $ btrfs subvolume snapshot -r vol snap1
        $ btrfs subvolume snapshot -r vol snap2
      
        # Turn the second snapshot to RW mode and delete the file while
        # holding an open file descriptor on it
        $ btrfs property set snap2 ro false
        $ exec 73<snap2/foo
        $ rm snap2/foo
      
        # Set the second snapshot back to RO mode and do an incremental send
        # with an unusal reverse order
        $ btrfs property set snap2 ro true
        $ btrfs send -p snap2 snap1 > /dev/null
        At subvol snap1
        ERROR: send ioctl failed with -2: No such file or directory
      
      It's subcase 3 of BTRFS_COMPARE_TREE_CHANGED in the commit 9ed0a72e
      ("btrfs: send: fix failures when processing inodes with no links"). And
      it's not a common case. We still have not met it in the real world.
      Theoretically, this case can happen in a batch cascading snapshot backup.
      In cascading backups, the receive operation in the middle may cause orphan
      inodes to appear because of the open file descriptors on the snapshot files
      during receiving. And if we don't do the batch snapshot backups in their
      creation order, then we can have an inode, which is an orphan in the parent
      snapshot but refers to a file in the send snapshot. Since an orphan inode
      has no paths, the send operation will fail with a ENOENT error if it
      tries to generate a path for it.
      
      In that patch, this subcase will be treated as an inode with a new
      generation. However, when the routine tries to delete the old paths in
      the parent snapshot, the function process_all_refs() doesn't check whether
      there are paths recorded or not before it calls the function
      process_recorded_refs(). And the function process_recorded_refs() try
      to get the first path in the parent snapshot in the beginning. Since it has
      no paths in the parent snapshot, the send operation fails.
      
      To fix this, we can easily put a link count check to avoid entering the
      deletion routine like what we do a link count check to avoid creating a
      new one. Moreover, we can assume that the function process_all_refs()
      can always collect references to process because we know it has a
      positive link count.
      
      Fixes: 9ed0a72e ("btrfs: send: fix failures when processing inodes with no links")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9b8be45f
    • Q
      btrfs: make thaw time super block check to also verify checksum · 3d17adea
      Qu Wenruo 提交于
      Previous commit a05d3c91 ("btrfs: check superblock to ensure the fs
      was not modified at thaw time") only checks the content of the super
      block, but it doesn't really check if the on-disk super block has a
      matching checksum.
      
      This patch will add the checksum verification to thaw time superblock
      verification.
      
      This involves the following extra changes:
      
      - Export btrfs_check_super_csum()
        As we need to call it in super.c.
      
      - Change the argument list of btrfs_check_super_csum()
        Instead of passing a char *, directly pass struct btrfs_super_block *
        pointer.
      
      - Verify that our checksum type didn't change before checking the
        checksum value, like it's done at mount time
      
      Fixes: a05d3c91 ("btrfs: check superblock to ensure the fs was not modified at thaw time")
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d17adea
    • J
      btrfs: fix tree mod log mishandling of reallocated nodes · 968b7158
      Josef Bacik 提交于
      We have been seeing the following panic in production
      
        kernel BUG at fs/btrfs/tree-mod-log.c:677!
        invalid opcode: 0000 [#1] SMP
        RIP: 0010:tree_mod_log_rewind+0x1b4/0x200
        RSP: 0000:ffffc9002c02f890 EFLAGS: 00010293
        RAX: 0000000000000003 RBX: ffff8882b448c700 RCX: 0000000000000000
        RDX: 0000000000008000 RSI: 00000000000000a7 RDI: ffff88877d831c00
        RBP: 0000000000000002 R08: 000000000000009f R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000100c40 R12: 0000000000000001
        R13: ffff8886c26d6a00 R14: ffff88829f5424f8 R15: ffff88877d831a00
        FS:  00007fee1d80c780(0000) GS:ffff8890400c0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fee1963a020 CR3: 0000000434f33002 CR4: 00000000007706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        PKRU: 55555554
        Call Trace:
         btrfs_get_old_root+0x12b/0x420
         btrfs_search_old_slot+0x64/0x2f0
         ? tree_mod_log_oldest_root+0x3d/0xf0
         resolve_indirect_ref+0xfd/0x660
         ? ulist_alloc+0x31/0x60
         ? kmem_cache_alloc_trace+0x114/0x2c0
         find_parent_nodes+0x97a/0x17e0
         ? ulist_alloc+0x30/0x60
         btrfs_find_all_roots_safe+0x97/0x150
         iterate_extent_inodes+0x154/0x370
         ? btrfs_search_path_in_tree+0x240/0x240
         iterate_inodes_from_logical+0x98/0xd0
         ? btrfs_search_path_in_tree+0x240/0x240
         btrfs_ioctl_logical_to_ino+0xd9/0x180
         btrfs_ioctl+0xe2/0x2ec0
         ? __mod_memcg_lruvec_state+0x3d/0x280
         ? do_sys_openat2+0x6d/0x140
         ? kretprobe_dispatcher+0x47/0x70
         ? kretprobe_rethook_handler+0x38/0x50
         ? rethook_trampoline_handler+0x82/0x140
         ? arch_rethook_trampoline_callback+0x3b/0x50
         ? kmem_cache_free+0xfb/0x270
         ? do_sys_openat2+0xd5/0x140
         __x64_sys_ioctl+0x71/0xb0
         do_syscall_64+0x2d/0x40
      
      Which is this code in tree_mod_log_rewind()
      
      	switch (tm->op) {
              case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
      		BUG_ON(tm->slot < n);
      
      This occurs because we replay the nodes in order that they happened, and
      when we do a REPLACE we will log a REMOVE_WHILE_FREEING for every slot,
      starting at 0.  'n' here is the number of items in this block, which in
      this case was 1, but we had 2 REMOVE_WHILE_FREEING operations.
      
      The actual root cause of this was that we were replaying operations for
      a block that shouldn't have been replayed.  Consider the following
      sequence of events
      
      1. We have an already modified root, and we do a btrfs_get_tree_mod_seq().
      2. We begin removing items from this root, triggering KEY_REPLACE for
         it's child slots.
      3. We remove one of the 2 children this root node points to, thus triggering
         the root node promotion of the remaining child, and freeing this node.
      4. We modify a new root, and re-allocate the above node to the root node of
         this other root.
      
      The tree mod log looks something like this
      
      	logical 0	op KEY_REPLACE (slot 1)			seq 2
      	logical 0	op KEY_REMOVE (slot 1)			seq 3
      	logical 0	op KEY_REMOVE_WHILE_FREEING (slot 0)	seq 4
      	logical 4096	op LOG_ROOT_REPLACE (old logical 0)	seq 5
      	logical 8192	op KEY_REMOVE_WHILE_FREEING (slot 1)	seq 6
      	logical 8192	op KEY_REMOVE_WHILE_FREEING (slot 0)	seq 7
      	logical 0	op LOG_ROOT_REPLACE (old logical 8192)	seq 8
      
      >From here the bug is triggered by the following steps
      
      1.  Call btrfs_get_old_root() on the new_root.
      2.  We call tree_mod_log_oldest_root(btrfs_root_node(new_root)), which is
          currently logical 0.
      3.  tree_mod_log_oldest_root() calls tree_mod_log_search_oldest(), which
          gives us the KEY_REPLACE seq 2, and since that's not a
          LOG_ROOT_REPLACE we incorrectly believe that we don't have an old
          root, because we expect that the most recent change should be a
          LOG_ROOT_REPLACE.
      4.  Back in tree_mod_log_oldest_root() we don't have a LOG_ROOT_REPLACE,
          so we don't set old_root, we simply use our existing extent buffer.
      5.  Since we're using our existing extent buffer (logical 0) we call
          tree_mod_log_search(0) in order to get the newest change to start the
          rewind from, which ends up being the LOG_ROOT_REPLACE at seq 8.
      6.  Again since we didn't find an old_root we simply clone logical 0 at
          it's current state.
      7.  We call tree_mod_log_rewind() with the cloned extent buffer.
      8.  Set n = btrfs_header_nritems(logical 0), which would be whatever the
          original nritems was when we COWed the original root, say for this
          example it's 2.
      9.  We start from the newest operation and work our way forward, so we
          see LOG_ROOT_REPLACE which we ignore.
      10. Next we see KEY_REMOVE_WHILE_FREEING for slot 0, which triggers the
          BUG_ON(tm->slot < n), because it expects if we've done this we have a
          completely empty extent buffer to replay completely.
      
      The correct thing would be to find the first LOG_ROOT_REPLACE, and then
      get the old_root set to logical 8192.  In fact making that change fixes
      this particular problem.
      
      However consider the much more complicated case.  We have a child node
      in this tree and the above situation.  In the above case we freed one
      of the child blocks at the seq 3 operation.  If this block was also
      re-allocated and got new tree mod log operations we would have a
      different problem.  btrfs_search_old_slot(orig root) would get down to
      the logical 0 root that still pointed at that node.  However in
      btrfs_search_old_slot() we call tree_mod_log_rewind(buf) directly.  This
      is not context aware enough to know which operations we should be
      replaying.  If the block was re-allocated multiple times we may only
      want to replay a range of operations, and determining what that range is
      isn't possible to determine.
      
      We could maybe solve this by keeping track of which root the node
      belonged to at every tree mod log operation, and then passing this
      around to make sure we're only replaying operations that relate to the
      root we're trying to rewind.
      
      However there's a simpler way to solve this problem, simply disallow
      reallocations if we have currently running tree mod log users.  We
      already do this for leaf's, so we're simply expanding this to nodes as
      well.  This is a relatively uncommon occurrence, and the problem is
      complicated enough I'm worried that we will still have corner cases in
      the reallocation case.  So fix this in the most straightforward way
      possible.
      
      Fixes: bd989ba3 ("Btrfs: add tree modification log functions")
      CC: stable@vger.kernel.org # 3.3+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      968b7158
    • D
      btrfs: reorder btrfs_bio for better packing · ae0e5df4
      David Sterba 提交于
      After changes in commit 917f32a2 ("btrfs: give struct btrfs_bio a
      real end_io handler") the layout of btrfs_bio can be improved.  There
      are two holes and the structure size is 264 bytes on release build. By
      reordering the iterator we can get rid of the holes and the size is 256
      bytes which fits to slabs much better.
      
      Final layout:
      
      struct btrfs_bio {
      	unsigned int               mirror_num;           /*     0     4 */
      	struct bvec_iter           iter;                 /*     4    20 */
      	u64                        file_offset;          /*    24     8 */
      	struct btrfs_device *      device;               /*    32     8 */
      	u8 *                       csum;                 /*    40     8 */
      	u8                         csum_inline[64];      /*    48    64 */
      	/* --- cacheline 1 boundary (64 bytes) was 48 bytes ago --- */
      	btrfs_bio_end_io_t         end_io;               /*   112     8 */
      	void *                     private;              /*   120     8 */
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	struct work_struct         end_io_work;          /*   128    32 */
      	struct bio                 bio;                  /*   160    96 */
      
      	/* size: 256, cachelines: 4, members: 10 */
      };
      
      Fixes: 917f32a2 ("btrfs: give struct btrfs_bio a real end_io handler")
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae0e5df4
    • Q
      btrfs: raid56: avoid double freeing for rbio if full_stripe_write() failed · ab4c54c6
      Qu Wenruo 提交于
      Currently if full_stripe_write() failed to allocate the pages for
      parity, it will call __free_raid_bio() first, then return -ENOMEM.
      
      But some caller of full_stripe_write() will also call __free_raid_bio()
      again, this would cause double freeing.
      
      And it's not a logically sound either, normally we should either free
      the memory at the same level where we allocated it, or let endio to
      handle everything.
      
      So this patch will solve the double freeing by make
      raid56_parity_write() to handle the error and free the rbio.
      
      Just like what we do in raid56_parity_recover().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ab4c54c6
    • Q
      btrfs: raid56: properly handle the error when unable to find the missing stripe · f15fb2cd
      Qu Wenruo 提交于
      In raid56_alloc_missing_rbio(), if we can not determine where the
      missing device is inside the full stripe, we just BUG_ON().
      
      This is not necessary especially the only caller inside scrub.c is
      already properly checking the return value, and will treat it as a
      memory allocation failure.
      
      Fix the error handling by:
      
      - Add an extra warning for the reason
        Although personally speaking it may be better to be an ASSERT().
      
      - Properly free the allocated rbio
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f15fb2cd
  6. 14 10月, 2022 1 次提交
  7. 11 10月, 2022 7 次提交
    • F
      btrfs: ignore fiemap path cache if we have multiple leaves for a data extent · 63c84b46
      Filipe Manana 提交于
      The path cache used during fiemap used to determine the sharedness of
      extent buffers in a path from a leaf containing a file extent item
      pointing to our data extent up to the root node of the tree, is meant to
      be used for a single path. Having a single path is by far the most common
      case, and therefore worth to optimize for, but it's possible to actually
      have multiple paths because we have 2 or more leaves.
      
      If we have multiple leaves, the 'level' variable keeps getting incremented
      in each iteration of the while loop at btrfs_is_data_extent_shared(),
      which means we will treat the second leaf in the 'tmp' ulist as a level 1
      node, and so forth. In the worst case this can lead to getting a level
      greater than or equals to BTRFS_MAX_LEVEL (8), which will trigger a
      WARN_ON_ONCE() in the functions to lookup from or store in the path cache
      (lookup_backref_shared_cache() and store_backref_shared_cache()). If the
      current level never goes beyond 8, due to shared nodes in the paths and
      a fs tree height smaller than 8, it can still result in incorrectly
      marking one leaf as shared because some other leaf is shared and is stored
      one level below that other leaf, as when storing a true sharedness value
      in the cache results in updating the sharedness to true of all entries in
      the cache below the current level.
      
      Having multiple leaves happens in a case like the following:
      
        - We have a file extent item point to data extent at bytenr X, for
          a file range [0, 1M[ for example;
      
        - At this moment we have an extent data ref for the extent, with
          an offset of 0 and a count of 1;
      
        - A write into the middle of the extent happens, file range [64K, 128K)
          so the file extent item is split into two (at btrfs_drop_extents()):
      
          1) One for file range [0, 64K), with a length (num_bytes field) of
             64K and an extent offset of 0;
      
          2) Another one for file range [128K, 1M), with a length of 896K
             (1M - 128K) and an extent offset of 128K.
      
        - At this moment the two file extent items are located in the same
          leaf;
      
        - A new file extent item for the range [64K, 128K), pointing to a new
          data extent, is inserted in the leaf. This results in a leaf split
          and now those two file extent items pointing to data extent X end
          up located in different leaves;
      
        - Once delayed refs are run, we still have a single extent data ref
          item for our data extent at bytenr X, for offset 0, but now with a
          count of 2 instead of 1;
      
        - So during fiemap, at btrfs_is_data_extent_shared(), after we call
          find_parent_nodes() for the data extent, we get two leaves, since
          we have two file extent items point to data extent at bytenr X that
          are located in two different leaves.
      
      So skip the use of the path cache when we get more than one leaf.
      
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      63c84b46
    • F
      btrfs: fix processing of delayed tree block refs during backref walking · 943553ef
      Filipe Manana 提交于
      During backref walking, when processing a delayed reference with a type of
      BTRFS_TREE_BLOCK_REF_KEY, we have two bugs there:
      
      1) We are accessing the delayed references extent_op, and its key, without
         the protection of the delayed ref head's lock;
      
      2) If there's no extent op for the delayed ref head, we end up with an
         uninitialized key in the stack, variable 'tmp_op_key', and then pass
         it to add_indirect_ref(), which adds the reference to the indirect
         refs rb tree.
      
         This is wrong, because indirect references should have a NULL key
         when we don't have access to the key, and in that case they should be
         added to the indirect_missing_keys rb tree and not to the indirect rb
         tree.
      
         This means that if have BTRFS_TREE_BLOCK_REF_KEY delayed ref resulting
         from freeing an extent buffer, therefore with a count of -1, it will
         not cancel out the corresponding reference we have in the extent tree
         (with a count of 1), since both references end up in different rb
         trees.
      
         When using fiemap, where we often need to check if extents are shared
         through shared subtrees resulting from snapshots, it means we can
         incorrectly report an extent as shared when it's no longer shared.
         However this is temporary because after the transaction is committed
         the extent is no longer reported as shared, as running the delayed
         reference results in deleting the tree block reference from the extent
         tree.
      
         Outside the fiemap context, the result is unpredictable, as the key was
         not initialized but it's used when navigating the rb trees to insert
         and search for references (prelim_ref_compare()), and we expect all
         references in the indirect rb tree to have valid keys.
      
      The following reproducer triggers the second bug:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount -o compress $DEV $MNT
      
         # With a compressed 128M file we get a tree height of 2 (level 1 root).
         xfs_io -f -c "pwrite -b 1M 0 128M" $MNT/foo
      
         btrfs subvolume snapshot $MNT $MNT/snap
      
         # Fiemap should output 0x2008 in the flags column.
         # 0x2000 means shared extent
         # 0x8 means encoded extent (because it's compressed)
         echo
         echo "fiemap after snapshot, range [120M, 120M + 128K):"
         xfs_io -c "fiemap -v 120M 128K" $MNT/foo
         echo
      
         # Overwrite one extent and fsync to flush delalloc and COW a new path
         # in the snapshot's tree.
         #
         # After this we have a BTRFS_DROP_DELAYED_REF delayed ref of type
         # BTRFS_TREE_BLOCK_REF_KEY with a count of -1 for every COWed extent
         # buffer in the path.
         #
         # In the extent tree we have inline references of type
         # BTRFS_TREE_BLOCK_REF_KEY, with a count of 1, for the same extent
         # buffers, so they should cancel each other, and the extent buffers in
         # the fs tree should no longer be considered as shared.
         #
         echo "Overwriting file range [120M, 120M + 128K)..."
         xfs_io -c "pwrite -b 128K 120M 128K" $MNT/snap/foo
         xfs_io -c "fsync" $MNT/snap/foo
      
         # Fiemap should output 0x8 in the flags column. The extent in the range
         # [120M, 120M + 128K) is no longer shared, it's now exclusive to the fs
         # tree.
         echo
         echo "fiemap after overwrite range [120M, 120M + 128K):"
         xfs_io -c "fiemap -v 120M 128K" $MNT/foo
         echo
      
         umount $MNT
      
      Running it before this patch:
      
         $ ./test.sh
         (...)
         wrote 134217728/134217728 bytes at offset 0
         128 MiB, 128 ops; 0.1152 sec (1.085 GiB/sec and 1110.5809 ops/sec)
         Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
      
         fiemap after snapshot, range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
         Overwriting file range [120M, 120M + 128K)...
         wrote 131072/131072 bytes at offset 125829120
         128 KiB, 1 ops; 0.0001 sec (683.060 MiB/sec and 5464.4809 ops/sec)
      
         fiemap after overwrite range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
      The extent in the range [120M, 120M + 128K) is still reported as shared
      (0x2000 bit set) after overwriting that range and flushing delalloc, which
      is not correct - an entire path was COWed in the snapshot's tree and the
      extent is now only referenced by the original fs tree.
      
      Running it after this patch:
      
         $ ./test.sh
         (...)
         wrote 134217728/134217728 bytes at offset 0
         128 MiB, 128 ops; 0.1198 sec (1.043 GiB/sec and 1068.2067 ops/sec)
         Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
      
         fiemap after snapshot, range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
         Overwriting file range [120M, 120M + 128K)...
         wrote 131072/131072 bytes at offset 125829120
         128 KiB, 1 ops; 0.0001 sec (694.444 MiB/sec and 5555.5556 ops/sec)
      
         fiemap after overwrite range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256   0x8
      
      Now the extent is not reported as shared anymore.
      
      So fix this by passing a NULL key pointer to add_indirect_ref() when
      processing a delayed reference for a tree block if there's no extent op
      for our delayed ref head with a defined key. Also access the extent op
      only after locking the delayed ref head's lock.
      
      The reproducer will be converted later to a test case for fstests.
      
      Fixes: 86d5f994 ("btrfs: convert prelimary reference tracking to use rbtrees")
      Fixes: a6dbceaf ("btrfs: Remove unused op_key var from add_delayed_refs")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      943553ef
    • F
      btrfs: fix processing of delayed data refs during backref walking · 4fc7b572
      Filipe Manana 提交于
      When processing delayed data references during backref walking and we are
      using a share context (we are being called through fiemap), whenever we
      find a delayed data reference for an inode different from the one we are
      interested in, then we immediately exit and consider the data extent as
      shared. This is wrong, because:
      
      1) This might be a DROP reference that will cancel out a reference in the
         extent tree;
      
      2) Even if it's an ADD reference, it may be followed by a DROP reference
         that cancels it out.
      
      In either case we should not exit immediately.
      
      Fix this by never exiting when we find a delayed data reference for
      another inode - instead add the reference and if it does not cancel out
      other delayed reference, we will exit early when we call
      extent_is_shared() after processing all delayed references. If we find
      a drop reference, then signal the code that processes references from
      the extent tree (add_inline_refs() and add_keyed_refs()) to not exit
      immediately if it finds there a reference for another inode, since we
      have delayed drop references that may cancel it out. In this later case
      we exit once we don't have references in the rb trees that cancel out
      each other and have two references for different inodes.
      
      Example reproducer for case 1):
      
         $ cat test-1.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      Example reproducer for case 2):
      
         $ cat test-2.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         # Flush delayed references to the extent tree and commit current
         # transaction.
         sync
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      After this patch, after deleting bar in both tests, the extent is not
      reported with the 0x2000 flag anymore, it gets only the flag 0x1
      (which is FIEMAP_EXTENT_LAST):
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
      These tests will later be converted to a test case for fstests.
      
      Fixes: dc046b10 ("Btrfs: make fiemap not blow when you have lots of snapshots")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4fc7b572
    • D
      btrfs: delete stale comments after merge conflict resolution · 295a53cc
      David Sterba 提交于
      There are two comments in btrfs_cache_block_group that I left when
      resolving conflict between commits ced8ecf0 "btrfs: fix space cache
      corruption and potential double allocations" and 527c490f "btrfs:
      delete btrfs_wait_space_cache_v1_finished".
      
      The former reworked the caching logic to wait until the caching ends in
      btrfs_cache_block_group while the latter only open coded the waiting.
      Both removed btrfs_wait_space_cache_v1_finished, the correct code is
      with the waiting and returning error. Thus the conflict resolution was
      OK.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      295a53cc
    • J
      btrfs: unlock locked extent area if we have contention · 9e769bd7
      Josef Bacik 提交于
      In production we hit the following deadlock
      
      task 1			task 2			task 3
      ------			------			------
      fiemap(file)		falloc(file)		fsync(file)
      						  write(0, 1MiB)
      						  btrfs_commit_transaction()
      						    wait_on(!pending_ordered)
      			  lock(512MiB, 1GiB)
      			  start_transaction
      			    wait_on_transaction
      
        lock(0, 1GiB)
          wait_extent_bit(512MiB)
      
      task 4
      ------
      finish_ordered_extent(0, 1MiB)
        lock(0, 1MiB)
        **DEADLOCK**
      
      This occurs because when task 1 does it's lock, it locks everything from
      0-512MiB, and then waits for the 512MiB chunk to unlock.  task 2 will
      never unlock because it's waiting on the transaction commit to happen,
      the transaction commit is waiting for the outstanding ordered extents,
      and then the ordered extent thread is blocked waiting on the 0-1MiB
      range to unlock.
      
      To fix this we have to clear anything we've locked so far, wait for the
      extent_state that we contended on, and then try to re-lock the entire
      range again.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9e769bd7
    • D
      btrfs: send: update command for protocol version check · c86eab81
      David Sterba 提交于
      For a protocol and command compatibility we have a helper that hasn't
      been updated for v3 yet. We use it for verity so update where necessary.
      
      Fixes: 38622010 ("btrfs: send: add support for fs-verity")
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c86eab81
    • B
      btrfs: send: allow protocol version 3 with CONFIG_BTRFS_DEBUG · 9971a741
      Boris Burkov 提交于
      We haven't finalized send stream v3 yet, so gate the send stream version
      behind CONFIG_BTRFS_DEBUG as we want some way to test it.
      
      The original verity send did not check the protocol version, so add that
      actual protection as well.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9971a741
  8. 07 10月, 2022 1 次提交
    • F
      btrfs: add missing path cache update during fiemap · 96dbcc00
      Filipe Manana 提交于
      When looking the stored result for a cached path node, if the stored
      result is valid and has a value of true, we must update all the nodes for
      all levels below it with a result of true as well. This is necessary when
      moving from one leaf in the fs tree to the next one, as well as when
      moving from a node at any level to the next node at the same level.
      
      Currently this logic is missing as it was somehow forgotten by a recent
      patch with the subject: "btrfs: speedup checking for extent sharedness
      during fiemap".
      
      This adds the missing logic, which is the counter part to what we do
      when adding a shared node to the cache at store_backref_shared_cache().
      
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      96dbcc00
  9. 29 9月, 2022 11 次提交
    • T
      btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer · cbddcc4f
      Tetsuo Handa 提交于
      syzbot is reporting uninit-value in btrfs_clean_tree_block() [1], for
      commit bc877d28 ("btrfs: Deduplicate extent_buffer init code")
      missed that btrfs_set_header_generation() in btrfs_init_new_buffer() must
      not be moved to after clean_tree_block() because clean_tree_block() is
      calling btrfs_header_generation() since commit 55c69072 ("Btrfs:
      Fix extent_buffer usage when nodesize != leafsize").
      
      Since memzero_extent_buffer() will reset "struct btrfs_header" part, we
      can't move btrfs_set_header_generation() to before memzero_extent_buffer().
      Just re-add btrfs_set_header_generation() before btrfs_clean_tree_block().
      
      Link: https://syzkaller.appspot.com/bug?extid=fba8e2116a12609b6c59 [1]
      Reported-by: Nsyzbot <syzbot+fba8e2116a12609b6c59@syzkaller.appspotmail.com>
      Fixes: bc877d28 ("btrfs: Deduplicate extent_buffer init code")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cbddcc4f
    • F
      btrfs: drop extent map range more efficiently · db21370b
      Filipe Manana 提交于
      Currently when dropping extent maps for a file range, through
      btrfs_drop_extent_map_range(), we do the following non-optimal things:
      
      1) We lookup for extent maps one by one, always starting the search from
         the root of the extent map tree. This is not efficient if we have
         multiple extent maps in the range;
      
      2) We check on every iteration if we have the 'split' and 'split2' spare
         extent maps in case we need to split an extent map that intersects our
         range but also crosses its boundaries (to the left, to the right or
         both cases). If our target range is for example:
      
             [2M, 8M)
      
         And we have 3 extents maps in the range:
      
             [1M, 3M) [3M, 6M) [6M, 10M[
      
         The on the first iteration we allocate two extent maps for 'split' and
         'split2', and use the 'split' to split the first extent map, so after
         the split we set 'split' to 'split2' and then set 'split2' to NULL.
      
         On the second iteration, we don't need to split the second extent map,
         but because 'split2' is now NULL, we allocate a new extent map for
         'split2'.
      
         On the third iteration we need to split the third extent map, so we
         use the extent map pointed by 'split'.
      
         So we ended up allocating 3 extent maps for splitting, but all we
         needed was 2 extent maps. We never need to allocate more than 2,
         because extent maps that need to be split are always the first one
         and the last one in the target range.
      
      Improve on this by:
      
      1) Using rb_next() to move on to the next extent map. This results in
         iterating over less nodes of the tree and it does not require comparing
         the ranges of nodes to our start/end offset;
      
      2) Allocate the 2 extent maps for splitting before entering the loop and
         never allocate more than 2. In practice it's very rare to have the
         combination of both extent map allocations fail, since we have a
         dedicated slab for extent maps, and also have the need to split two
         extent maps.
      
      This patch is part of a patchset comprised of the following patches:
      
         btrfs: fix missed extent on fsync after dropping extent maps
         btrfs: move btrfs_drop_extent_cache() to extent_map.c
         btrfs: use extent_map_end() at btrfs_drop_extent_map_range()
         btrfs: use cond_resched_rwlock_write() during inode eviction
         btrfs: move open coded extent map tree deletion out of inode eviction
         btrfs: add helper to replace extent map range with a new extent map
         btrfs: remove the refcount warning/check at free_extent_map()
         btrfs: remove unnecessary extent map initializations
         btrfs: assert tree is locked when clearing extent map from logging
         btrfs: remove unnecessary NULL pointer checks when searching extent maps
         btrfs: remove unnecessary next extent map search
         btrfs: avoid pointless extent map tree search when flushing delalloc
         btrfs: drop extent map range more efficiently
      
      And the following fio test was done before and after applying the whole
      patchset, on a non-debug kernel (Debian's default kernel config) on a 12
      cores Intel box with 64G of ram:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
         MOUNT_OPTIONS="-o ssd"
         MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
         cat <<EOF > /tmp/fio-job.ini
         [writers]
         rw=randwrite
         fsync=8
         fallocate=none
         group_reporting=1
         direct=0
         bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
         ioengine=psync
         filesize=2G
         runtime=300
         time_based
         directory=$MNT
         numjobs=8
         thread
         EOF
      
         echo performance | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         echo
         echo "Using config:"
         echo
         cat /tmp/fio-job.ini
         echo
      
         umount $MNT &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV
         mount $MOUNT_OPTIONS $DEV $MNT
      
         fio /tmp/fio-job.ini
      
         umount $MNT
      
      Result before applying the patchset:
      
         WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec
      
      Result after applying the patchset:
      
         WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      db21370b
    • F
      btrfs: avoid pointless extent map tree search when flushing delalloc · b54bb865
      Filipe Manana 提交于
      When flushing delalloc, in COW mode at cow_file_range(), before entering
      the loop that allocates extents and creates ordered extents, we do a call
      to btrfs_drop_extent_map_range() for the whole range. This is pointless
      because in the loop we call create_io_em(), which will also call
      btrfs_drop_extent_map_range() before inserting the new extent map.
      
      So remove that call at cow_file_range() not only because it is not needed,
      but also because it will make the btrfs_drop_extent_map_range() calls made
      from create_io_em() waste time searching the extent map tree, and that
      tree can be large for files with many extents. It also makes us waste time
      at btrfs_drop_extent_map_range() allocating and freeing the split extent
      maps for nothing.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b54bb865
    • F
      btrfs: remove unnecessary next extent map search · 6c05813e
      Filipe Manana 提交于
      At __tree_search(), and its single caller __lookup_extent_mapping(), there
      is no point in finding the next extent map that starts after the search
      offset if we were able to find the previous extent map that ends before
      our search offset, because __lookup_extent_mapping() ignores the next
      acceptable extent map if we were able to find the previous one.
      
      So just return immediately if we were able to find the previous extent
      map, therefore avoiding wasting time iterating the tree looking for the
      next extent map which will not be used by __lookup_extent_mapping().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6c05813e
    • F
      btrfs: remove unnecessary NULL pointer checks when searching extent maps · 08f088dd
      Filipe Manana 提交于
      The previous and next pointer arguments passed to __tree_search() are
      never NULL as the only caller of this function, __lookup_extent_mapping(),
      always passes the address of two on stack pointers. So remove the NULL
      checks and add assertions to verify the pointers.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      08f088dd
    • F
      btrfs: assert tree is locked when clearing extent map from logging · 74333c7d
      Filipe Manana 提交于
      When calling clear_em_logging() we should have a write lock on the extent
      map tree, as we will try to merge the extent map with the previous and
      next ones in the tree. So assert that we have a write lock.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74333c7d
    • F
      btrfs: remove unnecessary extent map initializations · 2e0cdaa0
      Filipe Manana 提交于
      When allocating an extent map, we use kmem_cache_zalloc() which guarantees
      the returned memory is initialized to zeroes, therefore it's pointless
      to initialize the generation and flags of the extent map to zero again.
      
      Remove those initializations, as they are pointless and slightly increase
      the object text size.
      
      Before removing them:
      
         $ size fs/btrfs/extent_map.o
            text	   data	    bss	    dec	    hex	filename
            9241	    274	     24	   9539	   2543	fs/btrfs/extent_map.o
      
      After removing them:
      
         $ size fs/btrfs/extent_map.o
            text	   data	    bss	    dec	    hex	filename
            9209	    274	     24	   9507	   2523	fs/btrfs/extent_map.o
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2e0cdaa0
    • F
      btrfs: remove the refcount warning/check at free_extent_map() · ad5d6e91
      Filipe Manana 提交于
      At free_extent_map(), it's pointless to have a WARN_ON() to check if the
      refcount of the extent map is zero. Such check is already done by the
      refcount_t module and refcount_dec_and_test(), which loudly complains if
      we try to decrement a reference count that is currently 0.
      
      The WARN_ON() dates back to the time when used a regular atomic_t type
      for the reference counter, before we switched to the refcount_t type.
      The main goal of the refcount_t type/module is precisely to catch such
      types of bugs and loudly complain if they happen.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ad5d6e91
    • F
      btrfs: add helper to replace extent map range with a new extent map · a1ba4c08
      Filipe Manana 提交于
      We have several places that need to drop all the extent maps in a given
      file range and then add a new extent map for that range. Currently they
      call btrfs_drop_extent_map_range() to delete all extent maps in the range
      and then keep trying to add the new extent map in a loop that keeps
      retrying while the insertion of the new extent map fails with -EEXIST.
      
      So instead of repeating this logic, add a helper to extent_map.c that
      does these steps and name it btrfs_replace_extent_map_range(). Also add
      a comment about why the retry loop is necessary.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a1ba4c08
    • F
      btrfs: move open coded extent map tree deletion out of inode eviction · 9c9d1b4f
      Filipe Manana 提交于
      Move the loop that removes all the extent maps from the inode's extent
      map tree during inode eviction out of inode.c and into extent_map.c, to
      btrfs_drop_extent_map_range(). Anything manipulating extent maps or the
      extent map tree should be in extent_map.c.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9c9d1b4f
    • F
      btrfs: use cond_resched_rwlock_write() during inode eviction · 99ba0c81
      Filipe Manana 提交于
      At evict_inode_truncate_pages(), instead of manually checking if
      rescheduling is needed, then unlock the extent map tree, reschedule and
      then write lock again the tree, use the helper cond_resched_rwlock_write()
      which does all that.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      99ba0c81