1. 23 8月, 2021 1 次提交
  2. 07 7月, 2021 1 次提交
    • F
      btrfs: rework chunk allocation to avoid exhaustion of the system chunk array · 79bd3712
      Filipe Manana 提交于
      Commit eafa4fd0 ("btrfs: fix exhaustion of the system chunk array
      due to concurrent allocations") fixed a problem that resulted in
      exhausting the system chunk array in the superblock when there are many
      tasks allocating chunks in parallel. Basically too many tasks enter the
      first phase of chunk allocation without previous tasks having finished
      their second phase of allocation, resulting in too many system chunks
      being allocated. That was originally observed when running the fallocate
      tests of stress-ng on a PowerPC machine, using a node size of 64K.
      
      However that commit also introduced a deadlock where a task in phase 1 of
      the chunk allocation waited for another task that had allocated a system
      chunk to finish its phase 2, but that other task was waiting on an extent
      buffer lock held by the first task, therefore resulting in both tasks not
      making any progress. That change was later reverted by a patch with the
      subject "btrfs: fix deadlock with concurrent chunk allocations involving
      system chunks", since there is no simple and short solution to address it
      and the deadlock is relatively easy to trigger on zoned filesystems, while
      the system chunk array exhaustion is not so common.
      
      This change reworks the chunk allocation to avoid the system chunk array
      exhaustion. It accomplishes that by making the first phase of chunk
      allocation do the updates of the device items in the chunk btree and the
      insertion of the new chunk item in the chunk btree. This is done while
      under the protection of the chunk mutex (fs_info->chunk_mutex), in the
      same critical section that checks for available system space, allocates
      a new system chunk if needed and reserves system chunk space. This way
      we do not have chunk space reserved until the second phase completes.
      
      The same logic is applied to chunk removal as well, since it keeps
      reserved system space long after it is done updating the chunk btree.
      
      For direct allocation of system chunks, the previous behaviour remains,
      because otherwise we would deadlock on extent buffers of the chunk btree.
      Changes to the chunk btree are by large done by chunk allocation and chunk
      removal, which first reserve chunk system space and then later do changes
      to the chunk btree. The other remaining cases are uncommon and correspond
      to adding a device, removing a device and resizing a device. All these
      other cases do not pre-reserve system space, they modify the chunk btree
      right away, so they don't hold reserved space for a long period like chunk
      allocation and chunk removal do.
      
      The diff of this change is huge, but more than half of it is just addition
      of comments describing both how things work regarding chunk allocation and
      removal, including both the new behavior and the parts of the old behavior
      that did not change.
      
      CC: stable@vger.kernel.org # 5.12+
      Tested-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Tested-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79bd3712
  3. 21 6月, 2021 1 次提交
    • J
      btrfs: always abort the transaction if we abort a trans handle · 5963ffca
      Josef Bacik 提交于
      While stress testing our error handling I noticed that sometimes we
      would still commit the transaction even though we had aborted the
      transaction.
      
      Currently we track if a trans handle has dirtied any metadata, and if it
      hasn't we mark the filesystem as having an error (so no new transactions
      can be started), but we will allow the current transaction to complete
      as we do not mark the transaction itself as having been aborted.
      
      This sounds good in theory, but we were not properly tracking IO errors
      in btrfs_finish_ordered_io, and thus committing the transaction with
      bogus free space data.  This isn't necessarily a problem per-se with the
      free space cache, as the other guards in place would have kept us from
      accepting the free space cache as valid, but highlights a real world
      case where we had a bug and could have corrupted the filesystem because
      of it.
      
      This "skip abort on empty trans handle" is nice in theory, but assumes
      we have perfect error handling everywhere, which we clearly do not.
      Also we do not allow further transactions to be started, so all this
      does is save the last transaction that was happening, which doesn't
      necessarily gain us anything other than the potential for real
      corruption.
      
      Remove this particular bit of code, if we decide we need to abort the
      transaction then abort the current one and keep us from doing real harm
      to the file system, regardless of whether this specific trans handle
      dirtied anything or not.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5963ffca
  4. 19 4月, 2021 3 次提交
    • F
      btrfs: improve btree readahead for full send operations · ace75066
      Filipe Manana 提交于
      Currently a full send operation uses the standard btree readahead when
      iterating over the subvolume/snapshot btree, which despite bringing good
      performance benefits, it could be improved in a few aspects for use cases
      such as full send operations, which are guaranteed to visit every node
      and leaf of a btree, in ascending and sequential order. The limitations
      of that standard btree readahead implementation are the following:
      
      1) It only triggers readahead for leaves that are physically close
         to the leaf being read, within a 64K range;
      
      2) It only triggers readahead for the next or previous leaves if the
         leaf being read is not currently in memory;
      
      3) It never triggers readahead for nodes.
      
      So add a new readahead mode that addresses all these points and use it
      for full send operations.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk and with 16GiB of RAM:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        initial_file_count=500000
        add_files $initial_file_count 0 4
      
        echo
        echo "Creating first snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        echo
        echo "Adding more files..."
        add_files $((initial_file_count / 4)) $initial_file_count 4
      
        echo
        echo "Updating 1/50th of the initial files..."
        for ((i = 1; i < $initial_file_count; i += 50)); do
            xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
        done
      
        echo
        echo "Creating second snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap2
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
      The durations of the full send operation in seconds were the following:
      
      Before this change:  217 seconds
      After this change:   205 seconds (-5.7%)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ace75066
    • F
      btrfs: use booleans where appropriate for the tree mod log functions · 406808ab
      Filipe Manana 提交于
      Several functions of the tree modification log use integers as booleans,
      so change them to use booleans instead, making their use more clear.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      406808ab
    • F
      btrfs: move the tree mod log code into its own file · f3a84ccd
      Filipe Manana 提交于
      The tree modification log, which records modifications done to btrees, is
      quite large and currently spread all over ctree.c, which is a huge file
      already.
      
      To make things better organized, move all that code into its own separate
      source and header files. Functions and definitions that are used outside
      of the module (mostly by ctree.c) are renamed so that they start with a
      "btrfs_" prefix. Everything else remains unchanged.
      
      This makes it easier to go over the tree modification log code every
      time I need to go read it to fix a bug.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor comment updates ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3a84ccd
  5. 17 3月, 2021 1 次提交
    • F
      btrfs: fix race when cloning extent buffer during rewind of an old root · dbcc7d57
      Filipe Manana 提交于
      While resolving backreferences, as part of a logical ino ioctl call or
      fiemap, we can end up hitting a BUG_ON() when replaying tree mod log
      operations of a root, triggering a stack trace like the following:
      
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.c:1210!
        invalid opcode: 0000 [#1] SMP KASAN PTI
        CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G        W         5.11.0-2d11c0084b02-misc-next+ #89
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
        Code: 05 48 8d 74 10 (...)
        RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297
        RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6
        RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c
        RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20
        R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee
        R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c
        FS:  00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0
        Call Trace:
         btrfs_search_old_slot+0x265/0x10d0
         ? lock_acquired+0xbb/0x600
         ? btrfs_search_slot+0x1090/0x1090
         ? free_extent_buffer.part.61+0xd7/0x140
         ? free_extent_buffer+0x13/0x20
         resolve_indirect_refs+0x3e9/0xfc0
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? add_prelim_ref.part.11+0x150/0x150
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? lock_acquired+0xbb/0x600
         ? __kasan_check_write+0x14/0x20
         ? do_raw_spin_unlock+0xa8/0x140
         ? rb_insert_color+0x30/0x360
         ? prelim_ref_insert+0x12d/0x430
         find_parent_nodes+0x5c3/0x1830
         ? resolve_indirect_refs+0xfc0/0xfc0
         ? lock_release+0xc8/0x620
         ? fs_reclaim_acquire+0x67/0xf0
         ? lock_acquire+0xc7/0x510
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x160/0x210
         ? lock_release+0xc8/0x620
         ? fs_reclaim_acquire+0x67/0xf0
         ? lock_acquire+0xc7/0x510
         ? poison_range+0x38/0x40
         ? unpoison_range+0x14/0x40
         ? trace_hardirqs_on+0x55/0x120
         btrfs_find_all_roots_safe+0x142/0x1e0
         ? find_parent_nodes+0x1830/0x1830
         ? btrfs_inode_flags_to_xflags+0x50/0x50
         iterate_extent_inodes+0x20e/0x580
         ? tree_backref_for_extent+0x230/0x230
         ? lock_downgrade+0x3d0/0x3d0
         ? read_extent_buffer+0xdd/0x110
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? lock_acquired+0xbb/0x600
         ? __kasan_check_write+0x14/0x20
         ? _raw_spin_unlock+0x22/0x30
         ? __kasan_check_write+0x14/0x20
         iterate_inodes_from_logical+0x129/0x170
         ? iterate_inodes_from_logical+0x129/0x170
         ? btrfs_inode_flags_to_xflags+0x50/0x50
         ? iterate_extent_inodes+0x580/0x580
         ? __vmalloc_node+0x92/0xb0
         ? init_data_container+0x34/0xb0
         ? init_data_container+0x34/0xb0
         ? kvmalloc_node+0x60/0x80
         btrfs_ioctl_logical_to_ino+0x158/0x230
         btrfs_ioctl+0x205e/0x4040
         ? __might_sleep+0x71/0xe0
         ? btrfs_ioctl_get_supported_features+0x30/0x30
         ? getrusage+0x4b6/0x9c0
         ? __kasan_check_read+0x11/0x20
         ? lock_release+0xc8/0x620
         ? __might_fault+0x64/0xd0
         ? lock_acquire+0xc7/0x510
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? __kasan_check_read+0x11/0x20
         ? do_vfs_ioctl+0xfc/0x9d0
         ? ioctl_file_clone+0xe0/0xe0
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? __kasan_check_read+0x11/0x20
         ? lock_release+0xc8/0x620
         ? __task_pid_nr_ns+0xd3/0x250
         ? lock_acquire+0xc7/0x510
         ? __fget_files+0x160/0x230
         ? __fget_light+0xf2/0x110
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7fd1976e2427
        Code: 00 00 90 48 8b 05 (...)
        RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427
        RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004
        RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120
        R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004
        R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8
        Modules linked in:
        ---[ end trace ec8931a1c36e57be ]---
      
        (gdb) l *(__tree_mod_log_rewind+0x3b1)
        0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210).
        1205                     * the modification. as we're going backwards, we do the
        1206                     * opposite of each operation here.
        1207                     */
        1208                    switch (tm->op) {
        1209                    case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
        1210                            BUG_ON(tm->slot < n);
        1211                            fallthrough;
        1212                    case MOD_LOG_KEY_REMOVE_WHILE_MOVING:
        1213                    case MOD_LOG_KEY_REMOVE:
        1214                            btrfs_set_node_key(eb, &tm->key, tm->slot);
      
      Here's what happens to hit that BUG_ON():
      
      1) We have one tree mod log user (through fiemap or the logical ino ioctl),
         with a sequence number of 1, so we have fs_info->tree_mod_seq == 1;
      
      2) Another task is at ctree.c:balance_level() and we have eb X currently as
         the root of the tree, and we promote its single child, eb Y, as the new
         root.
      
         Then, at ctree.c:balance_level(), we call:
      
            tree_mod_log_insert_root(eb X, eb Y, 1);
      
      3) At tree_mod_log_insert_root() we create tree mod log elements for each
         slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each
         with a ->logical pointing to ebX->start. These are placed in an array
         named tm_list.
         Lets assume there are N elements (N pointers in eb X);
      
      4) Then, still at tree_mod_log_insert_root(), we create a tree mod log
         element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to
         ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set
         to the level of eb X and ->generation set to the generation of eb X;
      
      5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
         tm_list as argument. After that, tree_mod_log_free_eb() calls
         __tree_mod_log_insert() for each member of tm_list in reverse order,
         from highest slot in eb X, slot N - 1, to slot 0 of eb X;
      
      6) __tree_mod_log_insert() sets the sequence number of each given tree mod
         log operation - it increments fs_info->tree_mod_seq and sets
         fs_info->tree_mod_seq as the sequence number of the given tree mod log
         operation.
      
         This means that for the tm_list created at tree_mod_log_insert_root(),
         the element corresponding to slot 0 of eb X has the highest sequence
         number (1 + N), and the element corresponding to the last slot has the
         lowest sequence number (2);
      
      7) Then, after inserting tm_list's elements into the tree mod log rbtree,
         the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest
         sequence number, which is N + 2;
      
      8) Back to ctree.c:balance_level(), we free eb X by calling
         btrfs_free_tree_block() on it. Because eb X was created in the current
         transaction, has no other references and writeback did not happen for
         it, we add it back to the free space cache/tree;
      
      9) Later some other task T allocates the metadata extent from eb X, since
         it is marked as free space in the space cache/tree, and uses it as a
         node for some other btree;
      
      10) The tree mod log user task calls btrfs_search_old_slot(), which calls
          get_old_root(), and finally that calls __tree_mod_log_oldest_root()
          with time_seq == 1 and eb_root == eb Y;
      
      11) First iteration of the while loop finds the tree mod log element with
          sequence number N + 2, for the logical address of eb Y and of type
          MOD_LOG_ROOT_REPLACE;
      
      12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out
          of the loop, and set root_logical to point to tm->old_root.logical
          which corresponds to the logical address of eb X;
      
      13) On the next iteration of the while loop, the call to
          tree_mod_log_search_oldest() returns the smallest tree mod log element
          for the logical address of eb X, which has a sequence number of 2, an
          operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to
          the old slot N - 1 of eb X (eb X had N items in it before being freed);
      
      14) We then break out of the while loop and return the tree mod log operation
          of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of
          eb X, to get_old_root();
      
      15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation
          and set "logical" to the logical address of eb X, which was the old
          root. We then call tree_mod_log_search() passing it the logical
          address of eb X and time_seq == 1;
      
      16) Then before calling tree_mod_log_search(), task T adds a key to eb X,
          which results in adding a tree mod log operation of type
          MOD_LOG_KEY_ADD to the tree mod log - this is done at
          ctree.c:insert_ptr() - but after adding the tree mod log operation
          and before updating the number of items in eb X from 0 to 1...
      
      17) The task at get_old_root() calls tree_mod_log_search() and gets the
          tree mod log operation of type MOD_LOG_KEY_ADD just added by task T.
          Then it enters the following if branch:
      
          if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
             (...)
          } (...)
      
          Calls read_tree_block() for eb X, which gets a reference on eb X but
          does not lock it - task T has it locked.
          Then it clones eb X while it has nritems set to 0 in its header, before
          task T sets nritems to 1 in eb X's header. From hereupon we use the
          clone of eb X which no other task has access to;
      
      18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD
          mod log operation we just got from tree_mod_log_search() in the
          previous step and the cloned version of eb X;
      
      19) At __tree_mod_log_rewind(), we set the local variable "n" to the number
          of items set in eb X's clone, which is 0. Then we enter the while loop,
          and in its first iteration we process the MOD_LOG_KEY_ADD operation,
          which just decrements "n" from 0 to (u32)-1, since "n" is declared with
          a type of u32. At the end of this iteration we call rb_next() to find the
          next tree mod log operation for eb X, that gives us the mod log operation
          of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence
          number of N + 1 (steps 3 to 6);
      
      20) Then we go back to the top of the while loop and trigger the following
          BUG_ON():
      
              (...)
              switch (tm->op) {
              case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
                       BUG_ON(tm->slot < n);
                       fallthrough;
              (...)
      
          Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0.
      
      Fix this by taking a read lock on the extent buffer before cloning it at
      ctree.c:get_old_root(). This should be done regardless of the extent
      buffer having been freed and reused, as a concurrent task might be
      modifying it (while holding a write lock on it).
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/
      Fixes: 834328a8 ("Btrfs: tree mod log's old roots could still be part of the tree")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dbcc7d57
  6. 09 2月, 2021 3 次提交
  7. 18 12月, 2020 1 次提交
    • E
      btrfs: correctly calculate item size used when item key collision happens · 9a664971
      ethanwu 提交于
      Item key collision is allowed for some item types, like dir item and
      inode refs, but the overall item size is limited by the nodesize.
      
      item size(ins_len) passed from btrfs_insert_empty_items to
      btrfs_search_slot already contains size of btrfs_item.
      
      When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
      The check incorrectly reports that split leaf is required, because
      it treats the space required by the newly inserted item as
      btrfs_item + item data. But in item key collision case, only item data
      is actually needed, the newly inserted item could merge into the existing
      one. No new btrfs_item will be inserted.
      
      And split_leaf return EOVERFLOW from following code:
      
        if (extend && data_size + btrfs_item_size_nr(l, slot) +
            sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
            return -EOVERFLOW;
      
      In most cases, when callers receive EOVERFLOW, they either return
      this error or handle in different ways. For example, in normal dir item
      creation the userspace will get errno EOVERFLOW; in inode ref case
      INODE_EXTREF is used instead.
      
      However, this is not the case for rename. To avoid the unrecoverable
      situation in rename, btrfs_check_dir_item_collision is called in
      early phase of rename. In this function, when item key collision is
      detected leaf space is checked:
      
        data_size = sizeof(*di) + name_len;
        if (data_size + btrfs_item_size_nr(leaf, slot) +
            sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
      
      the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
      refers to existing item size, the condition here correctly calculates
      the needed size for collision case rather than the wrong case above.
      
      The consequence of inconsistent condition check between
      btrfs_check_dir_item_collision and btrfs_search_slot when item key
      collision happens is that we might pass check here but fail
      later at btrfs_search_slot. Rename fails and volume is forced readonly
      
        [436149.586170] ------------[ cut here ]------------
        [436149.586173] BTRFS: Transaction aborted (error -75)
        [436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
        [436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G      D           4.18.0-rc5+ #1
        [436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
        [436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
        [436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
        [436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
        [436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
        [436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
        [436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
        [436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
        [436149.586260] FS:  00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
        [436149.586261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
        [436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [436149.586296] Call Trace:
        [436149.586311]  vfs_rename+0x383/0x920
        [436149.586313]  ? vfs_rename+0x383/0x920
        [436149.586315]  do_renameat2+0x4ca/0x590
        [436149.586317]  __x64_sys_rename+0x20/0x30
        [436149.586324]  do_syscall_64+0x5a/0x120
        [436149.586330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [436149.586332] RIP: 0033:0x7fa3133b1d37
        [436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
        [436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
        [436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
        [436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
        [436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
      
      Thanks to Hans van Kranenburg for information about crc32 hash collision
      tools, I was able to reproduce the dir item collision with following
      python script.
      https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
      it under a btrfs volume will trigger the abort transaction.  It simply
      creates files and rename them to forged names that leads to
      hash collision.
      
      There are two ways to fix this. One is to simply revert the patch
      878f2d2c ("Btrfs: fix max dir item size calculation") to make the
      condition consistent although that patch is correct about the size.
      
      The other way is to handle the leaf space check correctly when
      collision happens. I prefer the second one since it correct leaf
      space check in collision case. This fix will not account
      sizeof(struct btrfs_item) when the item already exists.
      There are two places where ins_len doesn't contain
      sizeof(struct btrfs_item), however.
      
        1. extent-tree.c: lookup_inline_extent_backref
        2. file-item.c: btrfs_csum_file_blocks
      
      to make the logic of btrfs_search_slot more clear, we add a flag
      search_for_extension in btrfs_path.
      
      This flag indicates that ins_len passed to btrfs_search_slot doesn't
      contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
      will use the actual size needed to calculate the required leaf space.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: Nethanwu <ethanwu@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a664971
  8. 10 12月, 2020 1 次提交
    • Q
      btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors · 884b07d0
      Qu Wenruo 提交于
      To support sectorsize < PAGE_SIZE case, we need to take extra care of
      extent buffer accessors.
      
      Since sectorsize is smaller than PAGE_SIZE, one page can contain
      multiple tree blocks, we must use eb->start to determine the real offset
      to read/write for extent buffer accessors.
      
      This patch introduces two helpers to do this:
      
      - get_eb_page_index()
        This is to calculate the index to access extent_buffer::pages.
        It's just a simple wrapper around "start >> PAGE_SHIFT".
      
        For sectorsize == PAGE_SIZE case, nothing is changed.
        For sectorsize < PAGE_SIZE case, we always get index as 0, and
        the existing page shift also works.
      
      - get_eb_offset_in_page()
        This is to calculate the offset to access extent_buffer::pages.
        This needs to take extent_buffer::start into consideration.
      
        For sectorsize == PAGE_SIZE case, extent_buffer::start is always
        aligned to PAGE_SIZE, thus adding extent_buffer::start to
        offset_in_page() won't change the result.
        For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
        us the correct offset to access.
      
      This patch will touch the following parts to cover all extent buffer
      accessors:
      
      - BTRFS_SETGET_HEADER_FUNCS()
      - read_extent_buffer()
      - read_extent_buffer_to_user()
      - memcmp_extent_buffer()
      - write_extent_buffer_chunk_tree_uuid()
      - write_extent_buffer_fsid()
      - write_extent_buffer()
      - memzero_extent_buffer()
      - copy_extent_buffer_full()
      - copy_extent_buffer()
      - memcpy_extent_buffer()
      - memmove_extent_buffer()
      - btrfs_get_token_##bits()
      - btrfs_get_##bits()
      - btrfs_set_token_##bits()
      - btrfs_set_##bits()
      - generic_bin_search()
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      884b07d0
  9. 08 12月, 2020 12 次提交
  10. 07 10月, 2020 16 次提交
    • J
      btrfs: cleanup cow block on error · 572c83ac
      Josef Bacik 提交于
      In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
      to a system lockup. It gets stuck trying to write back inodes, and the
      write back thread was trying to lock an extent buffer:
      
        $ cat /proc/2143497/stack
        [<0>] __btrfs_tree_lock+0x108/0x250
        [<0>] lock_extent_buffer_for_io+0x35e/0x3a0
        [<0>] btree_write_cache_pages+0x15a/0x3b0
        [<0>] do_writepages+0x28/0xb0
        [<0>] __writeback_single_inode+0x54/0x5c0
        [<0>] writeback_sb_inodes+0x1e8/0x510
        [<0>] wb_writeback+0xcc/0x440
        [<0>] wb_workfn+0xd7/0x650
        [<0>] process_one_work+0x236/0x560
        [<0>] worker_thread+0x55/0x3c0
        [<0>] kthread+0x13a/0x150
        [<0>] ret_from_fork+0x1f/0x30
      
      This is because we got an error while COWing a block, specifically here
      
              if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
                      ret = btrfs_reloc_cow_block(trans, root, buf, cow);
                      if (ret) {
                              btrfs_abort_transaction(trans, ret);
                              return ret;
                      }
              }
      
        [16402.241552] BTRFS: Transaction aborted (error -2)
        [16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
        [16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
        [16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        [16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
        [16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
        [16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
        [16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
        [16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
        [16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
        [16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
        [16402.256321] FS:  00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
        [16402.256973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
        [16402.257867] Call Trace:
        [16402.258072]  btrfs_cow_block+0x109/0x230
        [16402.258356]  btrfs_search_slot+0x530/0x9d0
        [16402.258655]  btrfs_lookup_file_extent+0x37/0x40
        [16402.259155]  __btrfs_drop_extents+0x13c/0xd60
        [16402.259628]  ? btrfs_block_rsv_migrate+0x4f/0xb0
        [16402.259949]  btrfs_replace_file_extents+0x190/0x820
        [16402.260873]  btrfs_clone+0x9ae/0xc00
        [16402.261139]  btrfs_extent_same_range+0x66/0x90
        [16402.261771]  btrfs_remap_file_range+0x353/0x3b1
        [16402.262333]  vfs_dedupe_file_range_one.part.0+0xd5/0x140
        [16402.262821]  vfs_dedupe_file_range+0x189/0x220
        [16402.263150]  do_vfs_ioctl+0x552/0x700
        [16402.263662]  __x64_sys_ioctl+0x62/0xb0
        [16402.264023]  do_syscall_64+0x33/0x40
        [16402.264364]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [16402.264862] RIP: 0033:0x7f542c7d15cb
        [16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
        [16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
        [16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
        [16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
        [16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
        [16402.271498] irq event stamp: 0
        [16402.271846] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
        [16402.273343] softirqs last  enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
        [16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [16402.274338] ---[ end trace 737874a5a41a8236 ]---
        [16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
        [16402.276179] BTRFS info (device dm-9): forced readonly
        [16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
        [16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
        [16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
        [16402.280582] BTRFS info (device dm-9): balance: ended with status: -30
      
      The problem here is that as soon as we allocate the new block it is
      locked and marked dirty in the btree inode.  This means that we could
      attempt to writeback this block and need to lock the extent buffer.
      However we're not unlocking it here and thus we deadlock.
      
      Fix this by unlocking the cow block if we have any errors inside of
      __btrfs_cow_block, and also free it so we do not leak it.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      572c83ac
    • N
      btrfs: improve error message in setup_items_for_insert · 7269ddd2
      Nikolay Borisov 提交于
      Reword and update formats to match variable types.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update formats ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7269ddd2
    • N
    • N
      btrfs: sink total_data parameter in setup_items_for_insert · fc0d82e1
      Nikolay Borisov 提交于
      That parameter can easily be derived based on the "data_size" and "nr"
      parameters exploit this fact to simply the function's signature. No
      functional changes.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fc0d82e1
    • N
      btrfs: eliminate total_size parameter from setup_items_for_insert · 3dc9dc89
      Nikolay Borisov 提交于
      The value of this argument can be derived from the total_data as it's
      simply the value of the data size + size of btrfs_items being touched.
      Move the parameter calculation inside the function. This results in a
      simpler interface and also a minor size reduction:
      
      ./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o
      add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34)
      Function                                     old     new   delta
      btrfs_duplicate_item                         260     259      -1
      setup_items_for_insert                      1200    1190     -10
      btrfs_insert_empty_items                     177     154     -23
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3dc9dc89
    • N
      btrfs: re-arrange statements in setup_items_for_insert · fc0716c2
      Nikolay Borisov 提交于
      Rearrange statements calculating the offset of the newly added items so
      that the calculation has to be done only once. No functional change.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fc0716c2
    • J
      btrfs: use BTRFS_NESTED_NEW_ROOT for double splits · ca9d473a
      Josef Bacik 提交于
      I've made this change separate since it requires both of the newly added
      NESTED flags and I didn't want to slip it into one of those changes.
      
      If we do a double split of a node we can end up doing a
      BTRFS_NESTED_SPLIT on level 0, which throws lockdep off because it
      appears as a double lock.  Since we're maxed out on subclasses, use
      BTRFS_NESTED_NEW_ROOT if we had to do a double split.  This is OK
      because we won't have to do a double split if we had to insert a new
      root, and the new root would be at a higher level anyway.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca9d473a
    • J
      btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots · cf6f34aa
      Josef Bacik 提交于
      The way we add new roots is confusing from a locking perspective for
      lockdep.  We generally have the rule that we lock things in order from
      highest level to lowest, but in the case of adding a new level to the
      tree we actually allocate a new block for the root, which makes the
      locking go in reverse.  A similar issue exists for snapshotting, we cow
      the original root for the root of a new tree, however they're at the
      same level.  Address this by using BTRFS_NESTING_NEW_ROOT for these
      operations.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cf6f34aa
    • J
      btrfs: introduce BTRFS_NESTING_SPLIT for split blocks · 4dff97e6
      Josef Bacik 提交于
      If we are splitting a leaf/node, we could do something like the
      following
      
      lock(leaf)  BTRFS_NESTING_NORMAL
        lock(left) BTRFS_NESTING_LEFT + BTRFS_NESTING_COW
          push from leaf -> left
            reset path to point to left
              split left
                allocate new block, lock block BTRFS_NESTING_SPLIT
      
      at the new block point we need to have a different nesting level,
      because we have already used either BTRFS_NESTING_LEFT or
      BTRFS_NESTING_RIGHT when pushing items from the original leaf into the
      adjacent leaves.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4dff97e6
    • J
      btrfs: introduce BTRFS_NESTING_LEFT/RIGHT_COW · bf59a5a2
      Josef Bacik 提交于
      For similar reasons as BTRFS_NESTING_COW, we need
      BTRFS_NESTING_LEFT/RIGHT_COW.  The pattern is this
      
      lock leaf -> BTRFS_NESTING_NORMAL
        cow leaf -> BTRFS_NESTING_COW
          split leaf
            lock left -> BTRFS_NESTING_LEFT
              cow left -> BTRFS_NESTING_LEFT_COW
      
      We need this in order to indicate to lockdep that these locks are
      discrete and are being taken in a safe order.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf59a5a2
    • J
      btrfs: introduce BTRFS_NESTING_LEFT/BTRFS_NESTING_RIGHT · bf77467a
      Josef Bacik 提交于
      Our lockdep maps are based on rootid+level, however in some cases we
      will lock adjacent blocks on the same level, namely in searching forward
      or in split/balance.  Because of this lockdep will complain, so we need
      a separate subclass to indicate to lockdep that these are different
      locks.
      
      lock leaf -> BTRFS_NESTING_NORMAL
        cow leaf -> BTRFS_NESTING_COW
          split leaf
             lock left -> BTRFS_NESTING_LEFT
             lock right -> BTRFS_NESTING_RIGHT
      
      The above graph illustrates the need for this new nesting subclass.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf77467a
    • J
      btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks · 9631e4cc
      Josef Bacik 提交于
      When we COW a block we are holding a lock on the original block, and
      then we lock the new COW block.  Because our lockdep maps are based on
      root + level, this will make lockdep complain.  We need a way to
      indicate a subclass for locking the COW'ed block, so plumb through our
      btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
      and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.
      
      The reason I've added all this extra infrastructure is because there
      will be need of different nesting classes in follow up patches.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9631e4cc
    • J
      btrfs: add nesting tags to the locking helpers · fd7ba1c1
      Josef Bacik 提交于
      We will need these when we switch to an rwsem, so plumb in the
      infrastructure here to use later on.  I violate the 80 character limit
      some here because it'll be cleaned up later.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd7ba1c1
    • J
      btrfs: introduce btrfs_path::recurse · 51899412
      Josef Bacik 提交于
      Our current tree locking stuff allows us to recurse with read locks if
      we're already holding the write lock.  This is necessary for the space
      cache inode, as we could be holding a lock on the root_tree root when we
      need to cache a block group, and thus need to be able to read down the
      root_tree to read in the inode cache.
      
      We can get away with this in our current locking, but we won't be able
      to with a rwsem.  Handle this by purposefully annotating the places
      where we require recursion, so that in the future we can maybe come up
      with a way to avoid the recursion.  In the case of the free space inode,
      this will be superseded by the free space tree.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      51899412
    • Q
      btrfs: ctree: check key order before merging tree blocks · d16c702f
      Qu Wenruo 提交于
      [BUG]
      With a crafted image, btrfs can panic at btrfs_del_csums():
      
        kernel BUG at fs/btrfs/ctree.c:3188!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
        RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180
        RSP: 0018:ffff976141257ab8 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000
        RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf
        RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8
        R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe
        R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae
        FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0
        Call Trace:
        truncate_one_csum+0xac/0xf0
        btrfs_del_csums+0x24f/0x3a0
        __btrfs_free_extent.isra.72+0x5a7/0xbe0
        __btrfs_run_delayed_refs+0x539/0x1120
        btrfs_run_delayed_refs+0xdb/0x1b0
        btrfs_commit_transaction+0x52/0x950
        ? start_transaction+0x94/0x450
        transaction_kthread+0x163/0x190
        kthread+0x105/0x140
        ? btrfs_cleanup_transaction+0x560/0x560
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x35/0x40
        Modules linked in:
        ---[ end trace 93bf9db00e6c374e ]---
      
      [CAUSE]
      This crafted image has a tricky key order corruption:
      
        checksum tree key (CSUM_TREE ROOT_ITEM 0)
        node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE
                ...
                key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19
                key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19
                ...
      
        leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE
                item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672
                        range start 73785344 end 75497472 length 1712128
                item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4
                        range start 75497472 end 75501568 length 4096
                item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740
                        range start 75501568 end 77283328 length 1781760
                item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4
                        range start 77283328 end 77287424 length 4096
                item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<<
                        range start 4120596480 end 4120903680 length 307200
        leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE
                item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672
                        range start 77594624 end 79306752 length 1712128
                ...
      
      Note the item 4 key of leaf 29757440, which is obviously too large, and
      even larger than the first key of the next leaf.
      
      However it still follows the key order in that tree block, thus tree
      checker is unable to detect it at read time, since tree checker can only
      work inside one leaf, thus such complex corruption can't be detected in
      advance.
      
      [FIX]
      The next time to detect such problem is at tree block merge time,
      which is in push_node_left(), balance_node_right(), push_leaf_left() or
      push_leaf_right().
      
      Now we check if the key order of the right-most key of the left node is
      larger than the left-most key of the right node.
      
      By this we don't need to call the full tree-checker, while still keeping
      the key order correct as key order in each node is already checked by
      tree checker thus we only need to check the above two slots.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d16c702f
    • R
      btrfs: delete duplicated words + other fixes in comments · 260db43c
      Randy Dunlap 提交于
      Delete repeated words in fs/btrfs/.
      {to, the, a, and old}
      and change "into 2 part" to "into 2 parts".
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      260db43c