1. 03 1月, 2022 20 次提交
    • J
      btrfs: handle priority ticket failures in their respective helpers · 9f35f76d
      Josef Bacik 提交于
      Currently the error case for the priority tickets is handled where we
      deal with all of the tickets, priority and non-priority.  This is OK in
      general, but it makes for some awkward locking.  We take and drop the
      space_info->lock back to back because of these different types of
      tickets.
      
      Rework the code to handle priority ticket failures in their respective
      helpers.  This allows us to be less wonky with our space_info->lock
      usage, and means that the main handler simply has to check
      ticket->error, as the ticket is guaranteed to be off any list and
      completely handled by the time it exits one of the handlers.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9f35f76d
    • N
      btrfs: zoned: cache reported zone during mount · 16beac87
      Naohiro Aota 提交于
      When mounting a device, we are reporting the zones twice: once for
      checking the zone attributes in btrfs_get_dev_zone_info and once for
      loading block groups' zone info in
      btrfs_load_block_group_zone_info(). With a lot of block groups, that
      leads to a lot of REPORT ZONE commands and slows down the mount
      process.
      
      This patch introduces a zone info cache in struct
      btrfs_zoned_device_info. The cache is populated while in
      btrfs_get_dev_zone_info() and used for
      btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE
      commands. The zone cache is then released after loading the block
      groups, as it will not be much effective during the run time.
      
      Benchmark: Mount an HDD with 57,007 block groups
      Before patch: 171.368 seconds
      After patch: 64.064 seconds
      
      While it still takes a minute due to the slowness of loading all the
      block groups, the patch reduces the mount time by 1/3.
      
      Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT5WY+C=cD49nQ@mail.gmail.com/
      Fixes: 5b316468 ("btrfs: get zone information of zoned block devices")
      CC: stable@vger.kernel.org
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      16beac87
    • S
      btrfs: remove unused parameter fs_devices from btrfs_init_workqueues · d21deec5
      Su Yue 提交于
      Since commit ba8a9d07 ("Btrfs: delete the entire async bio submission
      framework") removed submit workqueues, the parameter fs_devices is not used
      anymore.
      
      Remove it, no functional changes.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NSu Yue <l@damenly.su>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d21deec5
    • F
      btrfs: reduce the scope of the tree log mutex during transaction commit · dfba78dc
      Filipe Manana 提交于
      In the transaction commit path we are acquiring the tree log mutex too
      early and we have a stale comment because:
      
      1) It mentions a function named btrfs_commit_tree_roots(), which does not
         exists anymore, it was the old name of commit_cowonly_roots(), renamed
         a very long time ago by commit 5d4f98a2 ("Btrfs: Mixed back
         reference  (FORWARD ROLLING FORMAT CHANGE)"));
      
      2) It mentions that we need to acquire the tree log mutex at that point
         to ensure we have no running log writers. That is not correct anymore,
         for many years at least, since we are guaranteed that we do not have
         any log writers at that point simply because we have set the state of
         the transaction to TRANS_STATE_COMMIT_DOING and have waited for all
         writers to complete - meaning no one can log until we change the state
         of the transaction to TRANS_STATE_UNBLOCKED. Any attempts to join the
         transaction or start a new one will block until we do that state
         transition;
      
      3) The comment mentions a "trans mutex" which doesn't exists since 2011,
         commit a4abeea4 ("Btrfs: kill trans_mutex") removed it;
      
      4) The current use of the tree log mutex is to ensure proper serialization
         of super block writes - if someone started a new transaction and uses it
         for logging, it will wait for the previous transaction to write its
         super block before writing the super block when attempting to sync the
         log.
      
      So acquire the tree log mutex only when it's absolutely needed, before
      setting the transaction state to TRANS_STATE_UNBLOCKED, fix and move the
      stale comment, add some assertions and new comments where appropriate.
      
      Also, this has no effect on concurrency or performance, since the new
      start of the critical section is still when the transaction is in the
      state TRANS_STATE_COMMIT_DOING.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dfba78dc
    • A
      btrfs: consolidate device_list_mutex in prepare_sprout to its parent · 849eae5e
      Anand Jain 提交于
      btrfs_prepare_sprout() splices seed devices into its own struct fs_devices,
      so that its parent function btrfs_init_new_device() can add the new sprout
      device to fs_info->fs_devices.
      
      Both btrfs_prepare_sprout() and btrfs_init_new_device() need
      device_list_mutex. But they are holding it separately, thus create a
      small race window. Close it and hold device_list_mutex across both
      functions btrfs_init_new_device() and btrfs_prepare_sprout().
      
      Split btrfs_prepare_sprout() into btrfs_init_sprout() and
      btrfs_setup_sprout(). This split is essential because device_list_mutex
      must not be held for allocations in btrfs_init_sprout() but must be held
      for btrfs_setup_sprout(). So now a common device_list_mutex can be used
      between btrfs_init_new_device() and btrfs_setup_sprout().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      849eae5e
    • A
      btrfs: switch seeding_dev in init_new_device to bool · fd880809
      Anand Jain 提交于
      Declare int seeding_dev as a bool. Also, move its declaration a line
      below to adjust packing.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd880809
    • O
      btrfs: send: remove unused type parameter to iterate_inode_ref_t · b1dea4e7
      Omar Sandoval 提交于
      Again, I don't think this was ever used since iterate_dir_item() is only
      used for xattrs. No functional change.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b1dea4e7
    • O
      btrfs: send: remove unused found_type parameter to lookup_dir_item_inode() · eab67c06
      Omar Sandoval 提交于
      As far as I can tell, this was never used. No functional change.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eab67c06
    • J
      btrfs: rename btrfs_item_end_nr to btrfs_item_data_end · dc2e724e
      Josef Bacik 提交于
      The name btrfs_item_end_nr() is a bit of a misnomer, as it's actually
      the offset of the end of the data the item points to.  In fact all of
      the helpers that we use btrfs_item_end_nr() use data in their name, like
      BTRFS_LEAF_DATA_SIZE() and leaf_data().  Rename to btrfs_item_data_end()
      to make it clear what this helper is giving us.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dc2e724e
    • J
      btrfs: remove the btrfs_item_end() helper · 5a08663d
      Josef Bacik 提交于
      We're only using btrfs_item_end() from btrfs_item_end_nr(), so this can
      be collapsed.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a08663d
    • J
      btrfs: drop the _nr from the item helpers · 3212fa14
      Josef Bacik 提交于
      Now that all call sites are using the slot number to modify item values,
      rename the SETGET helpers to raw_item_*(), and then rework the _nr()
      helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
      rename all of the callers to the new helpers.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3212fa14
    • J
      btrfs: introduce item_nr token variant helpers · 74794207
      Josef Bacik 提交于
      The last remaining place where we have the pattern of
      
      	item = btrfs_item_nr(slot)
      	<do something with the item>
      
      are the token helpers.  Handle this by introducing token helpers that
      will do the btrfs_item_nr() work inside of the helper itself, and then
      convert all users of the btrfs_item token helpers to the new _nr()
      variants.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74794207
    • J
      btrfs: make btrfs_file_extent_inline_item_len take a slot · 437bd07e
      Josef Bacik 提交于
      Instead of getting the btrfs_item for this, simply pass in the slot of
      the item and then use the btrfs_item_size_nr() helper inside of
      btrfs_file_extent_inline_item_len().
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      437bd07e
    • J
      btrfs: add btrfs_set_item_*_nr() helpers · c91666b1
      Josef Bacik 提交于
      We have the pattern of
      
      	item = btrfs_item_nr(slot);
      	btrfs_set_item_*(leaf, item);
      
      in a bunch of places in our code.  Fix this by adding
      btrfs_set_item_*_nr() helpers which will do the appropriate work, and
      replace those calls with
      
      	btrfs_set_item_*_nr(leaf, slot);
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c91666b1
    • J
      btrfs: use btrfs_item_size_nr/btrfs_item_offset_nr everywhere · 227f3cd0
      Josef Bacik 提交于
      We have this pattern in a lot of places
      
      	item = btrfs_item_nr(slot);
      	btrfs_item_size(leaf, item);
      
      when we could simply use
      
      	btrfs_item_size(leaf, slot);
      
      Fix all callers of btrfs_item_size() and btrfs_item_offset() to use the
      _nr variation of the helpers.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      227f3cd0
    • F
      btrfs: remove no longer needed logic for replaying directory deletes · ccae4a19
      Filipe Manana 提交于
      Now that we log only dir index keys when logging a directory, we no longer
      need to deal with dir item keys in the log replay code for replaying
      directory deletes. This is also true for the case when we replay a log
      tree created by a kernel that still logs dir items.
      
      So remove the remaining code of the replay of directory deletes algorithm
      that deals with dir item keys.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ccae4a19
    • F
      btrfs: only copy dir index keys when logging a directory · 339d0354
      Filipe Manana 提交于
      Currently, when logging a directory, we copy both dir items and dir index
      items from the fs/subvolume tree to the log tree. Both items have exactly
      the same data (same struct btrfs_dir_item), the difference lies in the key
      values, where a dir index key contains the index number of a directory
      entry while the dir item key does not, as it's used for doing fast lookups
      of an entry by name, while the former is used for sorting entries when
      listing a directory.
      
      We can exploit that and log only the dir index items, since they contain
      all the information needed to correctly add, replace and delete directory
      entries when replaying a log tree. Logging only the dir index items is
      also backward and forward compatible: an unpatched kernel (without this
      change) can correctly replay a log tree generated by a patched kernel
      (with this patch), and a patched kernel can correctly replay a log tree
      generated by an unpatched kernel.
      
      The backward compatibility is ensured because:
      
      1) For inserting a new dentry: a dentry is only inserted when we find a
         new dir index key - we can only insert if we know the dir index offset,
         which is encoded in the dir index key's offset;
      
      2) For deleting dentries: during log replay, before adding or replacing
         dentries, we first replay dentry deletions. Whenever we find a dir item
         key or a dir index key in the subvolume/fs tree that is not logged in
         a range for which the log tree is authoritative, we do the unlink of
         the dentry, which removes both the existing dir item key and the dir
         index key. Therefore logging just dir index keys is enough to ensure
         dentry deletions are correctly replayed;
      
      3) For dentry replacements: they work when we log only dir index keys
         and this is mostly due to a combination of 1) and 2). If we replace a
         dentry with name "foobar" to point from inode A to inode B, then we
         know the dir index key for the new dentry is different from the old
         one, as it has an index number (key offset) larger than the old one.
         This results in replaying a deletion, through replay_dir_deletes(),
         that causes the old dentry to be removed, both the dir item key and
         the dir index key, as mentioned at 2). Then when processing the new
         dir index key, we add the new dentry, adding both a new dir item key
         and a new index key pointing to inode B, as stated in 1).
      
      The forward compatibility, the ability for a patched kernel to replay a
      log created by an older, unpatched kernel, comes from the changes required
      for making sure we are able to replay a log that only contains dir index
      keys - we simply ignore every dir item key we find.
      
      So modify directory logging to log only dir index items, and modify the
      log replay process to ignore dir item keys, from log trees created by an
      unpatched kernel, and process only with dir index keys. This reduces the
      amount of logged metadata by about half, and therefore the time spent
      logging or fsyncing large directories (less CPU time and less IO).
      
      The following test script was used to measure this change:
      
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
      
         NUM_NEW_FILES=1000000
         NUM_FILE_DELETES=10000
      
         mkfs.btrfs -f $DEV
         mount -o ssd $DEV $MNT
      
         mkdir $MNT/testdir
      
         for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
                 echo -n > $MNT/testdir/file_$i
         done
      
         start=$(date +%s%N)
         xfs_io -c "fsync" $MNT/testdir
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
      
         # sync to force transaction commit and wipeout the log.
         sync
      
         del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
         for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
                 rm -f $MNT/testdir/file_$i
         done
      
         start=$(date +%s%N)
         xfs_io -c "fsync" $MNT/testdir
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
         echo
      
         umount $MNT
      
      The tests were run on a physical machine, with a non-debug kernel (Debian's
      default kernel config), for different values of $NUM_NEW_FILES and
      $NUM_FILE_DELETES, and the results were the following:
      
      ** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
      
      dir fsync took 8412 ms after adding 1000000 files
      dir fsync took 500 ms after deleting 10000 files
      
      ** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
      
      dir fsync took 4252 ms after adding 1000000 files   (-49.5%)
      dir fsync took 269 ms after deleting 10000 files    (-46.2%)
      
      ** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 745 ms after adding 100000 files
      dir fsync took 59 ms after deleting 1000 files
      
      ** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 404 ms after adding 100000 files   (-45.8%)
      dir fsync took 31 ms after deleting 1000 files    (-47.5%)
      
      ** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 67 ms after adding 10000 files
      dir fsync took 9 ms after deleting 1000 files
      
      ** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 36 ms after adding 10000 files   (-46.3%)
      dir fsync took 5 ms after deleting 1000 files   (-44.4%)
      
      ** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
      
      dir fsync took 9 ms after adding 1000 files
      dir fsync took 4 ms after deleting 100 files
      
      ** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
      
      dir fsync took 7 ms after adding 1000 files     (-22.2%)
      dir fsync took 3 ms after deleting 100 files    (-25.0%)
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      339d0354
    • N
      btrfs: remove spurious unlock/lock of unused_bgs_lock · 17130a65
      Nikolay Borisov 提交于
      Since both unused block groups and reclaim bgs lists are protected by
      unused_bgs_lock then free them in the same critical section without
      doing an extra unlock/lock pair.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      17130a65
    • F
      btrfs: fix deadlock between quota enable and other quota operations · 232796df
      Filipe Manana 提交于
      When enabling quotas, we attempt to commit a transaction while holding the
      mutex fs_info->qgroup_ioctl_lock. This can result on a deadlock with other
      quota operations such as:
      
      - qgroup creation and deletion, ioctl BTRFS_IOC_QGROUP_CREATE;
      
      - adding and removing qgroup relations, ioctl BTRFS_IOC_QGROUP_ASSIGN.
      
      This is because these operations join a transaction and after that they
      attempt to lock the mutex fs_info->qgroup_ioctl_lock. Acquiring that mutex
      after joining or starting a transaction is a pattern followed everywhere
      in qgroups, so the quota enablement operation is the one at fault here,
      and should not commit a transaction while holding that mutex.
      
      Fix this by making the transaction commit while not holding the mutex.
      We are safe from two concurrent tasks trying to enable quotas because
      we are serialized by the rw semaphore fs_info->subvol_sem at
      btrfs_ioctl_quota_ctl(), which is the only call site for enabling
      quotas.
      
      When this deadlock happens, it produces a trace like the following:
      
        INFO: task syz-executor:25604 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc6 #4
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor state:D stack:24800 pid:25604 ppid: 24873 flags:0x00004004
        Call Trace:
        context_switch kernel/sched/core.c:4940 [inline]
        __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
        schedule+0xd3/0x270 kernel/sched/core.c:6366
        btrfs_commit_transaction+0x994/0x2e90 fs/btrfs/transaction.c:2201
        btrfs_quota_enable+0x95c/0x1790 fs/btrfs/qgroup.c:1120
        btrfs_ioctl_quota_ctl fs/btrfs/ioctl.c:4229 [inline]
        btrfs_ioctl+0x637e/0x7b70 fs/btrfs/ioctl.c:5010
        vfs_ioctl fs/ioctl.c:51 [inline]
        __do_sys_ioctl fs/ioctl.c:874 [inline]
        __se_sys_ioctl fs/ioctl.c:860 [inline]
        __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7f86920b2c4d
        RSP: 002b:00007f868f61ac58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007f86921d90a0 RCX: 00007f86920b2c4d
        RDX: 0000000020005e40 RSI: 00000000c0109428 RDI: 0000000000000008
        RBP: 00007f869212bd80 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000246 R12: 00007f86921d90a0
        R13: 00007fff6d233e4f R14: 00007fff6d233ff0 R15: 00007f868f61adc0
        INFO: task syz-executor:25628 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc6 #4
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor state:D stack:29080 pid:25628 ppid: 24873 flags:0x00004004
        Call Trace:
        context_switch kernel/sched/core.c:4940 [inline]
        __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
        schedule+0xd3/0x270 kernel/sched/core.c:6366
        schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
        __mutex_lock_common kernel/locking/mutex.c:669 [inline]
        __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
        btrfs_remove_qgroup+0xb7/0x7d0 fs/btrfs/qgroup.c:1548
        btrfs_ioctl_qgroup_create fs/btrfs/ioctl.c:4333 [inline]
        btrfs_ioctl+0x683c/0x7b70 fs/btrfs/ioctl.c:5014
        vfs_ioctl fs/ioctl.c:51 [inline]
        __do_sys_ioctl fs/ioctl.c:874 [inline]
        __se_sys_ioctl fs/ioctl.c:860 [inline]
        __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      Reported-by: NHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsZQF19bQ1C6=yetF3BvL10OSORpFUcWXTP6HErshDB4dQ@mail.gmail.com/
      Fixes: 340f1aa2 ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
      CC: stable@vger.kernel.org # 4.19
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      232796df
    • F
      btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range · f0bfa76a
      Filipe Manana 提交于
      When doing a direct IO write against a file range that either has
      preallocated extents in that range or has regular extents and the file
      has the NOCOW attribute set, the write fails with -ENOSPC when all of
      the following conditions are met:
      
      1) There are no data blocks groups with enough free space matching
         the size of the write;
      
      2) There's not enough unallocated space for allocating a new data block
         group;
      
      3) The extents in the target file range are not shared, neither through
         snapshots nor through reflinks.
      
      This is wrong because a NOCOW write can be done in such case, and in fact
      it's possible to do it using a buffered IO write, since when failing to
      allocate data space, the buffered IO path checks if a NOCOW write is
      possible.
      
      The failure in direct IO write path comes from the fact that early on,
      at btrfs_dio_iomap_begin(), we try to allocate data space for the write
      and if it that fails we return the error and stop - we never check if we
      can do NOCOW. But later, at btrfs_get_blocks_direct_write(), we check
      if we can do a NOCOW write into the range, or a subset of the range, and
      then release the previously reserved data space.
      
      Fix this by doing the data reservation only if needed, when we must COW,
      at btrfs_get_blocks_direct_write() instead of doing it at
      btrfs_dio_iomap_begin(). This also simplifies a bit the logic and removes
      the inneficiency of doing unnecessary data reservations.
      
      The following example test script reproduces the problem:
      
        $ cat dio-nocow-enospc.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        # Use a small fixed size (1G) filesystem so that it's quick to fill
        # it up.
        # Make sure the mixed block groups feature is not enabled because we
        # later want to not have more space available for allocating data
        # extents but still have enough metadata space free for the file writes.
        mkfs.btrfs -f -b $((1024 * 1024 * 1024)) -O ^mixed-bg $DEV
        mount $DEV $MNT
      
        # Create our test file with the NOCOW attribute set.
        touch $MNT/foobar
        chattr +C $MNT/foobar
      
        # Now fill in all unallocated space with data for our test file.
        # This will allocate a data block group that will be full and leave
        # no (or a very small amount of) unallocated space in the device, so
        # that it will not be possible to allocate a new block group later.
        echo
        echo "Creating test file with initial data..."
        xfs_io -c "pwrite -S 0xab -b 1M 0 900M" $MNT/foobar
      
        # Now try a direct IO write against file range [0, 10M[.
        # This should succeed since this is a NOCOW file and an extent for the
        # range was previously allocated.
        echo
        echo "Trying direct IO write over allocated space..."
        xfs_io -d -c "pwrite -S 0xcd -b 10M 0 10M" $MNT/foobar
      
        umount $MNT
      
      When running the test:
      
        $ ./dio-nocow-enospc.sh
        (...)
      
        Creating test file with initial data...
        wrote 943718400/943718400 bytes at offset 0
        900 MiB, 900 ops; 0:00:01.43 (625.526 MiB/sec and 625.5265 ops/sec)
      
        Trying direct IO write over allocated space...
        pwrite: No space left on device
      
      A test case for fstests will follow, testing both this direct IO write
      scenario as well as the buffered IO write scenario to make it less likely
      to get future regressions on the buffered IO case.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f0bfa76a
  2. 31 12月, 2021 1 次提交
  3. 23 12月, 2021 1 次提交
  4. 19 12月, 2021 1 次提交
    • C
      NFSD: Fix READDIR buffer overflow · 53b1119a
      Chuck Lever 提交于
      If a client sends a READDIR count argument that is too small (say,
      zero), then the buffer size calculation in the new init_dirlist
      helper functions results in an underflow, allowing the XDR stream
      functions to write beyond the actual buffer.
      
      This calculation has always been suspect. NFSD has never sanity-
      checked the READDIR count argument, but the old entry encoders
      managed the problem correctly.
      
      With the commits below, entry encoding changed, exposing the
      underflow to the pointer arithmetic in xdr_reserve_space().
      
      Modern NFS clients attempt to retrieve as much data as possible
      for each READDIR request. Also, we have no unit tests that
      exercise the behavior of READDIR at the lower bound of @count
      values. Thus this case was missed during testing.
      Reported-by: NAnatoly Trosinenko <anatoly.trosinenko@gmail.com>
      Fixes: f5dcccd6 ("NFSD: Update the NFSv2 READDIR entry encoder to use struct xdr_stream")
      Fixes: 7f87fc2d ("NFSD: Update NFSv3 READDIR entry encoders to use struct xdr_stream")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      53b1119a
  5. 18 12月, 2021 3 次提交
  6. 17 12月, 2021 4 次提交
  7. 16 12月, 2021 4 次提交
    • S
      btrfs: fix missing blkdev_put() call in btrfs_scan_one_device() · 4989d4a0
      Shin'ichiro Kawasaki 提交于
      The function btrfs_scan_one_device() calls blkdev_get_by_path() and
      blkdev_put() to get and release its target block device. However, when
      btrfs_sb_log_location_bdev() fails, blkdev_put() is not called and the
      block device is left without clean up. This triggered failure of fstests
      generic/085. Fix the failure path of btrfs_sb_log_location_bdev() to
      call blkdev_put().
      
      Fixes: 12659251 ("btrfs: implement log-structured superblock for ZONED mode")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4989d4a0
    • F
      btrfs: fix warning when freeing leaf after subvolume creation failure · 212a58fd
      Filipe Manana 提交于
      When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
      insert the root item for the new subvolume into the root tree, we can
      trigger the following warning:
      
      [78961.741046] WARNING: CPU: 0 PID: 4079814 at fs/btrfs/extent-tree.c:3357 btrfs_free_tree_block+0x2af/0x310 [btrfs]
      [78961.743344] Modules linked in:
      [78961.749440]  dm_snapshot dm_thin_pool (...)
      [78961.773648] CPU: 0 PID: 4079814 Comm: fsstress Not tainted 5.16.0-rc4-btrfs-next-108 #1
      [78961.775198] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [78961.777266] RIP: 0010:btrfs_free_tree_block+0x2af/0x310 [btrfs]
      [78961.778398] Code: 17 00 48 85 (...)
      [78961.781067] RSP: 0018:ffffaa4001657b28 EFLAGS: 00010202
      [78961.781877] RAX: 0000000000000213 RBX: ffff897f8a796910 RCX: 0000000000000000
      [78961.782780] RDX: 0000000000000000 RSI: 0000000011004000 RDI: 00000000ffffffff
      [78961.783764] RBP: ffff8981f490e800 R08: 0000000000000001 R09: 0000000000000000
      [78961.784740] R10: 0000000000000000 R11: 0000000000000001 R12: ffff897fc963fcc8
      [78961.785665] R13: 0000000000000001 R14: ffff898063548000 R15: ffff898063548000
      [78961.786620] FS:  00007f31283c6b80(0000) GS:ffff8982ace00000(0000) knlGS:0000000000000000
      [78961.787717] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [78961.788598] CR2: 00007f31285c3000 CR3: 000000023fcc8003 CR4: 0000000000370ef0
      [78961.789568] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [78961.790585] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [78961.791684] Call Trace:
      [78961.792082]  <TASK>
      [78961.792359]  create_subvol+0x5d1/0x9a0 [btrfs]
      [78961.793054]  btrfs_mksubvol+0x447/0x4c0 [btrfs]
      [78961.794009]  ? preempt_count_add+0x49/0xa0
      [78961.794705]  __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
      [78961.795712]  ? _copy_from_user+0x66/0xa0
      [78961.796382]  btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
      [78961.797392]  btrfs_ioctl+0xd1e/0x35c0 [btrfs]
      [78961.798172]  ? __slab_free+0x10a/0x360
      [78961.798820]  ? rcu_read_lock_sched_held+0x12/0x60
      [78961.799664]  ? lock_release+0x223/0x4a0
      [78961.800321]  ? lock_acquired+0x19f/0x420
      [78961.800992]  ? rcu_read_lock_sched_held+0x12/0x60
      [78961.801796]  ? trace_hardirqs_on+0x1b/0xe0
      [78961.802495]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
      [78961.803358]  ? kmem_cache_free+0x321/0x3c0
      [78961.804071]  ? __x64_sys_ioctl+0x83/0xb0
      [78961.804711]  __x64_sys_ioctl+0x83/0xb0
      [78961.805348]  do_syscall_64+0x3b/0xc0
      [78961.805969]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [78961.806830] RIP: 0033:0x7f31284bc957
      [78961.807517] Code: 3c 1c 48 f7 d8 (...)
      
      This is because we are calling btrfs_free_tree_block() on an extent
      buffer that is dirty. Fix that by cleaning the extent buffer, with
      btrfs_clean_tree_block(), before freeing it.
      
      This was triggered by test case generic/475 from fstests.
      
      Fixes: 67addf29 ("btrfs: fix metadata extent leak after failure to create subvolume")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      212a58fd
    • F
      btrfs: fix invalid delayed ref after subvolume creation failure · 7a163608
      Filipe Manana 提交于
      When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
      insert the new root's root item into the root tree, we are freeing the
      metadata extent we reserved for the new root to prevent a metadata
      extent leak, as we don't abort the transaction at that point (since
      there is nothing at that point that is irreversible).
      
      However we allocated the metadata extent for the new root which we are
      creating for the new subvolume, so its delayed reference refers to the
      ID of this new root. But when we free the metadata extent we pass the
      root of the subvolume where the new subvolume is located to
      btrfs_free_tree_block() - this is incorrect because this will generate
      a delayed reference that refers to the ID of the parent subvolume's root,
      and not to ID of the new root.
      
      This results in a failure when running delayed references that leads to
      a transaction abort and a trace like the following:
      
      [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs]
      [3868.739857] Code: 68 0f 85 e6 fb ff (...)
      [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246
      [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002
      [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88
      [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88
      [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000
      [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88
      [3868.750725] FS:  00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000
      [3868.752275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0
      [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [3868.757803] Call Trace:
      [3868.758281]  <TASK>
      [3868.758655]  ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs]
      [3868.759827]  __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
      [3868.761047]  btrfs_run_delayed_refs+0x86/0x210 [btrfs]
      [3868.762069]  ? lock_acquired+0x19f/0x420
      [3868.762829]  btrfs_commit_transaction+0x69/0xb20 [btrfs]
      [3868.763860]  ? _raw_spin_unlock+0x29/0x40
      [3868.764614]  ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs]
      [3868.765870]  create_subvol+0x1d8/0x9a0 [btrfs]
      [3868.766766]  btrfs_mksubvol+0x447/0x4c0 [btrfs]
      [3868.767669]  ? preempt_count_add+0x49/0xa0
      [3868.768444]  __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
      [3868.769639]  ? _copy_from_user+0x66/0xa0
      [3868.770391]  btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
      [3868.771495]  btrfs_ioctl+0xd1e/0x35c0 [btrfs]
      [3868.772364]  ? __slab_free+0x10a/0x360
      [3868.773198]  ? rcu_read_lock_sched_held+0x12/0x60
      [3868.774121]  ? lock_release+0x223/0x4a0
      [3868.774863]  ? lock_acquired+0x19f/0x420
      [3868.775634]  ? rcu_read_lock_sched_held+0x12/0x60
      [3868.776530]  ? trace_hardirqs_on+0x1b/0xe0
      [3868.777373]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
      [3868.778280]  ? kmem_cache_free+0x321/0x3c0
      [3868.779011]  ? __x64_sys_ioctl+0x83/0xb0
      [3868.779718]  __x64_sys_ioctl+0x83/0xb0
      [3868.780387]  do_syscall_64+0x3b/0xc0
      [3868.781059]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [3868.781953] RIP: 0033:0x7f281c59e957
      [3868.782585] Code: 3c 1c 48 f7 d8 4c (...)
      [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957
      [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003
      [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36
      [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
      [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc
      [3868.793765]  </TASK>
      [3868.794037] irq event stamp: 0
      [3868.794548] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
      [3868.797086] softirqs last  enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
      [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [3868.799284] ---[ end trace be24c7002fe27747 ]---
      [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2
      [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627
      [3868.802056]  item 0 key (237436928 169 0) itemoff 16250 itemsize 33
      [3868.802863]          extent refs 1 gen 1265 flags 2
      [3868.803447]          ref#0: tree block backref root 1610
      (...)
      [3869.064354]  item 114 key (241008640 169 0) itemoff 12488 itemsize 33
      [3869.065421]          extent refs 1 gen 1268 flags 2
      [3869.066115]          ref#0: tree block backref root 1689
      (...)
      [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622  owner 0 offset 0
      [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry
      [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry
      
      Fix this by passing the new subvolume's root ID to btrfs_free_tree_block().
      This requires changing the root argument of btrfs_free_tree_block() from
      struct btrfs_root * to a u64, since at this point during the subvolume
      creation we have not yet created the struct btrfs_root for the new
      subvolume, and btrfs_free_tree_block() only needs a root ID and nothing
      else from a struct btrfs_root.
      
      This was triggered by test case generic/475 from fstests.
      
      Fixes: 67addf29 ("btrfs: fix metadata extent leak after failure to create subvolume")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a163608
    • J
      btrfs: check WRITE_ERR when trying to read an extent buffer · 651740a5
      Josef Bacik 提交于
      Filipe reported a hang when we have errors on btrfs.  This turned out to
      be a side-effect of my fix c2e39305 ("btrfs: clear extent buffer
      uptodate when we fail to write it") which made it so we clear
      EXTENT_BUFFER_UPTODATE on an eb when we fail to write it out.
      
      Below is a paste of Filipe's analysis he got from using drgn to debug
      the hang
      
      """
      btree readahead code calls read_extent_buffer_pages(), sets ->io_pages to
      a value while writeback of all pages has not yet completed:
         --> writeback for the first 3 pages finishes, we clear
             EXTENT_BUFFER_UPTODATE from eb on the first page when we get an
             error.
         --> at this point eb->io_pages is 1 and we cleared Uptodate bit from the
             first 3 pages
         --> read_extent_buffer_pages() does not see EXTENT_BUFFER_UPTODATE() so
             it continues, it's able to lock the pages since we obviously don't
             hold the pages locked during writeback
         --> read_extent_buffer_pages() then computes 'num_reads' as 3, and sets
             eb->io_pages to 3, since only the first page does not have Uptodate
             bit set at this point
         --> writeback for the remaining page completes, we ended decrementing
             eb->io_pages by 1, resulting in eb->io_pages == 2, and therefore
             never calling end_extent_buffer_writeback(), so
             EXTENT_BUFFER_WRITEBACK remains in the eb's flags
         --> of course, when the read bio completes, it doesn't and shouldn't
             call end_extent_buffer_writeback()
         --> we should clear EXTENT_BUFFER_UPTODATE only after all pages of
             the eb finished writeback?  or maybe make the read pages code
             wait for writeback of all pages of the eb to complete before
             checking which pages need to be read, touch ->io_pages, submit
             read bio, etc
      
      writeback bit never cleared means we can hang when aborting a
      transaction, at:
      
          btrfs_cleanup_one_transaction()
             btrfs_destroy_marked_extents()
               wait_on_extent_buffer_writeback()
      """
      
      This is a problem because our writes are not synchronized with reads in
      any way.  We clear the UPTODATE flag and then we can easily come in and
      try to read the EB while we're still waiting on other bio's to
      complete.
      
      We have two options here, we could lock all the pages, and then check to
      see if eb->io_pages != 0 to know if we've already got an outstanding
      write on the eb.
      
      Or we can simply check to see if we have WRITE_ERR set on this extent
      buffer.  We set this bit _before_ we clear UPTODATE, so if the read gets
      triggered because we aren't UPTODATE because of a write error we're
      guaranteed to have WRITE_ERR set, and in this case we can simply return
      -EIO.  This will fix the reported hang.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: c2e39305 ("btrfs: clear extent buffer uptodate when we fail to write it")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      651740a5
  8. 14 12月, 2021 5 次提交
    • F
      btrfs: fix missing last dir item offset update when logging directory · 1b2e5e5c
      Filipe Manana 提交于
      When logging a directory, once we finish processing a leaf that is full
      of dir items, if we find the next leaf was not modified in the current
      transaction, we grab the first key of that next leaf and log it as to
      mark the end of a key range boundary.
      
      However we did not update the value of ctx->last_dir_item_offset, which
      tracks the offset of the last logged key. This can result in subsequent
      logging of the same directory in the current transaction to not realize
      that key was already logged, and then add it to the middle of a batch
      that starts with a lower key, resulting later in a leaf with one key
      that is duplicated and at non-consecutive slots. When that happens we get
      an error later when writing out the leaf, reporting that there is a pair
      of keys in wrong order. The report is something like the following:
      
      Dec 13 21:44:50 kernel: BTRFS critical (device dm-0): corrupt leaf:
      root=18446744073709551610 block=118444032 slot=21, bad key order, prev
      (704687 84 4146773349) current (704687 84 1063561078)
      Dec 13 21:44:50 kernel: BTRFS info (device dm-0): leaf 118444032 gen
      91449 total ptrs 39 free space 546 owner 18446744073709551610
      Dec 13 21:44:50 kernel:         item 0 key (704687 1 0) itemoff 3835
      itemsize 160
      Dec 13 21:44:50 kernel:                 inode generation 35532 size
      1026 mode 40755
      Dec 13 21:44:50 kernel:         item 1 key (704687 12 704685) itemoff
      3822 itemsize 13
      Dec 13 21:44:50 kernel:         item 2 key (704687 24 3817753667)
      itemoff 3736 itemsize 86
      Dec 13 21:44:50 kernel:         item 3 key (704687 60 0) itemoff 3728 itemsize 8
      Dec 13 21:44:50 kernel:         item 4 key (704687 72 0) itemoff 3720 itemsize 8
      Dec 13 21:44:50 kernel:         item 5 key (704687 84 140445108)
      itemoff 3666 itemsize 54
      Dec 13 21:44:50 kernel:                 dir oid 704793 type 1
      Dec 13 21:44:50 kernel:         item 6 key (704687 84 298800632)
      itemoff 3599 itemsize 67
      Dec 13 21:44:50 kernel:                 dir oid 707849 type 2
      Dec 13 21:44:50 kernel:         item 7 key (704687 84 476147658)
      itemoff 3532 itemsize 67
      Dec 13 21:44:50 kernel:                 dir oid 707901 type 2
      Dec 13 21:44:50 kernel:         item 8 key (704687 84 633818382)
      itemoff 3471 itemsize 61
      Dec 13 21:44:50 kernel:                 dir oid 704694 type 2
      Dec 13 21:44:50 kernel:         item 9 key (704687 84 654256665)
      itemoff 3403 itemsize 68
      Dec 13 21:44:50 kernel:                 dir oid 707841 type 1
      Dec 13 21:44:50 kernel:         item 10 key (704687 84 995843418)
      itemoff 3331 itemsize 72
      Dec 13 21:44:50 kernel:                 dir oid 2167736 type 1
      Dec 13 21:44:50 kernel:         item 11 key (704687 84 1063561078)
      itemoff 3278 itemsize 53
      Dec 13 21:44:50 kernel:                 dir oid 704799 type 2
      Dec 13 21:44:50 kernel:         item 12 key (704687 84 1101156010)
      itemoff 3225 itemsize 53
      Dec 13 21:44:50 kernel:                 dir oid 704696 type 1
      Dec 13 21:44:50 kernel:         item 13 key (704687 84 2521936574)
      itemoff 3173 itemsize 52
      Dec 13 21:44:50 kernel:                 dir oid 704704 type 2
      Dec 13 21:44:50 kernel:         item 14 key (704687 84 2618368432)
      itemoff 3112 itemsize 61
      Dec 13 21:44:50 kernel:                 dir oid 704738 type 1
      Dec 13 21:44:50 kernel:         item 15 key (704687 84 2676316190)
      itemoff 3046 itemsize 66
      Dec 13 21:44:50 kernel:                 dir oid 2167729 type 1
      Dec 13 21:44:50 kernel:         item 16 key (704687 84 3319104192)
      itemoff 2986 itemsize 60
      Dec 13 21:44:50 kernel:                 dir oid 704745 type 2
      Dec 13 21:44:50 kernel:         item 17 key (704687 84 3908046265)
      itemoff 2929 itemsize 57
      Dec 13 21:44:50 kernel:                 dir oid 2167734 type 1
      Dec 13 21:44:50 kernel:         item 18 key (704687 84 3945713089)
      itemoff 2857 itemsize 72
      Dec 13 21:44:50 kernel:                 dir oid 2167730 type 1
      Dec 13 21:44:50 kernel:         item 19 key (704687 84 4077169308)
      itemoff 2795 itemsize 62
      Dec 13 21:44:50 kernel:                 dir oid 704688 type 1
      Dec 13 21:44:50 kernel:         item 20 key (704687 84 4146773349)
      itemoff 2727 itemsize 68
      Dec 13 21:44:50 kernel:                 dir oid 707892 type 1
      Dec 13 21:44:50 kernel:         item 21 key (704687 84 1063561078)
      itemoff 2674 itemsize 53
      Dec 13 21:44:50 kernel:                 dir oid 704799 type 2
      Dec 13 21:44:50 kernel:         item 22 key (704687 96 2) itemoff 2612
      itemsize 62
      Dec 13 21:44:50 kernel:         item 23 key (704687 96 6) itemoff 2551
      itemsize 61
      Dec 13 21:44:50 kernel:         item 24 key (704687 96 7) itemoff 2498
      itemsize 53
      Dec 13 21:44:50 kernel:         item 25 key (704687 96 12) itemoff
      2446 itemsize 52
      Dec 13 21:44:50 kernel:         item 26 key (704687 96 14) itemoff
      2385 itemsize 61
      Dec 13 21:44:50 kernel:         item 27 key (704687 96 18) itemoff
      2325 itemsize 60
      Dec 13 21:44:50 kernel:         item 28 key (704687 96 24) itemoff
      2271 itemsize 54
      Dec 13 21:44:50 kernel:         item 29 key (704687 96 28) itemoff
      2218 itemsize 53
      Dec 13 21:44:50 kernel:         item 30 key (704687 96 62) itemoff
      2150 itemsize 68
      Dec 13 21:44:50 kernel:         item 31 key (704687 96 66) itemoff
      2083 itemsize 67
      Dec 13 21:44:50 kernel:         item 32 key (704687 96 75) itemoff
      2015 itemsize 68
      Dec 13 21:44:50 kernel:         item 33 key (704687 96 79) itemoff
      1948 itemsize 67
      Dec 13 21:44:50 kernel:         item 34 key (704687 96 82) itemoff
      1882 itemsize 66
      Dec 13 21:44:50 kernel:         item 35 key (704687 96 83) itemoff
      1810 itemsize 72
      Dec 13 21:44:50 kernel:         item 36 key (704687 96 85) itemoff
      1753 itemsize 57
      Dec 13 21:44:50 kernel:         item 37 key (704687 96 87) itemoff
      1681 itemsize 72
      Dec 13 21:44:50 kernel:         item 38 key (704694 1 0) itemoff 1521
      itemsize 160
      Dec 13 21:44:50 kernel:                 inode generation 35534 size 30
      mode 40755
      Dec 13 21:44:50 kernel: BTRFS error (device dm-0): block=118444032
      write time tree block corruption detected
      
      So fix that by adding the missing update of ctx->last_dir_item_offset with
      the offset of the boundary key.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtT+RSzpUjbMq+UfzNUMe1X5+1G+DnAGbHC=OZ=iRS24jg@mail.gmail.com/
      Fixes: dc287224 ("btrfs: keep track of the last logged keys when logging a directory")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1b2e5e5c
    • F
      btrfs: fix double free of anon_dev after failure to create subvolume · 33fab972
      Filipe Manana 提交于
      When creating a subvolume, at create_subvol(), we allocate an anonymous
      device and later call btrfs_get_new_fs_root(), which in turn just calls
      btrfs_get_root_ref(). There we call btrfs_init_fs_root() which assigns
      the anonymous device to the root, but if after that call there's an error,
      when we jump to 'fail' label, we call btrfs_put_root(), which frees the
      anonymous device and then returns an error that is propagated back to
      create_subvol(). Than create_subvol() frees the anonymous device again.
      
      When this happens, if the anonymous device was not reallocated after
      the first time it was freed with btrfs_put_root(), we get a kernel
      message like the following:
      
        (...)
        [13950.282466] BTRFS: error (device dm-0) in create_subvol:663: errno=-5 IO failure
        [13950.283027] ida_free called for id=65 which is not allocated.
        [13950.285974] BTRFS info (device dm-0): forced readonly
        (...)
      
      If the anonymous device gets reallocated by another btrfs filesystem
      or any other kernel subsystem, then bad things can happen.
      
      So fix this by setting the root's anonymous device to 0 at
      btrfs_get_root_ref(), before we call btrfs_put_root(), if an error
      happened.
      
      Fixes: 2dfb1e43 ("btrfs: preallocate anon block device at first phase of snapshot creation")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      33fab972
    • J
      btrfs: fix memory leak in __add_inode_ref() · f35838a6
      Jianglei Nie 提交于
      Line 1169 (#3) allocates a memory chunk for victim_name by kmalloc(),
      but  when the function returns in line 1184 (#4) victim_name allocated
      by line 1169 (#3) is not freed, which will lead to a memory leak.
      There is a similar snippet of code in this function as allocating a memory
      chunk for victim_name in line 1104 (#1) as well as releasing the memory
      in line 1116 (#2).
      
      We should kfree() victim_name when the return value of backref_in_log()
      is less than zero and before the function returns in line 1184 (#4).
      
      1057 static inline int __add_inode_ref(struct btrfs_trans_handle *trans,
      1058 				  struct btrfs_root *root,
      1059 				  struct btrfs_path *path,
      1060 				  struct btrfs_root *log_root,
      1061 				  struct btrfs_inode *dir,
      1062 				  struct btrfs_inode *inode,
      1063 				  u64 inode_objectid, u64 parent_objectid,
      1064 				  u64 ref_index, char *name, int namelen,
      1065 				  int *search_done)
      1066 {
      
      1104 	victim_name = kmalloc(victim_name_len, GFP_NOFS);
      	// #1: kmalloc (victim_name-1)
      1105 	if (!victim_name)
      1106 		return -ENOMEM;
      
      1112	ret = backref_in_log(log_root, &search_key,
      1113			parent_objectid, victim_name,
      1114			victim_name_len);
      1115	if (ret < 0) {
      1116		kfree(victim_name); // #2: kfree (victim_name-1)
      1117		return ret;
      1118	} else if (!ret) {
      
      1169 	victim_name = kmalloc(victim_name_len, GFP_NOFS);
      	// #3: kmalloc (victim_name-2)
      1170 	if (!victim_name)
      1171 		return -ENOMEM;
      
      1180 	ret = backref_in_log(log_root, &search_key,
      1181 			parent_objectid, victim_name,
      1182 			victim_name_len);
      1183 	if (ret < 0) {
      1184 		return ret; // #4: missing kfree (victim_name-2)
      1185 	} else if (!ret) {
      
      1241 	return 0;
      1242 }
      
      Fixes: d3316c82 ("btrfs: Properly handle backref_in_log retval")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJianglei Nie <niejianglei2021@163.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f35838a6
    • L
      fget: clarify and improve __fget_files() implementation · e386dfc5
      Linus Torvalds 提交于
      Commit 054aa8d4 ("fget: check that the fd still exists after getting
      a ref to it") fixed a race with getting a reference to a file just as it
      was being closed.  It was a fairly minimal patch, and I didn't think
      re-checking the file pointer lookup would be a measurable overhead,
      since it was all right there and cached.
      
      But I was wrong, as pointed out by the kernel test robot.
      
      The 'poll2' case of the will-it-scale.per_thread_ops benchmark regressed
      quite noticeably.  Admittedly it seems to be a very artificial test:
      doing "poll()" system calls on regular files in a very tight loop in
      multiple threads.
      
      That means that basically all the time is spent just looking up file
      descriptors without ever doing anything useful with them (not that doing
      'poll()' on a regular file is useful to begin with).  And as a result it
      shows the extra "re-check fd" cost as a sore thumb.
      
      Happily, the regression is fixable by just writing the code to loook up
      the fd to be better and clearer.  There's still a cost to verify the
      file pointer, but now it's basically in the noise even for that
      benchmark that does nothing else - and the code is more understandable
      and has better comments too.
      
      [ Side note: this patch is also a classic case of one that looks very
        messy with the default greedy Myers diff - it's much more legible with
        either the patience of histogram diff algorithm ]
      
      Link: https://lore.kernel.org/lkml/20211210053743.GA36420@xsang-OptiPlex-9020/
      Link: https://lore.kernel.org/lkml/20211213083154.GA20853@linux.intel.com/Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Tested-by: NCarel Si <beibei.si@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e386dfc5
    • J
      io-wq: drop wqe lock before creating new worker · d800c65c
      Jens Axboe 提交于
      We have two io-wq creation paths:
      
      - On queue enqueue
      - When a worker goes to sleep
      
      The latter invokes worker creation with the wqe->lock held, but that can
      run into problems if we end up exiting and need to cancel the queued work.
      syzbot caught this:
      
      ============================================
      WARNING: possible recursive locking detected
      5.16.0-rc4-syzkaller #0 Not tainted
      --------------------------------------------
      iou-wrk-6468/6471 is trying to acquire lock:
      ffff88801aa98018 (&wqe->lock){+.+.}-{2:2}, at: io_worker_cancel_cb+0xb7/0x210 fs/io-wq.c:187
      
      but task is already holding lock:
      ffff88801aa98018 (&wqe->lock){+.+.}-{2:2}, at: io_wq_worker_sleeping+0xb6/0x140 fs/io-wq.c:700
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&wqe->lock);
        lock(&wqe->lock);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      1 lock held by iou-wrk-6468/6471:
       #0: ffff88801aa98018 (&wqe->lock){+.+.}-{2:2}, at: io_wq_worker_sleeping+0xb6/0x140 fs/io-wq.c:700
      
      stack backtrace:
      CPU: 1 PID: 6471 Comm: iou-wrk-6468 Not tainted 5.16.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x1dc/0x2d8 lib/dump_stack.c:106
       print_deadlock_bug kernel/locking/lockdep.c:2956 [inline]
       check_deadlock kernel/locking/lockdep.c:2999 [inline]
       validate_chain+0x5984/0x8240 kernel/locking/lockdep.c:3788
       __lock_acquire+0x1382/0x2b00 kernel/locking/lockdep.c:5027
       lock_acquire+0x19f/0x4d0 kernel/locking/lockdep.c:5637
       __raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
       _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:154
       io_worker_cancel_cb+0xb7/0x210 fs/io-wq.c:187
       io_wq_cancel_tw_create fs/io-wq.c:1220 [inline]
       io_queue_worker_create+0x3cf/0x4c0 fs/io-wq.c:372
       io_wq_worker_sleeping+0xbe/0x140 fs/io-wq.c:701
       sched_submit_work kernel/sched/core.c:6295 [inline]
       schedule+0x67/0x1f0 kernel/sched/core.c:6323
       schedule_timeout+0xac/0x300 kernel/time/timer.c:1857
       wait_woken+0xca/0x1b0 kernel/sched/wait.c:460
       unix_msg_wait_data net/unix/unix_bpf.c:32 [inline]
       unix_bpf_recvmsg+0x7f9/0xe20 net/unix/unix_bpf.c:77
       unix_stream_recvmsg+0x214/0x2c0 net/unix/af_unix.c:2832
       sock_recvmsg_nosec net/socket.c:944 [inline]
       sock_recvmsg net/socket.c:962 [inline]
       sock_read_iter+0x3a7/0x4d0 net/socket.c:1035
       call_read_iter include/linux/fs.h:2156 [inline]
       io_iter_do_read fs/io_uring.c:3501 [inline]
       io_read fs/io_uring.c:3558 [inline]
       io_issue_sqe+0x144c/0x9590 fs/io_uring.c:6671
       io_wq_submit_work+0x2d8/0x790 fs/io_uring.c:6836
       io_worker_handle_work+0x808/0xdd0 fs/io-wq.c:574
       io_wqe_worker+0x395/0x870 fs/io-wq.c:630
       ret_from_fork+0x1f/0x30
      
      We can safely drop the lock before doing work creation, making the two
      contexts the same in that regard.
      
      Reported-by: syzbot+b18b8be69df33a3918e9@syzkaller.appspotmail.com
      Fixes: 71a85387 ("io-wq: check for wq exit after adding new worker task_work")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d800c65c
  9. 11 12月, 2021 1 次提交
    • J
      io-wq: check for wq exit after adding new worker task_work · 71a85387
      Jens Axboe 提交于
      We check IO_WQ_BIT_EXIT before attempting to create a new worker, and
      wq exit cancels pending work if we have any. But it's possible to have
      a race between the two, where creation checks exit finding it not set,
      but we're in the process of exiting. The exit side will cancel pending
      creation task_work, but there's a gap where we add task_work after we've
      canceled existing creations at exit time.
      
      Fix this by checking the EXIT bit post adding the creation task_work.
      If it's set, run the same cancelation that exit does.
      
      Reported-and-tested-by: syzbot+b60c982cb0efc5e05a47@syzkaller.appspotmail.com
      Reviewed-by: NHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      71a85387