1. 03 1月, 2022 23 次提交
    • N
      btrfs: zoned: cache reported zone during mount · 16beac87
      Naohiro Aota 提交于
      When mounting a device, we are reporting the zones twice: once for
      checking the zone attributes in btrfs_get_dev_zone_info and once for
      loading block groups' zone info in
      btrfs_load_block_group_zone_info(). With a lot of block groups, that
      leads to a lot of REPORT ZONE commands and slows down the mount
      process.
      
      This patch introduces a zone info cache in struct
      btrfs_zoned_device_info. The cache is populated while in
      btrfs_get_dev_zone_info() and used for
      btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE
      commands. The zone cache is then released after loading the block
      groups, as it will not be much effective during the run time.
      
      Benchmark: Mount an HDD with 57,007 block groups
      Before patch: 171.368 seconds
      After patch: 64.064 seconds
      
      While it still takes a minute due to the slowness of loading all the
      block groups, the patch reduces the mount time by 1/3.
      
      Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT5WY+C=cD49nQ@mail.gmail.com/
      Fixes: 5b316468 ("btrfs: get zone information of zoned block devices")
      CC: stable@vger.kernel.org
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      16beac87
    • S
      btrfs: remove unused parameter fs_devices from btrfs_init_workqueues · d21deec5
      Su Yue 提交于
      Since commit ba8a9d07 ("Btrfs: delete the entire async bio submission
      framework") removed submit workqueues, the parameter fs_devices is not used
      anymore.
      
      Remove it, no functional changes.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NSu Yue <l@damenly.su>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d21deec5
    • F
      btrfs: reduce the scope of the tree log mutex during transaction commit · dfba78dc
      Filipe Manana 提交于
      In the transaction commit path we are acquiring the tree log mutex too
      early and we have a stale comment because:
      
      1) It mentions a function named btrfs_commit_tree_roots(), which does not
         exists anymore, it was the old name of commit_cowonly_roots(), renamed
         a very long time ago by commit 5d4f98a2 ("Btrfs: Mixed back
         reference  (FORWARD ROLLING FORMAT CHANGE)"));
      
      2) It mentions that we need to acquire the tree log mutex at that point
         to ensure we have no running log writers. That is not correct anymore,
         for many years at least, since we are guaranteed that we do not have
         any log writers at that point simply because we have set the state of
         the transaction to TRANS_STATE_COMMIT_DOING and have waited for all
         writers to complete - meaning no one can log until we change the state
         of the transaction to TRANS_STATE_UNBLOCKED. Any attempts to join the
         transaction or start a new one will block until we do that state
         transition;
      
      3) The comment mentions a "trans mutex" which doesn't exists since 2011,
         commit a4abeea4 ("Btrfs: kill trans_mutex") removed it;
      
      4) The current use of the tree log mutex is to ensure proper serialization
         of super block writes - if someone started a new transaction and uses it
         for logging, it will wait for the previous transaction to write its
         super block before writing the super block when attempting to sync the
         log.
      
      So acquire the tree log mutex only when it's absolutely needed, before
      setting the transaction state to TRANS_STATE_UNBLOCKED, fix and move the
      stale comment, add some assertions and new comments where appropriate.
      
      Also, this has no effect on concurrency or performance, since the new
      start of the critical section is still when the transaction is in the
      state TRANS_STATE_COMMIT_DOING.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dfba78dc
    • A
      btrfs: consolidate device_list_mutex in prepare_sprout to its parent · 849eae5e
      Anand Jain 提交于
      btrfs_prepare_sprout() splices seed devices into its own struct fs_devices,
      so that its parent function btrfs_init_new_device() can add the new sprout
      device to fs_info->fs_devices.
      
      Both btrfs_prepare_sprout() and btrfs_init_new_device() need
      device_list_mutex. But they are holding it separately, thus create a
      small race window. Close it and hold device_list_mutex across both
      functions btrfs_init_new_device() and btrfs_prepare_sprout().
      
      Split btrfs_prepare_sprout() into btrfs_init_sprout() and
      btrfs_setup_sprout(). This split is essential because device_list_mutex
      must not be held for allocations in btrfs_init_sprout() but must be held
      for btrfs_setup_sprout(). So now a common device_list_mutex can be used
      between btrfs_init_new_device() and btrfs_setup_sprout().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      849eae5e
    • A
      btrfs: switch seeding_dev in init_new_device to bool · fd880809
      Anand Jain 提交于
      Declare int seeding_dev as a bool. Also, move its declaration a line
      below to adjust packing.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd880809
    • O
      btrfs: send: remove unused type parameter to iterate_inode_ref_t · b1dea4e7
      Omar Sandoval 提交于
      Again, I don't think this was ever used since iterate_dir_item() is only
      used for xattrs. No functional change.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b1dea4e7
    • O
      btrfs: send: remove unused found_type parameter to lookup_dir_item_inode() · eab67c06
      Omar Sandoval 提交于
      As far as I can tell, this was never used. No functional change.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eab67c06
    • J
      btrfs: rename btrfs_item_end_nr to btrfs_item_data_end · dc2e724e
      Josef Bacik 提交于
      The name btrfs_item_end_nr() is a bit of a misnomer, as it's actually
      the offset of the end of the data the item points to.  In fact all of
      the helpers that we use btrfs_item_end_nr() use data in their name, like
      BTRFS_LEAF_DATA_SIZE() and leaf_data().  Rename to btrfs_item_data_end()
      to make it clear what this helper is giving us.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dc2e724e
    • J
      btrfs: remove the btrfs_item_end() helper · 5a08663d
      Josef Bacik 提交于
      We're only using btrfs_item_end() from btrfs_item_end_nr(), so this can
      be collapsed.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a08663d
    • J
      btrfs: drop the _nr from the item helpers · 3212fa14
      Josef Bacik 提交于
      Now that all call sites are using the slot number to modify item values,
      rename the SETGET helpers to raw_item_*(), and then rework the _nr()
      helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
      rename all of the callers to the new helpers.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3212fa14
    • J
      btrfs: introduce item_nr token variant helpers · 74794207
      Josef Bacik 提交于
      The last remaining place where we have the pattern of
      
      	item = btrfs_item_nr(slot)
      	<do something with the item>
      
      are the token helpers.  Handle this by introducing token helpers that
      will do the btrfs_item_nr() work inside of the helper itself, and then
      convert all users of the btrfs_item token helpers to the new _nr()
      variants.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74794207
    • J
      btrfs: make btrfs_file_extent_inline_item_len take a slot · 437bd07e
      Josef Bacik 提交于
      Instead of getting the btrfs_item for this, simply pass in the slot of
      the item and then use the btrfs_item_size_nr() helper inside of
      btrfs_file_extent_inline_item_len().
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      437bd07e
    • J
      btrfs: add btrfs_set_item_*_nr() helpers · c91666b1
      Josef Bacik 提交于
      We have the pattern of
      
      	item = btrfs_item_nr(slot);
      	btrfs_set_item_*(leaf, item);
      
      in a bunch of places in our code.  Fix this by adding
      btrfs_set_item_*_nr() helpers which will do the appropriate work, and
      replace those calls with
      
      	btrfs_set_item_*_nr(leaf, slot);
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c91666b1
    • J
      btrfs: use btrfs_item_size_nr/btrfs_item_offset_nr everywhere · 227f3cd0
      Josef Bacik 提交于
      We have this pattern in a lot of places
      
      	item = btrfs_item_nr(slot);
      	btrfs_item_size(leaf, item);
      
      when we could simply use
      
      	btrfs_item_size(leaf, slot);
      
      Fix all callers of btrfs_item_size() and btrfs_item_offset() to use the
      _nr variation of the helpers.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      227f3cd0
    • F
      btrfs: remove no longer needed logic for replaying directory deletes · ccae4a19
      Filipe Manana 提交于
      Now that we log only dir index keys when logging a directory, we no longer
      need to deal with dir item keys in the log replay code for replaying
      directory deletes. This is also true for the case when we replay a log
      tree created by a kernel that still logs dir items.
      
      So remove the remaining code of the replay of directory deletes algorithm
      that deals with dir item keys.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ccae4a19
    • F
      btrfs: only copy dir index keys when logging a directory · 339d0354
      Filipe Manana 提交于
      Currently, when logging a directory, we copy both dir items and dir index
      items from the fs/subvolume tree to the log tree. Both items have exactly
      the same data (same struct btrfs_dir_item), the difference lies in the key
      values, where a dir index key contains the index number of a directory
      entry while the dir item key does not, as it's used for doing fast lookups
      of an entry by name, while the former is used for sorting entries when
      listing a directory.
      
      We can exploit that and log only the dir index items, since they contain
      all the information needed to correctly add, replace and delete directory
      entries when replaying a log tree. Logging only the dir index items is
      also backward and forward compatible: an unpatched kernel (without this
      change) can correctly replay a log tree generated by a patched kernel
      (with this patch), and a patched kernel can correctly replay a log tree
      generated by an unpatched kernel.
      
      The backward compatibility is ensured because:
      
      1) For inserting a new dentry: a dentry is only inserted when we find a
         new dir index key - we can only insert if we know the dir index offset,
         which is encoded in the dir index key's offset;
      
      2) For deleting dentries: during log replay, before adding or replacing
         dentries, we first replay dentry deletions. Whenever we find a dir item
         key or a dir index key in the subvolume/fs tree that is not logged in
         a range for which the log tree is authoritative, we do the unlink of
         the dentry, which removes both the existing dir item key and the dir
         index key. Therefore logging just dir index keys is enough to ensure
         dentry deletions are correctly replayed;
      
      3) For dentry replacements: they work when we log only dir index keys
         and this is mostly due to a combination of 1) and 2). If we replace a
         dentry with name "foobar" to point from inode A to inode B, then we
         know the dir index key for the new dentry is different from the old
         one, as it has an index number (key offset) larger than the old one.
         This results in replaying a deletion, through replay_dir_deletes(),
         that causes the old dentry to be removed, both the dir item key and
         the dir index key, as mentioned at 2). Then when processing the new
         dir index key, we add the new dentry, adding both a new dir item key
         and a new index key pointing to inode B, as stated in 1).
      
      The forward compatibility, the ability for a patched kernel to replay a
      log created by an older, unpatched kernel, comes from the changes required
      for making sure we are able to replay a log that only contains dir index
      keys - we simply ignore every dir item key we find.
      
      So modify directory logging to log only dir index items, and modify the
      log replay process to ignore dir item keys, from log trees created by an
      unpatched kernel, and process only with dir index keys. This reduces the
      amount of logged metadata by about half, and therefore the time spent
      logging or fsyncing large directories (less CPU time and less IO).
      
      The following test script was used to measure this change:
      
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
      
         NUM_NEW_FILES=1000000
         NUM_FILE_DELETES=10000
      
         mkfs.btrfs -f $DEV
         mount -o ssd $DEV $MNT
      
         mkdir $MNT/testdir
      
         for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
                 echo -n > $MNT/testdir/file_$i
         done
      
         start=$(date +%s%N)
         xfs_io -c "fsync" $MNT/testdir
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
      
         # sync to force transaction commit and wipeout the log.
         sync
      
         del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
         for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
                 rm -f $MNT/testdir/file_$i
         done
      
         start=$(date +%s%N)
         xfs_io -c "fsync" $MNT/testdir
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
         echo
      
         umount $MNT
      
      The tests were run on a physical machine, with a non-debug kernel (Debian's
      default kernel config), for different values of $NUM_NEW_FILES and
      $NUM_FILE_DELETES, and the results were the following:
      
      ** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
      
      dir fsync took 8412 ms after adding 1000000 files
      dir fsync took 500 ms after deleting 10000 files
      
      ** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 **
      
      dir fsync took 4252 ms after adding 1000000 files   (-49.5%)
      dir fsync took 269 ms after deleting 10000 files    (-46.2%)
      
      ** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 745 ms after adding 100000 files
      dir fsync took 59 ms after deleting 1000 files
      
      ** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 404 ms after adding 100000 files   (-45.8%)
      dir fsync took 31 ms after deleting 1000 files    (-47.5%)
      
      ** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 67 ms after adding 10000 files
      dir fsync took 9 ms after deleting 1000 files
      
      ** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 **
      
      dir fsync took 36 ms after adding 10000 files   (-46.3%)
      dir fsync took 5 ms after deleting 1000 files   (-44.4%)
      
      ** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
      
      dir fsync took 9 ms after adding 1000 files
      dir fsync took 4 ms after deleting 100 files
      
      ** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 **
      
      dir fsync took 7 ms after adding 1000 files     (-22.2%)
      dir fsync took 3 ms after deleting 100 files    (-25.0%)
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      339d0354
    • N
      btrfs: remove spurious unlock/lock of unused_bgs_lock · 17130a65
      Nikolay Borisov 提交于
      Since both unused block groups and reclaim bgs lists are protected by
      unused_bgs_lock then free them in the same critical section without
      doing an extra unlock/lock pair.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      17130a65
    • F
      btrfs: fix deadlock between quota enable and other quota operations · 232796df
      Filipe Manana 提交于
      When enabling quotas, we attempt to commit a transaction while holding the
      mutex fs_info->qgroup_ioctl_lock. This can result on a deadlock with other
      quota operations such as:
      
      - qgroup creation and deletion, ioctl BTRFS_IOC_QGROUP_CREATE;
      
      - adding and removing qgroup relations, ioctl BTRFS_IOC_QGROUP_ASSIGN.
      
      This is because these operations join a transaction and after that they
      attempt to lock the mutex fs_info->qgroup_ioctl_lock. Acquiring that mutex
      after joining or starting a transaction is a pattern followed everywhere
      in qgroups, so the quota enablement operation is the one at fault here,
      and should not commit a transaction while holding that mutex.
      
      Fix this by making the transaction commit while not holding the mutex.
      We are safe from two concurrent tasks trying to enable quotas because
      we are serialized by the rw semaphore fs_info->subvol_sem at
      btrfs_ioctl_quota_ctl(), which is the only call site for enabling
      quotas.
      
      When this deadlock happens, it produces a trace like the following:
      
        INFO: task syz-executor:25604 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc6 #4
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor state:D stack:24800 pid:25604 ppid: 24873 flags:0x00004004
        Call Trace:
        context_switch kernel/sched/core.c:4940 [inline]
        __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
        schedule+0xd3/0x270 kernel/sched/core.c:6366
        btrfs_commit_transaction+0x994/0x2e90 fs/btrfs/transaction.c:2201
        btrfs_quota_enable+0x95c/0x1790 fs/btrfs/qgroup.c:1120
        btrfs_ioctl_quota_ctl fs/btrfs/ioctl.c:4229 [inline]
        btrfs_ioctl+0x637e/0x7b70 fs/btrfs/ioctl.c:5010
        vfs_ioctl fs/ioctl.c:51 [inline]
        __do_sys_ioctl fs/ioctl.c:874 [inline]
        __se_sys_ioctl fs/ioctl.c:860 [inline]
        __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7f86920b2c4d
        RSP: 002b:00007f868f61ac58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007f86921d90a0 RCX: 00007f86920b2c4d
        RDX: 0000000020005e40 RSI: 00000000c0109428 RDI: 0000000000000008
        RBP: 00007f869212bd80 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000246 R12: 00007f86921d90a0
        R13: 00007fff6d233e4f R14: 00007fff6d233ff0 R15: 00007f868f61adc0
        INFO: task syz-executor:25628 blocked for more than 143 seconds.
        Not tainted 5.15.0-rc6 #4
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor state:D stack:29080 pid:25628 ppid: 24873 flags:0x00004004
        Call Trace:
        context_switch kernel/sched/core.c:4940 [inline]
        __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
        schedule+0xd3/0x270 kernel/sched/core.c:6366
        schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
        __mutex_lock_common kernel/locking/mutex.c:669 [inline]
        __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
        btrfs_remove_qgroup+0xb7/0x7d0 fs/btrfs/qgroup.c:1548
        btrfs_ioctl_qgroup_create fs/btrfs/ioctl.c:4333 [inline]
        btrfs_ioctl+0x683c/0x7b70 fs/btrfs/ioctl.c:5014
        vfs_ioctl fs/ioctl.c:51 [inline]
        __do_sys_ioctl fs/ioctl.c:874 [inline]
        __se_sys_ioctl fs/ioctl.c:860 [inline]
        __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
        do_syscall_x64 arch/x86/entry/common.c:50 [inline]
        do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      Reported-by: NHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsZQF19bQ1C6=yetF3BvL10OSORpFUcWXTP6HErshDB4dQ@mail.gmail.com/
      Fixes: 340f1aa2 ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
      CC: stable@vger.kernel.org # 4.19
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      232796df
    • F
      btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range · f0bfa76a
      Filipe Manana 提交于
      When doing a direct IO write against a file range that either has
      preallocated extents in that range or has regular extents and the file
      has the NOCOW attribute set, the write fails with -ENOSPC when all of
      the following conditions are met:
      
      1) There are no data blocks groups with enough free space matching
         the size of the write;
      
      2) There's not enough unallocated space for allocating a new data block
         group;
      
      3) The extents in the target file range are not shared, neither through
         snapshots nor through reflinks.
      
      This is wrong because a NOCOW write can be done in such case, and in fact
      it's possible to do it using a buffered IO write, since when failing to
      allocate data space, the buffered IO path checks if a NOCOW write is
      possible.
      
      The failure in direct IO write path comes from the fact that early on,
      at btrfs_dio_iomap_begin(), we try to allocate data space for the write
      and if it that fails we return the error and stop - we never check if we
      can do NOCOW. But later, at btrfs_get_blocks_direct_write(), we check
      if we can do a NOCOW write into the range, or a subset of the range, and
      then release the previously reserved data space.
      
      Fix this by doing the data reservation only if needed, when we must COW,
      at btrfs_get_blocks_direct_write() instead of doing it at
      btrfs_dio_iomap_begin(). This also simplifies a bit the logic and removes
      the inneficiency of doing unnecessary data reservations.
      
      The following example test script reproduces the problem:
      
        $ cat dio-nocow-enospc.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        # Use a small fixed size (1G) filesystem so that it's quick to fill
        # it up.
        # Make sure the mixed block groups feature is not enabled because we
        # later want to not have more space available for allocating data
        # extents but still have enough metadata space free for the file writes.
        mkfs.btrfs -f -b $((1024 * 1024 * 1024)) -O ^mixed-bg $DEV
        mount $DEV $MNT
      
        # Create our test file with the NOCOW attribute set.
        touch $MNT/foobar
        chattr +C $MNT/foobar
      
        # Now fill in all unallocated space with data for our test file.
        # This will allocate a data block group that will be full and leave
        # no (or a very small amount of) unallocated space in the device, so
        # that it will not be possible to allocate a new block group later.
        echo
        echo "Creating test file with initial data..."
        xfs_io -c "pwrite -S 0xab -b 1M 0 900M" $MNT/foobar
      
        # Now try a direct IO write against file range [0, 10M[.
        # This should succeed since this is a NOCOW file and an extent for the
        # range was previously allocated.
        echo
        echo "Trying direct IO write over allocated space..."
        xfs_io -d -c "pwrite -S 0xcd -b 10M 0 10M" $MNT/foobar
      
        umount $MNT
      
      When running the test:
      
        $ ./dio-nocow-enospc.sh
        (...)
      
        Creating test file with initial data...
        wrote 943718400/943718400 bytes at offset 0
        900 MiB, 900 ops; 0:00:01.43 (625.526 MiB/sec and 625.5265 ops/sec)
      
        Trying direct IO write over allocated space...
        pwrite: No space left on device
      
      A test case for fstests will follow, testing both this direct IO write
      scenario as well as the buffered IO write scenario to make it less likely
      to get future regressions on the buffered IO case.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f0bfa76a
    • L
      Linux 5.16-rc8 · c9e6606c
      Linus Torvalds 提交于
      c9e6606c
    • L
      Merge tag 'perf-tools-fixes-for-v5.16-2022-01-02' of... · 24a0b220
      Linus Torvalds 提交于
      Merge tag 'perf-tools-fixes-for-v5.16-2022-01-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Fix TUI exit screen refresh race condition in 'perf top'.
      
       - Fix parsing of Intel PT VM time correlation arguments.
      
       - Honour CPU filtering command line request of a script's switch events
         in 'perf script'.
      
       - Fix printing of switch events in Intel PT python script.
      
       - Fix duplicate alias events list printing in 'perf list', noticed on
         heterogeneous arm64 systems.
      
       - Fix return value of ids__new(), users expect NULL for failure, not
         ERR_PTR(-ENOMEM).
      
      * tag 'perf-tools-fixes-for-v5.16-2022-01-02' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        perf top: Fix TUI exit screen refresh race condition
        perf pmu: Fix alias events list
        perf scripts python: intel-pt-events.py: Fix printing of switch events
        perf script: Fix CPU filtering of a script's switch events
        perf intel-pt: Fix parsing of VM time correlation arguments
        perf expr: Fix return value of ids__new()
      24a0b220
    • L
      Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 859431ac
      Linus Torvalds 提交于
      Pull i2c fixes from Wolfram Sang:
       "Better input validation for compat ioctls and a documentation bugfix
        for 5.16"
      
      * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        Docs: Fixes link to I2C specification
        i2c: validate user data in compat ioctl
      859431ac
    • L
      Merge tag 'x86_urgent_for_v5.16_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1286cc48
      Linus Torvalds 提交于
      Pull x86 fix from Borislav Petkov:
      
       - Use the proper CONFIG symbol in a preprocessor check.
      
      * tag 'x86_urgent_for_v5.16_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/build: Use the proper name CONFIG_FW_LOADER
      1286cc48
  2. 02 1月, 2022 3 次提交
    • Y
      perf top: Fix TUI exit screen refresh race condition · 64f18d2d
      yaowenbin 提交于
      When the following command is executed several times, a coredump file is
      generated.
      
      	$ timeout -k 9 5 perf top -e task-clock
      	*******
      	*******
      	*******
      	0.01%  [kernel]                  [k] __do_softirq
      	0.01%  libpthread-2.28.so        [.] __pthread_mutex_lock
      	0.01%  [kernel]                  [k] __ll_sc_atomic64_sub_return
      	double free or corruption (!prev) perf top --sort comm,dso
      	timeout: the monitored command dumped core
      
      When we terminate "perf top" using sending signal method,
      SLsmg_reset_smg() called. SLsmg_reset_smg() resets the SLsmg screen
      management routines by freeing all memory allocated while it was active.
      
      However SLsmg_reinit_smg() maybe be called by another thread.
      
      SLsmg_reinit_smg() will free the same memory accessed by
      SLsmg_reset_smg(), thus it results in a double free.
      
      SLsmg_reinit_smg() is called already protected by ui__lock, so we fix
      the problem by adding pthread_mutex_trylock of ui__lock when calling
      SLsmg_reset_smg().
      Signed-off-by: NWenyu Liu <liuwenyu7@huawei.com>
      Tested-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: wuxu.wu@huawei.com
      Link: http://lore.kernel.org/lkml/a91e3943-7ddc-f5c0-a7f5-360f073c20e6@huawei.comSigned-off-by: NHewenliang <hewenliang4@huawei.com>
      Signed-off-by: Nyaowenbin <yaowenbin1@huawei.com>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      64f18d2d
    • J
      perf pmu: Fix alias events list · e0257a01
      John Garry 提交于
      Commit 0e0ae874 ("perf list: Display hybrid PMU events with cpu
      type") changes the event list for uncore PMUs or arm64 heterogeneous CPU
      systems, such that duplicate aliases are incorrectly listed per PMU
      (which they should not be), like:
      
        # perf list
        ...
        unc_cbo_cache_lookup.any_es
        [Unit: uncore_cbox L3 Lookup any request that access cache and found
        line in E or S-state]
        unc_cbo_cache_lookup.any_es
        [Unit: uncore_cbox L3 Lookup any request that access cache and found
        line in E or S-state]
        unc_cbo_cache_lookup.any_i
        [Unit: uncore_cbox L3 Lookup any request that access cache and found
        line in I-state]
        unc_cbo_cache_lookup.any_i
        [Unit: uncore_cbox L3 Lookup any request that access cache and found
        line in I-state]
        ...
      
      Notice how the events are listed twice.
      
      The named commit changed how we remove duplicate events, in that events
      for different PMUs are not treated as duplicates. I suppose this is to
      handle how "Each hybrid pmu event has been assigned with a pmu name".
      
      Fix PMU alias listing by restoring behaviour to remove duplicates for
      non-hybrid PMUs.
      
      Fixes: 0e0ae874 ("perf list: Display hybrid PMU events with cpu type")
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Tested-by: NZhengjun Xing <zhengjun.xing@linux.intel.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/1640103090-140490-1-git-send-email-john.garry@huawei.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e0257a01
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · 278218f6
      Linus Torvalds 提交于
      Pull input fixes from Dmitry Torokhov:
       "Two small fixups for spaceball joystick driver and appletouch touchpad
        driver"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
        Input: spaceball - fix parsing of movement data packets
        Input: appletouch - initialize work before device registration
      278218f6
  3. 01 1月, 2022 6 次提交
  4. 31 12月, 2021 8 次提交
    • D
      Docs: Fixes link to I2C specification · c116fe1e
      Deep Majumder 提交于
      The link to the I2C specification is broken. Although
      "https://www.nxp.com" hosts Rev 7 (2021) of this specification, it is
      behind a login-wall. Thus, an additional link has been added (which
      doesn't require a login) and the NXP official docs link has been
      updated.
      Signed-off-by: NDeep Majumder <deep@fastmail.in>
      [wsa: minor updates to text and commit message]
      Signed-off-by: NWolfram Sang <wsa@kernel.org>
      c116fe1e
    • P
      i2c: validate user data in compat ioctl · bb436283
      Pavel Skripkin 提交于
      Wrong user data may cause warning in i2c_transfer(), ex: zero msgs.
      Userspace should not be able to trigger warnings, so this patch adds
      validation checks for user data in compact ioctl to prevent reported
      warnings
      
      Reported-and-tested-by: syzbot+e417648b303855b91d8a@syzkaller.appspotmail.com
      Fixes: 7d5cb456 ("i2c compat ioctls: move to ->compat_ioctl()")
      Signed-off-by: NPavel Skripkin <paskripkin@gmail.com>
      Signed-off-by: NWolfram Sang <wsa@kernel.org>
      bb436283
    • L
      Input: spaceball - fix parsing of movement data packets · bc7ec917
      Leo L. Schwab 提交于
      The spaceball.c module was not properly parsing the movement reports
      coming from the device.  The code read axis data as signed 16-bit
      little-endian values starting at offset 2.
      
      In fact, axis data in Spaceball movement reports are signed 16-bit
      big-endian values starting at offset 3.  This was determined first by
      visually inspecting the data packets, and later verified by consulting:
      http://spacemice.org/pdf/SpaceBall_2003-3003_Protocol.pdf
      
      If this ever worked properly, it was in the time before Git...
      Signed-off-by: NLeo L. Schwab <ewhac@ewhac.org>
      Link: https://lore.kernel.org/r/20211221101630.1146385-1-ewhac@ewhac.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      bc7ec917
    • P
      Input: appletouch - initialize work before device registration · 9f3ccdc3
      Pavel Skripkin 提交于
      Syzbot has reported warning in __flush_work(). This warning is caused by
      work->func == NULL, which means missing work initialization.
      
      This may happen, since input_dev->close() calls
      cancel_work_sync(&dev->work), but dev->work initalization happens _after_
      input_register_device() call.
      
      So this patch moves dev->work initialization before registering input
      device
      
      Fixes: 5a6eb676 ("Input: appletouch - improve powersaving for Geyser3 devices")
      Reported-and-tested-by: syzbot+b88c5eae27386b252bbd@syzkaller.appspotmail.com
      Signed-off-by: NPavel Skripkin <paskripkin@gmail.com>
      Link: https://lore.kernel.org/r/20211230141151.17300-1-paskripkin@gmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      9f3ccdc3
    • L
      Merge tag 'drm-fixes-2021-12-31' of git://anongit.freedesktop.org/drm/drm · 4f3d93c6
      Linus Torvalds 提交于
      Pull drm fixes from Dave Airlie:
       "This is a bit bigger than I'd like, however it has two weeks of amdgpu
        fixes in it, since they missed last week, which was very small.
      
        The nouveau regression is probably the biggest fix in here, and it
        needs to go into 5.15 as well, two i915 fixes, and then a scattering
        of amdgpu fixes. The biggest fix in there is for a fencing NULL
        pointer dereference, the rest are pretty minor.
      
        For the misc team, I've pulled the two misc fixes manually since I'm
        not sure what is happening at this time of year!
      
        The amdgpu maintainers have the outstanding runpm regression to fix
        still, they are just working through the last bits of it now.
      
        Summary:
      
        nouveau:
         - fencing regression fix
      
        i915:
         - Fix possible uninitialized variable
         - Fix composite fence seqno icrement on each fence creation
      
        amdgpu:
         - Fencing fix
         - XGMI fix
         - VCN regression fix
         - IP discovery regression fixes
         - Fix runpm documentation
         - Suspend/resume fixes
         - Yellow Carp display fixes
         - MCLK power management fix
         - dma-buf fix"
      
      * tag 'drm-fixes-2021-12-31' of git://anongit.freedesktop.org/drm/drm:
        drm/amd/display: Changed pipe split policy to allow for multi-display pipe split
        drm/amd/display: Fix USB4 null pointer dereference in update_psp_stream_config
        drm/amd/display: Set optimize_pwr_state for DCN31
        drm/amd/display: Send s0i2_rdy in stream_count == 0 optimization
        drm/amd/display: Added power down for DCN10
        drm/amd/display: fix B0 TMDS deepcolor no dislay issue
        drm/amdgpu: no DC support for headless chips
        drm/amdgpu: put SMU into proper state on runpm suspending for BOCO capable platform
        drm/amdgpu: always reset the asic in suspend (v2)
        drm/amd/pm: skip setting gfx cgpg in the s0ix suspend-resume
        drm/i915: Increment composite fence seqno
        drm/i915: Fix possible uninitialized variable in parallel extension
        drm/amdgpu: fix runpm documentation
        drm/nouveau: wait for the exclusive fence after the shared ones v2
        drm/amdgpu: add support for IP discovery gc_info table v2
        drm/amdgpu: When the VCN(1.0) block is suspended, powergating is explicitly enabled
        drm/amd/pm: Fix xgmi link control on aldebaran
        drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fence
        drm/amdgpu: fix dropped backing store handling in amdgpu_dma_buf_move_notify
      4f3d93c6
    • D
      Merge branch 'drm-misc-fixes' of ssh://git.freedesktop.org/git/drm/drm-misc into drm-fixes · ce9b333c
      Dave Airlie 提交于
      This merges two fixes that haven't been sent to me yet, but I wanted to get in.
      
      One amdgpu fix, but one nouveau regression fixer.
      Signed-off-by: NDave Airlie <airlied@redhat.com>
      ce9b333c
    • C
      fs/mount_setattr: always cleanup mount_kattr · 012e3322
      Christian Brauner 提交于
      Make sure that finish_mount_kattr() is called after mount_kattr was
      succesfully built in both the success and failure case to prevent
      leaking any references we took when we built it.  We returned early if
      path lookup failed thereby risking to leak an additional reference we
      took when building mount_kattr when an idmapped mount was requested.
      
      Cc: linux-fsdevel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 9caccd41 ("fs: introduce MOUNT_ATTR_IDMAP")
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      012e3322
    • L
      Merge tag 'net-5.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 74c78b42
      Linus Torvalds 提交于
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from.. Santa?
      
        No regressions on our radar at this point. The igc problem fixed here
        was the last one I was tracking but it was broken in previous
        releases, anyway. Mostly driver fixes and a couple of largish SMC
        fixes.
      
        Current release - regressions:
      
         - xsk: initialise xskb free_list_node, fixup for a -rc7 fix
      
        Current release - new code bugs:
      
         - mlx5: handful of minor fixes:
      
         - use first online CPU instead of hard coded CPU
      
         - fix some error handling paths in 'mlx5e_tc_add_fdb_flow()'
      
         - fix skb memory leak when TC classifier action offloads are disabled
      
         - fix memory leak with rules with internal OvS port
      
        Previous releases - regressions:
      
         - igc: do not enable crosstimestamping for i225-V models
      
        Previous releases - always broken:
      
         - udp: use datalen to cap ipv6 udp max gso segments
      
         - fix use-after-free in tw_timer_handler due to early free of stats
      
         - smc: fix kernel panic caused by race of smc_sock
      
         - smc: don't send CDC/LLC message if link not ready, avoid timeouts
      
         - sctp: use call_rcu to free endpoint, avoid UAF in sock diag
      
         - bridge: mcast: add and enforce query interval minimum
      
         - usb: pegasus: do not drop long Ethernet frames
      
         - mlx5e: fix ICOSQ recovery flow for XSK
      
         - nfc: uapi: use kernel size_t to fix user-space builds"
      
      * tag 'net-5.16-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
        fsl/fman: Fix missing put_device() call in fman_port_probe
        selftests: net: using ping6 for IPv6 in udpgro_fwd.sh
        Documentation: fix outdated interpretation of ip_no_pmtu_disc
        net/ncsi: check for error return from call to nla_put_u32
        net: bridge: mcast: fix br_multicast_ctx_vlan_global_disabled helper
        net: fix use-after-free in tw_timer_handler
        selftests: net: Fix a typo in udpgro_fwd.sh
        selftests/net: udpgso_bench_tx: fix dst ip argument
        net: bridge: mcast: add and enforce startup query interval minimum
        net: bridge: mcast: add and enforce query interval minimum
        ipv6: raw: check passed optlen before reading
        xsk: Initialise xskb free_list_node
        net/mlx5e: Fix wrong features assignment in case of error
        net/mlx5e: TC, Fix memory leak with rules with internal port
        ionic: Initialize the 'lif->dbid_inuse' bitmap
        igc: Fix TX timestamp support for non-MSI-X platforms
        igc: Do not enable crosstimestamping for i225-V models
        net/smc: fix kernel panic caused by race of smc_sock
        net/smc: don't send CDC/LLC message if link not ready
        NFC: st21nfca: Fix memory leak in device probe and remove
        ...
      74c78b42