1. 26 3月, 2021 2 次提交
  2. 25 3月, 2021 1 次提交
  3. 24 3月, 2021 2 次提交
  4. 21 3月, 2021 10 次提交
  5. 20 3月, 2021 1 次提交
    • S
      cifs: fix allocation size on newly created files · 65af8f01
      Steve French 提交于
      Applications that create and extend and write to a file do not
      expect to see 0 allocation size.  When file is extended,
      set its allocation size to a plausible value until we have a
      chance to query the server for it.  When the file is cached
      this will prevent showing an impossible number of allocated
      blocks (like 0).  This fixes e.g. xfstests 614 which does
      
          1) create a file and set its size to 64K
          2) mmap write 64K to the file
          3) stat -c %b for the file (to query the number of allocated blocks)
      
      It was failing because we returned 0 blocks.  Even though we would
      return the correct cached file size, we returned an impossible
      allocation size.
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: <stable@vger.kernel.org>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      65af8f01
  6. 19 3月, 2021 3 次提交
  7. 18 3月, 2021 11 次提交
    • F
      btrfs: fix sleep while in non-sleep context during qgroup removal · 0bb78830
      Filipe Manana 提交于
      While removing a qgroup's sysfs entry we end up taking the kernfs_mutex,
      through kobject_del(), while holding the fs_info->qgroup_lock spinlock,
      producing the following trace:
      
        [821.843637] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:281
        [821.843641] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 28214, name: podman
        [821.843644] CPU: 3 PID: 28214 Comm: podman Tainted: G        W         5.11.6 #15
        [821.843646] Hardware name: Dell Inc. PowerEdge R330/084XW4, BIOS 2.11.0 12/08/2020
        [821.843647] Call Trace:
        [821.843650]  dump_stack+0xa1/0xfb
        [821.843656]  ___might_sleep+0x144/0x160
        [821.843659]  mutex_lock+0x17/0x40
        [821.843662]  kernfs_remove_by_name_ns+0x1f/0x80
        [821.843666]  sysfs_remove_group+0x7d/0xe0
        [821.843668]  sysfs_remove_groups+0x28/0x40
        [821.843670]  kobject_del+0x2a/0x80
        [821.843672]  btrfs_sysfs_del_one_qgroup+0x2b/0x40 [btrfs]
        [821.843685]  __del_qgroup_rb+0x12/0x150 [btrfs]
        [821.843696]  btrfs_remove_qgroup+0x288/0x2a0 [btrfs]
        [821.843707]  btrfs_ioctl+0x3129/0x36a0 [btrfs]
        [821.843717]  ? __mod_lruvec_page_state+0x5e/0xb0
        [821.843719]  ? page_add_new_anon_rmap+0xbc/0x150
        [821.843723]  ? kfree+0x1b4/0x300
        [821.843725]  ? mntput_no_expire+0x55/0x330
        [821.843728]  __x64_sys_ioctl+0x5a/0xa0
        [821.843731]  do_syscall_64+0x33/0x70
        [821.843733]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [821.843736] RIP: 0033:0x4cd3fb
        [821.843741] RSP: 002b:000000c000906b20 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
        [821.843744] RAX: ffffffffffffffda RBX: 000000c000050000 RCX: 00000000004cd3fb
        [821.843745] RDX: 000000c000906b98 RSI: 000000004010942a RDI: 000000000000000f
        [821.843747] RBP: 000000c000907cd0 R08: 000000c000622901 R09: 0000000000000000
        [821.843748] R10: 000000c000d992c0 R11: 0000000000000206 R12: 000000000000012d
        [821.843749] R13: 000000000000012c R14: 0000000000000200 R15: 0000000000000049
      
      Fix this by removing the qgroup sysfs entry while not holding the spinlock,
      since the spinlock is only meant for protection of the qgroup rbtree.
      Reported-by: NStuart Shelton <srcshelton@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/7A5485BB-0628-419D-A4D3-27B1AF47E25A@gmail.com/
      Fixes: 49e5fb46 ("btrfs: qgroup: export qgroups in sysfs")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0bb78830
    • P
      io_uring: don't leak creds on SQO attach error · de75a3d3
      Pavel Begunkov 提交于
      Attaching to already dead/dying SQPOLL task is disallowed in
      io_sq_offload_create(), but cleanup is hand coded by calling
      io_put_sq_data()/etc., that miss to put ctx->sq_creds.
      
      Defer everything to error-path io_sq_thread_finish(), adding
      ctx->sqd_list in the error case as well as finish will handle it.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      de75a3d3
    • S
    • S
    • S
      76cd979f
    • F
      btrfs: fix subvolume/snapshot deletion not triggered on mount · 8d488a8c
      Filipe Manana 提交于
      During the mount procedure we are calling btrfs_orphan_cleanup() against
      the root tree, which will find all orphans items in this tree. When an
      orphan item corresponds to a deleted subvolume/snapshot (instead of an
      inode space cache), it must not delete the orphan item, because that will
      cause btrfs_find_orphan_roots() to not find the orphan item and therefore
      not add the corresponding subvolume root to the list of dead roots, which
      results in the subvolume's tree never being deleted by the cleanup thread.
      
      The same applies to the remount from RO to RW path.
      
      Fix this by making btrfs_find_orphan_roots() run before calling
      btrfs_orphan_cleanup() against the root tree.
      
      A test case for fstests will follow soon.
      Reported-by: NRobbie Ko <robbieko@synology.com>
      Link: https://lore.kernel.org/linux-btrfs/b19f4310-35e0-606e-1eea-2dd84d28c5da@synology.com/
      Fixes: 638331fa ("btrfs: fix transaction leak and crash after cleaning up orphans on RO mount")
      CC: stable@vger.kernel.org # 5.11+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8d488a8c
    • D
      btrfs: fix build when using M=fs/btrfs · ebd99a6b
      David Sterba 提交于
      There are people building the module with M= that's supposed to be used
      for external modules. This got broken in e9aa7c28 ("btrfs: enable
      W=1 checks for btrfs").
      
        $ make M=fs/btrfs
        scripts/Makefile.lib:10: *** Recursive variable 'KBUILD_CFLAGS' references itself (eventually).  Stop.
        make: *** [Makefile:1755: modules] Error 2
      
      There's a difference compared to 'make fs/btrfs/btrfs.ko' which needs
      to rebuild a few more things and also the dependency modules need to be
      available. It could fail with eg.
      
        WARNING: Symbol version dump "Module.symvers" is missing.
      	   Modules may not have dependencies or modversions.
      
      In some environments it's more convenient to rebuild just the btrfs
      module by M= so let's make it work.
      
      The problem is with recursive variable evaluation in += so the
      conditional C options are stored in a temporary variable to avoid the
      recursion.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ebd99a6b
    • J
      btrfs: do not initialize dev replace for bad dev root · 3cb89497
      Josef Bacik 提交于
      While helping Neal fix his broken file system I added a debug patch to
      catch if we were calling btrfs_search_slot with a NULL root, and this
      stack trace popped:
      
        we tried to search with a NULL root
        CPU: 0 PID: 1760 Comm: mount Not tainted 5.11.0-155.nealbtrfstest.1.fc34.x86_64 #1
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
        Call Trace:
         dump_stack+0x6b/0x83
         btrfs_search_slot.cold+0x11/0x1b
         ? btrfs_init_dev_replace+0x36/0x450
         btrfs_init_dev_replace+0x71/0x450
         open_ctree+0x1054/0x1610
         btrfs_mount_root.cold+0x13/0xfa
         legacy_get_tree+0x27/0x40
         vfs_get_tree+0x25/0xb0
         vfs_kern_mount.part.0+0x71/0xb0
         btrfs_mount+0x131/0x3d0
         ? legacy_get_tree+0x27/0x40
         ? btrfs_show_options+0x640/0x640
         legacy_get_tree+0x27/0x40
         vfs_get_tree+0x25/0xb0
         path_mount+0x441/0xa80
         __x64_sys_mount+0xf4/0x130
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f644730352e
      
      Fix this by not starting the device replace stuff if we do not have a
      NULL dev root.
      Reported-by: NNeal Gompa <ngompa13@gmail.com>
      CC: stable@vger.kernel.org # 5.11+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3cb89497
    • J
      btrfs: initialize device::fs_info always · 820a49da
      Josef Bacik 提交于
      Neal reported a panic trying to use -o rescue=all
      
        BUG: kernel NULL pointer dereference, address: 0000000000000030
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 696 Comm: mount Tainted: G        W         5.12.0-rc2+ #296
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        RIP: 0010:btrfs_device_init_dev_stats+0x1d/0x200
        RSP: 0018:ffffafaec1483bb8 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff9a5715bcb298 RCX: 0000000000000070
        RDX: ffff9a5703248000 RSI: ffff9a57052ea150 RDI: ffff9a5715bca400
        RBP: ffff9a57052ea150 R08: 0000000000000070 R09: ffff9a57052ea150
        R10: 000130faf0741c10 R11: 0000000000000000 R12: ffff9a5703700000
        R13: 0000000000000000 R14: ffff9a5715bcb278 R15: ffff9a57052ea150
        FS:  00007f600d122c40(0000) GS:ffff9a577bc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000030 CR3: 0000000112a46005 CR4: 0000000000370ef0
        Call Trace:
         ? btrfs_init_dev_stats+0x1f/0xf0
         ? kmem_cache_alloc+0xef/0x1f0
         btrfs_init_dev_stats+0x5f/0xf0
         open_ctree+0x10cb/0x1720
         btrfs_mount_root.cold+0x12/0xea
         legacy_get_tree+0x27/0x40
         vfs_get_tree+0x25/0xb0
         vfs_kern_mount.part.0+0x71/0xb0
         btrfs_mount+0x10d/0x380
         legacy_get_tree+0x27/0x40
         vfs_get_tree+0x25/0xb0
         path_mount+0x433/0xa00
         __x64_sys_mount+0xe3/0x120
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      This happens because when we call btrfs_init_dev_stats we do
      device->fs_info->dev_root.  However device->fs_info isn't initialized
      because we were only calling btrfs_init_devices_late() if we properly
      read the device root.  However we don't actually need the device root to
      init the devices, this function simply assigns the devices their
      ->fs_info pointer properly, so this needs to be done unconditionally
      always so that we can properly dereference device->fs_info in rescue
      cases.
      Reported-by: NNeal Gompa <ngompa13@gmail.com>
      CC: stable@vger.kernel.org # 5.11+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      820a49da
    • J
      btrfs: do not initialize dev stats if we have no dev_root · 82d62d06
      Josef Bacik 提交于
      Neal reported a panic trying to use -o rescue=all
      
        BUG: kernel NULL pointer dereference, address: 0000000000000030
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 0 PID: 4095 Comm: mount Not tainted 5.11.0-0.rc7.149.fc34.x86_64 #1
        RIP: 0010:btrfs_device_init_dev_stats+0x4c/0x1f0
        RSP: 0018:ffffa60285fbfb68 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff88b88f806498 RCX: ffff88b82e7a2a10
        RDX: ffffa60285fbfb97 RSI: ffff88b82e7a2a10 RDI: 0000000000000000
        RBP: ffff88b88f806b3c R08: 0000000000000000 R09: 0000000000000000
        R10: ffff88b82e7a2a10 R11: 0000000000000000 R12: ffff88b88f806a00
        R13: ffff88b88f806478 R14: ffff88b88f806a00 R15: ffff88b82e7a2a10
        FS:  00007f698be1ec40(0000) GS:ffff88b937e00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000030 CR3: 0000000092c9c006 CR4: 00000000003706f0
        Call Trace:
        ? btrfs_init_dev_stats+0x1f/0xf0
        btrfs_init_dev_stats+0x62/0xf0
        open_ctree+0x1019/0x15ff
        btrfs_mount_root.cold+0x13/0xfa
        legacy_get_tree+0x27/0x40
        vfs_get_tree+0x25/0xb0
        vfs_kern_mount.part.0+0x71/0xb0
        btrfs_mount+0x131/0x3d0
        ? legacy_get_tree+0x27/0x40
        ? btrfs_show_options+0x640/0x640
        legacy_get_tree+0x27/0x40
        vfs_get_tree+0x25/0xb0
        path_mount+0x441/0xa80
        __x64_sys_mount+0xf4/0x130
        do_syscall_64+0x33/0x40
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f698c04e52e
      
      This happens because we unconditionally attempt to initialize device
      stats on mount, but we may not have been able to read the device root.
      Fix this by skipping initializing the device stats if we do not have a
      device root.
      Reported-by: NNeal Gompa <ngompa13@gmail.com>
      CC: stable@vger.kernel.org # 5.11+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      82d62d06
    • J
      btrfs: zoned: remove outdated WARN_ON in direct IO · f3da882e
      Johannes Thumshirn 提交于
      In btrfs_submit_direct() there's a WAN_ON_ONCE() that will trigger if
      we're submitting a DIO write on a zoned filesystem but are not using
      REQ_OP_ZONE_APPEND to submit the IO to the block device.
      
      This is a left over from a previous version where btrfs_dio_iomap_begin()
      didn't use btrfs_use_zone_append() to check for sequential write only
      zones.
      
      It is an oversight from the development phase. In v11 (I think) I've
      added 08f45559 ("btrfs: zoned: cache if block group is on a
      sequential zone") and forgot to remove the WARN_ON_ONCE() for
      544d24f9 ("btrfs: zoned: enable zone append writing for direct IO").
      
      When developing auto relocation I got hit by the WARN as a block groups
      where relocated to conventional zone and the dio code calls
      btrfs_use_zone_append() introduced by 08f45559 to check if it can
      use zone append (a.k.a. if it's a sequential zone) or not and sets the
      appropriate flags for iomap.
      
      I've never hit it in testing before, as I was relying on emulation to
      test the conventional zones code but this one case wasn't hit, because
      on emulation fs_info->max_zone_append_size is 0 and the WARN doesn't
      trigger either.
      
      Fixes: 544d24f9 ("btrfs: zoned: enable zone append writing for direct IO")
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3da882e
  8. 17 3月, 2021 5 次提交
    • C
      zonefs: fix to update .i_wr_refcnt correctly in zonefs_open_zone() · 6980d29c
      Chao Yu 提交于
      In zonefs_open_zone(), if opened zone count is larger than
      .s_max_open_zones threshold, we missed to recover .i_wr_refcnt,
      fix this.
      
      Fixes: b5c00e97 ("zonefs: open/close zone on file open/close")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      6980d29c
    • O
      kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data() · 5abbe51a
      Oleg Nesterov 提交于
      Preparation for fixing get_nr_restart_syscall() on X86 for COMPAT.
      
      Add a new helper which sets restart_block->fn and calls a dummy
      arch_set_restart_data() helper.
      
      Fixes: 609c19a3 ("x86/ptrace: Stop setting TS_COMPAT in ptrace code")
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210201174641.GA17871@redhat.com
      5abbe51a
    • F
      btrfs: always pin deleted leaves when there are active tree mod log users · 485df755
      Filipe Manana 提交于
      When freeing a tree block we may end up adding its extent back to the
      free space cache/tree, as long as there are no more references for it,
      it was created in the current transaction and writeback for it never
      happened. This is generally fine, however when we have tree mod log
      operations it can result in inconsistent versions of a btree after
      unwinding extent buffers with the recorded tree mod log operations.
      
      This is because:
      
      * We only log operations for nodes (adding and removing key/pointers),
        for leaves we don't do anything;
      
      * This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation
        for a node that points to a leaf that was deleted;
      
      * Before we apply the logged operation to unwind a node, we can have
        that leaf's extent allocated again, either as a node or as a leaf, and
        possibly for another btree. This is possible if the leaf was created in
        the current transaction and writeback for it never started, in which
        case btrfs_free_tree_block() returns its extent back to the free space
        cache/tree;
      
      * Then, before applying the tree mod log operation, some task allocates
        the metadata extent just freed before, and uses it either as a leaf or
        as a node for some btree (can be the same or another one, it does not
        matter);
      
      * After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now
        get the target node with an item pointing to the metadata extent that
        now has content different from what it had before the leaf was deleted.
        It might now belong to a different btree and be a node and not a leaf
        anymore.
      
        As a consequence, the results of searches after the unwinding can be
        unpredictable and produce unexpected results.
      
      So make sure we pin extent buffers corresponding to leaves when there
      are tree mod log users.
      
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      485df755
    • F
      btrfs: fix race when cloning extent buffer during rewind of an old root · dbcc7d57
      Filipe Manana 提交于
      While resolving backreferences, as part of a logical ino ioctl call or
      fiemap, we can end up hitting a BUG_ON() when replaying tree mod log
      operations of a root, triggering a stack trace like the following:
      
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.c:1210!
        invalid opcode: 0000 [#1] SMP KASAN PTI
        CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G        W         5.11.0-2d11c0084b02-misc-next+ #89
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
        Code: 05 48 8d 74 10 (...)
        RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297
        RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6
        RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c
        RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20
        R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee
        R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c
        FS:  00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0
        Call Trace:
         btrfs_search_old_slot+0x265/0x10d0
         ? lock_acquired+0xbb/0x600
         ? btrfs_search_slot+0x1090/0x1090
         ? free_extent_buffer.part.61+0xd7/0x140
         ? free_extent_buffer+0x13/0x20
         resolve_indirect_refs+0x3e9/0xfc0
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? add_prelim_ref.part.11+0x150/0x150
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? lock_acquired+0xbb/0x600
         ? __kasan_check_write+0x14/0x20
         ? do_raw_spin_unlock+0xa8/0x140
         ? rb_insert_color+0x30/0x360
         ? prelim_ref_insert+0x12d/0x430
         find_parent_nodes+0x5c3/0x1830
         ? resolve_indirect_refs+0xfc0/0xfc0
         ? lock_release+0xc8/0x620
         ? fs_reclaim_acquire+0x67/0xf0
         ? lock_acquire+0xc7/0x510
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x160/0x210
         ? lock_release+0xc8/0x620
         ? fs_reclaim_acquire+0x67/0xf0
         ? lock_acquire+0xc7/0x510
         ? poison_range+0x38/0x40
         ? unpoison_range+0x14/0x40
         ? trace_hardirqs_on+0x55/0x120
         btrfs_find_all_roots_safe+0x142/0x1e0
         ? find_parent_nodes+0x1830/0x1830
         ? btrfs_inode_flags_to_xflags+0x50/0x50
         iterate_extent_inodes+0x20e/0x580
         ? tree_backref_for_extent+0x230/0x230
         ? lock_downgrade+0x3d0/0x3d0
         ? read_extent_buffer+0xdd/0x110
         ? lock_downgrade+0x3d0/0x3d0
         ? __kasan_check_read+0x11/0x20
         ? lock_acquired+0xbb/0x600
         ? __kasan_check_write+0x14/0x20
         ? _raw_spin_unlock+0x22/0x30
         ? __kasan_check_write+0x14/0x20
         iterate_inodes_from_logical+0x129/0x170
         ? iterate_inodes_from_logical+0x129/0x170
         ? btrfs_inode_flags_to_xflags+0x50/0x50
         ? iterate_extent_inodes+0x580/0x580
         ? __vmalloc_node+0x92/0xb0
         ? init_data_container+0x34/0xb0
         ? init_data_container+0x34/0xb0
         ? kvmalloc_node+0x60/0x80
         btrfs_ioctl_logical_to_ino+0x158/0x230
         btrfs_ioctl+0x205e/0x4040
         ? __might_sleep+0x71/0xe0
         ? btrfs_ioctl_get_supported_features+0x30/0x30
         ? getrusage+0x4b6/0x9c0
         ? __kasan_check_read+0x11/0x20
         ? lock_release+0xc8/0x620
         ? __might_fault+0x64/0xd0
         ? lock_acquire+0xc7/0x510
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? __kasan_check_read+0x11/0x20
         ? do_vfs_ioctl+0xfc/0x9d0
         ? ioctl_file_clone+0xe0/0xe0
         ? lock_downgrade+0x3d0/0x3d0
         ? lockdep_hardirqs_on_prepare+0x210/0x210
         ? __kasan_check_read+0x11/0x20
         ? lock_release+0xc8/0x620
         ? __task_pid_nr_ns+0xd3/0x250
         ? lock_acquire+0xc7/0x510
         ? __fget_files+0x160/0x230
         ? __fget_light+0xf2/0x110
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7fd1976e2427
        Code: 00 00 90 48 8b 05 (...)
        RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427
        RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004
        RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120
        R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004
        R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8
        Modules linked in:
        ---[ end trace ec8931a1c36e57be ]---
      
        (gdb) l *(__tree_mod_log_rewind+0x3b1)
        0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210).
        1205                     * the modification. as we're going backwards, we do the
        1206                     * opposite of each operation here.
        1207                     */
        1208                    switch (tm->op) {
        1209                    case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
        1210                            BUG_ON(tm->slot < n);
        1211                            fallthrough;
        1212                    case MOD_LOG_KEY_REMOVE_WHILE_MOVING:
        1213                    case MOD_LOG_KEY_REMOVE:
        1214                            btrfs_set_node_key(eb, &tm->key, tm->slot);
      
      Here's what happens to hit that BUG_ON():
      
      1) We have one tree mod log user (through fiemap or the logical ino ioctl),
         with a sequence number of 1, so we have fs_info->tree_mod_seq == 1;
      
      2) Another task is at ctree.c:balance_level() and we have eb X currently as
         the root of the tree, and we promote its single child, eb Y, as the new
         root.
      
         Then, at ctree.c:balance_level(), we call:
      
            tree_mod_log_insert_root(eb X, eb Y, 1);
      
      3) At tree_mod_log_insert_root() we create tree mod log elements for each
         slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each
         with a ->logical pointing to ebX->start. These are placed in an array
         named tm_list.
         Lets assume there are N elements (N pointers in eb X);
      
      4) Then, still at tree_mod_log_insert_root(), we create a tree mod log
         element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to
         ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set
         to the level of eb X and ->generation set to the generation of eb X;
      
      5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
         tm_list as argument. After that, tree_mod_log_free_eb() calls
         __tree_mod_log_insert() for each member of tm_list in reverse order,
         from highest slot in eb X, slot N - 1, to slot 0 of eb X;
      
      6) __tree_mod_log_insert() sets the sequence number of each given tree mod
         log operation - it increments fs_info->tree_mod_seq and sets
         fs_info->tree_mod_seq as the sequence number of the given tree mod log
         operation.
      
         This means that for the tm_list created at tree_mod_log_insert_root(),
         the element corresponding to slot 0 of eb X has the highest sequence
         number (1 + N), and the element corresponding to the last slot has the
         lowest sequence number (2);
      
      7) Then, after inserting tm_list's elements into the tree mod log rbtree,
         the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest
         sequence number, which is N + 2;
      
      8) Back to ctree.c:balance_level(), we free eb X by calling
         btrfs_free_tree_block() on it. Because eb X was created in the current
         transaction, has no other references and writeback did not happen for
         it, we add it back to the free space cache/tree;
      
      9) Later some other task T allocates the metadata extent from eb X, since
         it is marked as free space in the space cache/tree, and uses it as a
         node for some other btree;
      
      10) The tree mod log user task calls btrfs_search_old_slot(), which calls
          get_old_root(), and finally that calls __tree_mod_log_oldest_root()
          with time_seq == 1 and eb_root == eb Y;
      
      11) First iteration of the while loop finds the tree mod log element with
          sequence number N + 2, for the logical address of eb Y and of type
          MOD_LOG_ROOT_REPLACE;
      
      12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out
          of the loop, and set root_logical to point to tm->old_root.logical
          which corresponds to the logical address of eb X;
      
      13) On the next iteration of the while loop, the call to
          tree_mod_log_search_oldest() returns the smallest tree mod log element
          for the logical address of eb X, which has a sequence number of 2, an
          operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to
          the old slot N - 1 of eb X (eb X had N items in it before being freed);
      
      14) We then break out of the while loop and return the tree mod log operation
          of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of
          eb X, to get_old_root();
      
      15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation
          and set "logical" to the logical address of eb X, which was the old
          root. We then call tree_mod_log_search() passing it the logical
          address of eb X and time_seq == 1;
      
      16) Then before calling tree_mod_log_search(), task T adds a key to eb X,
          which results in adding a tree mod log operation of type
          MOD_LOG_KEY_ADD to the tree mod log - this is done at
          ctree.c:insert_ptr() - but after adding the tree mod log operation
          and before updating the number of items in eb X from 0 to 1...
      
      17) The task at get_old_root() calls tree_mod_log_search() and gets the
          tree mod log operation of type MOD_LOG_KEY_ADD just added by task T.
          Then it enters the following if branch:
      
          if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
             (...)
          } (...)
      
          Calls read_tree_block() for eb X, which gets a reference on eb X but
          does not lock it - task T has it locked.
          Then it clones eb X while it has nritems set to 0 in its header, before
          task T sets nritems to 1 in eb X's header. From hereupon we use the
          clone of eb X which no other task has access to;
      
      18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD
          mod log operation we just got from tree_mod_log_search() in the
          previous step and the cloned version of eb X;
      
      19) At __tree_mod_log_rewind(), we set the local variable "n" to the number
          of items set in eb X's clone, which is 0. Then we enter the while loop,
          and in its first iteration we process the MOD_LOG_KEY_ADD operation,
          which just decrements "n" from 0 to (u32)-1, since "n" is declared with
          a type of u32. At the end of this iteration we call rb_next() to find the
          next tree mod log operation for eb X, that gives us the mod log operation
          of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence
          number of N + 1 (steps 3 to 6);
      
      20) Then we go back to the top of the while loop and trigger the following
          BUG_ON():
      
              (...)
              switch (tm->op) {
              case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
                       BUG_ON(tm->slot < n);
                       fallthrough;
              (...)
      
          Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0.
      
      Fix this by taking a read lock on the extent buffer before cloning it at
      ctree.c:get_old_root(). This should be done regardless of the extent
      buffer having been freed and reused, as a concurrent task might be
      modifying it (while holding a write lock on it).
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/
      Fixes: 834328a8 ("Btrfs: tree mod log's old roots could still be part of the tree")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dbcc7d57
    • D
      btrfs: fix slab cache flags for free space tree bitmap · 34e49994
      David Sterba 提交于
      The free space tree bitmap slab cache is created with SLAB_RED_ZONE but
      that's a debugging flag and not always enabled. Also the other slabs are
      created with at least SLAB_MEM_SPREAD that we want as well to average
      the memory placement cost.
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Fixes: 3acd4850 ("btrfs: fix allocation of free space cache v1 bitmap pages")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      34e49994
  9. 16 3月, 2021 5 次提交
    • A
      fuse: 32-bit user space ioctl compat for fuse device · f8425c93
      Alessio Balsini 提交于
      With a 64-bit kernel build the FUSE device cannot handle ioctl requests
      coming from 32-bit user space.  This is due to the ioctl command
      translation that generates different command identifiers that thus cannot
      be used for direct comparisons without proper manipulation.
      
      Explicitly extract type and number from the ioctl command to enable 32-bit
      user space compatibility on 64-bit kernel builds.
      Signed-off-by: NAlessio Balsini <balsini@android.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      f8425c93
    • Q
      btrfs: subpage: make readahead work properly · 60484cd9
      Qu Wenruo 提交于
      In readahead infrastructure, we are using a lot of hard coded PAGE_SHIFT
      while we're not doing anything specific to PAGE_SIZE.
      
      One of the most affected part is the radix tree operation of
      btrfs_fs_info::reada_tree.
      
      If using PAGE_SHIFT, subpage metadata readahead is broken and does no
      help reading metadata ahead.
      
      Fix the problem by using btrfs_fs_info::sectorsize_bits so that
      readahead could work for subpage.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      60484cd9
    • Q
      btrfs: subpage: fix wild pointer access during metadata read failure · d9bb77d5
      Qu Wenruo 提交于
      [BUG]
      When running fstests for btrfs subpage read-write test, it has a very
      high chance to crash at generic/475 with the following stack:
      
       BTRFS warning (device dm-8): direct IO failed ino 510 rw 1,34817 sector 0xcdf0 len 94208 err no 10
       Unable to handle kernel paging request at virtual address ffff80001157e7c0
       CPU: 2 PID: 687125 Comm: kworker/u12:4 Tainted: G        WC        5.12.0-rc2-custom+ #5
       Hardware name: Khadas VIM3 (DT)
       Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs]
       pc : queued_spin_lock_slowpath+0x1a0/0x390
       lr : do_raw_spin_lock+0xc4/0x11c
       Call trace:
        queued_spin_lock_slowpath+0x1a0/0x390
        _raw_spin_lock+0x68/0x84
        btree_readahead_hook+0x38/0xc0 [btrfs]
        end_bio_extent_readpage+0x504/0x5f4 [btrfs]
        bio_endio+0x170/0x1a4
        end_workqueue_fn+0x3c/0x60 [btrfs]
        btrfs_work_helper+0x1b0/0x1b4 [btrfs]
        process_one_work+0x22c/0x430
        worker_thread+0x70/0x3a0
        kthread+0x13c/0x140
        ret_from_fork+0x10/0x30
       Code: 910020e0 8b0200c2 f861d884 aa0203e1 (f8246827)
      
      [CAUSE]
      In end_bio_extent_readpage(), if we hit an error during read, we will
      handle the error differently for data and metadata.
      For data we queue a repair, while for metadata, we record the error and
      let the caller choose what to do.
      
      But the code is still using page->private to grab extent buffer, which
      no longer points to extent buffer for subpage metadata pages.
      
      Thus this wild pointer access leads to above crash.
      
      [FIX]
      Introduce a helper, find_extent_buffer_readpage(), to grab extent
      buffer.
      
      The difference against find_extent_buffer_nospinlock() is:
      
      - Also handles regular sectorsize == PAGE_SIZE case
      - No extent buffer refs increase/decrease
        As extent buffer under IO must have non-zero refs, so this is safe
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d9bb77d5
    • D
      zonefs: Fix O_APPEND async write handling · ebfd68cd
      Damien Le Moal 提交于
      zonefs updates the size of a sequential zone file inode only on
      completion of direct writes. When executing asynchronous append writes
      (with a file open with O_APPEND or using RWF_APPEND), the use of the
      current inode size in generic_write_checks() to set an iocb offset thus
      leads to unaligned write if an application issues an append write
      operation with another write already being executed.
      
      Fix this problem by introducing zonefs_write_checks() as a modified
      version of generic_write_checks() using the file inode wp_offset for an
      append write iocb offset. Also introduce zonefs_write_check_limits() to
      replace generic_write_check_limits() call. This zonefs special helper
      makes sure that the maximum file limit used is the maximum size of the
      file being accessed.
      
      Since zonefs_write_checks() already truncates the iov_iter, the calls
      to iov_iter_truncate() in zonefs_file_dio_write() and
      zonefs_file_buffered_write() are removed.
      
      Fixes: 8dcc1a9d ("fs: New zonefs file system")
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      ebfd68cd
    • D
      zonefs: prevent use of seq files as swap file · 1601ea06
      Damien Le Moal 提交于
      The sequential write constraint of sequential zone file prevent their
      use as swap files. Only allow conventional zone files to be used as swap
      files.
      
      Fixes: 8dcc1a9d ("fs: New zonefs file system")
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      1601ea06