1. 24 11月, 2020 5 次提交
    • F
      btrfs: fix lockdep splat when enabling and disabling qgroups · a855fbe6
      Filipe Manana 提交于
      When running test case btrfs/017 from fstests, lockdep reported the
      following splat:
      
        [ 1297.067385] ======================================================
        [ 1297.067708] WARNING: possible circular locking dependency detected
        [ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
        [ 1297.068322] ------------------------------------------------------
        [ 1297.068629] btrfs/189080 is trying to acquire lock:
        [ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.069274]
      		 but task is already holding lock:
        [ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
        [ 1297.070219]
      		 which lock already depends on the new lock.
      
        [ 1297.071131]
      		 the existing dependency chain (in reverse order) is:
        [ 1297.071721]
      		 -> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}:
        [ 1297.072375]        lock_acquire+0xd8/0x490
        [ 1297.072710]        __mutex_lock+0xa3/0xb30
        [ 1297.073061]        btrfs_qgroup_inherit+0x59/0x6a0 [btrfs]
        [ 1297.073421]        create_subvol+0x194/0x990 [btrfs]
        [ 1297.073780]        btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
        [ 1297.074133]        __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
        [ 1297.074498]        btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
        [ 1297.074872]        btrfs_ioctl+0x1a90/0x36f0 [btrfs]
        [ 1297.075245]        __x64_sys_ioctl+0x83/0xb0
        [ 1297.075617]        do_syscall_64+0x33/0x80
        [ 1297.075993]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 1297.076380]
      		 -> #0 (sb_internal#2){.+.+}-{0:0}:
        [ 1297.077166]        check_prev_add+0x91/0xc60
        [ 1297.077572]        __lock_acquire+0x1740/0x3110
        [ 1297.077984]        lock_acquire+0xd8/0x490
        [ 1297.078411]        start_transaction+0x3c5/0x760 [btrfs]
        [ 1297.078853]        btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.079323]        btrfs_ioctl+0x2c60/0x36f0 [btrfs]
        [ 1297.079789]        __x64_sys_ioctl+0x83/0xb0
        [ 1297.080232]        do_syscall_64+0x33/0x80
        [ 1297.080680]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 1297.081139]
      		 other info that might help us debug this:
      
        [ 1297.082536]  Possible unsafe locking scenario:
      
        [ 1297.083510]        CPU0                    CPU1
        [ 1297.084005]        ----                    ----
        [ 1297.084500]   lock(&fs_info->qgroup_ioctl_lock);
        [ 1297.084994]                                lock(sb_internal#2);
        [ 1297.085485]                                lock(&fs_info->qgroup_ioctl_lock);
        [ 1297.085974]   lock(sb_internal#2);
        [ 1297.086454]
      		  *** DEADLOCK ***
        [ 1297.087880] 3 locks held by btrfs/189080:
        [ 1297.088324]  #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs]
        [ 1297.088799]  #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
        [ 1297.089284]  #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs]
        [ 1297.089771]
      		 stack backtrace:
        [ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1
        [ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        [ 1297.092123] Call Trace:
        [ 1297.092629]  dump_stack+0x8d/0xb5
        [ 1297.093115]  check_noncircular+0xff/0x110
        [ 1297.093596]  check_prev_add+0x91/0xc60
        [ 1297.094076]  ? kvm_clock_read+0x14/0x30
        [ 1297.094553]  ? kvm_sched_clock_read+0x5/0x10
        [ 1297.095029]  __lock_acquire+0x1740/0x3110
        [ 1297.095510]  lock_acquire+0xd8/0x490
        [ 1297.095993]  ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.096476]  start_transaction+0x3c5/0x760 [btrfs]
        [ 1297.096962]  ? btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.097451]  btrfs_quota_enable+0xaf/0xa70 [btrfs]
        [ 1297.097941]  ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs]
        [ 1297.098429]  btrfs_ioctl+0x2c60/0x36f0 [btrfs]
        [ 1297.098904]  ? do_user_addr_fault+0x20c/0x430
        [ 1297.099382]  ? kvm_clock_read+0x14/0x30
        [ 1297.099854]  ? kvm_sched_clock_read+0x5/0x10
        [ 1297.100328]  ? sched_clock+0x5/0x10
        [ 1297.100801]  ? sched_clock_cpu+0x12/0x180
        [ 1297.101272]  ? __x64_sys_ioctl+0x83/0xb0
        [ 1297.101739]  __x64_sys_ioctl+0x83/0xb0
        [ 1297.102207]  do_syscall_64+0x33/0x80
        [ 1297.102673]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 1297.103148] RIP: 0033:0x7f773ff65d87
      
      This is because during the quota enable ioctl we lock first the mutex
      qgroup_ioctl_lock and then start a transaction, and starting a transaction
      acquires a fs freeze semaphore (at the VFS level). However, every other
      code path, except for the quota disable ioctl path, we do the opposite:
      we start a transaction and then lock the mutex.
      
      So fix this by making the quota enable and disable paths to start the
      transaction without having the mutex locked, and then, after starting the
      transaction, lock the mutex and check if some other task already enabled
      or disabled the quotas, bailing with success if that was the case.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a855fbe6
    • F
      btrfs: do nofs allocations when adding and removing qgroup relations · 7aa6d359
      Filipe Manana 提交于
      When adding or removing a qgroup relation we are doing a GFP_KERNEL
      allocation which is not safe because we are holding a transaction
      handle open and that can make us deadlock if the allocator needs to
      recurse into the filesystem. So just surround those calls with a
      nofs context.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7aa6d359
    • F
      btrfs: fix lockdep splat when reading qgroup config on mount · 3d05cad3
      Filipe Manana 提交于
      Lockdep reported the following splat when running test btrfs/190 from
      fstests:
      
        [ 9482.126098] ======================================================
        [ 9482.126184] WARNING: possible circular locking dependency detected
        [ 9482.126281] 5.10.0-rc4-btrfs-next-73 #1 Not tainted
        [ 9482.126365] ------------------------------------------------------
        [ 9482.126456] mount/24187 is trying to acquire lock:
        [ 9482.126534] ffffa0c869a7dac0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}, at: qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.126647]
      		 but task is already holding lock:
        [ 9482.126777] ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
        [ 9482.126886]
      		 which lock already depends on the new lock.
      
        [ 9482.127078]
      		 the existing dependency chain (in reverse order) is:
        [ 9482.127213]
      		 -> #1 (btrfs-quota-00){++++}-{3:3}:
        [ 9482.127366]        lock_acquire+0xd8/0x490
        [ 9482.127436]        down_read_nested+0x45/0x220
        [ 9482.127528]        __btrfs_tree_read_lock+0x27/0x120 [btrfs]
        [ 9482.127613]        btrfs_read_lock_root_node+0x41/0x130 [btrfs]
        [ 9482.127702]        btrfs_search_slot+0x514/0xc30 [btrfs]
        [ 9482.127788]        update_qgroup_status_item+0x72/0x140 [btrfs]
        [ 9482.127877]        btrfs_qgroup_rescan_worker+0xde/0x680 [btrfs]
        [ 9482.127964]        btrfs_work_helper+0xf1/0x600 [btrfs]
        [ 9482.128039]        process_one_work+0x24e/0x5e0
        [ 9482.128110]        worker_thread+0x50/0x3b0
        [ 9482.128181]        kthread+0x153/0x170
        [ 9482.128256]        ret_from_fork+0x22/0x30
        [ 9482.128327]
      		 -> #0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}:
        [ 9482.128464]        check_prev_add+0x91/0xc60
        [ 9482.128551]        __lock_acquire+0x1740/0x3110
        [ 9482.128623]        lock_acquire+0xd8/0x490
        [ 9482.130029]        __mutex_lock+0xa3/0xb30
        [ 9482.130590]        qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.131577]        btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
        [ 9482.132175]        open_ctree+0x1228/0x18a0 [btrfs]
        [ 9482.132756]        btrfs_mount_root.cold+0x13/0xed [btrfs]
        [ 9482.133325]        legacy_get_tree+0x30/0x60
        [ 9482.133866]        vfs_get_tree+0x28/0xe0
        [ 9482.134392]        fc_mount+0xe/0x40
        [ 9482.134908]        vfs_kern_mount.part.0+0x71/0x90
        [ 9482.135428]        btrfs_mount+0x13b/0x3e0 [btrfs]
        [ 9482.135942]        legacy_get_tree+0x30/0x60
        [ 9482.136444]        vfs_get_tree+0x28/0xe0
        [ 9482.136949]        path_mount+0x2d7/0xa70
        [ 9482.137438]        do_mount+0x75/0x90
        [ 9482.137923]        __x64_sys_mount+0x8e/0xd0
        [ 9482.138400]        do_syscall_64+0x33/0x80
        [ 9482.138873]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 9482.139346]
      		 other info that might help us debug this:
      
        [ 9482.140735]  Possible unsafe locking scenario:
      
        [ 9482.141594]        CPU0                    CPU1
        [ 9482.142011]        ----                    ----
        [ 9482.142411]   lock(btrfs-quota-00);
        [ 9482.142806]                                lock(&fs_info->qgroup_rescan_lock);
        [ 9482.143216]                                lock(btrfs-quota-00);
        [ 9482.143629]   lock(&fs_info->qgroup_rescan_lock);
        [ 9482.144056]
      		  *** DEADLOCK ***
      
        [ 9482.145242] 2 locks held by mount/24187:
        [ 9482.145637]  #0: ffffa0c8411c40e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xb9/0x400
        [ 9482.146061]  #1: ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs]
        [ 9482.146509]
      		 stack backtrace:
        [ 9482.147350] CPU: 1 PID: 24187 Comm: mount Not tainted 5.10.0-rc4-btrfs-next-73 #1
        [ 9482.147788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        [ 9482.148709] Call Trace:
        [ 9482.149169]  dump_stack+0x8d/0xb5
        [ 9482.149628]  check_noncircular+0xff/0x110
        [ 9482.150090]  check_prev_add+0x91/0xc60
        [ 9482.150561]  ? kvm_clock_read+0x14/0x30
        [ 9482.151017]  ? kvm_sched_clock_read+0x5/0x10
        [ 9482.151470]  __lock_acquire+0x1740/0x3110
        [ 9482.151941]  ? __btrfs_tree_read_lock+0x27/0x120 [btrfs]
        [ 9482.152402]  lock_acquire+0xd8/0x490
        [ 9482.152887]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.153354]  __mutex_lock+0xa3/0xb30
        [ 9482.153826]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.154301]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.154768]  ? qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.155226]  qgroup_rescan_init+0x43/0xf0 [btrfs]
        [ 9482.155690]  btrfs_read_qgroup_config+0x43a/0x550 [btrfs]
        [ 9482.156160]  open_ctree+0x1228/0x18a0 [btrfs]
        [ 9482.156643]  btrfs_mount_root.cold+0x13/0xed [btrfs]
        [ 9482.157108]  ? rcu_read_lock_sched_held+0x5d/0x90
        [ 9482.157567]  ? kfree+0x31f/0x3e0
        [ 9482.158030]  legacy_get_tree+0x30/0x60
        [ 9482.158489]  vfs_get_tree+0x28/0xe0
        [ 9482.158947]  fc_mount+0xe/0x40
        [ 9482.159403]  vfs_kern_mount.part.0+0x71/0x90
        [ 9482.159875]  btrfs_mount+0x13b/0x3e0 [btrfs]
        [ 9482.160335]  ? rcu_read_lock_sched_held+0x5d/0x90
        [ 9482.160805]  ? kfree+0x31f/0x3e0
        [ 9482.161260]  ? legacy_get_tree+0x30/0x60
        [ 9482.161714]  legacy_get_tree+0x30/0x60
        [ 9482.162166]  vfs_get_tree+0x28/0xe0
        [ 9482.162616]  path_mount+0x2d7/0xa70
        [ 9482.163070]  do_mount+0x75/0x90
        [ 9482.163525]  __x64_sys_mount+0x8e/0xd0
        [ 9482.163986]  do_syscall_64+0x33/0x80
        [ 9482.164437]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [ 9482.164902] RIP: 0033:0x7f51e907caaa
      
      This happens because at btrfs_read_qgroup_config() we can call
      qgroup_rescan_init() while holding a read lock on a quota btree leaf,
      acquired by the previous call to btrfs_search_slot_for_read(), and
      qgroup_rescan_init() acquires the mutex qgroup_rescan_lock.
      
      A qgroup rescan worker does the opposite: it acquires the mutex
      qgroup_rescan_lock, at btrfs_qgroup_rescan_worker(), and then tries to
      update the qgroup status item in the quota btree through the call to
      update_qgroup_status_item(). This inversion of locking order
      between the qgroup_rescan_lock mutex and quota btree locks causes the
      splat.
      
      Fix this simply by releasing and freeing the path before calling
      qgroup_rescan_init() at btrfs_read_qgroup_config().
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d05cad3
    • D
      btrfs: tree-checker: add missing returns after data_ref alignment checks · 6d06b0ad
      David Sterba 提交于
      There are sectorsize alignment checks that are reported but then
      check_extent_data_ref continues. This was not intended, wrong alignment
      is not a minor problem and we should return with error.
      
      CC: stable@vger.kernel.org # 5.4+
      Fixes: 0785a9aa ("btrfs: tree-checker: Add EXTENT_DATA_REF check")
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d06b0ad
    • J
      btrfs: don't access possibly stale fs_info data for printing duplicate device · 0697d9a6
      Johannes Thumshirn 提交于
      Syzbot reported a possible use-after-free when printing a duplicate device
      warning device_list_add().
      
      At this point it can happen that a btrfs_device::fs_info is not correctly
      setup yet, so we're accessing stale data, when printing the warning
      message using the btrfs_printk() wrappers.
      
        ==================================================================
        BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
        Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068
      
        CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0x1d6/0x29e lib/dump_stack.c:118
         print_address_description+0x66/0x620 mm/kasan/report.c:383
         __kasan_report mm/kasan/report.c:513 [inline]
         kasan_report+0x132/0x1d0 mm/kasan/report.c:530
         btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245
         device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943
         btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359
         btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x44840a
        RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5
        RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a
        RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630
        RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a
        R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003
      
        Allocated by task 6945:
         kasan_save_stack mm/kasan/common.c:48 [inline]
         kasan_set_track mm/kasan/common.c:56 [inline]
         __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
         kmalloc_node include/linux/slab.h:577 [inline]
         kvmalloc_node+0x81/0x110 mm/util.c:574
         kvmalloc include/linux/mm.h:757 [inline]
         kvzalloc include/linux/mm.h:765 [inline]
         btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 6945:
         kasan_save_stack mm/kasan/common.c:48 [inline]
         kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
         kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
         __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
         __cache_free mm/slab.c:3418 [inline]
         kfree+0x113/0x200 mm/slab.c:3756
         deactivate_locked_super+0xa7/0xf0 fs/super.c:335
         btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         fc_mount fs/namespace.c:978 [inline]
         vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008
         btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732
         legacy_get_tree+0xea/0x180 fs/fs_context.c:592
         vfs_get_tree+0x88/0x270 fs/super.c:1547
         do_new_mount fs/namespace.c:2875 [inline]
         path_mount+0x179d/0x29e0 fs/namespace.c:3192
         do_mount fs/namespace.c:3205 [inline]
         __do_sys_mount fs/namespace.c:3413 [inline]
         __se_sys_mount+0x126/0x180 fs/namespace.c:3390
         do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        The buggy address belongs to the object at ffff8880878e0000
         which belongs to the cache kmalloc-16k of size 16384
        The buggy address is located 1704 bytes inside of
         16384-byte region [ffff8880878e0000, ffff8880878e4000)
        The buggy address belongs to the page:
        page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0
        head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0
        flags: 0xfffe0000010200(slab|head)
        raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00
        raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      				    ^
         ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
      
      The syzkaller reproducer for this use-after-free crafts a filesystem image
      and loop mounts it twice in a loop. The mount will fail as the crafted
      image has an invalid chunk tree. When this happens btrfs_mount_root() will
      call deactivate_locked_super(), which then cleans up fs_info and
      fs_info::sb. If a second thread now adds the same block-device to the
      filesystem, it will get detected as a duplicate device and
      device_list_add() will reject the duplicate and print a warning. But as
      the fs_info pointer passed in is non-NULL this will result in a
      use-after-free.
      
      Instead of printing possibly uninitialized or already freed memory in
      btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the
      device name will be skipped altogether.
      
      There was a slightly different approach discussed in
      https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u
      
      Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com
      Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0697d9a6
  2. 23 11月, 2020 2 次提交
    • D
      afs: Fix speculative status fetch going out of order wrt to modifications · a9e5c87c
      David Howells 提交于
      When doing a lookup in a directory, the afs filesystem uses a bulk
      status fetch to speculatively retrieve the statuses of up to 48 other
      vnodes found in the same directory and it will then either update extant
      inodes or create new ones - effectively doing 'lookup ahead'.
      
      To avoid the possibility of deadlocking itself, however, the filesystem
      doesn't lock all of those inodes; rather just the directory inode is
      locked (by the VFS).
      
      When the operation completes, afs_inode_init_from_status() or
      afs_apply_status() is called, depending on whether the inode already
      exists, to commit the new status.
      
      A case exists, however, where the speculative status fetch operation may
      straddle a modification operation on one of those vnodes.  What can then
      happen is that the speculative bulk status RPC retrieves the old status,
      and whilst that is happening, the modification happens - which returns
      an updated status, then the modification status is committed, then we
      attempt to commit the speculative status.
      
      This results in something like the following being seen in dmesg:
      
      	kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus
      
      showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
      say that the vnode had data version 8 when we'd already recorded version
      9 due to a local modification.  This was causing the cache to be
      invalidated for that vnode when it shouldn't have been.  If it happens
      on a data file, this might lead to local changes being lost.
      
      Fix this by ignoring speculative status updates if the data version
      doesn't match the expected value.
      
      Note that it is possible to get a DV regression if a volume gets
      restored from a backup - but we should get a callback break in such a
      case that should trigger a recheck anyway.  It might be worth checking
      the volume creation time in the volsync info and, if a change is
      observed in that (as would happen on a restore), invalidate all caches
      associated with the volume.
      
      Fixes: 5cf9dd55 ("afs: Prospectively look up extra files when doing a single lookup")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9e5c87c
    • Y
      libfs: fix error cast of negative value in simple_attr_write() · 488dac0c
      Yicong Yang 提交于
      The attr->set() receive a value of u64, but simple_strtoll() is used for
      doing the conversion.  It will lead to the error cast if user inputs a
      negative value.
      
      Use kstrtoull() instead of simple_strtoll() to convert a string got from
      the user to an unsigned value.  The former will return '-EINVAL' if it
      gets a negetive value, but the latter can't handle the situation
      correctly.  Make 'val' unsigned long long as what kstrtoull() takes,
      this will eliminate the compile warning on no 64-bit architectures.
      
      Fixes: f7b88631 ("fs/libfs.c: fix simple_attr_write() on 32bit machines")
      Signed-off-by: NYicong Yang <yangyicong@hisilicon.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lkml.kernel.org/r/1605341356-11872-1-git-send-email-yangyicong@hisilicon.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      488dac0c
  3. 20 11月, 2020 5 次提交
    • J
      ext4: fix bogus warning in ext4_update_dx_flag() · f902b216
      Jan Kara 提交于
      The idea of the warning in ext4_update_dx_flag() is that we should warn
      when we are clearing EXT4_INODE_INDEX on a filesystem with metadata
      checksums enabled since after clearing the flag, checksums for internal
      htree nodes will become invalid. So there's no need to warn (or actually
      do anything) when EXT4_INODE_INDEX is not set.
      
      Link: https://lore.kernel.org/r/20201118153032.17281-1-jack@suse.cz
      Fixes: 48a34311 ("ext4: fix checksum errors with indexed dirs")
      Reported-by: NEric Biggers <ebiggers@kernel.org>
      Reviewed-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      f902b216
    • M
      jbd2: fix kernel-doc markups · 2bf31d94
      Mauro Carvalho Chehab 提交于
      Kernel-doc markup should use this format:
              identifier - description
      
      They should not have any type before that, as otherwise
      the parser won't do the right thing.
      
      Also, some identifiers have different names between their
      prototypes and the kernel-doc markup.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Link: https://lore.kernel.org/r/72f5c6628f5f278d67625f60893ffbc2ca28d46e.1605521731.git.mchehab+huawei@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      2bf31d94
    • D
      xfs: revert "xfs: fix rmap key and record comparison functions" · eb840907
      Darrick J. Wong 提交于
      This reverts commit 6ff646b2.
      
      Your maintainer committed a major braino in the rmap code by adding the
      attr fork, bmbt, and unwritten extent usage bits into rmap record key
      comparisons.  While XFS uses the usage bits *in the rmap records* for
      cross-referencing metadata in xfs_scrub and xfs_repair, it only needs
      the owner and offset information to distinguish between reverse mappings
      of the same physical extent into the data fork of a file at multiple
      offsets.  The other bits are not important for key comparisons for index
      lookups, and never have been.
      
      Eric Sandeen reports that this causes regressions in generic/299, so
      undo this patch before it does more damage.
      Reported-by: NEric Sandeen <sandeen@sandeen.net>
      Fixes: 6ff646b2 ("xfs: fix rmap key and record comparison functions")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      eb840907
    • T
      ext4: drop fast_commit from /proc/mounts · 704c2317
      Theodore Ts'o 提交于
      The options in /proc/mounts must be valid mount options --- and
      fast_commit is not a mount option.  Otherwise, command sequences like
      this will fail:
      
          # mount /dev/vdc /vdc
          # mkdir -p /vdc/phoronix_test_suite /pts
          # mount --bind /vdc/phoronix_test_suite /pts
          # mount -o remount,nodioread_nolock /pts
          mount: /pts: mount point not mounted or bad option.
      
      And in the system logs, you'll find:
      
          EXT4-fs (vdc): Unrecognized mount option "fast_commit" or missing value
      
      Fixes: 995a3ed6 ("ext4: add fast_commit feature and handling for extended mount options")
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      704c2317
    • D
      xfs: don't allow NOWAIT DIO across extent boundaries · 883a790a
      Dave Chinner 提交于
      Jens has reported a situation where partial direct IOs can be issued
      and completed yet still return -EAGAIN. We don't want this to report
      a short IO as we want XFS to complete user DIO entirely or not at
      all.
      
      This partial IO situation can occur on a write IO that is split
      across an allocated extent and a hole, and the second mapping is
      returning EAGAIN because allocation would be required.
      
      The trivial reproducer:
      
      $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
      wrote 4096/4096 bytes at offset 0
      4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
      pwrite: Resource temporarily unavailable
      $
      
      The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
      the first 4kB write:
      
       xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
       iomap_apply:          dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
       xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
       xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
       xfs_iomap_found:      dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
       iomap_apply_dstmap:   dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY
      
      Here the first iomap loop has mapped the first 4kB of the file and
      issued the IO, and we enter the second iomap_apply loop:
      
       iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
       xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
       xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
      
      And we exit with -EAGAIN out because we hit the allocate case trying
      to make the second 4kB block.
      
      Then IO completes on the first 4kB and the original IO context
      completes and unlocks the inode, returning -EAGAIN to userspace:
      
       xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
       xfs_iunlock:          dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write
      
      There are other vectors to the same problem when we re-enter the
      mapping code if we have to make multiple mappinfs under NOWAIT
      conditions. e.g. failing trylocks, COW extents being found,
      allocation being required, and so on.
      
      Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
      to go ahead if the mapping we retrieve for the IO spans an entire
      allocated extent. This avoids the possibility of subsequent mappings
      to complete the IO from triggering NOWAIT semantics by any means as
      NOWAIT IO will now only enter the mapping code once per NOWAIT IO.
      Reported-and-tested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      883a790a
  4. 19 11月, 2020 6 次提交
  5. 18 11月, 2020 4 次提交
    • B
      gfs2: Fix regression in freeze_go_sync · 20b32912
      Bob Peterson 提交于
      Patch 541656d3 ("gfs2: freeze should work on read-only mounts") changed
      the check for glock state in function freeze_go_sync() from "gl->gl_state
      == LM_ST_SHARED" to "gl->gl_req == LM_ST_EXCLUSIVE".  That's wrong and it
      regressed gfs2's freeze/thaw mechanism because it caused only the freezing
      node (which requests the glock in EX) to queue freeze work.
      
      All nodes go through this go_sync code path during the freeze to drop their
      SHared hold on the freeze glock, allowing the freezing node to acquire it
      in EXclusive mode. But all the nodes must freeze access to the file system
      locally, so they ALL must queue freeze work. The freeze_work calls
      freeze_func, which makes a request to reacquire the freeze glock in SH,
      effectively blocking until the thaw from the EX holder. Once thawed, the
      freezing node drops its EX hold on the freeze glock, then the (blocked)
      freeze_func reacquires the freeze glock in SH again (on all nodes, including
      the freezer) so all nodes go back to a thawed state.
      
      This patch changes the check back to gl_state == LM_ST_SHARED like it was
      prior to 541656d3.
      
      Fixes: 541656d3 ("gfs2: freeze should work on read-only mounts")
      Cc: stable@vger.kernel.org # v5.8+
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      20b32912
    • P
      io_uring: order refnode recycling · e297822b
      Pavel Begunkov 提交于
      Don't recycle a refnode until we're done with all requests of nodes
      ejected before.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Cc: stable@vger.kernel.org # v5.7+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e297822b
    • P
      io_uring: get an active ref_node from files_data · 1e5d770b
      Pavel Begunkov 提交于
      An active ref_node always can be found in ctx->files_data, it's much
      safer to get it this way instead of poking into files_data->ref_list.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Cc: stable@vger.kernel.org # v5.7+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1e5d770b
    • J
      io_uring: don't double complete failed reissue request · c993df5a
      Jens Axboe 提交于
      Zorro reports that an xfstest test case is failing, and it turns out that
      for the reissue path we can potentially issue a double completion on the
      request for the failure path. There's an issue around the retry as well,
      but for now, at least just make sure that we handle the error path
      correctly.
      
      Cc: stable@vger.kernel.org
      Fixes: b63534c4 ("io_uring: re-issue block requests that failed because of resources")
      Reported-by: NZorro Lang <zlang@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c993df5a
  6. 16 11月, 2020 4 次提交
    • R
      smb3: Handle error case during offload read path · 12541000
      Rohith Surabattula 提交于
      Mid callback needs to be called only when valid data is
      read into pages.
      
      These patches address a problem found during decryption offload:
            CIFS: VFS: trying to dequeue a deleted mid
      that could cause a refcount use after free:
            Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
      Signed-off-by: NRohith Surabattula <rohiths@microsoft.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      CC: Stable <stable@vger.kernel.org> #5.4+
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      12541000
    • R
      smb3: Avoid Mid pending list corruption · ac873aa3
      Rohith Surabattula 提交于
      When reconnect happens Mid queue can be corrupted when both
      demultiplex and offload thread try to dequeue the MID from the
      pending list.
      
      These patches address a problem found during decryption offload:
               CIFS: VFS: trying to dequeue a deleted mid
      that could cause a refcount use after free:
               Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
      Signed-off-by: NRohith Surabattula <rohiths@microsoft.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      CC: Stable <stable@vger.kernel.org> #5.4+
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      ac873aa3
    • R
      smb3: Call cifs reconnect from demultiplex thread · de9ac0a6
      Rohith Surabattula 提交于
      cifs_reconnect needs to be called only from demultiplex thread.
      skip cifs_reconnect in offload thread. So, cifs_reconnect will be
      called by demultiplex thread in subsequent request.
      
      These patches address a problem found during decryption offload:
           CIFS: VFS: trying to dequeue a deleted mid
      that can cause a refcount use after free:
      
      [ 1271.389453] Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
      [ 1271.389456] RIP: 0010:refcount_warn_saturate+0xae/0xf0
      [ 1271.389457] Code: fa 1d 6a 01 01 e8 c7 44 b1 ff 0f 0b 5d c3 80 3d e7 1d 6a 01 00 75 91 48 c7 c7 d8 be 1d a2 c6 05 d7 1d 6a 01 01 e8 a7 44 b1 ff <0f> 0b 5d c3 80 3d c5 1d 6a 01 00 0f 85 6d ff ff ff 48 c7 c7 30 bf
      [ 1271.389458] RSP: 0018:ffffa4cdc1f87e30 EFLAGS: 00010286
      [ 1271.389458] RAX: 0000000000000000 RBX: ffff9974d2809f00 RCX: ffff9974df898cc8
      [ 1271.389459] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9974df898cc0
      [ 1271.389460] RBP: ffffa4cdc1f87e30 R08: 0000000000000004 R09: 00000000000002c0
      [ 1271.389460] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9974b7fdb5c0
      [ 1271.389461] R13: ffff9974d2809f00 R14: ffff9974ccea0a80 R15: ffff99748e60db80
      [ 1271.389462] FS:  0000000000000000(0000) GS:ffff9974df880000(0000) knlGS:0000000000000000
      [ 1271.389462] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1271.389463] CR2: 000055c60f344fe4 CR3: 0000001031a3c002 CR4: 00000000003706e0
      [ 1271.389465] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1271.389465] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1271.389466] Call Trace:
      [ 1271.389483]  cifs_mid_q_entry_release+0xce/0x110 [cifs]
      [ 1271.389499]  smb2_decrypt_offload+0xa9/0x1c0 [cifs]
      [ 1271.389501]  process_one_work+0x1e8/0x3b0
      [ 1271.389503]  worker_thread+0x50/0x370
      [ 1271.389504]  kthread+0x12f/0x150
      [ 1271.389506]  ? process_one_work+0x3b0/0x3b0
      [ 1271.389507]  ? __kthread_bind_mask+0x70/0x70
      [ 1271.389509]  ret_from_fork+0x22/0x30
      Signed-off-by: NRohith Surabattula <rohiths@microsoft.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      CC: Stable <stable@vger.kernel.org> #5.4+
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      de9ac0a6
    • N
      cifs: fix a memleak with modefromsid · 98128572
      Namjae Jeon 提交于
      kmemleak reported a memory leak allocated in query_info() when cifs is
      working with modefromsid.
      
        backtrace:
          [<00000000aeef6a1e>] slab_post_alloc_hook+0x58/0x510
          [<00000000b2f7a440>] __kmalloc+0x1a0/0x390
          [<000000006d470ebc>] query_info+0x5b5/0x700 [cifs]
          [<00000000bad76ce0>] SMB2_query_acl+0x2b/0x30 [cifs]
          [<000000001fa09606>] get_smb2_acl_by_path+0x2f3/0x720 [cifs]
          [<000000001b6ebab7>] get_smb2_acl+0x75/0x90 [cifs]
          [<00000000abf43904>] cifs_acl_to_fattr+0x13b/0x1d0 [cifs]
          [<00000000a5372ec3>] cifs_get_inode_info+0x4cd/0x9a0 [cifs]
          [<00000000388e0a04>] cifs_revalidate_dentry_attr+0x1cd/0x510 [cifs]
          [<0000000046b6b352>] cifs_getattr+0x8a/0x260 [cifs]
          [<000000007692c95e>] vfs_getattr_nosec+0xa1/0xc0
          [<00000000cbc7d742>] vfs_getattr+0x36/0x40
          [<00000000de8acf67>] vfs_statx_fd+0x4a/0x80
          [<00000000a58c6adb>] __do_sys_newfstat+0x31/0x70
          [<00000000300b3b4e>] __x64_sys_newfstat+0x16/0x20
          [<000000006d8e9c48>] do_syscall_64+0x37/0x80
      
      This patch add missing kfree for pntsd when mounting modefromsid option.
      
      Cc: Stable <stable@vger.kernel.org> # v5.4+
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      98128572
  7. 15 11月, 2020 3 次提交
    • D
      afs: Fix afs_write_end() when called with copied == 0 [ver #3] · 3ad216ee
      David Howells 提交于
      When afs_write_end() is called with copied == 0, it tries to set the
      dirty region, but there's no way to actually encode a 0-length region in
      the encoding in page->private.
      
      "0,0", for example, indicates a 1-byte region at offset 0.  The maths
      miscalculates this and sets it incorrectly.
      
      Fix it to just do nothing but unlock and put the page in this case.  We
      don't actually need to mark the page dirty as nothing presumably
      changed.
      
      Fixes: 65dd2d60 ("afs: Alter dirty range encoding in page->private")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ad216ee
    • W
      ocfs2: initialize ip_next_orphan · f5785283
      Wengang Wang 提交于
      Though problem if found on a lower 4.1.12 kernel, I think upstream has
      same issue.
      
      In one node in the cluster, there is the following callback trace:
      
         # cat /proc/21473/stack
         __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
         ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
         ocfs2_evict_inode+0x152/0x820 [ocfs2]
         evict+0xae/0x1a0
         iput+0x1c6/0x230
         ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
         ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
         ocfs2_dir_foreach+0x29/0x30 [ocfs2]
         ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
         ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
         process_one_work+0x169/0x4a0
         worker_thread+0x5b/0x560
         kthread+0xcb/0xf0
         ret_from_fork+0x61/0x90
      
      The above stack is not reasonable, the final iput shouldn't happen in
      ocfs2_orphan_filldir() function.  Looking at the code,
      
        2067         /* Skip inodes which are already added to recover list, since dio may
        2068          * happen concurrently with unlink/rename */
        2069         if (OCFS2_I(iter)->ip_next_orphan) {
        2070                 iput(iter);
        2071                 return 0;
        2072         }
        2073
      
      The logic thinks the inode is already in recover list on seeing
      ip_next_orphan is non-NULL, so it skip this inode after dropping a
      reference which incremented in ocfs2_iget().
      
      While, if the inode is already in recover list, it should have another
      reference and the iput() at line 2070 should not be the final iput
      (dropping the last reference).  So I don't think the inode is really in
      the recover list (no vmcore to confirm).
      
      Note that ocfs2_queue_orphans(), though not shown up in the call back
      trace, is holding cluster lock on the orphan directory when looking up
      for unlinked inodes.  The on disk inode eviction could involve a lot of
      IOs which may need long time to finish.  That means this node could hold
      the cluster lock for very long time, that can lead to the lock requests
      (from other nodes) to the orhpan directory hang for long time.
      
      Looking at more on ip_next_orphan, I found it's not initialized when
      allocating a new ocfs2_inode_info structure.
      
      This causes te reflink operations from some nodes hang for very long
      time waiting for the cluster lock on the orphan directory.
      
      Fix: initialize ip_next_orphan as NULL.
      Signed-off-by: NWengang Wang <wen.gang.wang@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201109171746.27884-1-wen.gang.wang@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5785283
    • J
      io_uring: handle -EOPNOTSUPP on path resolution · 944d1444
      Jens Axboe 提交于
      Any attempt to do path resolution on /proc/self from an async worker will
      yield -EOPNOTSUPP. We can safely do that resolution from the task itself,
      and without blocking, so retry it from there.
      
      Ideally io_uring would know this upfront and not have to go through the
      worker thread to find out, but that doesn't currently seem feasible.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      944d1444
  8. 14 11月, 2020 4 次提交
    • J
      proc: don't allow async path resolution of /proc/self components · 8d4c3e76
      Jens Axboe 提交于
      If this is attempted by a kthread, then return -EOPNOTSUPP as we don't
      currently support that. Once we can get task_pid_ptr() doing the right
      thing, then this can go away again.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8d4c3e76
    • D
      btrfs: tree-checker: add missing return after error in root_item · 1a49a97d
      Daniel Xu 提交于
      There's a missing return statement after an error is found in the
      root_item, this can cause further problems when a crafted image triggers
      the error.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210181
      Fixes: 259ee775 ("btrfs: tree-checker: Add ROOT_ITEM check")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDaniel Xu <dxu@dxuuu.xyz>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a49a97d
    • Q
      btrfs: qgroup: don't commit transaction when we already hold the handle · 6f23277a
      Qu Wenruo 提交于
      [BUG]
      When running the following script, btrfs will trigger an ASSERT():
      
        #/bin/bash
        mkfs.btrfs -f $dev
        mount $dev $mnt
        xfs_io -f -c "pwrite 0 1G" $mnt/file
        sync
        btrfs quota enable $mnt
        btrfs quota rescan -w $mnt
      
        # Manually set the limit below current usage
        btrfs qgroup limit 512M $mnt $mnt
      
        # Crash happens
        touch $mnt/file
      
      The dmesg looks like this:
      
        assertion failed: refcount_read(&trans->use_count) == 1, in fs/btrfs/transaction.c:2022
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3230!
        invalid opcode: 0000 [#1] SMP PTI
        RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
         btrfs_commit_transaction.cold+0x11/0x5d [btrfs]
         try_flush_qgroup+0x67/0x100 [btrfs]
         __btrfs_qgroup_reserve_meta+0x3a/0x60 [btrfs]
         btrfs_delayed_update_inode+0xaa/0x350 [btrfs]
         btrfs_update_inode+0x9d/0x110 [btrfs]
         btrfs_dirty_inode+0x5d/0xd0 [btrfs]
         touch_atime+0xb5/0x100
         iterate_dir+0xf1/0x1b0
         __x64_sys_getdents64+0x78/0x110
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7fb5afe588db
      
      [CAUSE]
      In try_flush_qgroup(), we assume we don't hold a transaction handle at
      all.  This is true for data reservation and mostly true for metadata.
      Since data space reservation always happens before we start a
      transaction, and for most metadata operation we reserve space in
      start_transaction().
      
      But there is an exception, btrfs_delayed_inode_reserve_metadata().
      It holds a transaction handle, while still trying to reserve extra
      metadata space.
      
      When we hit EDQUOT inside btrfs_delayed_inode_reserve_metadata(), we
      will join current transaction and commit, while we still have
      transaction handle from qgroup code.
      
      [FIX]
      Let's check current->journal before we join the transaction.
      
      If current->journal is unset or BTRFS_SEND_TRANS_STUB, it means
      we are not holding a transaction, thus are able to join and then commit
      transaction.
      
      If current->journal is a valid transaction handle, we avoid committing
      transaction and just end it
      
      This is less effective than committing current transaction, as it won't
      free metadata reserved space, but we may still free some data space
      before new data writes.
      
      Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1178634
      Fixes: c53e9653 ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6f23277a
    • F
      btrfs: fix missing delalloc new bit for new delalloc ranges · c3347309
      Filipe Manana 提交于
      When doing a buffered write, through one of the write family syscalls, we
      look for ranges which currently don't have allocated extents and set the
      'delalloc new' bit on them, so that we can report a correct number of used
      blocks to the stat(2) syscall until delalloc is flushed and ordered extents
      complete.
      
      However there are a few other places where we can do a buffered write
      against a range that is mapped to a hole (no extent allocated) and where
      we do not set the 'new delalloc' bit. Those places are:
      
      - Doing a memory mapped write against a hole;
      
      - Cloning an inline extent into a hole starting at file offset 0;
      
      - Calling btrfs_cont_expand() when the i_size of the file is not aligned
        to the sector size and is located in a hole. For example when cloning
        to a destination offset beyond EOF.
      
      So after such cases, until the corresponding delalloc range is flushed and
      the respective ordered extents complete, we can report an incorrect number
      of blocks used through the stat(2) syscall.
      
      In some cases we can end up reporting 0 used blocks to stat(2), which is a
      particular bad value to report as it may mislead tools to think a file is
      completely sparse when its i_size is not zero, making them skip reading
      any data, an undesired consequence for tools such as archivers and other
      backup tools, as reported a long time ago in the following thread (and
      other past threads):
      
        https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
      
      Example reproducer:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        mount $DEV $MNT
      
        xfs_io -f -c "truncate 64K"   \
            -c "mmap -w 0 64K"        \
            -c "mwrite -S 0xab 0 64K" \
            -c "munmap"               \
            $MNT/foo
      
        blocks_used=$(stat -c %b $MNT/foo)
        echo "blocks used: $blocks_used"
      
        if [ $blocks_used -eq 0 ]; then
            echo "ERROR: blocks used is 0"
        fi
      
        umount $DEV
      
        $ ./reproducer.sh
        blocks used: 0
        ERROR: blocks used is 0
      
      So move the logic that decides to set the 'delalloc bit' bit into the
      function btrfs_set_extent_delalloc(), since that is what we use for all
      those missing cases as well as for the cases that currently work well.
      
      This change is also preparatory work for an upcoming patch that fixes
      other problems related to tracking and reporting the number of bytes used
      by an inode.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3347309
  9. 13 11月, 2020 2 次提交
    • B
      gfs2: Fix case in which ail writes are done to jdata holes · 4e79e3f0
      Bob Peterson 提交于
      Patch b2a846db ("gfs2: Ignore journal log writes for jdata holes")
      tried (unsuccessfully) to fix a case in which writes were done to jdata
      blocks, the blocks are sent to the ail list, then a punch_hole or truncate
      operation caused the blocks to be freed. In other words, the ail items
      are for jdata holes. Before b2a846db, the jdata hole caused function
      gfs2_block_map to return -EIO, which was eventually interpreted as an
      IO error to the journal, and then withdraw.
      
      This patch changes function gfs2_get_block_noalloc, which is only used
      for jdata writes, so it returns -ENODATA rather than -EIO, and when
      -ENODATA is returned to gfs2_ail1_start_one, the error is ignored.
      We can safely ignore it because gfs2_ail1_start_one is only called
      when the jdata pages have already been written and truncated, so the
      ail1 content no longer applies.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      4e79e3f0
    • B
      Revert "gfs2: Ignore journal log writes for jdata holes" · d3039c06
      Bob Peterson 提交于
      This reverts commit b2a846db.
      
      That commit changed the behavior of function gfs2_block_map to return
      -ENODATA in cases where a hole (IOMAP_HOLE) is encountered and create is
      false.  While that fixed the intended problem for jdata, it also broke
      other callers of gfs2_block_map such as some jdata block reads.  Before
      the patch, an encountered hole would be skipped and the buffer seen as
      unmapped by the caller.  The patch changed the behavior to return
      -ENODATA, which is interpreted as an error by the caller.
      
      The -ENODATA return code should be restricted to the specific case where
      jdata holes are encountered during ail1 writes.  That will be done in a
      later patch.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      d3039c06
  10. 12 11月, 2020 5 次提交