1. 26 10月, 2020 1 次提交
    • F
      btrfs: fix readahead hang and use-after-free after removing a device · 66d204a1
      Filipe Manana 提交于
      Very sporadically I had test case btrfs/069 from fstests hanging (for
      years, it is not a recent regression), with the following traces in
      dmesg/syslog:
      
        [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
        [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
        [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
        [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
        [162513.514318]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.514747] task:btrfs-transacti state:D stack:    0 pid:1356167 ppid:     2 flags:0x00004000
        [162513.514751] Call Trace:
        [162513.514761]  __schedule+0x5ce/0xd00
        [162513.514765]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.514771]  schedule+0x46/0xf0
        [162513.514844]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.514850]  ? finish_wait+0x90/0x90
        [162513.514864]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.514879]  transaction_kthread+0xa4/0x170 [btrfs]
        [162513.514891]  ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
        [162513.514894]  kthread+0x153/0x170
        [162513.514897]  ? kthread_stop+0x2c0/0x2c0
        [162513.514902]  ret_from_fork+0x22/0x30
        [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
        [162513.515192]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.515680] task:fsstress        state:D stack:    0 pid:1356184 ppid:1356177 flags:0x00004000
        [162513.515682] Call Trace:
        [162513.515688]  __schedule+0x5ce/0xd00
        [162513.515691]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.515697]  schedule+0x46/0xf0
        [162513.515712]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.515716]  ? finish_wait+0x90/0x90
        [162513.515729]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.515743]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
        [162513.515753]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
        [162513.515758]  ? __ia32_sys_fdatasync+0x20/0x20
        [162513.515761]  iterate_supers+0x87/0xf0
        [162513.515765]  ksys_sync+0x60/0xb0
        [162513.515768]  __do_sys_sync+0xa/0x10
        [162513.515771]  do_syscall_64+0x33/0x80
        [162513.515774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.515781] RIP: 0033:0x7f5238f50bd7
        [162513.515782] Code: Bad RIP value.
        [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
        [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
        [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
        [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
        [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
        [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
        [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
        [162513.516064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.516617] task:fsstress        state:D stack:    0 pid:1356185 ppid:1356177 flags:0x00000000
        [162513.516620] Call Trace:
        [162513.516625]  __schedule+0x5ce/0xd00
        [162513.516628]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.516634]  schedule+0x46/0xf0
        [162513.516647]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.516650]  ? finish_wait+0x90/0x90
        [162513.516662]  start_transaction+0x4d7/0x5f0 [btrfs]
        [162513.516679]  btrfs_setxattr_trans+0x3c/0x100 [btrfs]
        [162513.516686]  __vfs_setxattr+0x66/0x80
        [162513.516691]  __vfs_setxattr_noperm+0x70/0x200
        [162513.516697]  vfs_setxattr+0x6b/0x120
        [162513.516703]  setxattr+0x125/0x240
        [162513.516709]  ? lock_acquire+0xb1/0x480
        [162513.516712]  ? mnt_want_write+0x20/0x50
        [162513.516721]  ? rcu_read_lock_any_held+0x8e/0xb0
        [162513.516723]  ? preempt_count_add+0x49/0xa0
        [162513.516725]  ? __sb_start_write+0x19b/0x290
        [162513.516727]  ? preempt_count_add+0x49/0xa0
        [162513.516732]  path_setxattr+0xba/0xd0
        [162513.516739]  __x64_sys_setxattr+0x27/0x30
        [162513.516741]  do_syscall_64+0x33/0x80
        [162513.516743]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.516745] RIP: 0033:0x7f5238f56d5a
        [162513.516746] Code: Bad RIP value.
        [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
        [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
        [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
        [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
        [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
        [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
        [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
        [162513.517064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.517763] task:fsstress        state:D stack:    0 pid:1356196 ppid:1356177 flags:0x00004000
        [162513.517780] Call Trace:
        [162513.517786]  __schedule+0x5ce/0xd00
        [162513.517789]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.517796]  schedule+0x46/0xf0
        [162513.517810]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.517814]  ? finish_wait+0x90/0x90
        [162513.517829]  start_transaction+0x37c/0x5f0 [btrfs]
        [162513.517845]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
        [162513.517857]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
        [162513.517862]  ? __ia32_sys_fdatasync+0x20/0x20
        [162513.517865]  iterate_supers+0x87/0xf0
        [162513.517869]  ksys_sync+0x60/0xb0
        [162513.517872]  __do_sys_sync+0xa/0x10
        [162513.517875]  do_syscall_64+0x33/0x80
        [162513.517878]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.517881] RIP: 0033:0x7f5238f50bd7
        [162513.517883] Code: Bad RIP value.
        [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
        [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
        [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
        [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
        [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
        [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
        [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
        [162513.518298]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.519157] task:fsstress        state:D stack:    0 pid:1356197 ppid:1356177 flags:0x00000000
        [162513.519160] Call Trace:
        [162513.519165]  __schedule+0x5ce/0xd00
        [162513.519168]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.519174]  schedule+0x46/0xf0
        [162513.519190]  wait_current_trans+0xde/0x140 [btrfs]
        [162513.519193]  ? finish_wait+0x90/0x90
        [162513.519206]  start_transaction+0x4d7/0x5f0 [btrfs]
        [162513.519222]  btrfs_create+0x57/0x200 [btrfs]
        [162513.519230]  lookup_open+0x522/0x650
        [162513.519246]  path_openat+0x2b8/0xa50
        [162513.519270]  do_filp_open+0x91/0x100
        [162513.519275]  ? find_held_lock+0x32/0x90
        [162513.519280]  ? lock_acquired+0x33b/0x470
        [162513.519285]  ? do_raw_spin_unlock+0x4b/0xc0
        [162513.519287]  ? _raw_spin_unlock+0x29/0x40
        [162513.519295]  do_sys_openat2+0x20d/0x2d0
        [162513.519300]  do_sys_open+0x44/0x80
        [162513.519304]  do_syscall_64+0x33/0x80
        [162513.519307]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.519309] RIP: 0033:0x7f5238f4a903
        [162513.519310] Code: Bad RIP value.
        [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
        [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
        [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
        [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
        [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
        [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
        [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
        [162513.519727]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
        [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [162513.520508] task:btrfs           state:D stack:    0 pid:1356211 ppid:1356178 flags:0x00004002
        [162513.520511] Call Trace:
        [162513.520516]  __schedule+0x5ce/0xd00
        [162513.520519]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [162513.520525]  schedule+0x46/0xf0
        [162513.520544]  btrfs_scrub_pause+0x11f/0x180 [btrfs]
        [162513.520548]  ? finish_wait+0x90/0x90
        [162513.520562]  btrfs_commit_transaction+0x45a/0xc30 [btrfs]
        [162513.520574]  ? start_transaction+0xe0/0x5f0 [btrfs]
        [162513.520596]  btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
        [162513.520619]  btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
        [162513.520639]  btrfs_ioctl+0x2a25/0x36f0 [btrfs]
        [162513.520643]  ? do_sigaction+0xf3/0x240
        [162513.520645]  ? find_held_lock+0x32/0x90
        [162513.520648]  ? do_sigaction+0xf3/0x240
        [162513.520651]  ? lock_acquired+0x33b/0x470
        [162513.520655]  ? _raw_spin_unlock_irq+0x24/0x50
        [162513.520657]  ? lockdep_hardirqs_on+0x7d/0x100
        [162513.520660]  ? _raw_spin_unlock_irq+0x35/0x50
        [162513.520662]  ? do_sigaction+0xf3/0x240
        [162513.520671]  ? __x64_sys_ioctl+0x83/0xb0
        [162513.520672]  __x64_sys_ioctl+0x83/0xb0
        [162513.520677]  do_syscall_64+0x33/0x80
        [162513.520679]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [162513.520681] RIP: 0033:0x7fc3cd307d87
        [162513.520682] Code: Bad RIP value.
        [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
        [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
        [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
        [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
        [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
        [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
        [162513.520703]
      		  Showing all locks held in the system:
        [162513.520712] 1 lock held by khungtaskd/54:
        [162513.520713]  #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
        [162513.520728] 1 lock held by in:imklog/596:
        [162513.520729]  #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
        [162513.520782] 1 lock held by btrfs-transacti/1356167:
        [162513.520784]  #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
        [162513.520798] 1 lock held by btrfs/1356190:
        [162513.520800]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
        [162513.520805] 1 lock held by fsstress/1356184:
        [162513.520806]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
        [162513.520811] 3 locks held by fsstress/1356185:
        [162513.520812]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
        [162513.520815]  #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
        [162513.520820]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
        [162513.520833] 1 lock held by fsstress/1356196:
        [162513.520834]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
        [162513.520838] 3 locks held by fsstress/1356197:
        [162513.520839]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
        [162513.520843]  #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
        [162513.520846]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
        [162513.520858] 2 locks held by btrfs/1356211:
        [162513.520859]  #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
        [162513.520877]  #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
      
      This was weird because the stack traces show that a transaction commit,
      triggered by a device replace operation, is blocking trying to pause any
      running scrubs but there are no stack traces of blocked tasks doing a
      scrub.
      
      After poking around with drgn, I noticed there was a scrub task that was
      constantly running and blocking for shorts periods of time:
      
        >>> t = find_task(prog, 1356190)
        >>> prog.stack_trace(t)
        #0  __schedule+0x5ce/0xcfc
        #1  schedule+0x46/0xe4
        #2  schedule_timeout+0x1df/0x475
        #3  btrfs_reada_wait+0xda/0x132
        #4  scrub_stripe+0x2a8/0x112f
        #5  scrub_chunk+0xcd/0x134
        #6  scrub_enumerate_chunks+0x29e/0x5ee
        #7  btrfs_scrub_dev+0x2d5/0x91b
        #8  btrfs_ioctl+0x7f5/0x36e7
        #9  __x64_sys_ioctl+0x83/0xb0
        #10 do_syscall_64+0x33/0x77
        #11 entry_SYSCALL_64+0x7c/0x156
      
      Which corresponds to:
      
      int btrfs_reada_wait(void *handle)
      {
          struct reada_control *rc = handle;
          struct btrfs_fs_info *fs_info = rc->fs_info;
      
          while (atomic_read(&rc->elems)) {
              if (!atomic_read(&fs_info->reada_works_cnt))
                  reada_start_machine(fs_info);
              wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
                                (HZ + 9) / 10);
          }
      (...)
      
      So the counter "rc->elems" was set to 1 and never decreased to 0, causing
      the scrub task to loop forever in that function. Then I used the following
      script for drgn to check the readahead requests:
      
        $ cat dump_reada.py
        import sys
        import drgn
        from drgn import NULL, Object, cast, container_of, execscript, \
            reinterpret, sizeof
        from drgn.helpers.linux import *
      
        mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
      
        mnt = None
        for mnt in for_each_mount(prog, dst = mnt_path):
            pass
      
        if mnt is None:
            sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
            sys.exit(1)
      
        fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
      
        def dump_re(re):
            nzones = re.nzones.value_()
            print(f're at {hex(re.value_())}')
            print(f'\t logical {re.logical.value_()}')
            print(f'\t refcnt {re.refcnt.value_()}')
            print(f'\t nzones {nzones}')
            for i in range(nzones):
                dev = re.zones[i].device
                name = dev.name.str.string_()
                print(f'\t\t dev id {dev.devid.value_()} name {name}')
            print()
      
        for _, e in radix_tree_for_each(fs_info.reada_tree):
            re = cast('struct reada_extent *', e)
            dump_re(re)
      
        $ drgn dump_reada.py
        re at 0xffff8f3da9d25ad8
                logical 38928384
                refcnt 1
                nzones 1
                       dev id 0 name b'/dev/sdd'
        $
      
      So there was one readahead extent with a single zone corresponding to the
      source device of that last device replace operation logged in dmesg/syslog.
      Also the ID of that zone's device was 0 which is a special value set in
      the source device of a device replace operation when the operation finishes
      (constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
      confirming again that device /dev/sdd was the source of a device replace
      operation.
      
      Normally there should be as many zones in the readahead extent as there are
      devices, and I wasn't expecting the extent to be in a block group with a
      'single' profile, so I went and confirmed with the following drgn script
      that there weren't any single profile block groups:
      
        $ cat dump_block_groups.py
        import sys
        import drgn
        from drgn import NULL, Object, cast, container_of, execscript, \
            reinterpret, sizeof
        from drgn.helpers.linux import *
      
        mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"
      
        mnt = None
        for mnt in for_each_mount(prog, dst = mnt_path):
            pass
      
        if mnt is None:
            sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
            sys.exit(1)
      
        fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)
      
        BTRFS_BLOCK_GROUP_DATA = (1 << 0)
        BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
        BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
        BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
        BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
        BTRFS_BLOCK_GROUP_DUP = (1 << 5)
        BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
        BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
        BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
        BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
        BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)
      
        def bg_flags_string(bg):
            flags = bg.flags.value_()
            ret = ''
            if flags & BTRFS_BLOCK_GROUP_DATA:
                ret = 'data'
            if flags & BTRFS_BLOCK_GROUP_METADATA:
                if len(ret) > 0:
                    ret += '|'
                ret += 'meta'
            if flags & BTRFS_BLOCK_GROUP_SYSTEM:
                if len(ret) > 0:
                    ret += '|'
                ret += 'system'
            if flags & BTRFS_BLOCK_GROUP_RAID0:
                ret += ' raid0'
            elif flags & BTRFS_BLOCK_GROUP_RAID1:
                ret += ' raid1'
            elif flags & BTRFS_BLOCK_GROUP_DUP:
                ret += ' dup'
            elif flags & BTRFS_BLOCK_GROUP_RAID10:
                ret += ' raid10'
            elif flags & BTRFS_BLOCK_GROUP_RAID5:
                ret += ' raid5'
            elif flags & BTRFS_BLOCK_GROUP_RAID6:
                ret += ' raid6'
            elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
                ret += ' raid1c3'
            elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
                ret += ' raid1c4'
            else:
                ret += ' single'
      
            return ret
      
        def dump_bg(bg):
            print()
            print(f'block group at {hex(bg.value_())}')
            print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
            print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')
      
        bg_root = fs_info.block_group_cache_tree.address_of_()
        for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
            dump_bg(bg)
      
        $ drgn dump_block_groups.py
      
        block group at 0xffff8f3d673b0400
               start 22020096 length 16777216
               flags 258 - system raid6
      
        block group at 0xffff8f3d53ddb400
               start 38797312 length 536870912
               flags 260 - meta raid6
      
        block group at 0xffff8f3d5f4d9c00
               start 575668224 length 2147483648
               flags 257 - data raid6
      
        block group at 0xffff8f3d08189000
               start 2723151872 length 67108864
               flags 258 - system raid6
      
        block group at 0xffff8f3db70ff000
               start 2790260736 length 1073741824
               flags 260 - meta raid6
      
        block group at 0xffff8f3d5f4dd800
               start 3864002560 length 67108864
               flags 258 - system raid6
      
        block group at 0xffff8f3d67037000
               start 3931111424 length 2147483648
               flags 257 - data raid6
        $
      
      So there were only 2 reasons left for having a readahead extent with a
      single zone: reada_find_zone(), called when creating a readahead extent,
      returned NULL either because we failed to find the corresponding block
      group or because a memory allocation failed. With some additional and
      custom tracing I figured out that on every further ocurrence of the
      problem the block group had just been deleted when we were looping to
      create the zones for the readahead extent (at reada_find_extent()), so we
      ended up with only one zone in the readahead extent, corresponding to a
      device that ends up getting replaced.
      
      So after figuring that out it became obvious why the hang happens:
      
      1) Task A starts a scrub on any device of the filesystem, except for
         device /dev/sdd;
      
      2) Task B starts a device replace with /dev/sdd as the source device;
      
      3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
         starting to scrub a stripe from block group X. This call to
         btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
         calls reada_add_block(), it passes the logical address of the extent
         tree's root node as its 'logical' argument - a value of 38928384;
      
      4) Task A then enters reada_find_extent(), called from reada_add_block().
         It finds there isn't any existing readahead extent for the logical
         address 38928384, so it proceeds to the path of creating a new one.
      
         It calls btrfs_map_block() to find out which stripes exist for the block
         group X. On the first iteration of the for loop that iterates over the
         stripes, it finds the stripe for device /dev/sdd, so it creates one
         zone for that device and adds it to the readahead extent. Before getting
         into the second iteration of the loop, the cleanup kthread deletes block
         group X because it was empty. So in the iterations for the remaining
         stripes it does not add more zones to the readahead extent, because the
         calls to reada_find_zone() returned NULL because they couldn't find
         block group X anymore.
      
         As a result the new readahead extent has a single zone, corresponding to
         the device /dev/sdd;
      
      4) Before task A returns to btrfs_reada_add() and queues the readahead job
         for the readahead work queue, task B finishes the device replace and at
         btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
         device /dev/sdg;
      
      5) Task A returns to reada_add_block(), which increments the counter
         "->elems" of the reada_control structure allocated at btrfs_reada_add().
      
         Then it returns back to btrfs_reada_add() and calls
         reada_start_machine(). This queues a job in the readahead work queue to
         run the function reada_start_machine_worker(), which calls
         __reada_start_machine().
      
         At __reada_start_machine() we take the device list mutex and for each
         device found in the current device list, we call
         reada_start_machine_dev() to start the readahead work. However at this
         point the device /dev/sdd was already freed and is not in the device
         list anymore.
      
         This means the corresponding readahead for the extent at 38928384 is
         never started, and therefore the "->elems" counter of the reada_control
         structure allocated at btrfs_reada_add() never goes down to 0, causing
         the call to btrfs_reada_wait(), done by the scrub task, to wait forever.
      
      Note that the readahead request can be made either after the device replace
      started or before it started, however in pratice it is very unlikely that a
      device replace is able to start after a readahead request is made and is
      able to complete before the readahead request completes - maybe only on a
      very small and nearly empty filesystem.
      
      This hang however is not the only problem we can have with readahead and
      device removals. When the readahead extent has other zones other than the
      one corresponding to the device that is being removed (either by a device
      replace or a device remove operation), we risk having a use-after-free on
      the device when dropping the last reference of the readahead extent.
      
      For example if we create a readahead extent with two zones, one for the
      device /dev/sdd and one for the device /dev/sde:
      
      1) Before the readahead worker starts, the device /dev/sdd is removed,
         and the corresponding btrfs_device structure is freed. However the
         readahead extent still has the zone pointing to the device structure;
      
      2) When the readahead worker starts, it only finds device /dev/sde in the
         current device list of the filesystem;
      
      3) It starts the readahead work, at reada_start_machine_dev(), using the
         device /dev/sde;
      
      4) Then when it finishes reading the extent from device /dev/sde, it calls
         __readahead_hook() which ends up dropping the last reference on the
         readahead extent through the last call to reada_extent_put();
      
      5) At reada_extent_put() it iterates over each zone of the readahead extent
         and attempts to delete an element from the device's 'reada_extents'
         radix tree, resulting in a use-after-free, as the device pointer of the
         zone for /dev/sdd is now stale. We can also access the device after
         dropping the last reference of a zone, through reada_zone_release(),
         also called by reada_extent_put().
      
      And a device remove suffers the same problem, however since it shrinks the
      device size down to zero before removing the device, it is very unlikely to
      still have readahead requests not completed by the time we free the device,
      the only possibility is if the device has a very little space allocated.
      
      While the hang problem is exclusive to scrub, since it is currently the
      only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
      problem affects any path that triggers readhead, which includes
      btree_readahead_hook() and __readahead_hook() (a readahead worker can
      trigger readahed for the children of a node) for example - any path that
      ends up calling reada_add_block() can trigger the use-after-free after a
      device is removed.
      
      So fix this by waiting for any readahead requests for a device to complete
      before removing a device, ensuring that while waiting for existing ones no
      new ones can be made.
      
      This problem has been around for a very long time - the readahead code was
      added in 2011, device remove exists since 2008 and device replace was
      introduced in 2013, hard to pick a specific commit for a git Fixes tag.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      66d204a1
  2. 07 10月, 2020 19 次提交
    • N
      btrfs: remove struct extent_io_ops · 905eb88b
      Nikolay Borisov 提交于
      It's no longer used just remove the function and any related code which
      was initialising it for inodes. No functional changes.
      
      Removing 8 bytes from extent_io_tree in turn reduces size of other
      structures where it is embedded, notably btrfs_inode where it reduces
      size by 24 bytes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      905eb88b
    • N
      btrfs: stop calling submit_bio_hook for data inodes · 908930f3
      Nikolay Borisov 提交于
      Instead export and rename the function to btrfs_submit_data_bio and
      call it directly in submit_one_bio. This avoids paying the cost for
      speculative attacks mitigations and improves code readability.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      908930f3
    • N
      btrfs: replace readpage_end_io_hook with direct calls · 9a446d6a
      Nikolay Borisov 提交于
      Don't call readpage_end_io_hook for the btree inode.  Instead of relying
      on indirect calls to implement metadata buffer validation simply check
      if the inode whose page we are processing equals the btree inode. If it
      does call the necessary function.
      
      This is an improvement in 2 directions:
      
      1. We aren't paying the penalty of indirect calls in a post-speculation
         attacks world.
      
      2. The function is now named more explicitly so it's obvious what's
         going on
      
      This is in preparation to removing struct extent_io_ops altogether.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a446d6a
    • D
      btrfs: use unaligned helpers for stack and header set/get helpers · e97659ce
      David Sterba 提交于
      In the definitions generated by BTRFS_SETGET_HEADER_FUNCS there's direct
      pointer assignment but we should use the helpers for unaligned access
      for clarity. It hasn't been a problem so far because of the natural
      alignment.
      
      Similarly for BTRFS_SETGET_STACK_FUNCS, that usually get a structure
      from stack that has an aligned start but some members may not be aligned
      due to packing. This as well hasn't caused problems so far.
      
      Move the put/get_unaligned_le8 stubs to ctree.h so we can use them.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e97659ce
    • N
      btrfs: sink total_data parameter in setup_items_for_insert · fc0d82e1
      Nikolay Borisov 提交于
      That parameter can easily be derived based on the "data_size" and "nr"
      parameters exploit this fact to simply the function's signature. No
      functional changes.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fc0d82e1
    • N
      btrfs: eliminate total_size parameter from setup_items_for_insert · 3dc9dc89
      Nikolay Borisov 提交于
      The value of this argument can be derived from the total_data as it's
      simply the value of the data size + size of btrfs_items being touched.
      Move the parameter calculation inside the function. This results in a
      simpler interface and also a minor size reduction:
      
      ./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o
      add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34)
      Function                                     old     new   delta
      btrfs_duplicate_item                         260     259      -1
      setup_items_for_insert                      1200    1190     -10
      btrfs_insert_empty_items                     177     154     -23
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3dc9dc89
    • F
      btrfs: rename btrfs_punch_hole_range() to a more generic name · 306bfec0
      Filipe Manana 提交于
      The function btrfs_punch_hole_range() is now used to replace all the file
      extents in a given file range with an extent described in the given struct
      btrfs_replace_extent_info argument. This extent can either be an existing
      extent that is being cloned or it can be a new extent (namely a prealloc
      extent). When that argument is NULL it only punches a hole (drops all the
      existing extents) in the file range.
      
      So rename the function to btrfs_replace_file_extents().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      306bfec0
    • F
      btrfs: rename struct btrfs_clone_extent_info to a more generic name · bf385648
      Filipe Manana 提交于
      Now that we can use btrfs_clone_extent_info to convey information for a
      new prealloc extent as well, and not just for existing extents that are
      being cloned, rename it to btrfs_replace_extent_info, which reflects the
      fact that this is now more generic and it is used to replace all existing
      extents in a file range with the extent described by the structure.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bf385648
    • F
      btrfs: remove item_size member of struct btrfs_clone_extent_info · fb870f6c
      Filipe Manana 提交于
      The value of item_size of struct btrfs_clone_extent_info is always set to
      the size of a non-inline file extent item, and in fact the infrastructure
      that uses this structure (btrfs_punch_hole_range()) does not work with
      inline file extents at all (and it is not supposed to).
      
      So just remove that field from the structure and use directly
      sizeof(struct btrfs_file_extent_item) instead. Also assert that the
      file extent type is not inline at btrfs_insert_clone_extent().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb870f6c
    • F
      btrfs: fix metadata reservation for fallocate that leads to transaction aborts · 8fccebfa
      Filipe Manana 提交于
      When doing an fallocate(), specially a zero range operation, we assume
      that reserving 3 units of metadata space is enough, that at most we touch
      one leaf in subvolume/fs tree for removing existing file extent items and
      inserting a new file extent item. This assumption is generally true for
      most common use cases. However when we end up needing to remove file extent
      items from multiple leaves, we can end up failing with -ENOSPC and abort
      the current transaction, turning the filesystem to RO mode. When this
      happens a stack trace like the following is dumped in dmesg/syslog:
      
      [ 1500.620934] ------------[ cut here ]------------
      [ 1500.620938] BTRFS: Transaction aborted (error -28)
      [ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs]
      [ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...)
      [ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G        W         5.9.0-rc3-btrfs-next-67 #1
      [ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs]
      [ 1500.621026] Code: 8b 40 50 f0 48 (...)
      [ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286
      [ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000
      [ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
      [ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001
      [ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000
      [ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60
      [ 1500.621039] FS:  00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000
      [ 1500.621041] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0
      [ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1500.621049] Call Trace:
      [ 1500.621076]  btrfs_prealloc_file_range+0x10/0x20 [btrfs]
      [ 1500.621087]  btrfs_fallocate+0xccd/0x1280 [btrfs]
      [ 1500.621108]  vfs_fallocate+0x14d/0x290
      [ 1500.621112]  ksys_fallocate+0x3a/0x70
      [ 1500.621117]  __x64_sys_fallocate+0x1a/0x20
      [ 1500.621120]  do_syscall_64+0x33/0x80
      [ 1500.621123]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 1500.621126] RIP: 0033:0x7fb5b248c477
      [ 1500.621128] Code: 89 7c 24 08 (...)
      [ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
      [ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477
      [ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003
      [ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000
      [ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010
      [ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003
      [ 1500.621151] irq event stamp: 1026217
      [ 1500.621154] hardirqs last  enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0
      [ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0
      [ 1500.621159] softirqs last  enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606
      [ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20
      [ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]---
      [ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left
      
      When we use fallocate() internally, for reserving an extent for a space
      cache, inode cache or relocation, we can't hit this problem since either
      there aren't any file extent items to remove from the subvolume tree or
      there is at most one.
      
      When using plain fallocate() it's very unlikely, since that would require
      having many file extent items representing holes for the target range and
      crossing multiple leafs - we attempt to increase the range (merge) of such
      file extent items when punching holes, so at most we end up with 2 file
      extent items for holes at leaf boundaries.
      
      However when using the zero range operation of fallocate() for a large
      range (100+ MiB for example) that's fairly easy to trigger. The following
      example reproducer triggers the issue:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        umount /dev/sdj &> /dev/null
        mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null
        mount /dev/sdj /mnt/sdj
      
        # Create a 100M file with many file extent items. Punch a hole every 8K
        # just to speedup the file creation - we could do 4K sequential writes
        # followed by fsync (or O_SYNC) as well, but that takes a lot of time.
        file_size=$((100 * 1024 * 1024))
        xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar
        for ((i = 0; i < $file_size; i += 8192)); do
            xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar
        done
      
        # Force a transaction commit, so the zero range operation will be forced
        # to COW all metadata extents it need to touch.
        sync
      
        xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar
      
        umount /mnt/sdj
      
        $ ./reproducer.sh
        wrote 104857600/104857600 bytes at offset 0
        100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec)
        fallocate: No space left on device
      
        $ dmesg
        <shows the same stack trace pasted before>
      
      To fix this use the existing infrastructure that hole punching and
      extent cloning use for replacing a file range with another extent. This
      deals with doing the removal of file extent items and inserting the new
      one using an incremental approach, reserving more space when needed and
      always ensuring we don't leave an implicit hole in the range in case
      we need to do multiple iterations and a crash happens between iterations.
      
      A test case for fstests will follow up soon.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8fccebfa
    • G
      btrfs: enumerate the type of exclusive operation in progress · c3e1f96c
      Goldwyn Rodrigues 提交于
      Instead of using a flag bit for exclusive operation, use a variable to
      store which exclusive operation is being performed.  Introduce an API
      to start and finish an exclusive operation.
      
      This would enable another way for tools to check which operation is
      running on why starting an exclusive operation failed. The followup
      patch adds a sysfs_notify() to alert userspace when the state changes, so
      userspace can perform select() on it to get notified of the change.
      
      This would enable us to enqueue a command which will wait for current
      exclusive operation to complete before issuing the next exclusive
      operation. This has been done synchronously as opposed to a background
      process, or else error collection (if any) will become difficult.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3e1f96c
    • N
      btrfs: convert btrfs_inode_sectorsize to take btrfs_inode · 6fee248d
      Nikolay Borisov 提交于
      It's counterintuitive to have a function named btrfs_inode_xxx which
      takes a generic inode. Also move the function to btrfs_inode.h so that
      it has access to the definition of struct btrfs_inode.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6fee248d
    • J
      btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks · 9631e4cc
      Josef Bacik 提交于
      When we COW a block we are holding a lock on the original block, and
      then we lock the new COW block.  Because our lockdep maps are based on
      root + level, this will make lockdep complain.  We need a way to
      indicate a subclass for locking the COW'ed block, so plumb through our
      btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
      and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.
      
      The reason I've added all this extra infrastructure is because there
      will be need of different nesting classes in follow up patches.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9631e4cc
    • J
      btrfs: introduce btrfs_path::recurse · 51899412
      Josef Bacik 提交于
      Our current tree locking stuff allows us to recurse with read locks if
      we're already holding the write lock.  This is necessary for the space
      cache inode, as we could be holding a lock on the root_tree root when we
      need to cache a block group, and thus need to be able to read down the
      root_tree to read in the inode cache.
      
      We can get away with this in our current locking, but we won't be able
      to with a rwsem.  Handle this by purposefully annotating the places
      where we require recursion, so that in the future we can maybe come up
      with a way to avoid the recursion.  In the case of the free space inode,
      this will be superseded by the free space tree.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      51899412
    • Q
      btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations · e85fde51
      Qu Wenruo 提交于
      [BUG]
      When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:
      
        generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)
      
      And with the following metadata leak:
      
        BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
        Call Trace:
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace a6cfd45ba80e4e06 ]---
        BTRFS error (device dm-3): qgroup reserved space leaked
        BTRFS info (device dm-3): disk space caching is enabled
        BTRFS info (device dm-3): has skinny extents
      
      [CAUSE]
      The qgroup preallocated meta rsv operations of that offending root are:
      
        btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
        btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
        btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
        btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
        btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
      
      It's pretty obvious that, we reserve qgroup meta rsv in
      btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
      release/convert calls in btrfs_subvolume_release_metadata().
      
      This leads to the leakage.
      
      [FIX]
      To fix this bug, we should follow what we're doing in
      btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
      add it to block_rsv->qgroup_rsv_reserved.
      
      And free the qgroup reserved metadata space when releasing the
      block_rsv.
      
      To do this, we need to change the btrfs_subvolume_release_metadata() to
      accept btrfs_root, and record the qgroup_to_release number, and call
      btrfs_qgroup_convert_reserved_meta() for it.
      
      Fixes: 733e03a0 ("btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e85fde51
    • G
      btrfs: switch to iomap for direct IO · f85781fb
      Goldwyn Rodrigues 提交于
      We're using direct io implementation based on buffer heads. This patch
      switches to the new iomap infrastructure.
      
      Switch from __blockdev_direct_IO() to iomap_dio_rw().  Rename
      btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as
      iomap_begin() for iomap direct I/O functions. This function allocates
      and locks all the blocks required for the I/O.  btrfs_submit_direct() is
      used as the submit_io() hook for direct I/O ops.
      
      Since we need direct I/O reads to go through iomap_dio_rw(), we change
      file_operations.read_iter() to a btrfs_file_read_iter() which calls
      btrfs_direct_IO() for direct reads and falls back to
      generic_file_buffered_read() for incomplete reads and buffered reads.
      
      We don't need address_space.direct_IO() anymore: set it to noop.
      
      Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is
      capable of direct I/O reads from a hole, so we don't need to return
      -ENOENT.
      
      Btrfs direct I/O is now done under i_rwsem, shared in case of reads and
      exclusive in case of writes. This guards against simultaneous truncates.
      
      Use iomap->iomap_end() to check for failed or incomplete direct I/O:
      
        - for writes, call __endio_write_update_ordered()
        - for reads, unlock extents
      
      btrfs_dio_data is now hooked in iomap->private and not
      current->journal_info. It carries the reservation variable and the
      amount of data submitted, so we can calculate the amount of data to call
      __endio_write_update_ordered in case of an error.
      
      This patch removes last use of struct buffer_head from btrfs.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f85781fb
    • J
      btrfs: do async reclaim for data reservations · 57056740
      Josef Bacik 提交于
      Now that we have the data ticketing stuff in place, move normal data
      reservations to use an async reclaim helper to satisfy tickets.  Before
      we could have multiple tasks race in and both allocate chunks, resulting
      in more data chunks than we would necessarily need.  Serializing these
      allocations and making a single thread responsible for flushing will
      only allocate chunks as needed, as well as cut down on transaction
      commits and other flush related activities.
      
      Priority reservations will still work as they have before, simply
      trying to allocate a chunk until they can make their reservation.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      57056740
    • J
      btrfs: add flushing states for handling data reservations · 058e6d1d
      Josef Bacik 提交于
      Currently the way we do data reservations is by seeing if we have enough
      space in our space_info.  If we do not and we're a normal inode we'll
      
      1) Attempt to force a chunk allocation until we can't anymore.
      2) If that fails we'll flush delalloc, then commit the transaction, then
         run the delayed iputs.
      
      If we are a free space inode we're only allowed to force a chunk
      allocation.  In order to use the normal flushing mechanism we need to
      encode this into a flush state array for normal inodes.  Since both will
      start with allocating chunks until the space info is full there is no
      need to add this as a flush state, this will be handled specially.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      058e6d1d
    • J
      btrfs: change nr to u64 in btrfs_start_delalloc_roots · b4912139
      Josef Bacik 提交于
      We have btrfs_wait_ordered_roots() which takes a u64 for nr, but
      btrfs_start_delalloc_roots() that takes an int for nr, which makes using
      them in conjunction, especially for something like (u64)-1, annoying and
      inconsistent.  Fix btrfs_start_delalloc_roots() to take a u64 for nr and
      adjust start_delalloc_inodes() and it's callers appropriately.
      
      This means we've adjusted start_delalloc_inodes() to take a pointer of
      nr since we want to preserve the ability for start-delalloc_inodes() to
      return an error, so simply make it do the nr adjusting as necessary.
      
      Part of adjusting the callers to this means changing
      btrfs_writeback_inodes_sb_nr() to take a u64 for items.  This may be
      confusing because it seems unrelated, but the caller of
      btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the
      function variable that needs to be changed.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4912139
  3. 21 8月, 2020 1 次提交
    • B
      btrfs: detect nocow for swap after snapshot delete · a84d5d42
      Boris Burkov 提交于
      can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
      detecting a must cow condition which is not exactly accurate, but saves
      unnecessary tree traversal. The incorrect assumption is that if the
      extent was created in a generation smaller than the last snapshot
      generation, it must be referenced by that snapshot. That is true, except
      the snapshot could have since been deleted, without affecting the last
      snapshot generation.
      
      The original patch claimed a performance win from this check, but it
      also leads to a bug where you are unable to use a swapfile if you ever
      snapshotted the subvolume it's in. Make the check slower and more strict
      for the swapon case, without modifying the general cow checks as a
      compromise. Turning swap on does not seem to be a particularly
      performance sensitive operation, so incurring a possibly unnecessary
      btrfs_search_slot seems worthwhile for the added usability.
      
      Note: Until the snapshot is competely cleaned after deletion,
      check_committed_refs will still cause the logic to think that cow is
      necessary, so the user must until 'btrfs subvolu sync' finished before
      activating the swapfile swapon.
      
      CC: stable@vger.kernel.org # 5.4+
      Suggested-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a84d5d42
  4. 20 8月, 2020 1 次提交
  5. 27 7月, 2020 18 次提交
    • J
      btrfs: don't WARN if we abort a transaction with EROFS · f95ebdbe
      Josef Bacik 提交于
      If we got some sort of corruption via a read and call
      btrfs_handle_fs_error() we'll set BTRFS_FS_STATE_ERROR on the fs and
      complain.  If a subsequent trans handle trips over this it'll get EROFS
      and then abort.  However at that point we're not aborting for the
      original reason, we're aborting because we've been flipped read only.
      We do not need to WARN_ON() here.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f95ebdbe
    • Q
      btrfs: add comments for btrfs_reserve_flush_enum · fd7fb634
      Qu Wenruo 提交于
      This enum is the interface exposed to developers.
      
      Although we have a detailed comment explaining the whole idea of space
      flushing at the beginning of space-info.c, the exposed enum interface
      doesn't have any comment.
      
      Some corner cases, like BTRFS_RESERVE_FLUSH_ALL and
      BTRFS_RESERVE_FLUSH_ALL_STEAL can be interrupted by fatal signals, are
      not explained at all.
      
      So add some simple comments for these enums as a quick reference.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd7fb634
    • Q
      btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT · adca4d94
      Qu Wenruo 提交于
      commit a514d638 ("btrfs: qgroup: Commit transaction in advance to
      reduce early EDQUOT") tries to reduce the early EDQUOT problems by
      checking the qgroup free against threshold and tries to wake up commit
      kthread to free some space.
      
      The problem of that mechanism is, it can only free qgroup per-trans
      metadata space, can't do anything to data, nor prealloc qgroup space.
      
      Now since we have the ability to flush qgroup space, and implemented
      retry-after-EDQUOT behavior, such mechanism can be completely replaced.
      
      So this patch will cleanup such mechanism in favor of
      retry-after-EDQUOT.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      adca4d94
    • Q
      btrfs: qgroup: try to flush qgroup space when we get -EDQUOT · c53e9653
      Qu Wenruo 提交于
      [PROBLEM]
      There are known problem related to how btrfs handles qgroup reserved
      space.  One of the most obvious case is the the test case btrfs/153,
      which do fallocate, then write into the preallocated range.
      
        btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
            --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
            +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
            @@ -1,2 +1,5 @@
             QA output created by 153
            +pwrite: Disk quota exceeded
            +/mnt/scratch/testfile2: Disk quota exceeded
            +/mnt/scratch/testfile2: Disk quota exceeded
             Silence is golden
            ...
            (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
      
      [CAUSE]
      Since commit c6887cd1 ("Btrfs: don't do nocow check unless we have to"),
      we always reserve space no matter if it's COW or not.
      
      Such behavior change is mostly for performance, and reverting it is not
      a good idea anyway.
      
      For preallcoated extent, we reserve qgroup data space for it already,
      and since we also reserve data space for qgroup at buffered write time,
      it needs twice the space for us to write into preallocated space.
      
      This leads to the -EDQUOT in buffered write routine.
      
      And we can't follow the same solution, unlike data/meta space check,
      qgroup reserved space is shared between data/metadata.
      The EDQUOT can happen at the metadata reservation, so doing NODATACOW
      check after qgroup reservation failure is not a solution.
      
      [FIX]
      To solve the problem, we don't return -EDQUOT directly, but every time
      we got a -EDQUOT, we try to flush qgroup space:
      
      - Flush all inodes of the root
        NODATACOW writes will free the qgroup reserved at run_dealloc_range().
        However we don't have the infrastructure to only flush NODATACOW
        inodes, here we flush all inodes anyway.
      
      - Wait for ordered extents
        This would convert the preallocated metadata space into per-trans
        metadata, which can be freed in later transaction commit.
      
      - Commit transaction
        This will free all per-trans metadata space.
      
      Also we don't want to trigger flush multiple times, so here we introduce
      a per-root wait list and a new root status, to ensure only one thread
      starts the flushing.
      
      Fixes: c6887cd1 ("Btrfs: don't do nocow check unless we have to")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c53e9653
    • M
      btrfs: add multi-statement protection to btrfs_set/clear_and_info macros · 60f8667b
      Marcos Paulo de Souza 提交于
      Multi-statement macros should be enclosed in do/while(0) block to make
      their use safe in single statement if conditions. All current uses of
      the macros are safe, so this change is for future protection.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      60f8667b
    • F
      btrfs: remove no longer needed use of log_writers for the log root tree · a93e0168
      Filipe Manana 提交于
      When syncing the log, we used to update the log root tree without holding
      neither the log_mutex of the subvolume root nor the log_mutex of log root
      tree.
      
      We used to have two critical sections delimited by the log_mutex of the
      log root tree, so in the first one we incremented the log_writers of the
      log root tree and on the second one we decremented it and waited for the
      log_writers counter to go down to zero. This was because the update of
      the log root tree happened between the two critical sections.
      
      The use of two critical sections allowed a little bit more of parallelism
      and required the use of the log_writers counter, necessary to make sure
      we didn't miss any log root tree update when we have multiple tasks trying
      to sync the log in parallel.
      
      However after commit 06989c79 ("Btrfs: fix race updating log root
      item during fsync") the log root tree update was moved into a critical
      section delimited by the subvolume's log_mutex. Later another commit
      moved the log tree update from that critical section into the second
      critical section delimited by the log_mutex of the log root tree. Both
      commits addressed different bugs.
      
      The end result is that the first critical section delimited by the
      log_mutex of the log root tree became pointless, since there's nothing
      done between it and the second critical section, we just have an unlock
      of the log_mutex followed by a lock operation. This means we can merge
      both critical sections, as the first one does almost nothing now, and we
      can stop using the log_writers counter of the log root tree, which was
      incremented in the first critical section and decremented in the second
      criticial section, used to make sure no one in the second critical section
      started writeback of the log root tree before some other task updated it.
      
      So just remove the mutex_unlock() followed by mutex_lock() of the log root
      tree, as well as the use of the log_writers counter for the log root tree.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a93e0168
    • F
      btrfs: stop incremening log_batch for the log root tree when syncing log · 28a95795
      Filipe Manana 提交于
      We are incrementing the log_batch atomic counter of the root log tree but
      we never use that counter, it's used only for the log trees of subvolume
      roots. We started doing it when we moved the log_batch and log_write
      counters from the global, per fs, btrfs_fs_info structure, into the
      btrfs_root structure in commit 7237f183 ("Btrfs: fix tree logs
      parallel sync").
      
      So just stop doing it for the log root tree and add a comment over the
      field declaration so inform it's used only for log trees of subvolume
      roots.
      
      This patch is part of a series that has the following patches:
      
      1/4 btrfs: only commit the delayed inode when doing a full fsync
      2/4 btrfs: only commit delayed items at fsync if we are logging a directory
      3/4 btrfs: stop incremening log_batch for the log root tree when syncing log
      4/4 btrfs: remove no longer needed use of log_writers for the log root tree
      
      After the entire patchset applied I saw about 12% decrease on max latency
      reported by dbench. The test was done on a qemu vm, with 8 cores, 16Gb of
      ram, using kvm and using a raw NVMe device directly (no intermediary fs on
      the host). The test was invoked like the following:
      
        mkfs.btrfs -f /dev/sdk
        mount -o ssd -o nospace_cache /dev/sdk /mnt/sdk
        dbench -D /mnt/sdk -t 300 8
        umount /mnt/dsk
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28a95795
    • Q
      btrfs: qgroup: export qgroups in sysfs · 49e5fb46
      Qu Wenruo 提交于
      This patch will add the following sysfs interface:
      
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/referenced
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/exclusive
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_referenced
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/max_exclusive
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/limit_flags
      
      Which is also available in output of "btrfs qgroup show".
      
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_data
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_pertrans
        /sys/fs/btrfs/<UUID>/qgroups/<qgroup_id>/rsv_meta_prealloc
      
      The last 3 rsv related members are not visible to users, but can be very
      useful to debug qgroup limit related bugs.
      
      Also, to avoid '/' used in <qgroup_id>, the separator between qgroup
      level and qgroup id is changed to '_'.
      
      The interface is not hidden behind 'debug' as we want this interface to
      be included into production build and to provide another way to read the
      qgroup information besides the ioctls.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49e5fb46
    • N
      btrfs: make btrfs_dirty_pages take btrfs_inode · 088545f6
      Nikolay Borisov 提交于
      There is a single use of the generic vfs_inode so let's take btrfs_inode
      as a parameter and remove couple of redundant BTRFS_I() calls.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      088545f6
    • N
      btrfs: make btrfs_set_extent_delalloc take btrfs_inode · c2566f22
      Nikolay Borisov 提交于
      Preparation to make btrfs_dirty_pages take btrfs_inode as parameter.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2566f22
    • N
      btrfs: make btrfs_run_delalloc_range take btrfs_inode · 98456b9c
      Nikolay Borisov 提交于
      All children now take btrfs_inode so convert it to taking it as a
      parameter as well.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98456b9c
    • Q
      btrfs: refactor btrfs_check_can_nocow() into two variants · 38d37aa9
      Qu Wenruo 提交于
      The function btrfs_check_can_nocow() now has two completely different
      call patterns.
      
      For nowait variant, callers don't need to do any cleanup.  While for
      wait variant, callers need to release the lock if they can do nocow
      write.
      
      This is somehow confusing, and is already a problem for the exported
      btrfs_check_can_nocow().
      
      So this patch will separate the different patterns into different
      functions.
      For nowait variant, the function will be called check_nocow_nolock().
      For wait variant, the function pair will be btrfs_check_nocow_lock()
      btrfs_check_nocow_unlock().
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38d37aa9
    • Q
      btrfs: allow btrfs_truncate_block() to fallback to nocow for data space reservation · 6d4572a9
      Qu Wenruo 提交于
      [BUG]
      When the data space is exhausted, even if the inode has NOCOW attribute,
      we will still refuse to truncate unaligned range due to ENOSPC.
      
      The following script can reproduce it pretty easily:
        #!/bin/bash
      
        dev=/dev/test/test
        mnt=/mnt/btrfs
      
        umount $dev &> /dev/null
        umount $mnt &> /dev/null
      
        mkfs.btrfs -f $dev -b 1G
        mount -o nospace_cache $dev $mnt
        touch $mnt/foobar
        chattr +C $mnt/foobar
      
        xfs_io -f -c "pwrite -b 4k 0 4k" $mnt/foobar > /dev/null
        xfs_io -f -c "pwrite -b 4k 0 1G" $mnt/padding &> /dev/null
        sync
      
        xfs_io -c "fpunch 0 2k" $mnt/foobar
        umount $mnt
      
      Currently this will fail at the fpunch part.
      
      [CAUSE]
      Because btrfs_truncate_block() always reserves space without checking
      the NOCOW attribute.
      
      Since the writeback path follows NOCOW bit, we only need to bother the
      space reservation code in btrfs_truncate_block().
      
      [FIX]
      Make btrfs_truncate_block() follow btrfs_buffered_write() to try to
      reserve data space first, and fall back to NOCOW check only when we
      don't have enough space.
      
      Such always-try-reserve is an optimization introduced in
      btrfs_buffered_write(), to avoid expensive btrfs_check_can_nocow() call.
      
      This patch will export check_can_nocow() as btrfs_check_can_nocow(), and
      use it in btrfs_truncate_block() to fix the problem.
      Reported-by: NMartin Doucha <martin.doucha@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d4572a9
    • D
      btrfs: remove unused btrfs_root::defrag_trans_start · a2570ef3
      David Sterba 提交于
      Last touched in 2013 by commit de78b51a ("btrfs: remove cache only
      arguments from defrag path") that was the only code that used the value.
      Now it's only set but never used for anything, so we can remove it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a2570ef3
    • N
      btrfs: make __btrfs_drop_extents take btrfs_inode · 906c448c
      Nikolay Borisov 提交于
      It has only 4 uses of a vfs_inode for inode_sub_bytes but unifies the
      interface with the non  __ prefixed version. Will also makes converting
      its callers to btrfs_inode easier.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      906c448c
    • N
      btrfs: make btrfs_csum_one_bio takae btrfs_inode · bd242a08
      Nikolay Borisov 提交于
      Will enable converting btrfs_submit_compressed_write to btrfs_inode more
      easily.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd242a08
    • N
      btrfs: make btrfs_reloc_clone_csums take btrfs_inode · 7bfa9535
      Nikolay Borisov 提交于
      It really wants btrfs_inode and not a vfs inode.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7bfa9535
    • D
      btrfs: add little-endian optimized key helpers · ce6ef5ab
      David Sterba 提交于
      The CPU and on-disk keys are mapped to two different structures because
      of the endianness. There's an intermediate buffer used to do the
      conversion, but this is not necessary when CPU and on-disk endianness
      match.
      
      Add optimized versions of helpers that take disk_key and use the buffer
      directly for CPU keys or drop the intermediate buffer and conversion.
      
      This saves a lot of stack space accross many functions and removes about
      6K of generated binary code:
      
         text    data     bss     dec     hex filename
      1090439   17468   14912 1122819  112203 pre/btrfs.ko
      1084613   17456   14912 1116981  110b35 post/btrfs.ko
      
      Delta: -5826
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ce6ef5ab