1. 14 3月, 2016 2 次提交
  2. 12 3月, 2016 4 次提交
  3. 11 3月, 2016 1 次提交
  4. 02 3月, 2016 7 次提交
    • F
      Btrfs: do not collect ordered extents when logging that inode exists · 5e33a2bd
      Filipe Manana 提交于
      When logging that an inode exists, for example as part of a directory
      fsync operation, we were collecting any ordered extents for the inode but
      we ended up doing nothing with them except tagging them as processed, by
      setting the flag BTRFS_ORDERED_LOGGED on them, which prevented a
      subsequent fsync of that inode (using the LOG_INODE_ALL mode) from
      collecting and processing them. This created a time window where a second
      fsync against the inode, using the fast path, ended up not logging the
      checksums for the new extents but it logged the extents since they were
      part of the list of modified extents. This happened because the ordered
      extents were not collected and checksums were not yet added to the csum
      tree - the ordered extents have not gone through btrfs_finish_ordered_io()
      yet (which is where we add them to the csum tree by calling
      inode.c:add_pending_csums()).
      
      So fix this by not collecting an inode's ordered extents if we are logging
      it with the LOG_INODE_EXISTS mode.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5e33a2bd
    • F
      Btrfs: fix race when checking if we can skip fsync'ing an inode · affc0ff9
      Filipe Manana 提交于
      If we're about to do a fast fsync for an inode and btrfs_inode_in_log()
      returns false, it's possible that we had an ordered extent in progress
      (btrfs_finish_ordered_io() not run yet) when we noticed that the inode's
      last_trans field was not greater than the id of the last committed
      transaction, but shortly after, before we checked if there were any
      ongoing ordered extents, the ordered extent had just completed and
      removed itself from the inode's ordered tree, in which case we end up not
      logging the inode, losing some data if a power failure or crash happens
      after the fsync handler returns and before the transaction is committed.
      
      Fix this by checking first if there are any ongoing ordered extents
      before comparing the inode's last_trans with the id of the last committed
      transaction - when it completes, an ordered extent always updates the
      inode's last_trans before it removes itself from the inode's ordered
      tree (at btrfs_finish_ordered_io()).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      affc0ff9
    • F
      Btrfs: fix listxattrs not listing all xattrs packed in the same item · daac7ba6
      Filipe Manana 提交于
      In the listxattrs handler, we were not listing all the xattrs that are
      packed in the same btree item, which happens when multiple xattrs have
      a name that when crc32c hashed produce the same checksum value.
      
      Fix this by processing them all.
      
      The following test case for xfstests reproduces the issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            cd /
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/attr
      
        # real QA test starts here
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_attrs
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount
      
        # Create our test file with a few xattrs. The first 3 xattrs have a name
        # that when given as input to a crc32c function result in the same checksum.
        # This made btrfs list only one of the xattrs through listxattrs system call
        # (because it packs xattrs with the same name checksum into the same btree
        # item).
        touch $SCRATCH_MNT/testfile
        $SETFATTR_PROG -n user.foobar -v 123 $SCRATCH_MNT/testfile
        $SETFATTR_PROG -n user.WvG1c1Td -v qwerty $SCRATCH_MNT/testfile
        $SETFATTR_PROG -n user.J3__T_Km3dVsW_ -v hello $SCRATCH_MNT/testfile
        $SETFATTR_PROG -n user.something -v pizza $SCRATCH_MNT/testfile
        $SETFATTR_PROG -n user.ping -v pong $SCRATCH_MNT/testfile
      
        # Now call getfattr with --dump, which calls the listxattrs system call.
        # It should list all the xattrs we have set before.
        $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/testfile | _filter_scratch
      
        status=0
        exit
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      daac7ba6
    • F
      Btrfs: fix deadlock between direct IO reads and buffered writes · ade77029
      Filipe Manana 提交于
      While running a test with a mix of buffered IO and direct IO against
      the same files I hit a deadlock reported by the following trace:
      
      [11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds.
      [11642.142452]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.146332] kworker/u32:3   D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification.
      [11642.149771]     0 15282      2 0x00000000
      [11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs]
      [11642.154074]  ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0
      [11642.156722]  ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff
      [11642.159205]  0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541
      [11642.161403] Call Trace:
      [11642.162129]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.163396]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.164871]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.167020]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.167931]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.182320]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
      [11642.183762]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
      [11642.185308]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
      [11642.186782]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
      [11642.188217]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
      [11642.189626]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
      [11642.190803]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
      [11642.192158]  [<ffffffff8111829f>] __lock_page+0x66/0x68
      [11642.193379]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
      [11642.194831]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
      [11642.197068]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
      [11642.199188]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
      [11642.200723]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [11642.202465]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
      [11642.203836]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
      [11642.205624]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
      [11642.207057]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
      [11642.208529]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
      [11642.210375]  [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs]
      [11642.212132]  [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs]
      [11642.213837]  [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs]
      [11642.215457]  [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs]
      [11642.217095]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
      [11642.218324]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
      [11642.219466]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
      [11642.220801]  [<ffffffff8106a500>] kthread+0xd4/0xdc
      [11642.222032]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.223190]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
      [11642.224394]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.226295] 2 locks held by kworker/u32:3/15282:
      [11642.227273]  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.229412]  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds.
      [11642.232872]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.235776] kworker/u32:8   D ffff88020de5f848     0 15289      2 0x00000000
      [11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481)
      [11642.238670]  ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0
      [11642.240475]  ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff
      [11642.242154]  0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541
      [11642.243715] Call Trace:
      [11642.244390]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.245432]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.246392]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.247479]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.248551]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.249968]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
      [11642.251043]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
      [11642.252202]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
      [11642.253210]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
      [11642.254307]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
      [11642.256118]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
      [11642.257131]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
      [11642.258200]  [<ffffffff8111829f>] __lock_page+0x66/0x68
      [11642.259168]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
      [11642.260516]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
      [11642.261841]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
      [11642.263531]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
      [11642.264747]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [11642.266148]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
      [11642.267264]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
      [11642.268280]  [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba
      [11642.269407]  [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d
      [11642.270476]  [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae
      [11642.271547]  [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c
      [11642.272588]  [<ffffffff81194821>] wb_workfn+0x201/0x341
      [11642.273523]  [<ffffffff81194821>] ? wb_workfn+0x201/0x341
      [11642.274479]  [<ffffffff8106483e>] process_one_work+0x256/0x48b
      [11642.275497]  [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7
      [11642.276518]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
      [11642.277520]  [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289
      [11642.278517]  [<ffffffff8106a500>] kthread+0xd4/0xdc
      [11642.279371]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.280468]  [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70
      [11642.281607]  [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24
      [11642.282604] 3 locks held by kworker/u32:8/15289:
      [11642.283423]  #0:  ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.285629]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b
      [11642.287538]  #2:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b
      [11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds.
      [11642.290547]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.292864] fdm-stress      D ffff88022c107c20     0 26848  26591 0x00000000
      [11642.294118]  ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0
      [11642.295602]  ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff
      [11642.297098]  ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541
      [11642.298433] Call Trace:
      [11642.298896]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.299738]  [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs]
      [11642.300833]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
      [11642.301943]  [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs]
      [11642.303270]  [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs]
      [11642.304552]  [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs]
      [11642.305782]  [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs]
      [11642.306878]  [<ffffffff8116e298>] __vfs_write+0x7c/0xa5
      [11642.307729]  [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8
      [11642.308602]  [<ffffffff8116efbb>] SyS_write+0x50/0x7e
      [11642.309410]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.310403] 3 locks held by fdm-stress/26848:
      [11642.311108]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
      [11642.312578]  #1:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
      [11642.314170]  #2:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs]
      [11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds.
      [11642.317842]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.319959] fdm-stress      D ffff8801964ffa68     0 26849  26591 0x00000000
      [11642.321312]  ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0
      [11642.322555]  ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002
      [11642.323715]  ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541
      [11642.325096] Call Trace:
      [11642.325532]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.326303]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.327180]  [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74
      [11642.328114]  [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a
      [11642.329051]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.330053]  [<ffffffff8147bceb>] __wait_for_common+0x109/0x147
      [11642.330952]  [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147
      [11642.331869]  [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a
      [11642.332925]  [<ffffffff81074075>] ? wake_up_q+0x47/0x47
      [11642.333736]  [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26
      [11642.334672]  [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs]
      [11642.335858]  [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs]
      [11642.336854]  [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44
      [11642.337820]  [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs]
      [11642.339026]  [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs]
      [11642.340214]  [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs]
      [11642.341123]  [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10
      [11642.341934]  [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4]
      [11642.342936]  [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57
      [11642.343772]  [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d
      [11642.344673]  [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc
      [11642.346024]  [<ffffffff81186bbe>] ? __fget_light+0x62/0x71
      [11642.346873]  [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79
      [11642.347720]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.350222] 4 locks held by fdm-stress/26849:
      [11642.350898]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0
      [11642.352375]  #1:  (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs]
      [11642.354072]  #2:  (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs]
      [11642.355647]  #3:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds.
      [11642.358508]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.368625] fdm-stress      D ffff88021f167688     0 26850  26591 0x00000000
      [11642.369716]  ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0
      [11642.370950]  ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff
      [11642.372210]  0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541
      [11642.373430] Call Trace:
      [11642.373853]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.374623]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.375948]  [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109
      [11642.376862]  [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f
      [11642.377637]  [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197
      [11642.378610]  [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf
      [11642.379457]  [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33
      [11642.380366]  [<ffffffff810b0f61>] ? ktime_get+0x41/0x52
      [11642.381353]  [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102
      [11642.382255]  [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102
      [11642.383162]  [<ffffffff8147b814>] bit_wait_io+0x1b/0x39
      [11642.383945]  [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90
      [11642.384875]  [<ffffffff8111829f>] __lock_page+0x66/0x68
      [11642.385749]  [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a
      [11642.386721]  [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs]
      [11642.387596]  [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs]
      [11642.389030]  [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs]
      [11642.389973]  [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69
      [11642.390939]  [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [11642.392271]  [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs]
      [11642.393305]  [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs]
      [11642.394239]  [<ffffffff811236bc>] do_writepages+0x23/0x2c
      [11642.395045]  [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61
      [11642.395991]  [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15
      [11642.397144]  [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs]
      [11642.398392]  [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs]
      [11642.399363]  [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs]
      [11642.400445]  [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54
      [11642.401309]  [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111
      [11642.402213]  [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24
      [11642.403139]  [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs]
      [11642.404360]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
      [11642.406187]  [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33
      [11642.407070]  [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33
      [11642.407990]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
      [11642.409192]  [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs]
      [11642.410146]  [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs]
      [11642.411291]  [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1
      [11642.412263]  [<ffffffff8108ac05>] ? mark_lock+0x24/0x201
      [11642.413057]  [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d
      [11642.413897]  [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2
      [11642.414708]  [<ffffffff8116ef3d>] SyS_read+0x50/0x7e
      [11642.415573]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.416572] 1 lock held by fdm-stress/26850:
      [11642.417345]  #0:  (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40
      [11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds.
      [11642.419698]       Not tainted 4.4.0-rc6-btrfs-next-21+ #1
      [11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [11642.421807] fdm-stress      D ffff880196483d28     0 26851  26591 0x00000000
      [11642.422878]  ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0
      [11642.424149]  ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740
      [11642.425374]  ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541
      [11642.426591] Call Trace:
      [11642.427013]  [<ffffffff8147b541>] schedule+0x82/0x9a
      [11642.427856]  [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24
      [11642.428852]  [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4
      [11642.429743]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.430911]  [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.432102]  [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs]
      [11642.433259]  [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      [11642.434431]  [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs]
      [11642.436079]  [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs]
      [11642.437009]  [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c
      [11642.437860]  [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22
      [11642.438723]  [<ffffffff81171435>] iterate_supers+0x75/0xc2
      [11642.439597]  [<ffffffff81197d00>] sys_sync+0x52/0x80
      [11642.440454]  [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [11642.441533] 3 locks held by fdm-stress/26851:
      [11642.442370]  #0:  (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2
      [11642.444043]  #1:  (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs]
      [11642.446010]  #2:  (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs]
      
      This happened because under specific timings the path for direct IO reads
      can deadlock with concurrent buffered writes. The diagram below shows how
      this happens for an example file that has the following layout:
      
           [  extent A  ]  [  extent B  ]  [ ....
           0K              4K              8K
      
           CPU 1                                               CPU 2                             CPU 3
      
      DIO read against range
       [0K, 8K[ starts
      
      btrfs_direct_IO()
        --> calls btrfs_get_blocks_direct()
            which finds the extent map for the
            extent A and leaves the range
            [0K, 4K[ locked in the inode's
            io tree
      
                                                         buffered write against
                                                         range [4K, 8K[ starts
      
                                                         __btrfs_buffered_write()
                                                           --> dirties page at 4K
      
                                                                                           a user space
                                                                                           task calls sync
                                                                                           for e.g or
                                                                                           writepages() is
                                                                                           invoked by mm
      
                                                                                           writepages()
                                                                                             run_delalloc_range()
                                                                                               cow_file_range()
                                                                                                 --> ordered extent X
                                                                                                     for the buffered
                                                                                                     write is created
                                                                                                     and
                                                                                                     writeback starts
      
        --> calls btrfs_get_blocks_direct()
            again, without submitting first
            a bio for reading extent A, and
            finds the extent map for extent B
      
        --> calls lock_extent_direct()
      
            --> locks range [4K, 8K[
            --> finds ordered extent X
                covering range [4K, 8K[
            --> unlocks range [4K, 8K[
      
                                                        buffered write against
                                                        range [0K, 8K[ starts
      
                                                        __btrfs_buffered_write()
                                                          prepare_pages()
                                                            --> locks pages with
                                                                offsets 0 and 4K
                                                          lock_and_cleanup_extent_if_need()
                                                            --> blocks attempting to
                                                                lock range [0K, 8K[ in
                                                                the inode's io tree,
                                                                because the range [0, 4K[
                                                                is already locked by the
                                                                direct IO task at CPU 1
      
            --> calls
                btrfs_start_ordered_extent(oe X)
      
                btrfs_start_ordered_extent(oe X)
      
                  --> At this point writeback for ordered
                      extent X has not finished yet
      
                  filemap_fdatawrite_range()
                    btrfs_writepages()
                      extent_writepages()
                        extent_write_cache_pages()
                          --> finds page with offset 0
                              with the writeback tag
                              (and not dirty)
                          --> tries to lock it
                               --> deadlock, task at CPU 2
                                   has the page locked and
                                   is blocked on the io range
                                   [0, 4K[ that was locked
                                   earlier by this task
      
      So fix this by falling back to a buffered read in the direct IO read path
      when an ordered extent for a buffered write is found.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ade77029
    • F
      Btrfs: fix extent_same allowing destination offset beyond i_size · f4dfe687
      Filipe Manana 提交于
      When using the same file as the source and destination for a dedup
      (extent_same ioctl) operation we were allowing it to dedup to a
      destination offset beyond the file's size, which doesn't make sense and
      it's not allowed for the case where the source and destination files are
      not the same file. This made de deduplication operation successful only
      when the source range corresponded to a hole, a prealloc extent or an
      extent with all bytes having a value of 0x00. This was also leaving a
      file hole (between i_size and destination offset) without the
      corresponding file extent items, which can be reproduced with the
      following steps for example:
      
        $ mkfs.btrfs -f /dev/sdi
        $ mount /dev/sdi /mnt/sdi
      
        $ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar
        wrote 404990/404990 bytes at offset 304457
        395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec)
      
        $ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792
        Deduping 2 total files
        (28672, 24576): /mnt/sdi/foobar
        (929792, 24576): /mnt/sdi/foobar
        1 files asked to be deduped
        i: 0, status: 0, bytes_deduped: 24576
        24576 total bytes deduped in this operation
      
        $ umount /mnt/sdi
        $ btrfsck /dev/sdi
        Checking filesystem on /dev/sdi
        UUID: 98c528aa-0833-427d-9403-b98032ffbf9d
        checking extents
        checking free space cache
        checking fs roots
        root 5 inode 257 errors 100, file extent discount
        Found file extent holes:
                start: 712704, len: 217088
        found 540673 bytes used err is 1
        total csum bytes: 400
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 123675
        file data blocks allocated: 671744
          referenced 671744
        btrfs-progs v4.2.3
      
      So fix this by not allowing the destination to go beyond the file's size,
      just as we do for the same where the source and destination files are not
      the same.
      
      A test for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f4dfe687
    • F
      Btrfs: fix file loss on log replay after renaming a file and fsync · 2be63d5c
      Filipe Manana 提交于
      We have two cases where we end up deleting a file at log replay time
      when we should not. For this to happen the file must have been renamed
      and a directory inode must have been fsynced/logged.
      
      Two examples that exercise these two cases are listed below.
      
        Case 1)
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ mkdir -p /mnt/a/b
        $ mkdir /mnt/c
        $ touch /mnt/a/b/foo
        $ sync
        $ mv /mnt/a/b/foo /mnt/c/
        # Create file bar just to make sure the fsync on directory a/ does
        # something and it's not a no-op.
        $ touch /mnt/a/bar
        $ xfs_io -c "fsync" /mnt/a
        < power fail / crash >
      
        The next time the filesystem is mounted, the log replay procedure
        deletes file foo.
      
        Case 2)
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ mkdir /mnt/a
        $ mkdir /mnt/b
        $ mkdir /mnt/c
        $ touch /mnt/a/foo
        $ ln /mnt/a/foo /mnt/b/foo_link
        $ touch /mnt/b/bar
        $ sync
        $ unlink /mnt/b/foo_link
        $ mv /mnt/b/bar /mnt/c/
        $ xfs_io -c "fsync" /mnt/a/foo
        < power fail / crash >
      
        The next time the filesystem is mounted, the log replay procedure
        deletes file bar.
      
      The reason why the files are deleted is because when we log inodes
      other then the fsync target inode, we ignore their last_unlink_trans
      value and leave the log without enough information to later replay the
      rename operations. So we need to look at the last_unlink_trans values
      and fallback to a transaction commit if they are greater than the
      id of the last committed transaction.
      
      So fix this by looking at the last_unlink_trans values and fallback to
      transaction commits when needed. Also, when logging other inodes (for
      case 1 we logged descendants of the fsync target inode while for case 2
      we logged ascendants) we need to care about concurrent tasks updating
      the last_unlink_trans of inodes we are logging (which was already an
      existing problem in check_parent_dirs_for_sync()). Since we can not
      acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes
      deadlocks with other concurrent operations that acquire the i_mutex of
      2 inodes (other fsyncs or renames for example), we need to serialize on
      the log_mutex of the inode we are logging. A task setting a new value for
      an inode's last_unlink_trans must acquire the inode's log_mutex and it
      must do this update before doing the actual unlink operation (which is
      already the case except when deleting a snapshot). Conversely the task
      logging the inode must first log the inode and then check the inode's
      last_unlink_trans value while holding its log_mutex, as if its value is
      not greater then the id of the last committed transaction it means it
      logged a safe state of the inode's items, while if its value is not
      smaller then the id of the last committed transaction it means the inode
      state it has logged might not be safe (the concurrent task might have
      just updated last_unlink_trans but hasn't done yet the unlink operation)
      and therefore a transaction commit must be done.
      
      Test cases for xfstests follow in separate patches.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2be63d5c
    • F
      Btrfs: fix unreplayable log after snapshot delete + parent dir fsync · 1ec9a1ae
      Filipe Manana 提交于
      If we delete a snapshot, fsync its parent directory and crash/power fail
      before the next transaction commit, on the next mount when we attempt to
      replay the log tree of the root containing the parent directory we will
      fail and prevent the filesystem from mounting, which is solvable by wiping
      out the log trees with the btrfs-zero-log tool but very inconvenient as
      we will lose any data and metadata fsynced before the parent directory
      was fsynced.
      
      For example:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ mkdir /mnt/testdir
        $ btrfs subvolume snapshot /mnt /mnt/testdir/snap
        $ btrfs subvolume delete /mnt/testdir/snap
        $ xfs_io -c "fsync" /mnt/testdir
        < crash / power failure and reboot >
        $ mount /dev/sdc /mnt
        mount: mount(2) failed: No such file or directory
      
      And in dmesg/syslog we get the following message and trace:
      
      [192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
      [192066.363010] ------------[ cut here ]------------
      [192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]()
      [192066.367250] BTRFS: Transaction aborted (error -2)
      [192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G        W       4.4.0-rc6-btrfs-next-20+ #1
      [192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [192066.380889]  0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8
      [192066.382561]  ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe
      [192066.384191]  ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710
      [192066.385827] Call Trace:
      [192066.386373]  [<ffffffff81257570>] dump_stack+0x4e/0x79
      [192066.387387]  [<ffffffff8104ec07>] warn_slowpath_common+0x99/0xb2
      [192066.388429]  [<ffffffffa039dc2c>] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs]
      [192066.389236]  [<ffffffff8104ec68>] warn_slowpath_fmt+0x48/0x50
      [192066.389884]  [<ffffffffa039dc2c>] __btrfs_unlink_inode+0x17a/0x354 [btrfs]
      [192066.390621]  [<ffffffff81184b55>] ? iput+0xb0/0x266
      [192066.391200]  [<ffffffffa039ea25>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
      [192066.391930]  [<ffffffffa03ca623>] check_item_in_log+0x1fe/0x29b [btrfs]
      [192066.392715]  [<ffffffffa03ca827>] replay_dir_deletes+0x167/0x1cf [btrfs]
      [192066.393510]  [<ffffffffa03cccc7>] replay_one_buffer+0x417/0x570 [btrfs]
      [192066.394241]  [<ffffffffa03ca164>] walk_up_log_tree+0x10e/0x1dc [btrfs]
      [192066.394958]  [<ffffffffa03cac72>] walk_log_tree+0xa5/0x190 [btrfs]
      [192066.395628]  [<ffffffffa03ce8b8>] btrfs_recover_log_trees+0x239/0x32c [btrfs]
      [192066.396790]  [<ffffffffa03cc8b0>] ? replay_one_extent+0x50a/0x50a [btrfs]
      [192066.397891]  [<ffffffffa0394041>] open_ctree+0x1d8b/0x2167 [btrfs]
      [192066.398897]  [<ffffffffa03706e1>] btrfs_mount+0x5ef/0x729 [btrfs]
      [192066.399823]  [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
      [192066.400739]  [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
      [192066.401700]  [<ffffffff811714b9>] mount_fs+0x67/0x131
      [192066.402482]  [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
      [192066.403930]  [<ffffffffa03702bd>] btrfs_mount+0x1cb/0x729 [btrfs]
      [192066.404831]  [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
      [192066.405726]  [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
      [192066.406621]  [<ffffffff811714b9>] mount_fs+0x67/0x131
      [192066.407401]  [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
      [192066.408247]  [<ffffffff8118ae36>] do_mount+0x893/0x9d2
      [192066.409047]  [<ffffffff8113009b>] ? strndup_user+0x3f/0x8c
      [192066.409842]  [<ffffffff8118b187>] SyS_mount+0x75/0xa1
      [192066.410621]  [<ffffffff8147e517>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [192066.411572] ---[ end trace 2de42126c1e0a0f0 ]---
      [192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry
      [192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree)
      [192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30
      [192066.444613] BTRFS: open_ctree failed
      
      This happens because when we are replaying the log and processing the
      directory entry pointing to the snapshot in the subvolume tree, we treat
      its btrfs_dir_item item as having a location with a key type matching
      BTRFS_INODE_ITEM_KEY, which is wrong because the type matches
      BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the
      object id refers to a root number and not to an inode in the root
      containing the parent directory.
      
      So fix this by triggering a transaction commit if an fsync against the
      parent directory is requested after deleting a snapshot. This is the
      simplest approach for a rare use case. Some alternative that avoids the
      transaction commit would require more code to explicitly delete the
      snapshot at log replay time (factoring out common code from ioctl.c:
      btrfs_ioctl_snap_destroy()), special care at fsync time to remove the
      log tree of the snapshot's root from the log root of the root of tree
      roots, amongst other steps.
      
      A test case for xfstests that triggers the issue follows.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            cd /
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_target flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create a snapshot at the root of our filesystem (mount point path), delete it,
        # fsync the mount point path, crash and mount to replay the log. This should
        # succeed and after the filesystem is mounted the snapshot should not be visible
        # anymore.
        _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
        _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT
        _flakey_drop_and_remount
        [ -e $SCRATCH_MNT/snap1 ] && \
            echo "Snapshot snap1 still exists after log replay"
      
        # Similar scenario as above, but this time the snapshot is created inside a
        # directory and not directly under the root (mount point path).
        mkdir $SCRATCH_MNT/testdir
        _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2
        _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
        _flakey_drop_and_remount
        [ -e $SCRATCH_MNT/testdir/snap2 ] && \
            echo "Snapshot snap2 still exists after log replay"
      
        _unmount_flakey
      
        echo "Silence is golden"
        status=0
        exit
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1ec9a1ae
  5. 28 2月, 2016 14 次提交
  6. 23 2月, 2016 8 次提交
    • L
      Btrfs: fix lockdep deadlock warning due to dev_replace · 73beece9
      Liu Bo 提交于
      Xfstests btrfs/011 complains about a deadlock warning,
      
      [ 1226.649039] =========================================================
      [ 1226.649039] [ INFO: possible irq lock inversion dependency detected ]
      [ 1226.649039] 4.1.0+ #270 Not tainted
      [ 1226.649039] ---------------------------------------------------------
      [ 1226.652955] kswapd0/46 just changed the state of lock:
      [ 1226.652955]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955] but this lock took another, RECLAIM_FS-unsafe lock in the past:
      [ 1226.652955]  (&fs_info->dev_replace.lock){+.+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      [ 1226.652955]
      other info that might help us debug this:
      [ 1226.652955] Chain exists of:
        &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
      
      [ 1226.652955]  Possible interrupt unsafe locking scenario:
      
      [ 1226.652955]        CPU0                    CPU1
      [ 1226.652955]        ----                    ----
      [ 1226.652955]   lock(&fs_info->dev_replace.lock);
      [ 1226.652955]                                local_irq_disable();
      [ 1226.652955]                                lock(&delayed_node->mutex);
      [ 1226.652955]                                lock(&found->groups_sem);
      [ 1226.652955]   <Interrupt>
      [ 1226.652955]     lock(&delayed_node->mutex);
      [ 1226.652955]
       *** DEADLOCK ***
      
      Commit 084b6e7c ("btrfs: Fix a lockdep warning when running xfstest.") tried
      to fix a similar one that has the exactly same warning, but with that, we still
      run to this.
      
      The above lock chain comes from
      btrfs_commit_transaction
        ->btrfs_run_delayed_items
          ...
          ->__btrfs_update_delayed_inode
            ...
            ->__btrfs_cow_block
               ...
               ->find_free_extent
                  ->cache_block_group
                    ->load_free_space_cache
                      ->btrfs_readpages
                        ->submit_one_bio
                          ...
                          ->__btrfs_map_block
                            ->btrfs_dev_replace_lock
      
      However, with high memory pressure, tasks which hold dev_replace.lock can
      be interrupted by kswapd and then kswapd is intended to release memory occupied
      by superblock, inodes and dentries, where we may call evict_inode, and it comes
      to
      
      [ 1226.652955]  [<ffffffff81458735>] __btrfs_release_delayed_node+0x45/0x1d0
      [ 1226.652955]  [<ffffffff81459e74>] btrfs_remove_delayed_node+0x24/0x30
      [ 1226.652955]  [<ffffffff8140c5fe>] btrfs_evict_inode+0x34e/0x700
      
      delayed_node->mutex may be acquired in __btrfs_release_delayed_node(), and it leads
      to a ABBA deadlock.
      
      To fix this, we can use "blocking rwlock" used in the case of extent_buffer, but
      things are simpler here since we only needs read's spinlock to blocking lock.
      
      With this, btrfs/011 no more produces warnings in dmesg.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73beece9
    • D
    • D
      btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls · c5868f83
      David Sterba 提交于
      The control device is accessible when no filesystem is mounted and we
      may want to query features supported by the module. This is already
      possible using the sysfs files, this ioctl is for parity and
      convenience.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5868f83
    • D
      btrfs: change max_inline default to 2048 · f7e98a7f
      David Sterba 提交于
      The current practical default is ~4k on x86_64 (the logic is more complex,
      simplified for brevity), the inlined files land in the metadata group and
      thus consume space that could be needed for the real metadata.
      
      The inlining brings some usability surprises:
      
      1) total space consumption measured on various filesystems and btrfs
         with DUP metadata was quite visible because of the duplicated data
         within metadata
      
      2) inlined data may exhaust the metadata, which are more precious in case
         the entire device space is allocated to chunks (ie. balance cannot
         make the space more compact)
      
      3) performance suffers a bit as the inlined blocks are duplicate and
         stored far away on the device.
      
      Proposed fix: set the default to 2048
      
      This fixes namely 1), the total filesysystem space consumption will be on
      par with other filesystems.
      
      Partially fixes 2), more data are pushed to the data block groups.
      
      The characteristics of 3) are based on actual small file size
      distribution.
      
      The change is independent of the metadata blockgroup type (though it's
      most visible with DUP) or system page size as these parameters are not
      trival to find out, compared to file size.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7e98a7f
    • D
      btrfs: remove error message from search ioctl for nonexistent tree · 11ea474f
      David Sterba 提交于
      Let's remove the error message that appears when the tree_id is not
      present. This can happen with the quota tree and has been observed in
      practice. The applications are supposed to handle -ENOENT and we don't
      need to report that in the system log as it's not a fatal error.
      Reported-by: NVlastimil Babka <vbabka@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11ea474f
    • A
      btrfs: avoid uninitialized variable warning · f827ba9a
      Arnd Bergmann 提交于
      With CONFIG_SMP and CONFIG_PREEMPT both disabled, gcc decides
      to partially inline the get_state_failrec() function but cannot
      figure out that means the failrec pointer is always valid
      if the function returns success, which causes a harmless
      warning:
      
      fs/btrfs/extent_io.c: In function 'clean_io_failure':
      fs/btrfs/extent_io.c:2131:4: error: 'failrec' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      This marks get_state_failrec() and set_state_failrec() both
      as 'noinline', which avoids the warning in all cases for me,
      and seems less ugly than adding a fake initialization.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Fixes: 47dc196a ("btrfs: use proper type for failrec in extent_state")
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f827ba9a
    • T
      NFSv4.x/pnfs: Fix a race between layoutget and bulk recalls · 9fd4b9fc
      Trond Myklebust 提交于
      Replace another case where the layout 'plh_block_lgets' can trigger
      infinite loops in send_layoutget().
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      9fd4b9fc
    • T
      NFSv4.x/pnfs: Fix a race between layoutget and pnfs_destroy_layout · 2454dfea
      Trond Myklebust 提交于
      If the server reboots while there is a layoutget outstanding, then
      the call to pnfs_choose_layoutget_stateid() will fail with an EAGAIN
      error, which causes an infinite loop in send_layoutget(). The reason
      why we never break out of the loop is that the layout 'plh_block_lgets'
      field is never cleared.
      
      Fix is to replace plh_block_lgets with NFS_LAYOUT_INVALID_STID, which
      can be reset after a new layoutget.
      
      Fixes: ab7d763e ("pNFS: Ensure nfs4_layoutget_prepare returns...")
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      2454dfea
  7. 20 2月, 2016 4 次提交
    • M
      fs/pnode.c: treat zero mnt_group_id-s as unequal · 7ae8fd03
      Maxim Patlasov 提交于
      propagate_one(m) calculates "type" argument for copy_tree() like this:
      
      >    if (m->mnt_group_id == last_dest->mnt_group_id) {
      >        type = CL_MAKE_SHARED;
      >    } else {
      >        type = CL_SLAVE;
      >        if (IS_MNT_SHARED(m))
      >           type |= CL_MAKE_SHARED;
      >   }
      
      The "type" argument then governs clone_mnt() behavior with respect to flags
      and mnt_master of new mount. When we iterate through a slave group, it is
      possible that both current "m" and "last_dest" are not shared (although,
      both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
      above erroneously makes new mount shared and sets its mnt_master to
      last_source->mnt_master. The patch fixes the problem by handling zero
      mnt_group_id-s as though they are unequal.
      
      The similar problem exists in the implementation of "else" clause above
      when we have to ascend upward in the master/slave tree by calling:
      
      >    last_source = last_source->mnt_master;
      >    last_dest = last_source->mnt_parent;
      
      proper number of times. The last step is governed by
      "n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
      both are zero. The patch fixes this case in the same way as the former one.
      
      [AV: don't open-code an obvious helper...]
      Signed-off-by: NMaxim Patlasov <mpatlasov@virtuozzo.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ae8fd03
    • A
      affs_do_readpage_ofs(): just use kmap_atomic() around memcpy() · 0bacbe52
      Al Viro 提交于
      It forgets kunmap() on a failure exit, but there's really no point keeping
      the page kmapped at all - after all, what we are doing is a bunch of memcpy()
      into the parts of page, so kmap_atomic()/kunmap_atomic() just around those
      memcpy() is enough.
      Spotted-by: NInsu Yun <wuninsu@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0bacbe52
    • M
      xattr handlers: plug a lock leak in simple_xattr_list · 0e9a7da5
      Mateusz Guzik 提交于
      The code could leak xattrs->lock on error.
      
      Problem introduced with 786534b9 "tmpfs: listxattr should
      include POSIX ACL xattrs".
      Signed-off-by: NMateusz Guzik <mguzik@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0e9a7da5
    • W
      fs: allow no_seek_end_llseek to actually seek · 2feb55f8
      Wouter van Kesteren 提交于
      The user-visible impact of the issue is for example that without this
      patch sensors-detect breaks when trying to seek in /dev/cpu/0/cpuid.
      
      '~0ULL' is a 'unsigned long long' that when converted to a loff_t,
      which is signed, gets turned into -1. later in vfs_setpos we have
      'if (offset > maxsize)', which makes it always return EINVAL.
      
      Fixes: b25472f9 ("new helpers: no_seek_end_llseek{,_size}()")
      Signed-off-by: NWouter van Kesteren <woutershep@gmail.com>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2feb55f8