1. 20 1月, 2016 2 次提交
    • Z
      btrfs: merge functions for wait snapshot creation · 0bc19f90
      Zhao Lei 提交于
      wait_for_snapshot_creation() is in same group with oher two:
       btrfs_start_write_no_snapshoting()
       btrfs_end_write_no_snapshoting()
      
      Rename wait_for_snapshot_creation() and move it into same place
      with other two.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0bc19f90
    • F
      Btrfs: fix deadlock running delayed iputs at transaction commit time · c2d6cb16
      Filipe Manana 提交于
      While running a stress test I ran into a deadlock when running the delayed
      iputs at transaction time, which produced the following report and trace:
      
      [  886.399989] =============================================
      [  886.400871] [ INFO: possible recursive locking detected ]
      [  886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
      [  886.402384] ---------------------------------------------
      [  886.403182] fio/8277 is trying to acquire lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] but task is already holding lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] other info that might help us debug this:
      [  886.403568]  Possible unsafe locking scenario:
      [  886.403568]
      [  886.403568]        CPU0
      [  886.403568]        ----
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]
      [  886.403568]  *** DEADLOCK ***
      [  886.403568]
      [  886.403568]  May be due to missing lock nesting notation
      [  886.403568]
      [  886.403568] 3 locks held by fio/8277:
      [  886.403568]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
      [  886.403568]  #1:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
      [  886.403568]  #2:  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] stack backtrace:
      [  886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [  886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [  886.403568]  0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
      [  886.403568]  ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
      [  886.403568]  0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
      [  886.403568] Call Trace:
      [  886.403568]  [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
      [  886.403568]  [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
      [  886.403568]  [<ffffffff810c22db>] ? __module_address+0xdf/0x108
      [  886.403568]  [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffff8148556b>] down_read+0x3e/0x4d
      [  886.489542]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
      [  886.489542]  [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
      [  886.489542]  [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
      [  886.489542]  [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
      [  886.489542]  [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
      [  886.489542]  [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
      [  886.489542]  [<ffffffff81188e31>] evict+0xa7/0x15c
      [  886.489542]  [<ffffffff81189878>] iput+0x1d3/0x266
      [  886.489542]  [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
      [  886.489542]  [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
      [  886.489542]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [  886.489542]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [  886.489542]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [  886.489542]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [  886.489542]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [  886.489542]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [  886.489542]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [  886.489542]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [  886.489542]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [  886.489542]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
      [ 1081.854348]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.863227] fio        D ffff880213f9bb28     0  8244   8240 0x00000000
      [ 1081.868719]  ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
      [ 1081.872499]  ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
      [ 1081.876834]  ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
      [ 1081.880782] Call Trace:
      [ 1081.881793]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.883340]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.895525]  [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
      [ 1081.897419]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1081.899251]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1081.901063]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1081.902365]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1081.903846]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.906078]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.908846]  [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
      [ 1081.910409]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [ 1081.912482]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [ 1081.914597]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [ 1081.919037]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [ 1081.920754]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [ 1081.922496]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [ 1081.923922]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [ 1081.925275]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [ 1081.926584]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [ 1081.927968]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.985293] INFO: lockdep is turned off.
      [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
      [ 1081.987434]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.990147] fio        D ffff880218febbb8     0  8249   8240 0x00000000
      [ 1081.991626]  ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
      [ 1081.993258]  ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
      [ 1081.994850]  ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
      [ 1081.996485] Call Trace:
      [ 1081.997037]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.998017]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.999241]  [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
      [ 1082.000306]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1082.001533]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1082.002776]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1082.003995]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1082.005000]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.007403]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.008988]  [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
      [ 1082.010193]  [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
      [ 1082.011280]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.012265]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.013021]  [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
      [ 1082.013738]  [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
      [ 1082.014778]  [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
      [ 1082.015778]  [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
      [ 1082.016806]  [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
      [ 1082.017789]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [ 1082.018706]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This happens because we can recursively acquire the semaphore
      fs_info->delayed_iput_sem when attempting to allocate space to satisfy
      a file write request as shown in the first trace above - when committing
      a transaction we acquire (down_read) the semaphore before running the
      delayed iputs, and when running a delayed iput() we can end up calling
      an inode's eviction handler, which in turn commits another transaction
      and attempts to acquire (down_read) again the semaphore to run more
      delayed iput operations.
      This results in a deadlock because if a task acquires multiple times a
      semaphore it should invoke down_read_nested() with a different lockdep
      class for each level of recursion.
      
      Fix this by simplifying the implementation and use a mutex instead that
      is acquired by the cleaner kthread before it runs the delayed iputs
      instead of always acquiring a semaphore before delayed references are
      run from anywhere.
      
      Fixes: d7c15171 (btrfs: Fix NO_SPACE bug caused by delayed-iput)
      Cc: stable@vger.kernel.org   # 4.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c2d6cb16
  2. 07 1月, 2016 10 次提交
  3. 01 1月, 2016 2 次提交
    • F
      Btrfs: fix number of transaction units required to create symlink · 9269d12b
      Filipe Manana 提交于
      We weren't accounting for the insertion of an inline extent item for the
      symlink inode nor that we need to update the parent inode item (through
      the call to btrfs_add_nondir()). So fix this by including two more
      transaction units.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      9269d12b
    • F
      Btrfs: don't leave dangling dentry if symlink creation failed · d50866d0
      Filipe Manana 提交于
      When we are creating a symlink we might fail with an error after we
      created its inode and added the corresponding directory indexes to its
      parent inode. In this case we end up never removing the directory indexes
      because the inode eviction handler, called for our symlink inode on the
      final iput(), only removes items associated with the symlink inode and
      not with the parent inode.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdi
        $ mount /dev/sdi /mnt
        $ touch /mnt/foo
        $ ln -s /mnt/foo /mnt/bar
        ln: failed to create symbolic link ‘bar’: Cannot allocate memory
        $ umount /mnt
        $ btrfsck /dev/sdi
        Checking filesystem on /dev/sdi
        UUID: d5acb5ba-31bd-42da-b456-89dca2e716e1
        checking extents
        checking free space cache
        checking fs roots
        root 5 inode 258 errors 2001, no inode item, link count wrong
      	unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref
        found 131073 bytes used err is 1
        total csum bytes: 0
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 124305
        file data blocks allocated: 262144
         referenced 262144
        btrfs-progs v4.2.3
      
      So fix this by adding the directory index entries as the very last
      step of symlink creation.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      d50866d0
  4. 17 12月, 2015 3 次提交
    • F
      Btrfs: fix leaking of ordered extents after direct IO write error · f28a4928
      Filipe Manana 提交于
      When doing a direct IO write, __blockdev_direct_IO() can call the
      btrfs_get_blocks_direct() callback one or more times before it calls the
      btrfs_submit_direct() callback. However it can fail after calling the
      first callback and before calling the second callback, which is a problem
      because the first one creates ordered extents and the second one is the
      one that submits bios that cover the ordered extents created by the first
      one. That means the ordered extents will never complete nor have any of
      the flags BTRFS_ORDERED_IO_DONE / BTRFS_ORDERED_IOERR set, resulting in
      subsequent operations (such as other direct IO writes, buffered writes or
      hole punching) that lock the same IO range and lookup for ordered extents
      in the range to hang forever waiting for those ordered extents because
      they can not complete ever, since no bio was submitted.
      
      Fix this by tracking a range of created ordered extents that don't have
      yet corresponding bios submitted and completing the ordered extents in
      the range if __blockdev_direct_IO() fails with an error.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      f28a4928
    • F
      Btrfs: fix deadlock between direct IO write and defrag/readpages · b850ae14
      Filipe Manana 提交于
      If readpages() (triggered by defrag or buffered reads) is called while a
      direct IO write is in progress, we have a small time window where we can
      deadlock, resulting in traces like the following being generated:
      
      [84723.212993] INFO: task fio:2849 blocked for more than 120 seconds.
      [84723.214310]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [84723.215640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [84723.217313] fio        D ffff88023ec75218     0  2849   2835 0x00000000
      [84723.218778]  ffff880122dfb6e8 0000000000000092 0000000000000000 ffff88023ec75200
      [84723.220458]  ffff88000e05d2c0 ffff880122dfc000 ffff88023ec75200 7fffffffffffffff
      [84723.230597]  0000000000000002 ffffffff8147891a ffff880122dfb700 ffffffff8147856a
      [84723.232085] Call Trace:
      [84723.232625]  [<ffffffff8147891a>] ? bit_wait+0x3c/0x3c
      [84723.233529]  [<ffffffff8147856a>] schedule+0x7d/0x95
      [84723.234398]  [<ffffffff8147baa3>] schedule_timeout+0x43/0x10b
      [84723.235384]  [<ffffffff810f82eb>] ? time_hardirqs_on+0x15/0x28
      [84723.236426]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
      [84723.237502]  [<ffffffff810af8a3>] ? read_seqcount_begin.constprop.20+0x57/0x6d
      [84723.238807]  [<ffffffff8108a09b>] ? trace_hardirqs_on_caller+0x16/0x1ab
      [84723.242012]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
      [84723.243064]  [<ffffffff810af2ad>] ? timekeeping_get_ns+0xe/0x33
      [84723.244116]  [<ffffffff810afa2e>] ? ktime_get+0x41/0x52
      [84723.245029]  [<ffffffff81477cff>] io_schedule_timeout+0xb7/0x12b
      [84723.245942]  [<ffffffff81477cff>] ? io_schedule_timeout+0xb7/0x12b
      [84723.246596]  [<ffffffff81478953>] bit_wait_io+0x39/0x45
      [84723.247503]  [<ffffffff81478b93>] __wait_on_bit_lock+0x49/0x8d
      [84723.248540]  [<ffffffff8111684f>] __lock_page+0x66/0x68
      [84723.249558]  [<ffffffff81081c9b>] ? autoremove_wake_function+0x3a/0x3a
      [84723.250844]  [<ffffffff81124a04>] lock_page+0x2c/0x2f
      [84723.251871]  [<ffffffff81124afc>] invalidate_inode_pages2_range+0xf5/0x2aa
      [84723.253274]  [<ffffffff81117c34>] ? filemap_fdatawait_range+0x12d/0x146
      [84723.254757]  [<ffffffff81118191>] ? filemap_fdatawrite_range+0x13/0x15
      [84723.256378]  [<ffffffffa05139a2>] btrfs_get_blocks_direct+0x1b0/0x664 [btrfs]
      [84723.258556]  [<ffffffff8119e3f9>] ? submit_page_section+0x7b/0x111
      [84723.260064]  [<ffffffff8119eb90>] do_blockdev_direct_IO+0x658/0xbdb
      [84723.261479]  [<ffffffffa05137f2>] ? btrfs_page_exists_in_range+0x1a9/0x1a9 [btrfs]
      [84723.262961]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [84723.264449]  [<ffffffff8119f144>] __blockdev_direct_IO+0x31/0x33
      [84723.265614]  [<ffffffff8119f144>] ? __blockdev_direct_IO+0x31/0x33
      [84723.266769]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [84723.268264]  [<ffffffffa050935d>] btrfs_direct_IO+0x1b9/0x259 [btrfs]
      [84723.270954]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [84723.272465]  [<ffffffff8111878c>] generic_file_direct_write+0xb3/0x128
      [84723.273734]  [<ffffffffa051955c>] btrfs_file_write_iter+0x228/0x404 [btrfs]
      [84723.275101]  [<ffffffff8116ca6f>] __vfs_write+0x7c/0xa5
      [84723.276200]  [<ffffffff8116cfab>] vfs_write+0xa0/0xe4
      [84723.277298]  [<ffffffff8116d79d>] SyS_write+0x50/0x7e
      [84723.278327]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [84723.279595] INFO: lockdep is turned off.
      [84723.379035] INFO: task btrfs:2923 blocked for more than 120 seconds.
      [84723.380323]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [84723.381608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [84723.383003] btrfs           D ffff88023ed75218     0  2923   2859 0x00000000
      [84723.384277]  ffff88001311f860 0000000000000082 ffff88001311f840 ffff88023ed75200
      [84723.385748]  ffff88012c6751c0 ffff880013120000 ffff88012042fe68 ffff88012042fe30
      [84723.387152]  ffff880221571c88 0000000000000001 ffff88001311f878 ffffffff8147856a
      [84723.388620] Call Trace:
      [84723.389105]  [<ffffffff8147856a>] schedule+0x7d/0x95
      [84723.391882]  [<ffffffffa051da32>] btrfs_start_ordered_extent+0x161/0x1fa [btrfs]
      [84723.393718]  [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
      [84723.395659]  [<ffffffffa0522c5b>] __do_contiguous_readpages.constprop.21+0x81/0xdc [btrfs]
      [84723.397383]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
      [84723.398852]  [<ffffffffa0522da3>] __extent_readpages.constprop.20+0xed/0x100 [btrfs]
      [84723.400561]  [<ffffffff81123f6c>] ? __lru_cache_add+0x5d/0x72
      [84723.401787]  [<ffffffffa0523896>] extent_readpages+0x111/0x1a7 [btrfs]
      [84723.403121]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
      [84723.404583]  [<ffffffffa05088fa>] btrfs_readpages+0x1f/0x21 [btrfs]
      [84723.406007]  [<ffffffff811226df>] __do_page_cache_readahead+0x168/0x1f4
      [84723.407502]  [<ffffffff81122988>] ondemand_readahead+0x21d/0x22e
      [84723.408937]  [<ffffffff81122988>] ? ondemand_readahead+0x21d/0x22e
      [84723.410487]  [<ffffffff81122af1>] page_cache_sync_readahead+0x3d/0x3f
      [84723.411710]  [<ffffffffa0535388>] btrfs_defrag_file+0x419/0xaaf [btrfs]
      [84723.413007]  [<ffffffffa0531db0>] ? kzalloc+0xf/0x11 [btrfs]
      [84723.414085]  [<ffffffffa0535b43>] btrfs_ioctl_defrag+0x125/0x14e [btrfs]
      [84723.415307]  [<ffffffffa0536753>] btrfs_ioctl+0x746/0x24c6 [btrfs]
      [84723.416532]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
      [84723.417731]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
      [84723.418699]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
      [84723.421532]  [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
      [84723.422629]  [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
      [84723.423712]  [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
      [84723.424801]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
      [84723.425968]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
      [84723.427063]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
      [84723.428138]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      Consider the following logical and physical file layout:
      
      logical:    ... [ prealloc extent A ] [ prealloc extent B ] [ extent C ] ...
                      4K                    8K                    16K
      
      physical:   ... 12853248              12857344              1103101952   ...
                                            (= 12853248 + 4K)
      
      Extents A and B are physically adjacent. The following diagram shows a
      sequence of events that lead to the deadlock when we attempt to do a
      direct IO write against the file range [4K, 16K[ and a defrag is triggered
      simultaneously.
      
                 CPU 1                                               CPU 2
      
       btrfs_direct_IO()
      
         btrfs_get_blocks_direct()
           creates ordered extent A, covering
           the 4k prealloc extent A (range [4K, 8K[)
      
                                                          btrfs_defrag_file()
                                                            page_cache_sync_readahead([0K, 1M[)
                                                              btrfs_readpages()
                                                                extent_readpages()
      
                                                                  locks all pages in the file
                                                                  range [0K, 128K[ through calls
                                                                  to add_to_page_cache_lru()
      
                                                                  __do_contiguous_readpages()
      
                                                                     finds ordered extent A
      
                                                                     waits for it to complete
      
         btrfs_get_blocks_direct() called again
      
           lock_extent_direct(range [8K, 16K[)
      
             finds a page in range [8K, 16K[ through
             btrfs_page_exists_in_range()
      
             invalidate_inode_pages2_range([8K, 16K[)
      
               --> tries to lock pages that are already
                   locked by the task at CPU 2
      
               --> our task, running __blockdev_direct_IO(),
                   hangs waiting to lock the pages and the
                   submit bio callback, btrfs_submit_direct(),
                   ends up never being called, resulting in the
                   ordered extent A never completing (because a
                   corresponding bio is never submitted) and
                   CPU 2 will wait for it forever while holding
                   the pages locked
                    ---> deadlock!
      
      Fix this by removing the page invalidation approach when attempting to
      lock the range for IO from the callback btrfs_get_blocks_direct() and
      falling back buffered IO. This was a rare case anyway and well behaved
      applications do not mix concurrent direct IO writes with buffered reads
      anyway, being a concurrent defrag the only normal case that could lead
      to the deadlock.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      b850ae14
    • F
      Btrfs: fix error path when failing to submit bio for direct IO write · 14543774
      Filipe Manana 提交于
      Commit 61de718f ("Btrfs: fix memory corruption on failure to submit
      bio for direct IO") fixed problems with the error handling code after we
      fail to submit a bio for direct IO. However there were 2 problems that it
      did not address when the failure is due to memory allocation failures for
      direct IO writes:
      
      1) We considered that there could be only one ordered extent for the whole
         IO range, which is not always true, as we can have multiple;
      
      2) It did not set the bit BTRFS_ORDERED_IO_DONE in the ordered extent,
         which can make other tasks running btrfs_wait_logged_extents() hang
         forever, since they wait for that bit to be set. The general assumption
         is that regardless of an error, the BTRFS_ORDERED_IO_DONE is always set
         and it precedes setting the bit BTRFS_ORDERED_COMPLETE.
      
      Fix these issues by moving part of the btrfs_endio_direct_write() handler
      into a new helper function and having that new helper function called when
      we fail to allocate memory to submit the bio (and its private object) for
      a direct IO write.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      14543774
  5. 03 12月, 2015 3 次提交
  6. 25 11月, 2015 1 次提交
    • F
      Btrfs: use global reserve when deleting unused block group after ENOSPC · 8eab77ff
      Filipe Manana 提交于
      It's possible to reach a state where the cleaner kthread isn't able to
      start a transaction to delete an unused block group due to lack of enough
      free metadata space and due to lack of unallocated device space to allocate
      a new metadata block group as well. If this happens try to use space from
      the global block group reserve just like we do for unlink operations, so
      that we don't reach a permanent state where starting a transaction for
      filesystem operations (file creation, renames, etc) keeps failing with
      -ENOSPC. Such an unfortunate state was observed on a machine where over
      a dozen unused data block groups existed and the cleaner kthread was
      failing to delete them due to ENOSPC error when attempting to start a
      transaction, and even running balance with a -dusage=0 filter failed with
      ENOSPC as well. Also unmounting and mounting again the filesystem didn't
      help. Allowing the cleaner kthread to use the global block reserve to
      delete the unused data block groups fixed the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eab77ff
  7. 10 11月, 2015 1 次提交
  8. 09 11月, 2015 1 次提交
    • F
      Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow · 1d512cb7
      Filipe Manana 提交于
      If we are using the NO_HOLES feature, we have a tiny time window when
      running delalloc for a nodatacow inode where we can race with a concurrent
      link or xattr add operation leading to a BUG_ON.
      
      This happens because at run_delalloc_nocow() we end up casting a leaf item
      of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
      file extent item (struct btrfs_file_extent_item) and then analyse its
      extent type field, which won't match any of the expected extent types
      (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
      explicit BUG_ON(1).
      
      The following sequence diagram shows how the race happens when running a
      no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
      neighbour leafs:
      
                   Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                    slot N - 2         slot N - 1              slot 0
      
       (Note the implicit hole for inode 257 regarding the [0, 8K[ range)
      
             CPU 1                                         CPU 2
      
       run_dealloc_nocow()
         btrfs_lookup_file_extent()
           --> searches for a key with value
               (257 EXTENT_DATA 4096) in the
               fs/subvol tree
           --> returns us a path with
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
         because path->slots[0] is >=
         btrfs_header_nritems(leaf X), it
         calls btrfs_next_leaf()
      
         btrfs_next_leaf()
           --> releases the path
      
                                                    hard link added to our inode,
                                                    with key (257 INODE_REF 500)
                                                    added to the end of leaf X,
                                                    so leaf X now has N + 1 keys
      
           --> searches for the key
               (257 INODE_REF 256), because
               it was the last key in leaf X
               before it released the path,
               with path->keep_locks set to 1
      
           --> ends up at leaf X again and
               it verifies that the key
               (257 INODE_REF 256) is no longer
               the last key in the leaf, so it
               returns with path->nodes[0] ==
               leaf X and path->slots[0] == N,
               pointing to the new item with
               key (257 INODE_REF 500)
      
         the loop iteration of run_dealloc_nocow()
         does not break out the loop and continues
         because the key referenced in the path
         at path->nodes[0] and path->slots[0] is
         for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
         and its offset (500) is less then our delalloc
         range's end (8192)
      
         the item pointed by the path, an inode reference item,
         is (incorrectly) interpreted as a file extent item and
         we get an invalid extent type, leading to the BUG_ON(1):
      
         if (extent_type == BTRFS_FILE_EXTENT_REG ||
            extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
             (...)
         } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
             (...)
         } else {
             BUG_ON(1)
         }
      
      The same can happen if a xattr is added concurrently and ends up having
      a key with an offset smaller then the delalloc's range end.
      
      So fix this by skipping keys with a type smaller than
      BTRFS_EXTENT_DATA_KEY.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      1d512cb7
  9. 05 11月, 2015 1 次提交
    • F
      Btrfs: fix extent accounting for partial direct IO writes · 9c9464cc
      Filipe Manana 提交于
      When doing a write using direct IO we can end up not doing the whole write
      operation using the direct IO path, in that case we fallback to a buffered
      write to do the remaining IO. This happens for example if the range we are
      writing to contains a compressed extent.
      When we do a partial write and fallback to buffered IO, due to the
      existence of a compressed extent for example, we end up not adjusting the
      outstanding extents counter of our inode which ends up getting decremented
      twice, once by the DIO ordered extent for the partial write and once again
      by btrfs_direct_IO(), resulting in an arithmetic underflow at
      extent-tree.c:drop_outstanding_extent(). For example if we have:
      
        extents        [ prealloc extent ] [ compressed extent ]
        offsets        A        B          C       D           E
      
      and at the moment our inode's outstanding extents counter is 0, if we do a
      direct IO write against the range [B, D[ (which has a length smaller than
      128Mb), we end up bumping our inode's outstanding extents counter to 1, we
      create a DIO ordered extent for the range [B, C[ and then fallback to a
      buffered write for the range [C, D[. The direct IO handler
      (inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by
      1, leaving it with a value of 0, through a call to
      btrfs_delalloc_release_space() and then shortly after the DIO ordered
      extent finishes and calls btrfs_delalloc_release_metadata() which ends
      up to attempt to decrement the inode's outstanding extents counter by 1,
      resulting in an assertion failure at drop_outstanding_extent() because
      the operation would result in an arithmetic underflow (0 - 1). This
      produces the following trace:
      
        [125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5526
        [125471.338844] ------------[ cut here ]------------
        [125471.340745] kernel BUG at fs/btrfs/ctree.h:4173!
        [125471.340745] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        [125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs]
        [125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
        [125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
        [125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        [125471.340745] task: ffff8804244fcf80 ti: ffff88040a118000 task.ti: ffff88040a118000
        [125471.340745] RIP: 0010:[<ffffffffa0550da1>]  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
        [125471.340745] RSP: 0018:ffff88040a11bc78  EFLAGS: 00010296
        [125471.340745] RAX: 0000000000000075 RBX: 0000000000005000 RCX: 0000000000000000
        [125471.340745] RDX: ffffffff81098f93 RSI: ffffffff8147c619 RDI: 00000000ffffffff
        [125471.340745] RBP: ffff88040a11bc78 R08: 0000000000000001 R09: 0000000000000000
        [125471.340745] R10: ffff88040a11bc08 R11: ffffffff81651000 R12: ffff8803efb4a000
        [125471.340745] R13: ffff8803efb4a000 R14: 0000000000000000 R15: ffff8802f8e33c88
        [125471.340745] FS:  0000000000000000(0000) GS:ffff88043dd40000(0000) knlGS:0000000000000000
        [125471.340745] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        [125471.340745] CR2: 00007fae7ca86095 CR3: 0000000001a0b000 CR4: 00000000000006e0
        [125471.340745] Stack:
        [125471.340745]  ffff88040a11bc88 ffffffffa04ca0cd ffff88040a11bcc8 ffffffffa04ceeb1
        [125471.340745]  ffff8802f8e33940 ffff8802c93eadb0 ffff8802f8e0bf50 ffff8803efb4a000
        [125471.340745]  0000000000000000 ffff8802f8e33c88 ffff88040a11bd38 ffffffffa04eccfa
        [125471.340745] Call Trace:
        [125471.340745]  [<ffffffffa04ca0cd>] drop_outstanding_extent+0x3d/0x6d [btrfs]
        [125471.340745]  [<ffffffffa04ceeb1>] btrfs_delalloc_release_metadata+0x51/0xdd [btrfs]
        [125471.340745]  [<ffffffffa04eccfa>] btrfs_finish_ordered_io+0x420/0x4eb [btrfs]
        [125471.340745]  [<ffffffffa04ecdda>] finish_ordered_fn+0x15/0x17 [btrfs]
        [125471.340745]  [<ffffffffa050e6e8>] normal_work_helper+0x14c/0x32a [btrfs]
        [125471.340745]  [<ffffffffa050e9c8>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
        [125471.340745]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
        [125471.340745]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
        [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
        [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
        [125471.340745]  [<ffffffff8106904d>] kthread+0xef/0xf7
        [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
        [125471.340745]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
        [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
        [125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 0b 0f 1f 44 00 00 55 31 c9 ba 18 00 00 00 48 89 e5 41 56 41
        [125471.340745] RIP  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
        [125471.340745]  RSP <ffff88040a11bc78>
        [125471.539620] ---[ end trace 144259f7838b4aa4 ]---
      
      So fix this by ensuring we adjust the outstanding extents counter when we
      do the fallback just like we do for the case where the whole write can be
      done through the direct IO path.
      
      We were also adjusting the outstanding extents counter by a constant value
      of 1, which is incorrect because we were ignorning that we account extents
      in BTRFS_MAX_EXTENT_SIZE units, o fix that as well.
      
      The following test case for fstests reproduces this issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_xfs_io_command "falloc"
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount "-o compress"
      
        # Create a compressed extent covering the range [700K, 800K[.
        $XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Create prealloc extent covering the range [600K, 700K[.
        $XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo
      
        # Write 80K of data to the range [640K, 720K[ using direct IO. This
        # range covers both the prealloc extent and the compressed extent.
        # Because there's a compressed extent in the range we are writing to,
        # the DIO write code path ends up only writing the first 60k of data,
        # which goes to the prealloc extent, and then falls back to buffered IO
        # for writing the remaining 20K of data - because that remaining data
        # maps to a file range containing a compressed extent.
        # When falling back to buffered IO, we used to trigger an assertion when
        # releasing reserved space due to bad accounting of the inode's
        # outstanding extents counter, which was set to 1 but we ended up
        # decrementing it by 1 twice, once through the ordered extent for the
        # 60K of data we wrote using direct IO, and once through the main direct
        # IO handler (inode.cbtrfs_direct_IO()) because the direct IO write
        # wrote less than 80K of data (60K).
        $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now similar test as above but for very large write operations. This
        # triggers special cases for an inode's outstanding extents accounting,
        # as internally btrfs logically splits extents into 128Mb units.
        $XFS_IO_PROG -f -s \
            -c "pwrite -S 0xaa -b 128M 258M 128M" \
            -c "falloc 0 258M" \
            $SCRATCH_MNT/bar | _filter_xfs_io
        $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \
            | _filter_xfs_io
      
        # Now verify the file contents are correct and that they are the same
        # even after unmounting and mounting the fs again (or evicting the page
        # cache).
        #
        # For file foo, all bytes in the range [0, 640K[ must have a value of
        # 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb
        # and all bytes in the range [720K, 800K[ must have a value of 0xaa.
        #
        # For file bar, all bytes in the range [0, 3M[ must havea value of 0x00,
        # all bytes in the range [3M, 259M[ must have a value of 0xbb and all
        # bytes in the range [259M, 386M[ must have a value of 0xaa.
        #
        echo "File digests before remounting the file system:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        md5sum $SCRATCH_MNT/bar | _filter_scratch
        _scratch_remount
        echo "File digests after remounting the file system:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        md5sum $SCRATCH_MNT/bar | _filter_scratch
      
        status=0
        exit
      
      Fixes: e1cbbfa5 ("Btrfs: fix outstanding_extents accounting in DIO")
      Fixes: 3e05bde8 ("Btrfs: only adjust outstanding_extents when we do a short write")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      9c9464cc
  10. 27 10月, 2015 1 次提交
  11. 26 10月, 2015 1 次提交
    • F
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana 提交于
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: NElias Probst <mail@eliasprobst.eu>
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Reported-by: NMalte Schröder <malte@tnxip.de>
      Reported-by: NDerek Dongray <derek@valedon.co.uk>
      Reported-by: NErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
  12. 22 10月, 2015 10 次提交
  13. 17 10月, 2015 1 次提交
    • F
      Btrfs: fix truncation of compressed and inlined extents · 0305cd5f
      Filipe Manana 提交于
      When truncating a file to a smaller size which consists of an inline
      extent that is compressed, we did not discard (or made unusable) the
      data between the new file size and the old file size, wasting metadata
      space and allowing for the truncated data to be leaked and the data
      corruption/loss mentioned below.
      We were also not correctly decrementing the number of bytes used by the
      inode, we were setting it to zero, giving a wrong report for callers of
      the stat(2) syscall. The fsck tool also reported an error about a mismatch
      between the nbytes of the file versus the real space used by the file.
      
      Now because we weren't discarding the truncated region of the file, it
      was possible for a caller of the clone ioctl to actually read the data
      that was truncated, allowing for a security breach without requiring root
      access to the system, using only standard filesystem operations. The
      scenario is the following:
      
         1) User A creates a file which consists of an inline and compressed
            extent with a size of 2000 bytes - the file is not accessible to
            any other users (no read, write or execution permission for anyone
            else);
      
         2) The user truncates the file to a size of 1000 bytes;
      
         3) User A makes the file world readable;
      
         4) User B creates a file consisting of an inline extent of 2000 bytes;
      
         5) User B issues a clone operation from user A's file into its own
            file (using a length argument of 0, clone the whole range);
      
         6) User B now gets to see the 1000 bytes that user A truncated from
            its file before it made its file world readbale. User B also lost
            the bytes in the range [1000, 2000[ bytes from its own file, but
            that might be ok if his/her intention was reading stale data from
            user A that was never supposed to be public.
      
      Note that this contrasts with the case where we truncate a file from 2000
      bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In
      this case reading any byte from the range [1000, 2000[ will return a value
      of 0x00, instead of the original data.
      
      This problem exists since the clone ioctl was added and happens both with
      and without my recent data loss and file corruption fixes for the clone
      ioctl (patch "Btrfs: fix file corruption and data loss after cloning
      inline extents").
      
      So fix this by truncating the compressed inline extents as we do for the
      non-compressed case, which involves decompressing, if the data isn't already
      in the page cache, compressing the truncated version of the extent, writing
      the compressed content into the inline extent and then truncate it.
      
      The following test case for fstests reproduces the problem. In order for
      the test to pass both this fix and my previous fix for the clone ioctl
      that forbids cloning a smaller inline extent into a larger one,
      which is titled "Btrfs: fix file corruption and data loss after cloning
      inline extents", are needed. Without that other fix the test fails in a
      different way that does not leak the truncated data, instead part of
      destination file gets replaced with zeroes (because the destination file
      has a larger inline extent than the source).
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount "-o compress"
      
        # Create our test files. File foo is going to be the source of a clone operation
        # and consists of a single inline extent with an uncompressed size of 512 bytes,
        # while file bar consists of a single inline extent with an uncompressed size of
        # 256 bytes. For our test's purpose, it's important that file bar has an inline
        # extent with a size smaller than foo's inline extent.
        $XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128"   \
                -c "pwrite -S 0x2a 128 384" \
                $SCRATCH_MNT/foo | _filter_xfs_io
        $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io
      
        # Now durably persist all metadata and data. We do this to make sure that we get
        # on disk an inline extent with a size of 512 bytes for file foo.
        sync
      
        # Now truncate our file foo to a smaller size. Because it consists of a
        # compressed and inline extent, btrfs did not shrink the inline extent to the
        # new size (if the extent was not compressed, btrfs would shrink it to 128
        # bytes), it only updates the inode's i_size to 128 bytes.
        $XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo
      
        # Now clone foo's inline extent into bar.
        # This clone operation should fail with errno EOPNOTSUPP because the source
        # file consists only of an inline extent and the file's size is smaller than
        # the inline extent of the destination (128 bytes < 256 bytes). However the
        # clone ioctl was not prepared to deal with a file that has a size smaller
        # than the size of its inline extent (something that happens only for compressed
        # inline extents), resulting in copying the full inline extent from the source
        # file into the destination file.
        #
        # Note that btrfs' clone operation for inline extents consists of removing the
        # inline extent from the destination inode and copy the inline extent from the
        # source inode into the destination inode, meaning that if the destination
        # inode's inline extent is larger (N bytes) than the source inode's inline
        # extent (M bytes), some bytes (N - M bytes) will be lost from the destination
        # file. Btrfs could copy the source inline extent's data into the destination's
        # inline extent so that we would not lose any data, but that's currently not
        # done due to the complexity that would be needed to deal with such cases
        # (specially when one or both extents are compressed), returning EOPNOTSUPP, as
        # it's normally not a very common case to clone very small files (only case
        # where we get inline extents) and copying inline extents does not save any
        # space (unlike for normal, non-inlined extents).
        $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar
      
        # Now because the above clone operation used to succeed, and due to foo's inline
        # extent not being shinked by the truncate operation, our file bar got the whole
        # inline extent copied from foo, making us lose the last 128 bytes from bar
        # which got replaced by the bytes in range [128, 256[ from foo before foo was
        # truncated - in other words, data loss from bar and being able to read old and
        # stale data from foo that should not be possible to read anymore through normal
        # filesystem operations. Contrast with the case where we truncate a file from a
        # size N to a smaller size M, truncate it back to size N and then read the range
        # [M, N[, we should always get the value 0x00 for all the bytes in that range.
      
        # We expected the clone operation to fail with errno EOPNOTSUPP and therefore
        # not modify our file's bar data/metadata. So its content should be 256 bytes
        # long with all bytes having the value 0xbb.
        #
        # Without the btrfs bug fix, the clone operation succeeded and resulted in
        # leaking truncated data from foo, the bytes that belonged to its range
        # [128, 256[, and losing data from bar in that same range. So reading the
        # file gave us the following content:
        #
        # 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
        # *
        # 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
        # *
        # 0000400
        echo "File bar's content after the clone operation:"
        od -t x1 $SCRATCH_MNT/bar
      
        # Also because the foo's inline extent was not shrunk by the truncate
        # operation, btrfs' fsck, which is run by the fstests framework everytime a
        # test completes, failed reporting the following error:
        #
        #  root 5 inode 257 errors 400, nbytes wrong
      
        status=0
        exit
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      0305cd5f
  14. 11 10月, 2015 1 次提交
  15. 08 10月, 2015 1 次提交
  16. 22 9月, 2015 1 次提交
    • C
      Btrfs: Direct I/O: Fix space accounting · 50745b0a
      chandan 提交于
      The following call trace is seen when generic/095 test is executed,
      
      WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0()
      Modules linked in:
      CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
       ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0
       0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38
       ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0
      Call Trace:
       [<ffffffff81984058>] dump_stack+0x45/0x57
       [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0
       [<ffffffff81050465>] warn_slowpath_null+0x15/0x20
       [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0
       [<ffffffff8117ce07>] destroy_inode+0x37/0x60
       [<ffffffff8117cf39>] evict+0x109/0x170
       [<ffffffff8117cfd5>] dispose_list+0x35/0x50
       [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100
       [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0
       [<ffffffff81165951>] kill_anon_super+0x11/0x20
       [<ffffffff81302093>] btrfs_kill_super+0x13/0x110
       [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70
       [<ffffffff811660cf>] deactivate_super+0x5f/0x70
       [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90
       [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10
       [<ffffffff81069c06>] task_work_run+0x96/0xb0
       [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50
       [<ffffffff8198cbc2>] int_signal+0x12/0x17
      
      This means that the inode had non-zero "outstanding extents" during
      eviction. This occurs because, during direct I/O a task which successfully
      used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
      not clear the bit after finishing the DIO write. A future DIO write could
      actually fail and the unused reserve space won't be freed because of the
      previously set BTRFS_INODE_DIO_READY bit.
      
      Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
      following issue,
      |-----------------------------------+-------------------------------------|
      | Task A                            | Task B                              |
      |-----------------------------------+-------------------------------------|
      | Start direct i/o write on inode X.|                                     |
      | reserve space                     |                                     |
      | Allocate ordered extent           |                                     |
      | release reserved space            |                                     |
      | Set BTRFS_INODE_DIO_READY bit.    |                                     |
      |                                   | splice()                            |
      |                                   | Transfer data from pipe buffer to   |
      |                                   | destination file.                   |
      |                                   | - kmap(pipe buffer page)            |
      |                                   | - Start direct i/o write on         |
      |                                   |   inode X.                          |
      |                                   |   - reserve space                   |
      |                                   |   - dio_refill_pages()              |
      |                                   |     - sdio->blocks_available == 0   |
      |                                   |     - Since a kernel address is     |
      |                                   |       being passed instead of a     |
      |                                   |       user space address,           |
      |                                   |       iov_iter_get_pages() returns  |
      |                                   |       -EFAULT.                      |
      |                                   |   - Since BTRFS_INODE_DIO_READY is  |
      |                                   |     set, we don't release reserved  |
      |                                   |     space.                          |
      |                                   |   - Clear BTRFS_INODE_DIO_READY bit.|
      | -EIOCBQUEUED is returned.         |                                     |
      |-----------------------------------+-------------------------------------|
      
      Hence this commit introduces "struct btrfs_dio_data" to track the usage of
      reserved data space. The remaining unused "reserve space" can now be freed
      reliably.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50745b0a