1. 18 2月, 2016 2 次提交
  2. 26 1月, 2016 1 次提交
    • F
      Btrfs: fix race between fsync and lockless direct IO writes · de0ee0ed
      Filipe Manana 提交于
      An fsync, using the fast path, can race with a concurrent lockless direct
      IO write and end up logging a file extent item that points to an extent
      that wasn't written to yet. This is because the fast fsync path collects
      ordered extents into a local list and then collects all the new extent
      maps to log file extent items based on them, while the direct IO write
      path creates the new extent map before it creates the corresponding
      ordered extent (and submitting the respective bio(s)).
      
      So fix this by making the direct IO write path create ordered extents
      before the extent maps and make the fast fsync path collect any new
      ordered extents after it collects the extent maps.
      Note that making the fsync handler call inode_dio_wait() (after acquiring
      the inode's i_mutex) would not work and lead to a deadlock when doing
      AIO, as through AIO we end up in a path where the fsync handler is called
      (through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range())
      before the inode's dio counter is decremented (inode_dio_wait() waits
      for this counter to have a value of zero).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      de0ee0ed
  3. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  4. 20 1月, 2016 2 次提交
    • Z
      btrfs: merge functions for wait snapshot creation · 0bc19f90
      Zhao Lei 提交于
      wait_for_snapshot_creation() is in same group with oher two:
       btrfs_start_write_no_snapshoting()
       btrfs_end_write_no_snapshoting()
      
      Rename wait_for_snapshot_creation() and move it into same place
      with other two.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0bc19f90
    • F
      Btrfs: fix deadlock running delayed iputs at transaction commit time · c2d6cb16
      Filipe Manana 提交于
      While running a stress test I ran into a deadlock when running the delayed
      iputs at transaction time, which produced the following report and trace:
      
      [  886.399989] =============================================
      [  886.400871] [ INFO: possible recursive locking detected ]
      [  886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
      [  886.402384] ---------------------------------------------
      [  886.403182] fio/8277 is trying to acquire lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] but task is already holding lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] other info that might help us debug this:
      [  886.403568]  Possible unsafe locking scenario:
      [  886.403568]
      [  886.403568]        CPU0
      [  886.403568]        ----
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]
      [  886.403568]  *** DEADLOCK ***
      [  886.403568]
      [  886.403568]  May be due to missing lock nesting notation
      [  886.403568]
      [  886.403568] 3 locks held by fio/8277:
      [  886.403568]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
      [  886.403568]  #1:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
      [  886.403568]  #2:  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] stack backtrace:
      [  886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [  886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [  886.403568]  0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
      [  886.403568]  ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
      [  886.403568]  0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
      [  886.403568] Call Trace:
      [  886.403568]  [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
      [  886.403568]  [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
      [  886.403568]  [<ffffffff810c22db>] ? __module_address+0xdf/0x108
      [  886.403568]  [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffff8148556b>] down_read+0x3e/0x4d
      [  886.489542]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
      [  886.489542]  [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
      [  886.489542]  [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
      [  886.489542]  [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
      [  886.489542]  [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
      [  886.489542]  [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
      [  886.489542]  [<ffffffff81188e31>] evict+0xa7/0x15c
      [  886.489542]  [<ffffffff81189878>] iput+0x1d3/0x266
      [  886.489542]  [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
      [  886.489542]  [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
      [  886.489542]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [  886.489542]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [  886.489542]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [  886.489542]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [  886.489542]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [  886.489542]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [  886.489542]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [  886.489542]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [  886.489542]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [  886.489542]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
      [ 1081.854348]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.863227] fio        D ffff880213f9bb28     0  8244   8240 0x00000000
      [ 1081.868719]  ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
      [ 1081.872499]  ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
      [ 1081.876834]  ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
      [ 1081.880782] Call Trace:
      [ 1081.881793]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.883340]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.895525]  [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
      [ 1081.897419]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1081.899251]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1081.901063]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1081.902365]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1081.903846]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.906078]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.908846]  [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
      [ 1081.910409]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [ 1081.912482]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [ 1081.914597]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [ 1081.919037]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [ 1081.920754]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [ 1081.922496]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [ 1081.923922]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [ 1081.925275]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [ 1081.926584]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [ 1081.927968]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.985293] INFO: lockdep is turned off.
      [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
      [ 1081.987434]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.990147] fio        D ffff880218febbb8     0  8249   8240 0x00000000
      [ 1081.991626]  ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
      [ 1081.993258]  ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
      [ 1081.994850]  ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
      [ 1081.996485] Call Trace:
      [ 1081.997037]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.998017]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.999241]  [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
      [ 1082.000306]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1082.001533]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1082.002776]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1082.003995]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1082.005000]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.007403]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.008988]  [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
      [ 1082.010193]  [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
      [ 1082.011280]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.012265]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.013021]  [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
      [ 1082.013738]  [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
      [ 1082.014778]  [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
      [ 1082.015778]  [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
      [ 1082.016806]  [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
      [ 1082.017789]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [ 1082.018706]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This happens because we can recursively acquire the semaphore
      fs_info->delayed_iput_sem when attempting to allocate space to satisfy
      a file write request as shown in the first trace above - when committing
      a transaction we acquire (down_read) the semaphore before running the
      delayed iputs, and when running a delayed iput() we can end up calling
      an inode's eviction handler, which in turn commits another transaction
      and attempts to acquire (down_read) again the semaphore to run more
      delayed iput operations.
      This results in a deadlock because if a task acquires multiple times a
      semaphore it should invoke down_read_nested() with a different lockdep
      class for each level of recursion.
      
      Fix this by simplifying the implementation and use a mutex instead that
      is acquired by the cleaner kthread before it runs the delayed iputs
      instead of always acquiring a semaphore before delayed references are
      run from anywhere.
      
      Fixes: d7c15171 (btrfs: Fix NO_SPACE bug caused by delayed-iput)
      Cc: stable@vger.kernel.org   # 4.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c2d6cb16
  5. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  6. 07 1月, 2016 10 次提交
  7. 01 1月, 2016 2 次提交
    • F
      Btrfs: fix number of transaction units required to create symlink · 9269d12b
      Filipe Manana 提交于
      We weren't accounting for the insertion of an inline extent item for the
      symlink inode nor that we need to update the parent inode item (through
      the call to btrfs_add_nondir()). So fix this by including two more
      transaction units.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      9269d12b
    • F
      Btrfs: don't leave dangling dentry if symlink creation failed · d50866d0
      Filipe Manana 提交于
      When we are creating a symlink we might fail with an error after we
      created its inode and added the corresponding directory indexes to its
      parent inode. In this case we end up never removing the directory indexes
      because the inode eviction handler, called for our symlink inode on the
      final iput(), only removes items associated with the symlink inode and
      not with the parent inode.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdi
        $ mount /dev/sdi /mnt
        $ touch /mnt/foo
        $ ln -s /mnt/foo /mnt/bar
        ln: failed to create symbolic link ‘bar’: Cannot allocate memory
        $ umount /mnt
        $ btrfsck /dev/sdi
        Checking filesystem on /dev/sdi
        UUID: d5acb5ba-31bd-42da-b456-89dca2e716e1
        checking extents
        checking free space cache
        checking fs roots
        root 5 inode 258 errors 2001, no inode item, link count wrong
      	unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref
        found 131073 bytes used err is 1
        total csum bytes: 0
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 124305
        file data blocks allocated: 262144
         referenced 262144
        btrfs-progs v4.2.3
      
      So fix this by adding the directory index entries as the very last
      step of symlink creation.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      d50866d0
  8. 31 12月, 2015 1 次提交
  9. 17 12月, 2015 3 次提交
    • F
      Btrfs: fix leaking of ordered extents after direct IO write error · f28a4928
      Filipe Manana 提交于
      When doing a direct IO write, __blockdev_direct_IO() can call the
      btrfs_get_blocks_direct() callback one or more times before it calls the
      btrfs_submit_direct() callback. However it can fail after calling the
      first callback and before calling the second callback, which is a problem
      because the first one creates ordered extents and the second one is the
      one that submits bios that cover the ordered extents created by the first
      one. That means the ordered extents will never complete nor have any of
      the flags BTRFS_ORDERED_IO_DONE / BTRFS_ORDERED_IOERR set, resulting in
      subsequent operations (such as other direct IO writes, buffered writes or
      hole punching) that lock the same IO range and lookup for ordered extents
      in the range to hang forever waiting for those ordered extents because
      they can not complete ever, since no bio was submitted.
      
      Fix this by tracking a range of created ordered extents that don't have
      yet corresponding bios submitted and completing the ordered extents in
      the range if __blockdev_direct_IO() fails with an error.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      f28a4928
    • F
      Btrfs: fix deadlock between direct IO write and defrag/readpages · b850ae14
      Filipe Manana 提交于
      If readpages() (triggered by defrag or buffered reads) is called while a
      direct IO write is in progress, we have a small time window where we can
      deadlock, resulting in traces like the following being generated:
      
      [84723.212993] INFO: task fio:2849 blocked for more than 120 seconds.
      [84723.214310]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [84723.215640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [84723.217313] fio        D ffff88023ec75218     0  2849   2835 0x00000000
      [84723.218778]  ffff880122dfb6e8 0000000000000092 0000000000000000 ffff88023ec75200
      [84723.220458]  ffff88000e05d2c0 ffff880122dfc000 ffff88023ec75200 7fffffffffffffff
      [84723.230597]  0000000000000002 ffffffff8147891a ffff880122dfb700 ffffffff8147856a
      [84723.232085] Call Trace:
      [84723.232625]  [<ffffffff8147891a>] ? bit_wait+0x3c/0x3c
      [84723.233529]  [<ffffffff8147856a>] schedule+0x7d/0x95
      [84723.234398]  [<ffffffff8147baa3>] schedule_timeout+0x43/0x10b
      [84723.235384]  [<ffffffff810f82eb>] ? time_hardirqs_on+0x15/0x28
      [84723.236426]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
      [84723.237502]  [<ffffffff810af8a3>] ? read_seqcount_begin.constprop.20+0x57/0x6d
      [84723.238807]  [<ffffffff8108a09b>] ? trace_hardirqs_on_caller+0x16/0x1ab
      [84723.242012]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
      [84723.243064]  [<ffffffff810af2ad>] ? timekeeping_get_ns+0xe/0x33
      [84723.244116]  [<ffffffff810afa2e>] ? ktime_get+0x41/0x52
      [84723.245029]  [<ffffffff81477cff>] io_schedule_timeout+0xb7/0x12b
      [84723.245942]  [<ffffffff81477cff>] ? io_schedule_timeout+0xb7/0x12b
      [84723.246596]  [<ffffffff81478953>] bit_wait_io+0x39/0x45
      [84723.247503]  [<ffffffff81478b93>] __wait_on_bit_lock+0x49/0x8d
      [84723.248540]  [<ffffffff8111684f>] __lock_page+0x66/0x68
      [84723.249558]  [<ffffffff81081c9b>] ? autoremove_wake_function+0x3a/0x3a
      [84723.250844]  [<ffffffff81124a04>] lock_page+0x2c/0x2f
      [84723.251871]  [<ffffffff81124afc>] invalidate_inode_pages2_range+0xf5/0x2aa
      [84723.253274]  [<ffffffff81117c34>] ? filemap_fdatawait_range+0x12d/0x146
      [84723.254757]  [<ffffffff81118191>] ? filemap_fdatawrite_range+0x13/0x15
      [84723.256378]  [<ffffffffa05139a2>] btrfs_get_blocks_direct+0x1b0/0x664 [btrfs]
      [84723.258556]  [<ffffffff8119e3f9>] ? submit_page_section+0x7b/0x111
      [84723.260064]  [<ffffffff8119eb90>] do_blockdev_direct_IO+0x658/0xbdb
      [84723.261479]  [<ffffffffa05137f2>] ? btrfs_page_exists_in_range+0x1a9/0x1a9 [btrfs]
      [84723.262961]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [84723.264449]  [<ffffffff8119f144>] __blockdev_direct_IO+0x31/0x33
      [84723.265614]  [<ffffffff8119f144>] ? __blockdev_direct_IO+0x31/0x33
      [84723.266769]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [84723.268264]  [<ffffffffa050935d>] btrfs_direct_IO+0x1b9/0x259 [btrfs]
      [84723.270954]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [84723.272465]  [<ffffffff8111878c>] generic_file_direct_write+0xb3/0x128
      [84723.273734]  [<ffffffffa051955c>] btrfs_file_write_iter+0x228/0x404 [btrfs]
      [84723.275101]  [<ffffffff8116ca6f>] __vfs_write+0x7c/0xa5
      [84723.276200]  [<ffffffff8116cfab>] vfs_write+0xa0/0xe4
      [84723.277298]  [<ffffffff8116d79d>] SyS_write+0x50/0x7e
      [84723.278327]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [84723.279595] INFO: lockdep is turned off.
      [84723.379035] INFO: task btrfs:2923 blocked for more than 120 seconds.
      [84723.380323]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [84723.381608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [84723.383003] btrfs           D ffff88023ed75218     0  2923   2859 0x00000000
      [84723.384277]  ffff88001311f860 0000000000000082 ffff88001311f840 ffff88023ed75200
      [84723.385748]  ffff88012c6751c0 ffff880013120000 ffff88012042fe68 ffff88012042fe30
      [84723.387152]  ffff880221571c88 0000000000000001 ffff88001311f878 ffffffff8147856a
      [84723.388620] Call Trace:
      [84723.389105]  [<ffffffff8147856a>] schedule+0x7d/0x95
      [84723.391882]  [<ffffffffa051da32>] btrfs_start_ordered_extent+0x161/0x1fa [btrfs]
      [84723.393718]  [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
      [84723.395659]  [<ffffffffa0522c5b>] __do_contiguous_readpages.constprop.21+0x81/0xdc [btrfs]
      [84723.397383]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
      [84723.398852]  [<ffffffffa0522da3>] __extent_readpages.constprop.20+0xed/0x100 [btrfs]
      [84723.400561]  [<ffffffff81123f6c>] ? __lru_cache_add+0x5d/0x72
      [84723.401787]  [<ffffffffa0523896>] extent_readpages+0x111/0x1a7 [btrfs]
      [84723.403121]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
      [84723.404583]  [<ffffffffa05088fa>] btrfs_readpages+0x1f/0x21 [btrfs]
      [84723.406007]  [<ffffffff811226df>] __do_page_cache_readahead+0x168/0x1f4
      [84723.407502]  [<ffffffff81122988>] ondemand_readahead+0x21d/0x22e
      [84723.408937]  [<ffffffff81122988>] ? ondemand_readahead+0x21d/0x22e
      [84723.410487]  [<ffffffff81122af1>] page_cache_sync_readahead+0x3d/0x3f
      [84723.411710]  [<ffffffffa0535388>] btrfs_defrag_file+0x419/0xaaf [btrfs]
      [84723.413007]  [<ffffffffa0531db0>] ? kzalloc+0xf/0x11 [btrfs]
      [84723.414085]  [<ffffffffa0535b43>] btrfs_ioctl_defrag+0x125/0x14e [btrfs]
      [84723.415307]  [<ffffffffa0536753>] btrfs_ioctl+0x746/0x24c6 [btrfs]
      [84723.416532]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
      [84723.417731]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
      [84723.418699]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
      [84723.421532]  [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
      [84723.422629]  [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
      [84723.423712]  [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
      [84723.424801]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
      [84723.425968]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
      [84723.427063]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
      [84723.428138]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      Consider the following logical and physical file layout:
      
      logical:    ... [ prealloc extent A ] [ prealloc extent B ] [ extent C ] ...
                      4K                    8K                    16K
      
      physical:   ... 12853248              12857344              1103101952   ...
                                            (= 12853248 + 4K)
      
      Extents A and B are physically adjacent. The following diagram shows a
      sequence of events that lead to the deadlock when we attempt to do a
      direct IO write against the file range [4K, 16K[ and a defrag is triggered
      simultaneously.
      
                 CPU 1                                               CPU 2
      
       btrfs_direct_IO()
      
         btrfs_get_blocks_direct()
           creates ordered extent A, covering
           the 4k prealloc extent A (range [4K, 8K[)
      
                                                          btrfs_defrag_file()
                                                            page_cache_sync_readahead([0K, 1M[)
                                                              btrfs_readpages()
                                                                extent_readpages()
      
                                                                  locks all pages in the file
                                                                  range [0K, 128K[ through calls
                                                                  to add_to_page_cache_lru()
      
                                                                  __do_contiguous_readpages()
      
                                                                     finds ordered extent A
      
                                                                     waits for it to complete
      
         btrfs_get_blocks_direct() called again
      
           lock_extent_direct(range [8K, 16K[)
      
             finds a page in range [8K, 16K[ through
             btrfs_page_exists_in_range()
      
             invalidate_inode_pages2_range([8K, 16K[)
      
               --> tries to lock pages that are already
                   locked by the task at CPU 2
      
               --> our task, running __blockdev_direct_IO(),
                   hangs waiting to lock the pages and the
                   submit bio callback, btrfs_submit_direct(),
                   ends up never being called, resulting in the
                   ordered extent A never completing (because a
                   corresponding bio is never submitted) and
                   CPU 2 will wait for it forever while holding
                   the pages locked
                    ---> deadlock!
      
      Fix this by removing the page invalidation approach when attempting to
      lock the range for IO from the callback btrfs_get_blocks_direct() and
      falling back buffered IO. This was a rare case anyway and well behaved
      applications do not mix concurrent direct IO writes with buffered reads
      anyway, being a concurrent defrag the only normal case that could lead
      to the deadlock.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      b850ae14
    • F
      Btrfs: fix error path when failing to submit bio for direct IO write · 14543774
      Filipe Manana 提交于
      Commit 61de718f ("Btrfs: fix memory corruption on failure to submit
      bio for direct IO") fixed problems with the error handling code after we
      fail to submit a bio for direct IO. However there were 2 problems that it
      did not address when the failure is due to memory allocation failures for
      direct IO writes:
      
      1) We considered that there could be only one ordered extent for the whole
         IO range, which is not always true, as we can have multiple;
      
      2) It did not set the bit BTRFS_ORDERED_IO_DONE in the ordered extent,
         which can make other tasks running btrfs_wait_logged_extents() hang
         forever, since they wait for that bit to be set. The general assumption
         is that regardless of an error, the BTRFS_ORDERED_IO_DONE is always set
         and it precedes setting the bit BTRFS_ORDERED_COMPLETE.
      
      Fix these issues by moving part of the btrfs_endio_direct_write() handler
      into a new helper function and having that new helper function called when
      we fail to allocate memory to submit the bio (and its private object) for
      a direct IO write.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      14543774
  10. 09 12月, 2015 2 次提交
    • A
      replace ->follow_link() with new method that could stay in RCU mode · 6b255391
      Al Viro 提交于
      new method: ->get_link(); replacement of ->follow_link().  The differences
      are:
      	* inode and dentry are passed separately
      	* might be called both in RCU and non-RCU mode;
      the former is indicated by passing it a NULL dentry.
      	* when called that way it isn't allowed to block
      and should return ERR_PTR(-ECHILD) if it needs to be called
      in non-RCU mode.
      
      It's a flagday change - the old method is gone, all in-tree instances
      converted.  Conversion isn't hard; said that, so far very few instances
      do not immediately bail out when called in RCU mode.  That'll change
      in the next commits.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6b255391
    • A
      don't put symlink bodies in pagecache into highmem · 21fc61c7
      Al Viro 提交于
      kmap() in page_follow_link_light() needed to go - allowing to hold
      an arbitrary number of kmaps for long is a great way to deadlocking
      the system.
      
      new helper (inode_nohighmem(inode)) needs to be used for pagecache
      symlinks inodes; done for all in-tree cases.  page_follow_link_light()
      instrumented to yell about anything missed.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      21fc61c7
  11. 07 12月, 2015 2 次提交
  12. 03 12月, 2015 3 次提交
  13. 25 11月, 2015 1 次提交
    • F
      Btrfs: use global reserve when deleting unused block group after ENOSPC · 8eab77ff
      Filipe Manana 提交于
      It's possible to reach a state where the cleaner kthread isn't able to
      start a transaction to delete an unused block group due to lack of enough
      free metadata space and due to lack of unallocated device space to allocate
      a new metadata block group as well. If this happens try to use space from
      the global block group reserve just like we do for unlink operations, so
      that we don't reach a permanent state where starting a transaction for
      filesystem operations (file creation, renames, etc) keeps failing with
      -ENOSPC. Such an unfortunate state was observed on a machine where over
      a dozen unused data block groups existed and the cleaner kthread was
      failing to delete them due to ENOSPC error when attempting to start a
      transaction, and even running balance with a -dusage=0 filter failed with
      ENOSPC as well. Also unmounting and mounting again the filesystem didn't
      help. Allowing the cleaner kthread to use the global block reserve to
      delete the unused data block groups fixed the problem.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eab77ff
  14. 10 11月, 2015 1 次提交
  15. 09 11月, 2015 1 次提交
    • F
      Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow · 1d512cb7
      Filipe Manana 提交于
      If we are using the NO_HOLES feature, we have a tiny time window when
      running delalloc for a nodatacow inode where we can race with a concurrent
      link or xattr add operation leading to a BUG_ON.
      
      This happens because at run_delalloc_nocow() we end up casting a leaf item
      of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
      file extent item (struct btrfs_file_extent_item) and then analyse its
      extent type field, which won't match any of the expected extent types
      (values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
      explicit BUG_ON(1).
      
      The following sequence diagram shows how the race happens when running a
      no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
      neighbour leafs:
      
                   Leaf X (has N items)                    Leaf Y
      
       [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 EXTENT_DATA 8192), ... ]
                    slot N - 2         slot N - 1              slot 0
      
       (Note the implicit hole for inode 257 regarding the [0, 8K[ range)
      
             CPU 1                                         CPU 2
      
       run_dealloc_nocow()
         btrfs_lookup_file_extent()
           --> searches for a key with value
               (257 EXTENT_DATA 4096) in the
               fs/subvol tree
           --> returns us a path with
               path->nodes[0] == leaf X and
               path->slots[0] == N
      
         because path->slots[0] is >=
         btrfs_header_nritems(leaf X), it
         calls btrfs_next_leaf()
      
         btrfs_next_leaf()
           --> releases the path
      
                                                    hard link added to our inode,
                                                    with key (257 INODE_REF 500)
                                                    added to the end of leaf X,
                                                    so leaf X now has N + 1 keys
      
           --> searches for the key
               (257 INODE_REF 256), because
               it was the last key in leaf X
               before it released the path,
               with path->keep_locks set to 1
      
           --> ends up at leaf X again and
               it verifies that the key
               (257 INODE_REF 256) is no longer
               the last key in the leaf, so it
               returns with path->nodes[0] ==
               leaf X and path->slots[0] == N,
               pointing to the new item with
               key (257 INODE_REF 500)
      
         the loop iteration of run_dealloc_nocow()
         does not break out the loop and continues
         because the key referenced in the path
         at path->nodes[0] and path->slots[0] is
         for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
         and its offset (500) is less then our delalloc
         range's end (8192)
      
         the item pointed by the path, an inode reference item,
         is (incorrectly) interpreted as a file extent item and
         we get an invalid extent type, leading to the BUG_ON(1):
      
         if (extent_type == BTRFS_FILE_EXTENT_REG ||
            extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
             (...)
         } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
             (...)
         } else {
             BUG_ON(1)
         }
      
      The same can happen if a xattr is added concurrently and ends up having
      a key with an offset smaller then the delalloc's range end.
      
      So fix this by skipping keys with a type smaller than
      BTRFS_EXTENT_DATA_KEY.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      1d512cb7
  16. 05 11月, 2015 1 次提交
    • F
      Btrfs: fix extent accounting for partial direct IO writes · 9c9464cc
      Filipe Manana 提交于
      When doing a write using direct IO we can end up not doing the whole write
      operation using the direct IO path, in that case we fallback to a buffered
      write to do the remaining IO. This happens for example if the range we are
      writing to contains a compressed extent.
      When we do a partial write and fallback to buffered IO, due to the
      existence of a compressed extent for example, we end up not adjusting the
      outstanding extents counter of our inode which ends up getting decremented
      twice, once by the DIO ordered extent for the partial write and once again
      by btrfs_direct_IO(), resulting in an arithmetic underflow at
      extent-tree.c:drop_outstanding_extent(). For example if we have:
      
        extents        [ prealloc extent ] [ compressed extent ]
        offsets        A        B          C       D           E
      
      and at the moment our inode's outstanding extents counter is 0, if we do a
      direct IO write against the range [B, D[ (which has a length smaller than
      128Mb), we end up bumping our inode's outstanding extents counter to 1, we
      create a DIO ordered extent for the range [B, C[ and then fallback to a
      buffered write for the range [C, D[. The direct IO handler
      (inode.c:btrfs_direct_IO()) decrements the outstanding extents counter by
      1, leaving it with a value of 0, through a call to
      btrfs_delalloc_release_space() and then shortly after the DIO ordered
      extent finishes and calls btrfs_delalloc_release_metadata() which ends
      up to attempt to decrement the inode's outstanding extents counter by 1,
      resulting in an assertion failure at drop_outstanding_extent() because
      the operation would result in an arithmetic underflow (0 - 1). This
      produces the following trace:
      
        [125471.336838] BTRFS: assertion failed: BTRFS_I(inode)->outstanding_extents >= num_extents, file: fs/btrfs/extent-tree.c, line: 5526
        [125471.338844] ------------[ cut here ]------------
        [125471.340745] kernel BUG at fs/btrfs/ctree.h:4173!
        [125471.340745] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        [125471.340745] Modules linked in: btrfs f2fs xfs libcrc32c dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpufreq psmouse i2c_piix4 parport pcspkr serio_raw microcode processor evdev i2c_core button ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci virtio_ring floppy libata virtio e1000 scsi_mod [last unloaded: btrfs]
        [125471.340745] CPU: 10 PID: 23649 Comm: kworker/u32:1 Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
        [125471.340745] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
        [125471.340745] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
        [125471.340745] task: ffff8804244fcf80 ti: ffff88040a118000 task.ti: ffff88040a118000
        [125471.340745] RIP: 0010:[<ffffffffa0550da1>]  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
        [125471.340745] RSP: 0018:ffff88040a11bc78  EFLAGS: 00010296
        [125471.340745] RAX: 0000000000000075 RBX: 0000000000005000 RCX: 0000000000000000
        [125471.340745] RDX: ffffffff81098f93 RSI: ffffffff8147c619 RDI: 00000000ffffffff
        [125471.340745] RBP: ffff88040a11bc78 R08: 0000000000000001 R09: 0000000000000000
        [125471.340745] R10: ffff88040a11bc08 R11: ffffffff81651000 R12: ffff8803efb4a000
        [125471.340745] R13: ffff8803efb4a000 R14: 0000000000000000 R15: ffff8802f8e33c88
        [125471.340745] FS:  0000000000000000(0000) GS:ffff88043dd40000(0000) knlGS:0000000000000000
        [125471.340745] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        [125471.340745] CR2: 00007fae7ca86095 CR3: 0000000001a0b000 CR4: 00000000000006e0
        [125471.340745] Stack:
        [125471.340745]  ffff88040a11bc88 ffffffffa04ca0cd ffff88040a11bcc8 ffffffffa04ceeb1
        [125471.340745]  ffff8802f8e33940 ffff8802c93eadb0 ffff8802f8e0bf50 ffff8803efb4a000
        [125471.340745]  0000000000000000 ffff8802f8e33c88 ffff88040a11bd38 ffffffffa04eccfa
        [125471.340745] Call Trace:
        [125471.340745]  [<ffffffffa04ca0cd>] drop_outstanding_extent+0x3d/0x6d [btrfs]
        [125471.340745]  [<ffffffffa04ceeb1>] btrfs_delalloc_release_metadata+0x51/0xdd [btrfs]
        [125471.340745]  [<ffffffffa04eccfa>] btrfs_finish_ordered_io+0x420/0x4eb [btrfs]
        [125471.340745]  [<ffffffffa04ecdda>] finish_ordered_fn+0x15/0x17 [btrfs]
        [125471.340745]  [<ffffffffa050e6e8>] normal_work_helper+0x14c/0x32a [btrfs]
        [125471.340745]  [<ffffffffa050e9c8>] btrfs_endio_write_helper+0x12/0x14 [btrfs]
        [125471.340745]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
        [125471.340745]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
        [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
        [125471.340745]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
        [125471.340745]  [<ffffffff8106904d>] kthread+0xef/0xf7
        [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
        [125471.340745]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
        [125471.340745]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
        [125471.340745] Code: a5 55 a0 48 89 e5 e8 42 50 bc e0 0f 0b 55 89 f1 48 c7 c2 f0 a8 55 a0 48 89 fe 31 c0 48 c7 c7 14 aa 55 a0 48 89 e5 e8 22 50 bc e0 <0f> 0b 0f 1f 44 00 00 55 31 c9 ba 18 00 00 00 48 89 e5 41 56 41
        [125471.340745] RIP  [<ffffffffa0550da1>] assfail.constprop.46+0x1e/0x20 [btrfs]
        [125471.340745]  RSP <ffff88040a11bc78>
        [125471.539620] ---[ end trace 144259f7838b4aa4 ]---
      
      So fix this by ensuring we adjust the outstanding extents counter when we
      do the fallback just like we do for the case where the whole write can be
      done through the direct IO path.
      
      We were also adjusting the outstanding extents counter by a constant value
      of 1, which is incorrect because we were ignorning that we account extents
      in BTRFS_MAX_EXTENT_SIZE units, o fix that as well.
      
      The following test case for fstests reproduces this issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_xfs_io_command "falloc"
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount "-o compress"
      
        # Create a compressed extent covering the range [700K, 800K[.
        $XFS_IO_PROG -f -s -c "pwrite -S 0xaa -b 100K 700K 100K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Create prealloc extent covering the range [600K, 700K[.
        $XFS_IO_PROG -c "falloc 600K 100K" $SCRATCH_MNT/foo
      
        # Write 80K of data to the range [640K, 720K[ using direct IO. This
        # range covers both the prealloc extent and the compressed extent.
        # Because there's a compressed extent in the range we are writing to,
        # the DIO write code path ends up only writing the first 60k of data,
        # which goes to the prealloc extent, and then falls back to buffered IO
        # for writing the remaining 20K of data - because that remaining data
        # maps to a file range containing a compressed extent.
        # When falling back to buffered IO, we used to trigger an assertion when
        # releasing reserved space due to bad accounting of the inode's
        # outstanding extents counter, which was set to 1 but we ended up
        # decrementing it by 1 twice, once through the ordered extent for the
        # 60K of data we wrote using direct IO, and once through the main direct
        # IO handler (inode.cbtrfs_direct_IO()) because the direct IO write
        # wrote less than 80K of data (60K).
        $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 80K 640K 80K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now similar test as above but for very large write operations. This
        # triggers special cases for an inode's outstanding extents accounting,
        # as internally btrfs logically splits extents into 128Mb units.
        $XFS_IO_PROG -f -s \
            -c "pwrite -S 0xaa -b 128M 258M 128M" \
            -c "falloc 0 258M" \
            $SCRATCH_MNT/bar | _filter_xfs_io
        $XFS_IO_PROG -d -c "pwrite -S 0xbb -b 256M 3M 256M" $SCRATCH_MNT/bar \
            | _filter_xfs_io
      
        # Now verify the file contents are correct and that they are the same
        # even after unmounting and mounting the fs again (or evicting the page
        # cache).
        #
        # For file foo, all bytes in the range [0, 640K[ must have a value of
        # 0x00, all bytes in the range [640K, 720K[ must have a value of 0xbb
        # and all bytes in the range [720K, 800K[ must have a value of 0xaa.
        #
        # For file bar, all bytes in the range [0, 3M[ must havea value of 0x00,
        # all bytes in the range [3M, 259M[ must have a value of 0xbb and all
        # bytes in the range [259M, 386M[ must have a value of 0xaa.
        #
        echo "File digests before remounting the file system:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        md5sum $SCRATCH_MNT/bar | _filter_scratch
        _scratch_remount
        echo "File digests after remounting the file system:"
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        md5sum $SCRATCH_MNT/bar | _filter_scratch
      
        status=0
        exit
      
      Fixes: e1cbbfa5 ("Btrfs: fix outstanding_extents accounting in DIO")
      Fixes: 3e05bde8 ("Btrfs: only adjust outstanding_extents when we do a short write")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      9c9464cc
  17. 27 10月, 2015 1 次提交
  18. 26 10月, 2015 1 次提交
    • F
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana 提交于
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: NElias Probst <mail@eliasprobst.eu>
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Reported-by: NMalte Schröder <malte@tnxip.de>
      Reported-by: NDerek Dongray <derek@valedon.co.uk>
      Reported-by: NErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
  19. 22 10月, 2015 4 次提交
    • J
      Btrfs: fix prealloc under heavy fragmentation conditions · 0b670dc4
      Josef Bacik 提交于
      If we are heavily fragmented we will continually try to prealloc the largest
      extent size we can every time we call btrfs_reserve_extent.  This can be very
      expensive when we are heavily fragmented, burning lots of CPU cycles and loops
      through the allocator.  So instead notice when we get a smaller chunk from the
      allocator than what we specified and use this as the new maximum size we try to
      allocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0b670dc4
    • Q
      btrfs: qgroup: Check if qgroup reserved space leaked · 56fa9d07
      Qu Wenruo 提交于
      Add check at btrfs_destroy_inode() time to detect qgroup reserved space
      leak.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      56fa9d07
    • Q
      btrfs: qgroup: Avoid calling btrfs_free_reserved_data_space in clear_bit_hook · 51773bec
      Qu Wenruo 提交于
      In clear_bit_hook, qgroup reserved data is already handled quite well,
      either released by finish_ordered_io or invalidatepage.
      
      So calling btrfs_qgroup_free_data() here is completely meaningless, and
      since btrfs_qgroup_free_data() will lock io_tree, so it can't be called
      with io_tree lock hold.
      
      This patch will add a new function
      btrfs_free_reserved_data_space_noquota() for clear_bit_hook() to cease
      the lockdep warning.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      51773bec
    • Q
      btrfs: Add handler for invalidate page · b9d0b389
      Qu Wenruo 提交于
      For btrfs_invalidatepage() and its variant evict_inode_truncate_page(),
      there will be pages don't reach disk.
      In that case, their reserved space won't be release nor freed by
      finish_ordered_io() nor delayed_ref handler.
      
      So we must free their qgroup reserved space, or we will leaking reserved
      space again.
      
      So this will patch will call btrfs_qgroup_free_data() for
      invalidatepage() and its variant evict_inode_truncate_page().
      
      And due to the nature of new btrfs_qgroup_reserve/free_data() reserved
      space will only be reserved or freed once, so for pages which are
      already flushed to disk, their reserved space will be released and freed
      by delayed_ref handler.
      
      Double free won't be a problem.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b9d0b389