1. 18 2月, 2016 3 次提交
  2. 16 2月, 2016 1 次提交
    • F
      Btrfs: fix direct IO requests not reporting IO error to user space · 1636d1d7
      Filipe Manana 提交于
      If a bio for a direct IO request fails, we were not setting the error in
      the parent bio (the main DIO bio), making us not return the error to
      user space in btrfs_direct_IO(), that is, it made __blockdev_direct_IO()
      return the number of bytes issued for IO and not the error a bio created
      and submitted by btrfs_submit_direct() got from the block layer.
      This essentially happens because when we call:
      
         dio_end_io(dio_bio, bio->bi_error);
      
      It does not set dio_bio->bi_error to the value of the second argument.
      So just add this missing assignment in endio callbacks, just as we do in
      the error path at btrfs_submit_direct() when we fail to clone the dio bio
      or allocate its private object. This follows the convention of what is
      done with other similar APIs such as bio_endio() where the caller is
      responsible for setting the bi_error field in the bio it passes as an
      argument to bio_endio().
      
      This was detected by the new generic test cases in xfstests: 271, 272,
      276 and 278. Which essentially setup a dm error target, then load the
      error table, do a direct IO write and unload the error table. They
      expect the write to fail with -EIO, which was not getting reported
      when testing against btrfs.
      
      Cc: stable@vger.kernel.org  # 4.3+
      Fixes: 4246a0b6 ("block: add a bi_error field to struct bio")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      1636d1d7
  3. 11 2月, 2016 2 次提交
    • D
      btrfs: properly set the termination value of ctx->pos in readdir · bc4ef759
      David Sterba 提交于
      The value of ctx->pos in the last readdir call is supposed to be set to
      INT_MAX due to 32bit compatibility, unless 'pos' is intentially set to a
      larger value, then it's LLONG_MAX.
      
      There's a report from PaX SIZE_OVERFLOW plugin that "ctx->pos++"
      overflows (https://forums.grsecurity.net/viewtopic.php?f=1&t=4284), on a
      64bit arch, where the value is 0x7fffffffffffffff ie. LLONG_MAX before
      the increment.
      
      We can get to that situation like that:
      
      * emit all regular readdir entries
      * still in the same call to readdir, bump the last pos to INT_MAX
      * next call to readdir will not emit any entries, but will reach the
        bump code again, finds pos to be INT_MAX and sets it to LLONG_MAX
      
      Normally this is not a problem, but if we call readdir again, we'll find
      'pos' set to LLONG_MAX and the unconditional increment will overflow.
      
      The report from Victor at
      (http://thread.gmane.org/gmane.comp.file-systems.btrfs/49500) with debugging
      print shows that pattern:
      
       Overflow: e
       Overflow: 7fffffff
       Overflow: 7fffffffffffffff
       PAX: size overflow detected in function btrfs_real_readdir
         fs/btrfs/inode.c:5760 cicus.935_282 max, count: 9, decl: pos; num: 0;
         context: dir_context;
       CPU: 0 PID: 2630 Comm: polkitd Not tainted 4.2.3-grsec #1
       Hardware name: Gigabyte Technology Co., Ltd. H81ND2H/H81ND2H, BIOS F3 08/11/2015
        ffffffff81901608 0000000000000000 ffffffff819015e6 ffffc90004973d48
        ffffffff81742f0f 0000000000000007 ffffffff81901608 ffffc90004973d78
        ffffffff811cb706 0000000000000000 ffff8800d47359e0 ffffc90004973ed8
       Call Trace:
        [<ffffffff81742f0f>] dump_stack+0x4c/0x7f
        [<ffffffff811cb706>] report_size_overflow+0x36/0x40
        [<ffffffff812ef0bc>] btrfs_real_readdir+0x69c/0x6d0
        [<ffffffff811dafc8>] iterate_dir+0xa8/0x150
        [<ffffffff811e6d8d>] ? __fget_light+0x2d/0x70
        [<ffffffff811dba3a>] SyS_getdents+0xba/0x1c0
       Overflow: 1a
        [<ffffffff811db070>] ? iterate_dir+0x150/0x150
        [<ffffffff81749b69>] entry_SYSCALL_64_fastpath+0x12/0x83
      
      The jump from 7fffffff to 7fffffffffffffff happens when new dir entries
      are not yet synced and are processed from the delayed list. Then the code
      could go to the bump section again even though it might not emit any new
      dir entries from the delayed list.
      
      The fix avoids entering the "bump" section again once we've finished
      emitting the entries, both for synced and delayed entries.
      
      References: https://forums.grsecurity.net/viewtopic.php?f=1&t=4284Reported-by: NVictor <services@swwu.com>
      CC: stable@vger.kernel.org
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Tested-by: NHolger Hoffstätte <holger.hoffstaette@googlemail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bc4ef759
    • D
      btrfs: readdir: use GFP_KERNEL · 49e350a4
      David Sterba 提交于
      Readdir is initiated from userspace and is not on the critical
      writeback path, we don't need to use GFP_NOFS for allocations.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49e350a4
  4. 02 2月, 2016 8 次提交
  5. 26 1月, 2016 1 次提交
    • F
      Btrfs: fix race between fsync and lockless direct IO writes · de0ee0ed
      Filipe Manana 提交于
      An fsync, using the fast path, can race with a concurrent lockless direct
      IO write and end up logging a file extent item that points to an extent
      that wasn't written to yet. This is because the fast fsync path collects
      ordered extents into a local list and then collects all the new extent
      maps to log file extent items based on them, while the direct IO write
      path creates the new extent map before it creates the corresponding
      ordered extent (and submitting the respective bio(s)).
      
      So fix this by making the direct IO write path create ordered extents
      before the extent maps and make the fast fsync path collect any new
      ordered extents after it collects the extent maps.
      Note that making the fsync handler call inode_dio_wait() (after acquiring
      the inode's i_mutex) would not work and lead to a deadlock when doing
      AIO, as through AIO we end up in a path where the fsync handler is called
      (through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range())
      before the inode's dio counter is decremented (inode_dio_wait() waits
      for this counter to have a value of zero).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      de0ee0ed
  6. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  7. 20 1月, 2016 2 次提交
    • Z
      btrfs: merge functions for wait snapshot creation · 0bc19f90
      Zhao Lei 提交于
      wait_for_snapshot_creation() is in same group with oher two:
       btrfs_start_write_no_snapshoting()
       btrfs_end_write_no_snapshoting()
      
      Rename wait_for_snapshot_creation() and move it into same place
      with other two.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0bc19f90
    • F
      Btrfs: fix deadlock running delayed iputs at transaction commit time · c2d6cb16
      Filipe Manana 提交于
      While running a stress test I ran into a deadlock when running the delayed
      iputs at transaction time, which produced the following report and trace:
      
      [  886.399989] =============================================
      [  886.400871] [ INFO: possible recursive locking detected ]
      [  886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
      [  886.402384] ---------------------------------------------
      [  886.403182] fio/8277 is trying to acquire lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] but task is already holding lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] other info that might help us debug this:
      [  886.403568]  Possible unsafe locking scenario:
      [  886.403568]
      [  886.403568]        CPU0
      [  886.403568]        ----
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]
      [  886.403568]  *** DEADLOCK ***
      [  886.403568]
      [  886.403568]  May be due to missing lock nesting notation
      [  886.403568]
      [  886.403568] 3 locks held by fio/8277:
      [  886.403568]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
      [  886.403568]  #1:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
      [  886.403568]  #2:  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] stack backtrace:
      [  886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [  886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [  886.403568]  0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
      [  886.403568]  ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
      [  886.403568]  0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
      [  886.403568] Call Trace:
      [  886.403568]  [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
      [  886.403568]  [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
      [  886.403568]  [<ffffffff810c22db>] ? __module_address+0xdf/0x108
      [  886.403568]  [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffff8148556b>] down_read+0x3e/0x4d
      [  886.489542]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
      [  886.489542]  [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
      [  886.489542]  [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
      [  886.489542]  [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
      [  886.489542]  [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
      [  886.489542]  [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
      [  886.489542]  [<ffffffff81188e31>] evict+0xa7/0x15c
      [  886.489542]  [<ffffffff81189878>] iput+0x1d3/0x266
      [  886.489542]  [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
      [  886.489542]  [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
      [  886.489542]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [  886.489542]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [  886.489542]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [  886.489542]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [  886.489542]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [  886.489542]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [  886.489542]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [  886.489542]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [  886.489542]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [  886.489542]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
      [ 1081.854348]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.863227] fio        D ffff880213f9bb28     0  8244   8240 0x00000000
      [ 1081.868719]  ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
      [ 1081.872499]  ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
      [ 1081.876834]  ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
      [ 1081.880782] Call Trace:
      [ 1081.881793]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.883340]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.895525]  [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
      [ 1081.897419]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1081.899251]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1081.901063]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1081.902365]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1081.903846]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.906078]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.908846]  [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
      [ 1081.910409]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [ 1081.912482]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [ 1081.914597]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [ 1081.919037]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [ 1081.920754]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [ 1081.922496]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [ 1081.923922]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [ 1081.925275]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [ 1081.926584]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [ 1081.927968]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.985293] INFO: lockdep is turned off.
      [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
      [ 1081.987434]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.990147] fio        D ffff880218febbb8     0  8249   8240 0x00000000
      [ 1081.991626]  ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
      [ 1081.993258]  ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
      [ 1081.994850]  ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
      [ 1081.996485] Call Trace:
      [ 1081.997037]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.998017]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.999241]  [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
      [ 1082.000306]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1082.001533]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1082.002776]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1082.003995]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1082.005000]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.007403]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.008988]  [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
      [ 1082.010193]  [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
      [ 1082.011280]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.012265]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.013021]  [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
      [ 1082.013738]  [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
      [ 1082.014778]  [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
      [ 1082.015778]  [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
      [ 1082.016806]  [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
      [ 1082.017789]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [ 1082.018706]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This happens because we can recursively acquire the semaphore
      fs_info->delayed_iput_sem when attempting to allocate space to satisfy
      a file write request as shown in the first trace above - when committing
      a transaction we acquire (down_read) the semaphore before running the
      delayed iputs, and when running a delayed iput() we can end up calling
      an inode's eviction handler, which in turn commits another transaction
      and attempts to acquire (down_read) again the semaphore to run more
      delayed iput operations.
      This results in a deadlock because if a task acquires multiple times a
      semaphore it should invoke down_read_nested() with a different lockdep
      class for each level of recursion.
      
      Fix this by simplifying the implementation and use a mutex instead that
      is acquired by the cleaner kthread before it runs the delayed iputs
      instead of always acquiring a semaphore before delayed references are
      run from anywhere.
      
      Fixes: d7c15171 (btrfs: Fix NO_SPACE bug caused by delayed-iput)
      Cc: stable@vger.kernel.org   # 4.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c2d6cb16
  8. 15 1月, 2016 1 次提交
    • V
      kmemcg: account certain kmem allocations to memcg · 5d097056
      Vladimir Davydov 提交于
      Mark those kmem allocations that are known to be easily triggered from
      userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
      memcg.  For the list, see below:
      
       - threadinfo
       - task_struct
       - task_delay_info
       - pid
       - cred
       - mm_struct
       - vm_area_struct and vm_region (nommu)
       - anon_vma and anon_vma_chain
       - signal_struct
       - sighand_struct
       - fs_struct
       - files_struct
       - fdtable and fdtable->full_fds_bits
       - dentry and external_name
       - inode for all filesystems. This is the most tedious part, because
         most filesystems overwrite the alloc_inode method.
      
      The list is far from complete, so feel free to add more objects.
      Nevertheless, it should be close to "account everything" approach and
      keep most workloads within bounds.  Malevolent users will be able to
      breach the limit, but this was possible even with the former "account
      everything" approach (simply because it did not account everything in
      fact).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d097056
  9. 07 1月, 2016 10 次提交
  10. 01 1月, 2016 2 次提交
    • F
      Btrfs: fix number of transaction units required to create symlink · 9269d12b
      Filipe Manana 提交于
      We weren't accounting for the insertion of an inline extent item for the
      symlink inode nor that we need to update the parent inode item (through
      the call to btrfs_add_nondir()). So fix this by including two more
      transaction units.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      9269d12b
    • F
      Btrfs: don't leave dangling dentry if symlink creation failed · d50866d0
      Filipe Manana 提交于
      When we are creating a symlink we might fail with an error after we
      created its inode and added the corresponding directory indexes to its
      parent inode. In this case we end up never removing the directory indexes
      because the inode eviction handler, called for our symlink inode on the
      final iput(), only removes items associated with the symlink inode and
      not with the parent inode.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdi
        $ mount /dev/sdi /mnt
        $ touch /mnt/foo
        $ ln -s /mnt/foo /mnt/bar
        ln: failed to create symbolic link ‘bar’: Cannot allocate memory
        $ umount /mnt
        $ btrfsck /dev/sdi
        Checking filesystem on /dev/sdi
        UUID: d5acb5ba-31bd-42da-b456-89dca2e716e1
        checking extents
        checking free space cache
        checking fs roots
        root 5 inode 258 errors 2001, no inode item, link count wrong
      	unresolved ref dir 256 index 3 namelen 3 name bar filetype 7 errors 4, no inode ref
        found 131073 bytes used err is 1
        total csum bytes: 0
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 124305
        file data blocks allocated: 262144
         referenced 262144
        btrfs-progs v4.2.3
      
      So fix this by adding the directory index entries as the very last
      step of symlink creation.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      d50866d0
  11. 31 12月, 2015 1 次提交
  12. 17 12月, 2015 3 次提交
    • F
      Btrfs: fix leaking of ordered extents after direct IO write error · f28a4928
      Filipe Manana 提交于
      When doing a direct IO write, __blockdev_direct_IO() can call the
      btrfs_get_blocks_direct() callback one or more times before it calls the
      btrfs_submit_direct() callback. However it can fail after calling the
      first callback and before calling the second callback, which is a problem
      because the first one creates ordered extents and the second one is the
      one that submits bios that cover the ordered extents created by the first
      one. That means the ordered extents will never complete nor have any of
      the flags BTRFS_ORDERED_IO_DONE / BTRFS_ORDERED_IOERR set, resulting in
      subsequent operations (such as other direct IO writes, buffered writes or
      hole punching) that lock the same IO range and lookup for ordered extents
      in the range to hang forever waiting for those ordered extents because
      they can not complete ever, since no bio was submitted.
      
      Fix this by tracking a range of created ordered extents that don't have
      yet corresponding bios submitted and completing the ordered extents in
      the range if __blockdev_direct_IO() fails with an error.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      f28a4928
    • F
      Btrfs: fix deadlock between direct IO write and defrag/readpages · b850ae14
      Filipe Manana 提交于
      If readpages() (triggered by defrag or buffered reads) is called while a
      direct IO write is in progress, we have a small time window where we can
      deadlock, resulting in traces like the following being generated:
      
      [84723.212993] INFO: task fio:2849 blocked for more than 120 seconds.
      [84723.214310]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [84723.215640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [84723.217313] fio        D ffff88023ec75218     0  2849   2835 0x00000000
      [84723.218778]  ffff880122dfb6e8 0000000000000092 0000000000000000 ffff88023ec75200
      [84723.220458]  ffff88000e05d2c0 ffff880122dfc000 ffff88023ec75200 7fffffffffffffff
      [84723.230597]  0000000000000002 ffffffff8147891a ffff880122dfb700 ffffffff8147856a
      [84723.232085] Call Trace:
      [84723.232625]  [<ffffffff8147891a>] ? bit_wait+0x3c/0x3c
      [84723.233529]  [<ffffffff8147856a>] schedule+0x7d/0x95
      [84723.234398]  [<ffffffff8147baa3>] schedule_timeout+0x43/0x10b
      [84723.235384]  [<ffffffff810f82eb>] ? time_hardirqs_on+0x15/0x28
      [84723.236426]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
      [84723.237502]  [<ffffffff810af8a3>] ? read_seqcount_begin.constprop.20+0x57/0x6d
      [84723.238807]  [<ffffffff8108a09b>] ? trace_hardirqs_on_caller+0x16/0x1ab
      [84723.242012]  [<ffffffff8108a23d>] ? trace_hardirqs_on+0xd/0xf
      [84723.243064]  [<ffffffff810af2ad>] ? timekeeping_get_ns+0xe/0x33
      [84723.244116]  [<ffffffff810afa2e>] ? ktime_get+0x41/0x52
      [84723.245029]  [<ffffffff81477cff>] io_schedule_timeout+0xb7/0x12b
      [84723.245942]  [<ffffffff81477cff>] ? io_schedule_timeout+0xb7/0x12b
      [84723.246596]  [<ffffffff81478953>] bit_wait_io+0x39/0x45
      [84723.247503]  [<ffffffff81478b93>] __wait_on_bit_lock+0x49/0x8d
      [84723.248540]  [<ffffffff8111684f>] __lock_page+0x66/0x68
      [84723.249558]  [<ffffffff81081c9b>] ? autoremove_wake_function+0x3a/0x3a
      [84723.250844]  [<ffffffff81124a04>] lock_page+0x2c/0x2f
      [84723.251871]  [<ffffffff81124afc>] invalidate_inode_pages2_range+0xf5/0x2aa
      [84723.253274]  [<ffffffff81117c34>] ? filemap_fdatawait_range+0x12d/0x146
      [84723.254757]  [<ffffffff81118191>] ? filemap_fdatawrite_range+0x13/0x15
      [84723.256378]  [<ffffffffa05139a2>] btrfs_get_blocks_direct+0x1b0/0x664 [btrfs]
      [84723.258556]  [<ffffffff8119e3f9>] ? submit_page_section+0x7b/0x111
      [84723.260064]  [<ffffffff8119eb90>] do_blockdev_direct_IO+0x658/0xbdb
      [84723.261479]  [<ffffffffa05137f2>] ? btrfs_page_exists_in_range+0x1a9/0x1a9 [btrfs]
      [84723.262961]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [84723.264449]  [<ffffffff8119f144>] __blockdev_direct_IO+0x31/0x33
      [84723.265614]  [<ffffffff8119f144>] ? __blockdev_direct_IO+0x31/0x33
      [84723.266769]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [84723.268264]  [<ffffffffa050935d>] btrfs_direct_IO+0x1b9/0x259 [btrfs]
      [84723.270954]  [<ffffffffa050a8a6>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs]
      [84723.272465]  [<ffffffff8111878c>] generic_file_direct_write+0xb3/0x128
      [84723.273734]  [<ffffffffa051955c>] btrfs_file_write_iter+0x228/0x404 [btrfs]
      [84723.275101]  [<ffffffff8116ca6f>] __vfs_write+0x7c/0xa5
      [84723.276200]  [<ffffffff8116cfab>] vfs_write+0xa0/0xe4
      [84723.277298]  [<ffffffff8116d79d>] SyS_write+0x50/0x7e
      [84723.278327]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [84723.279595] INFO: lockdep is turned off.
      [84723.379035] INFO: task btrfs:2923 blocked for more than 120 seconds.
      [84723.380323]       Tainted: G        W       4.3.0-rc5-btrfs-next-17+ #1
      [84723.381608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [84723.383003] btrfs           D ffff88023ed75218     0  2923   2859 0x00000000
      [84723.384277]  ffff88001311f860 0000000000000082 ffff88001311f840 ffff88023ed75200
      [84723.385748]  ffff88012c6751c0 ffff880013120000 ffff88012042fe68 ffff88012042fe30
      [84723.387152]  ffff880221571c88 0000000000000001 ffff88001311f878 ffffffff8147856a
      [84723.388620] Call Trace:
      [84723.389105]  [<ffffffff8147856a>] schedule+0x7d/0x95
      [84723.391882]  [<ffffffffa051da32>] btrfs_start_ordered_extent+0x161/0x1fa [btrfs]
      [84723.393718]  [<ffffffff81081c61>] ? signal_pending_state+0x31/0x31
      [84723.395659]  [<ffffffffa0522c5b>] __do_contiguous_readpages.constprop.21+0x81/0xdc [btrfs]
      [84723.397383]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
      [84723.398852]  [<ffffffffa0522da3>] __extent_readpages.constprop.20+0xed/0x100 [btrfs]
      [84723.400561]  [<ffffffff81123f6c>] ? __lru_cache_add+0x5d/0x72
      [84723.401787]  [<ffffffffa0523896>] extent_readpages+0x111/0x1a7 [btrfs]
      [84723.403121]  [<ffffffffa050ac96>] ? btrfs_submit_direct+0x3f0/0x3f0 [btrfs]
      [84723.404583]  [<ffffffffa05088fa>] btrfs_readpages+0x1f/0x21 [btrfs]
      [84723.406007]  [<ffffffff811226df>] __do_page_cache_readahead+0x168/0x1f4
      [84723.407502]  [<ffffffff81122988>] ondemand_readahead+0x21d/0x22e
      [84723.408937]  [<ffffffff81122988>] ? ondemand_readahead+0x21d/0x22e
      [84723.410487]  [<ffffffff81122af1>] page_cache_sync_readahead+0x3d/0x3f
      [84723.411710]  [<ffffffffa0535388>] btrfs_defrag_file+0x419/0xaaf [btrfs]
      [84723.413007]  [<ffffffffa0531db0>] ? kzalloc+0xf/0x11 [btrfs]
      [84723.414085]  [<ffffffffa0535b43>] btrfs_ioctl_defrag+0x125/0x14e [btrfs]
      [84723.415307]  [<ffffffffa0536753>] btrfs_ioctl+0x746/0x24c6 [btrfs]
      [84723.416532]  [<ffffffff81087481>] ? arch_local_irq_save+0x9/0xc
      [84723.417731]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
      [84723.418699]  [<ffffffff8113ad61>] ? __might_fault+0x4c/0xa7
      [84723.421532]  [<ffffffff8113adba>] ? __might_fault+0xa5/0xa7
      [84723.422629]  [<ffffffff81171139>] ? cp_new_stat+0x15d/0x174
      [84723.423712]  [<ffffffff8117c610>] do_vfs_ioctl+0x427/0x4e6
      [84723.424801]  [<ffffffff81171175>] ? SYSC_newfstat+0x25/0x2e
      [84723.425968]  [<ffffffff8118574d>] ? __fget_light+0x4d/0x71
      [84723.427063]  [<ffffffff8117c726>] SyS_ioctl+0x57/0x79
      [84723.428138]  [<ffffffff8147cd97>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      Consider the following logical and physical file layout:
      
      logical:    ... [ prealloc extent A ] [ prealloc extent B ] [ extent C ] ...
                      4K                    8K                    16K
      
      physical:   ... 12853248              12857344              1103101952   ...
                                            (= 12853248 + 4K)
      
      Extents A and B are physically adjacent. The following diagram shows a
      sequence of events that lead to the deadlock when we attempt to do a
      direct IO write against the file range [4K, 16K[ and a defrag is triggered
      simultaneously.
      
                 CPU 1                                               CPU 2
      
       btrfs_direct_IO()
      
         btrfs_get_blocks_direct()
           creates ordered extent A, covering
           the 4k prealloc extent A (range [4K, 8K[)
      
                                                          btrfs_defrag_file()
                                                            page_cache_sync_readahead([0K, 1M[)
                                                              btrfs_readpages()
                                                                extent_readpages()
      
                                                                  locks all pages in the file
                                                                  range [0K, 128K[ through calls
                                                                  to add_to_page_cache_lru()
      
                                                                  __do_contiguous_readpages()
      
                                                                     finds ordered extent A
      
                                                                     waits for it to complete
      
         btrfs_get_blocks_direct() called again
      
           lock_extent_direct(range [8K, 16K[)
      
             finds a page in range [8K, 16K[ through
             btrfs_page_exists_in_range()
      
             invalidate_inode_pages2_range([8K, 16K[)
      
               --> tries to lock pages that are already
                   locked by the task at CPU 2
      
               --> our task, running __blockdev_direct_IO(),
                   hangs waiting to lock the pages and the
                   submit bio callback, btrfs_submit_direct(),
                   ends up never being called, resulting in the
                   ordered extent A never completing (because a
                   corresponding bio is never submitted) and
                   CPU 2 will wait for it forever while holding
                   the pages locked
                    ---> deadlock!
      
      Fix this by removing the page invalidation approach when attempting to
      lock the range for IO from the callback btrfs_get_blocks_direct() and
      falling back buffered IO. This was a rare case anyway and well behaved
      applications do not mix concurrent direct IO writes with buffered reads
      anyway, being a concurrent defrag the only normal case that could lead
      to the deadlock.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      b850ae14
    • F
      Btrfs: fix error path when failing to submit bio for direct IO write · 14543774
      Filipe Manana 提交于
      Commit 61de718f ("Btrfs: fix memory corruption on failure to submit
      bio for direct IO") fixed problems with the error handling code after we
      fail to submit a bio for direct IO. However there were 2 problems that it
      did not address when the failure is due to memory allocation failures for
      direct IO writes:
      
      1) We considered that there could be only one ordered extent for the whole
         IO range, which is not always true, as we can have multiple;
      
      2) It did not set the bit BTRFS_ORDERED_IO_DONE in the ordered extent,
         which can make other tasks running btrfs_wait_logged_extents() hang
         forever, since they wait for that bit to be set. The general assumption
         is that regardless of an error, the BTRFS_ORDERED_IO_DONE is always set
         and it precedes setting the bit BTRFS_ORDERED_COMPLETE.
      
      Fix these issues by moving part of the btrfs_endio_direct_write() handler
      into a new helper function and having that new helper function called when
      we fail to allocate memory to submit the bio (and its private object) for
      a direct IO write.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      14543774
  13. 09 12月, 2015 2 次提交
    • A
      replace ->follow_link() with new method that could stay in RCU mode · 6b255391
      Al Viro 提交于
      new method: ->get_link(); replacement of ->follow_link().  The differences
      are:
      	* inode and dentry are passed separately
      	* might be called both in RCU and non-RCU mode;
      the former is indicated by passing it a NULL dentry.
      	* when called that way it isn't allowed to block
      and should return ERR_PTR(-ECHILD) if it needs to be called
      in non-RCU mode.
      
      It's a flagday change - the old method is gone, all in-tree instances
      converted.  Conversion isn't hard; said that, so far very few instances
      do not immediately bail out when called in RCU mode.  That'll change
      in the next commits.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6b255391
    • A
      don't put symlink bodies in pagecache into highmem · 21fc61c7
      Al Viro 提交于
      kmap() in page_follow_link_light() needed to go - allowing to hold
      an arbitrary number of kmaps for long is a great way to deadlocking
      the system.
      
      new helper (inode_nohighmem(inode)) needs to be used for pagecache
      symlinks inodes; done for all in-tree cases.  page_follow_link_light()
      instrumented to yell about anything missed.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      21fc61c7
  14. 07 12月, 2015 2 次提交
  15. 03 12月, 2015 1 次提交