1. 26 1月, 2016 3 次提交
    • D
      Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()" · 80ad623e
      David Sterba 提交于
      This reverts commit 69624913. The
      cleaner thread can block freezing when there's a snapshot cleaning in
      progress and the other threads get suspended first. From the logs
      provided by Martin we're waiting for reading extent pages:
      
      kernel: PM: Syncing filesystems ... done.
      kernel: Freezing user space processes ... (elapsed 0.015 seconds) done.
      kernel: Freezing remaining freezable tasks ...
      kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0):
      kernel: btrfs-cleaner   D ffff88033dd13bc0     0   152      2 0x00000000
      kernel: ffff88032ebc2e00 ffff88032e750000 ffff88032e74fa50 7fffffffffffffff
      kernel: ffffffff814a58df 0000000000000002 ffffea000934d580 ffffffff814a5451
      kernel: 7fffffffffffffff ffffffff814a6e8f 0000000000000000 0000000000000020
      kernel: Call Trace:
      kernel: [<ffffffff814a58df>] ? bit_wait+0x2c/0x2c
      kernel: [<ffffffff814a5451>] ? schedule+0x6f/0x7c
      kernel: [<ffffffff814a6e8f>] ? schedule_timeout+0x2f/0xd8
      kernel: [<ffffffff81076f94>] ? timekeeping_get_ns+0xa/0x2e
      kernel: [<ffffffff81077603>] ? ktime_get+0x36/0x44
      kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
      kernel: [<ffffffff814a4f6c>] ? io_schedule_timeout+0x94/0xf2
      kernel: [<ffffffff814a590b>] ? bit_wait_io+0x2c/0x30
      kernel: [<ffffffff814a5694>] ? __wait_on_bit+0x41/0x73
      kernel: [<ffffffff8109eba8>] ? wait_on_page_bit+0x6d/0x72
      kernel: [<ffffffff8105d718>] ? autoremove_wake_function+0x2a/0x2a
      kernel: [<ffffffff811a02d7>] ? read_extent_buffer_pages+0x1bd/0x203
      kernel: [<ffffffff8117d9e9>] ? free_root_pointers+0x4c/0x4c
      kernel: [<ffffffff8117e831>] ? btree_read_extent_buffer_pages.constprop.57+0x5a/0xe9
      kernel: [<ffffffff8117f4f3>] ? read_tree_block+0x2d/0x45
      kernel: [<ffffffff8116782a>] ? read_block_for_search.isra.34+0x22a/0x26b
      kernel: [<ffffffff811656c3>] ? btrfs_set_path_blocking+0x1e/0x4a
      kernel: [<ffffffff8116919b>] ? btrfs_search_slot+0x648/0x736
      kernel: [<ffffffff81170559>] ? btrfs_lookup_extent_info+0xb7/0x2c7
      kernel: [<ffffffff81170ee5>] ? walk_down_proc+0x9c/0x1ae
      kernel: [<ffffffff81171c9d>] ? walk_down_tree+0x40/0xa4
      kernel: [<ffffffff8117375f>] ? btrfs_drop_snapshot+0x2da/0x664
      kernel: [<ffffffff8104ff21>] ? finish_task_switch+0x126/0x167
      kernel: [<ffffffff811850f8>] ? btrfs_clean_one_deleted_snapshot+0xa6/0xb0
      kernel: [<ffffffff8117eaba>] ? cleaner_kthread+0x13e/0x17b
      kernel: [<ffffffff8117e97c>] ? btrfs_item_end+0x33/0x33
      kernel: [<ffffffff8104d256>] ? kthread+0x95/0x9d
      kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
      kernel: [<ffffffff814a7b5f>] ? ret_from_fork+0x3f/0x70
      kernel: [<ffffffff8104d1c1>] ? kthread_parkme+0x16/0x16
      
      As this affects a released kernel (4.4) we need a minimal fix for
      stable kernels.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=108361Reported-by: NMartin Ziegler <ziegler@uni-freiburg.de>
      CC: stable@vger.kernel.org # 4.4
      CC: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      80ad623e
    • Q
      btrfs: async-thread: Fix a use-after-free error for trace · 0a95b851
      Qu Wenruo 提交于
      Parameter of trace_btrfs_work_queued() can be freed in its workqueue.
      So no one use use that pointer after queue_work().
      
      Fix the user-after-free bug by move the trace line before queue_work().
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0a95b851
    • F
      Btrfs: fix race between fsync and lockless direct IO writes · de0ee0ed
      Filipe Manana 提交于
      An fsync, using the fast path, can race with a concurrent lockless direct
      IO write and end up logging a file extent item that points to an extent
      that wasn't written to yet. This is because the fast fsync path collects
      ordered extents into a local list and then collects all the new extent
      maps to log file extent items based on them, while the direct IO write
      path creates the new extent map before it creates the corresponding
      ordered extent (and submitting the respective bio(s)).
      
      So fix this by making the direct IO write path create ordered extents
      before the extent maps and make the fast fsync path collect any new
      ordered extents after it collects the extent maps.
      Note that making the fsync handler call inode_dio_wait() (after acquiring
      the inode's i_mutex) would not work and lead to a deadlock when doing
      AIO, as through AIO we end up in a path where the fsync handler is called
      (through dio_aio_complete_work() -> dio_complete() -> vfs_fsync_range())
      before the inode's dio counter is decremented (inode_dio_wait() waits
      for this counter to have a value of zero).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      de0ee0ed
  2. 22 1月, 2016 3 次提交
  3. 21 1月, 2016 1 次提交
  4. 20 1月, 2016 14 次提交
    • Z
      btrfs: raid56: Use raid_write_end_io for scrub · a6111d11
      Zhao Lei 提交于
      No need to create additional end_io function for scrub, it increased
      code size and introduced some un-unified lines, as:
      raid_write_parity_end_io():
              int err = bio->bi_error;
              if (bio->bi_error)
      raid_write_end_io():
              int err = bio->bi_error;
              if (err)
      
      This patch combines them.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a6111d11
    • Z
      btrfs: Remove unnecessary ClearPageUptodate for raid56 · 748f4ef4
      Zhao Lei 提交于
      PageUptodate flag already initialized to 0 for new page,
      no need to set it again.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      748f4ef4
    • Z
      btrfs: use rbio->nr_pages to reduce calculation · 915e2290
      Zhao Lei 提交于
      We can use rbio->stripe_npages to reduce unnecessary calculation in
      many code place.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      915e2290
    • Z
      btrfs: Use unified stripe_page's index calculation · b7178a5f
      Zhao Lei 提交于
      We are using different index calculation method for stripe_page in
      current code:
      1: (rbio->stripe_len / PAGE_CACHE_SIZE) * stripe_index + page_index
      2: DIV_ROUND_UP(rbio->stripe_len, PAGE_CACHE_SIZE) * stripe_index + page_index
      3: DIV_ROUND_UP(rbio->stripe_len * stripe_index, PAGE_CACHE_SIZE) + page_index
      ...
      
      They can get same result when stripe_len align to PAGE_CACHE_SIZE,
      this is why current code can work, intruduce and use a common function
      for calculation is a better choose.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b7178a5f
    • Z
      btrfs: Fix calculation of rbio->dbitmap's size calculation · bfca9a6d
      Zhao Lei 提交于
      Current code is trying to calculate rbio->dbitmap's size to make it
      align to sizeof(long), but implement haven't achived this object,
      it is align to sizeof(char) instead.
      This patch fixed above calculation, and use sizeof(long) instead of
      fixed "8" to increate compatibility.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bfca9a6d
    • Z
      btrfs: Fix no_space in write and rm loop · e1746e83
      Zhao Lei 提交于
      I see no_space in v4.4-rc1 again in xfstests generic/102.
      It happened randomly in some node only.
      (one of 4 phy-node, and a kvm with non-virtio block driver)
      
      By bisect, we can found the first-bad is:
       commit bdced438 ("block: setup bi_phys_segments after splitting")'
      But above patch only triggered the bug by making bio operation
      faster(or slower).
      
      Main reason is in our space_allocating code, we need to commit
      page writeback before wait it complish, this patch fixed above
      bug.
      
      BTW, there is another reason for generic/102 fail, caused by
      disable default mixed-blockgroup, I'll fix it in xfstests.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e1746e83
    • Z
      btrfs: merge functions for wait snapshot creation · 0bc19f90
      Zhao Lei 提交于
      wait_for_snapshot_creation() is in same group with oher two:
       btrfs_start_write_no_snapshoting()
       btrfs_end_write_no_snapshoting()
      
      Rename wait_for_snapshot_creation() and move it into same place
      with other two.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0bc19f90
    • Z
      btrfs: delete unused argument in btrfs_copy_from_user · ee22f0c4
      Zhao Lei 提交于
      size_t write_bytes is not necessary for btrfs_copy_from_user(),
      delete it.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ee22f0c4
    • Z
      btrfs: Use direct way to determine raid56 write/recover mode · ad1ba2a0
      Zhao Lei 提交于
      Old code used bbio->raid_map to determine whether in raid56
      write/recover operation, because we didn't't have bbio->map_type.
      
      Now we have direct way for this condition, rid of using
      the function-relative data, and make the code more readable.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ad1ba2a0
    • Z
      btrfs: Small cleanup for get index_srcdev loop · 94a97dfe
      Zhao Lei 提交于
      1: Adjust condition in loop to make less TAB
      2: Move btrfs_put_bbio()'s line for combine, and makes logic clean.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      94a97dfe
    • Q
      btrfs: Enhance chunk validation check · f04b772b
      Qu Wenruo 提交于
      Enhance chunk validation:
      1) Num_stripes
         We already have such check but it's only in super block sys chunk
         array.
         Now check all on-disk chunks.
      
      2) Chunk logical
         It should be aligned to sector size.
         This behavior should be *DOUBLE CHECKED* for 64K sector size like
         PPC64 or AArch64.
         Maybe we can found some hidden bugs.
      
      3) Chunk length
         Same as chunk logical, should be aligned to sector size.
      
      4) Stripe length
         It should be power of 2.
      
      5) Chunk type
         Any bit out of TYPE_MAS | PROFILE_MASK is invalid.
      
      With all these much restrict rules, several fuzzed image reported in
      mail list should no longer cause kernel panic.
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f04b772b
    • Q
      btrfs: Enhance super validation check · 319e4d06
      Qu Wenruo 提交于
      Enhance btrfs_check_super_valid() function by the following points:
      1) Restrict sector/node size check
         Not the old max/min valid check, but also check if it's a power of 2.
         So some bogus number like 12K node size won't pass now.
      
      2) Super flag check
         For now, there is still some inconsistency between kernel and
         btrfs-progs super flags.
         And considering btrfs-progs may add new flags for super block, this
         check will only output warning.
      
      3) Better root alignment check
         Now root bytenr is checked against sector size.
      
      4) Move some check into btrfs_check_super_valid().
         Like node size vs leaf size check, and PAGESIZE vs sectorsize check.
         And magic number check.
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      319e4d06
    • F
      Btrfs: fix deadlock running delayed iputs at transaction commit time · c2d6cb16
      Filipe Manana 提交于
      While running a stress test I ran into a deadlock when running the delayed
      iputs at transaction time, which produced the following report and trace:
      
      [  886.399989] =============================================
      [  886.400871] [ INFO: possible recursive locking detected ]
      [  886.401663] 4.4.0-rc6-btrfs-next-18+ #1 Not tainted
      [  886.402384] ---------------------------------------------
      [  886.403182] fio/8277 is trying to acquire lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] but task is already holding lock:
      [  886.403568]  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] other info that might help us debug this:
      [  886.403568]  Possible unsafe locking scenario:
      [  886.403568]
      [  886.403568]        CPU0
      [  886.403568]        ----
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]   lock(&fs_info->delayed_iput_sem);
      [  886.403568]
      [  886.403568]  *** DEADLOCK ***
      [  886.403568]
      [  886.403568]  May be due to missing lock nesting notation
      [  886.403568]
      [  886.403568] 3 locks held by fio/8277:
      [  886.403568]  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81174c4c>] __sb_start_write+0x5f/0xb0
      [  886.403568]  #1:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa054620d>] btrfs_file_write_iter+0x73/0x408 [btrfs]
      [  886.403568]  #2:  (&fs_info->delayed_iput_sem){++++..}, at: [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.403568]
      [  886.403568] stack backtrace:
      [  886.403568] CPU: 6 PID: 8277 Comm: fio Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [  886.403568] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [  886.403568]  0000000000000000 ffff88009f80f770 ffffffff8125d4fd ffffffff82af1fc0
      [  886.403568]  ffff88009f80f830 ffffffff8108e5f9 0000000200000000 ffff88009fd92290
      [  886.403568]  0000000000000000 ffffffff82af1fc0 ffffffff829cfb01 00042b216d008804
      [  886.403568] Call Trace:
      [  886.403568]  [<ffffffff8125d4fd>] dump_stack+0x4e/0x79
      [  886.403568]  [<ffffffff8108e5f9>] __lock_acquire+0xd42/0xf0b
      [  886.403568]  [<ffffffff810c22db>] ? __module_address+0xdf/0x108
      [  886.403568]  [<ffffffff8108eb77>] lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffff8108eb77>] ? lock_acquire+0x10d/0x194
      [  886.403568]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffff8148556b>] down_read+0x3e/0x4d
      [  886.489542]  [<ffffffffa0538823>] ? btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0538823>] btrfs_run_delayed_iputs+0x36/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffffa0521d7a>] flush_space+0x435/0x44a [btrfs]
      [  886.489542]  [<ffffffffa052218b>] ? reserve_metadata_bytes+0x26a/0x384 [btrfs]
      [  886.489542]  [<ffffffffa05221ae>] reserve_metadata_bytes+0x28d/0x384 [btrfs]
      [  886.489542]  [<ffffffffa052256c>] ? btrfs_block_rsv_refill+0x58/0x96 [btrfs]
      [  886.489542]  [<ffffffffa0522584>] btrfs_block_rsv_refill+0x70/0x96 [btrfs]
      [  886.489542]  [<ffffffffa053d747>] btrfs_evict_inode+0x394/0x55a [btrfs]
      [  886.489542]  [<ffffffff81188e31>] evict+0xa7/0x15c
      [  886.489542]  [<ffffffff81189878>] iput+0x1d3/0x266
      [  886.489542]  [<ffffffffa053887c>] btrfs_run_delayed_iputs+0x8f/0xbf [btrfs]
      [  886.489542]  [<ffffffffa0533953>] btrfs_commit_transaction+0x8f5/0x96e [btrfs]
      [  886.489542]  [<ffffffff81085096>] ? signal_pending_state+0x31/0x31
      [  886.489542]  [<ffffffffa0521191>] btrfs_alloc_data_chunk_ondemand+0x1d7/0x288 [btrfs]
      [  886.489542]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [  886.489542]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [  886.489542]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [  886.489542]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [  886.489542]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [  886.489542]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [  886.489542]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [  886.489542]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [  886.489542]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [  886.489542]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.852335] INFO: task fio:8244 blocked for more than 120 seconds.
      [ 1081.854348]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.857560] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.863227] fio        D ffff880213f9bb28     0  8244   8240 0x00000000
      [ 1081.868719]  ffff880213f9bb28 00ffffff810fc6b0 ffffffff0000000a ffff88023ed55240
      [ 1081.872499]  ffff880206b5d400 ffff880213f9c000 ffff88020a4d5318 ffff880206b5d400
      [ 1081.876834]  ffffffff00000001 ffff880206b5d400 ffff880213f9bb40 ffffffff81482ba4
      [ 1081.880782] Call Trace:
      [ 1081.881793]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.883340]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.895525]  [<ffffffff8108d48d>] ? trace_hardirqs_on_caller+0x16/0x1ab
      [ 1081.897419]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1081.899251]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1081.901063]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1081.902365]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1081.903846]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.906078]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1081.908846]  [<ffffffff8108d461>] ? mark_held_locks+0x56/0x6c
      [ 1081.910409]  [<ffffffffa0521282>] btrfs_check_data_free_space+0x40/0x59 [btrfs]
      [ 1081.912482]  [<ffffffffa05228f5>] btrfs_delalloc_reserve_space+0x1e/0x4e [btrfs]
      [ 1081.914597]  [<ffffffffa053620a>] btrfs_direct_IO+0x10c/0x27e [btrfs]
      [ 1081.919037]  [<ffffffff8111d9a1>] generic_file_direct_write+0xb3/0x128
      [ 1081.920754]  [<ffffffffa05463c3>] btrfs_file_write_iter+0x229/0x408 [btrfs]
      [ 1081.922496]  [<ffffffff8108ae38>] ? __lock_is_held+0x38/0x50
      [ 1081.923922]  [<ffffffff8117279e>] __vfs_write+0x7c/0xa5
      [ 1081.925275]  [<ffffffff81172cda>] vfs_write+0xa0/0xe4
      [ 1081.926584]  [<ffffffff811734cc>] SyS_write+0x50/0x7e
      [ 1081.927968]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [ 1081.985293] INFO: lockdep is turned off.
      [ 1081.986132] INFO: task fio:8249 blocked for more than 120 seconds.
      [ 1081.987434]       Not tainted 4.4.0-rc6-btrfs-next-18+ #1
      [ 1081.988534] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1081.990147] fio        D ffff880218febbb8     0  8249   8240 0x00000000
      [ 1081.991626]  ffff880218febbb8 00ffffff81486b8e ffff88020000000b ffff88023ed75240
      [ 1081.993258]  ffff8802120a9a00 ffff880218fec000 ffff88020a4d5318 ffff8802120a9a00
      [ 1081.994850]  ffffffff00000001 ffff8802120a9a00 ffff880218febbd0 ffffffff81482ba4
      [ 1081.996485] Call Trace:
      [ 1081.997037]  [<ffffffff81482ba4>] schedule+0x7f/0x97
      [ 1081.998017]  [<ffffffff81485eb5>] rwsem_down_write_failed+0x2d5/0x325
      [ 1081.999241]  [<ffffffff810852a5>] ? finish_wait+0x6d/0x76
      [ 1082.000306]  [<ffffffff81269723>] call_rwsem_down_write_failed+0x13/0x20
      [ 1082.001533]  [<ffffffff81269723>] ? call_rwsem_down_write_failed+0x13/0x20
      [ 1082.002776]  [<ffffffff81089fae>] ? __down_write_nested.isra.0+0x1f/0x21
      [ 1082.003995]  [<ffffffff814855bd>] down_write+0x43/0x57
      [ 1082.005000]  [<ffffffffa05211b0>] ? btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.007403]  [<ffffffffa05211b0>] btrfs_alloc_data_chunk_ondemand+0x1f6/0x288 [btrfs]
      [ 1082.008988]  [<ffffffffa0545064>] btrfs_fallocate+0x7c1/0xc2f [btrfs]
      [ 1082.010193]  [<ffffffff8108a1ba>] ? percpu_down_read+0x4e/0x77
      [ 1082.011280]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.012265]  [<ffffffff81174c4c>] ? __sb_start_write+0x5f/0xb0
      [ 1082.013021]  [<ffffffff811712e4>] vfs_fallocate+0x170/0x1ff
      [ 1082.013738]  [<ffffffff81181ebb>] ioctl_preallocate+0x89/0x9b
      [ 1082.014778]  [<ffffffff811822d7>] do_vfs_ioctl+0x40a/0x4ea
      [ 1082.015778]  [<ffffffff81176ea7>] ? SYSC_newfstat+0x25/0x2e
      [ 1082.016806]  [<ffffffff8118b4de>] ? __fget_light+0x4d/0x71
      [ 1082.017789]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [ 1082.018706]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      This happens because we can recursively acquire the semaphore
      fs_info->delayed_iput_sem when attempting to allocate space to satisfy
      a file write request as shown in the first trace above - when committing
      a transaction we acquire (down_read) the semaphore before running the
      delayed iputs, and when running a delayed iput() we can end up calling
      an inode's eviction handler, which in turn commits another transaction
      and attempts to acquire (down_read) again the semaphore to run more
      delayed iput operations.
      This results in a deadlock because if a task acquires multiple times a
      semaphore it should invoke down_read_nested() with a different lockdep
      class for each level of recursion.
      
      Fix this by simplifying the implementation and use a mutex instead that
      is acquired by the cleaner kthread before it runs the delayed iputs
      instead of always acquiring a semaphore before delayed references are
      run from anywhere.
      
      Fixes: d7c15171 (btrfs: Fix NO_SPACE bug caused by delayed-iput)
      Cc: stable@vger.kernel.org   # 4.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c2d6cb16
    • F
      Btrfs: fix typo in log message when starting a balance · fedc0045
      Filipe Manana 提交于
      The recent change titled "Btrfs: Check metadata redundancy on balance"
      (already in linux-next) left a typo in a message for users:
      metatdata -> metadata.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fedc0045
  5. 19 1月, 2016 1 次提交
  6. 16 1月, 2016 6 次提交
    • S
      btrfs: initialize the seq counter in struct btrfs_device · 546bed63
      Sebastian Andrzej Siewior 提交于
      I managed to trigger this:
      | INFO: trying to register non-static key.
      | the code is fine but needs lockdep annotation.
      | turning off the locking correctness validator.
      | CPU: 1 PID: 781 Comm: systemd-gpt-aut Not tainted 4.4.0-rt2+ #14
      | Hardware name: ARM-Versatile Express
      | [<80307cec>] (dump_stack)
      | [<80070e98>] (__lock_acquire)
      | [<8007184c>] (lock_acquire)
      | [<80287800>] (btrfs_ioctl)
      | [<8012a8d4>] (do_vfs_ioctl)
      | [<8012ac14>] (SyS_ioctl)
      
      so I think that btrfs_device_data_ordered_init() is not invoked behind
      a macro somewhere.
      
      Fixes: 7cc8e58d ("Btrfs: fix unprotected device's variants on 32bits machine")
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      546bed63
    • D
      Btrfs: clean up an error code in btrfs_init_space_info() · 0dc924c5
      Dan Carpenter 提交于
      If we return 1 here, then the caller treats it as an error and returns
      -EINVAL.  It causes a static checker warning to treat positive returns
      as an error.
      
      Fixes: 1aba86d6 ('Btrfs: fix easily get into ENOSPC in mixed case')
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0dc924c5
    • G
      btrfs: fix iterator with update error in backref.c · 8e217858
      Geliang Tang 提交于
      Fix the following error:
      
      fs/btrfs/backref.c:565:1-20: iterator with update on line 577
      
      Fixes: a7ca4225('btrfs: use list_for_each_entry* in backref.c')
      Signed-off-by: NGeliang Tang <geliangtang@163.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8e217858
    • T
      Btrfs: fix output of compression message in btrfs_parse_options() · b7c47bbb
      Tsutomu Itoh 提交于
      The compression message might not be correctly output.
      Fix it.
      
      [[before fix]]
      
      # mount -o compress /dev/sdb3 /test3
      [  996.874264] BTRFS info (device sdb3): disk space caching is enabled
      [  996.874268] BTRFS: has skinny extents
      # mount | grep /test3
      /dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
      
      # mount -o remount,compress-force /dev/sdb3 /test3
      [ 1035.075017] BTRFS info (device sdb3): force zlib compression
      [ 1035.075021] BTRFS info (device sdb3): disk space caching is enabled
      # mount | grep /test3
      /dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)
      
      # mount -o remount,compress /dev/sdb3 /test3
      [ 1053.679092] BTRFS info (device sdb3): disk space caching is enabled
      # mount | grep /test3
      /dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
      
      [[after fix]]
      
      # mount -o compress /dev/sdb3 /test3
      [  401.021753] BTRFS info (device sdb3): use zlib compression
      [  401.021758] BTRFS info (device sdb3): disk space caching is enabled
      [  401.021760] BTRFS: has skinny extents
      # mount | grep /test3
      /dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
      
      # mount -o remount,compress-force /dev/sdb3 /test3
      [  439.824624] BTRFS info (device sdb3): force zlib compression
      [  439.824629] BTRFS info (device sdb3): disk space caching is enabled
      # mount | grep /test3
      /dev/sdb3 on /test3 type btrfs (rw,relatime,compress-force=zlib,space_cache,subvolid=5,subvol=/)
      
      # mount -o remount,compress /dev/sdb3 /test3
      [  459.918430] BTRFS info (device sdb3): use zlib compression
      [  459.918434] BTRFS info (device sdb3): disk space caching is enabled
      # mount | grep /test3
      /dev/sdb3 on /test3 type btrfs (rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/)
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7c47bbb
    • C
      Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots · f32e48e9
      Chandan Rajendra 提交于
      The following call trace is seen when btrfs/031 test is executed in a loop,
      
      [  158.661848] ------------[ cut here ]------------
      [  158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea()
      [  158.664102] BTRFS: Transaction aborted (error -2)
      [  158.664774] Modules linked in:
      [  158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted 4.4.0-rc6-g511711af #2
      [  158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      [  158.667392]  ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930
      [  158.668515]  ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000
      [  158.669647]  ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980
      [  158.670769] Call Trace:
      [  158.671153]  [<ffffffff81431fc8>] dump_stack+0x44/0x5c
      [  158.671884]  [<ffffffff81051aa1>] warn_slowpath_common+0x81/0xc0
      [  158.672769]  [<ffffffff81051b27>] warn_slowpath_fmt+0x47/0x50
      [  158.673620]  [<ffffffff813bc98d>] create_subvol+0x3d1/0x6ea
      [  158.674440]  [<ffffffff813777c9>] btrfs_mksubvol.isra.30+0x369/0x520
      [  158.675376]  [<ffffffff8108a4aa>] ? percpu_down_read+0x1a/0x50
      [  158.676235]  [<ffffffff81377a81>] btrfs_ioctl_snap_create_transid+0x101/0x180
      [  158.677268]  [<ffffffff81377b52>] btrfs_ioctl_snap_create+0x52/0x70
      [  158.678183]  [<ffffffff8137afb4>] btrfs_ioctl+0x474/0x2f90
      [  158.678975]  [<ffffffff81144b8e>] ? vma_merge+0xee/0x300
      [  158.679751]  [<ffffffff8115be31>] ? alloc_pages_vma+0x91/0x170
      [  158.680599]  [<ffffffff81123f62>] ? lru_cache_add_active_or_unevictable+0x22/0x70
      [  158.681686]  [<ffffffff813d99cf>] ? selinux_file_ioctl+0xff/0x1d0
      [  158.682581]  [<ffffffff8117b791>] do_vfs_ioctl+0x2c1/0x490
      [  158.683399]  [<ffffffff813d3cde>] ? security_file_ioctl+0x3e/0x60
      [  158.684297]  [<ffffffff8117b9d4>] SyS_ioctl+0x74/0x80
      [  158.685051]  [<ffffffff819b2bd7>] entry_SYSCALL_64_fastpath+0x12/0x6a
      [  158.685958] ---[ end trace 4b63312de5a2cb76 ]---
      [  158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry
      [  158.709508] BTRFS info (device loop0): forced readonly
      [  158.737113] BTRFS info (device loop0): disk space caching is enabled
      [  158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed
      [  158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30
      
      This occurs because,
      
      Mount filesystem
      Create subvol with ID 257
      Unmount filesystem
      Mount filesystem
      Delete subvol with ID 257
        btrfs_drop_snapshot()
          Add root corresponding to subvol 257 into
          btrfs_transaction->dropped_roots list
      Create new subvol (i.e. create_subvol())
        257 is returned as the next free objectid
        btrfs_read_fs_root_no_name()
          Finds the btrfs_root instance corresponding to the old subvol with ID 257
          in btrfs_fs_info->fs_roots_radix.
          Returns error since btrfs_root_item->refs has the value of 0.
      
      To fix the issue the commit initializes tree root's and subvolume root's
      highest_objectid when loading the roots from disk.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f32e48e9
    • J
      btrfs: cleanup, stop casting for extent_map->lookup everywhere · 95617d69
      Jeff Mahoney 提交于
      Overloading extent_map->bdev to struct map_lookup * might have started out
      as a means to an end, but it's a pattern that's used all over the place
      now. Let's get rid of the casting and just add a union instead.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95617d69
  7. 08 1月, 2016 1 次提交
    • F
      Btrfs: fix fitrim discarding device area reserved for boot loader's use · 8cdc7c5b
      Filipe Manana 提交于
      As of the 4.3 kernel release, the fitrim ioctl can now discard any region
      of a disk that is not allocated to any chunk/block group, including the
      first megabyte which is used for our primary superblock and by the boot
      loader (grub for example).
      
      Fix this by not allowing to trim/discard any region in the device starting
      with an offset not greater than min(alloc_start_mount_option, 1Mb), just
      as it was not possible before 4.3.
      
      A reproducer test case for xfstests follows.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            cd /
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
      
        # Write to the [0, 64Kb[ and [68Kb, 1Mb[ ranges of the device. These ranges are
        # reserved for a boot loader to use (GRUB for example) and btrfs should never
        # use them - neither for allocating metadata/data nor should trim/discard them.
        # The range [64Kb, 68Kb[ is used for the primary superblock of the filesystem.
        $XFS_IO_PROG -c "pwrite -S 0xfd 0 64K" $SCRATCH_DEV | _filter_xfs_io
        $XFS_IO_PROG -c "pwrite -S 0xfd 68K 956K" $SCRATCH_DEV | _filter_xfs_io
      
        # Now mount the filesystem and perform a fitrim against it.
        _scratch_mount
        _require_batched_discard $SCRATCH_MNT
        $FSTRIM_PROG $SCRATCH_MNT
      
        # Now unmount the filesystem and verify the content of the ranges was not
        # modified (no trim/discard happened on them).
        _scratch_unmount
        echo "Content of the ranges [0, 64Kb] and [68Kb, 1Mb[ after fitrim:"
        od -t x1 -N $((64 * 1024)) $SCRATCH_DEV
        od -t x1 -j $((68 * 1024)) -N $((956 * 1024)) $SCRATCH_DEV
      
        status=0
        exit
      Reported-by: NVincent Petry  <PVince81@yahoo.fr>
      Reported-by: NAndrei Borzenkov <arvidjaar@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109341
      Fixes: 499f377f (btrfs: iterate over unused chunk space in FITRIM)
      Cc: stable@vger.kernel.org # 4.3+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      8cdc7c5b
  8. 07 1月, 2016 11 次提交
    • S
      Btrfs: Check metadata redundancy on balance · ee592d07
      Sam Tygier 提交于
      When converting a filesystem via balance check that metadata mode
      is at least as redundant as the data mode. For example give warning
      when:
      -dconvert=raid1 -mconvert=single
      Signed-off-by: NSam Tygier <samtygier@yahoo.co.uk>
      [ minor message reformatting ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ee592d07
    • D
      btrfs: statfs: report zero available if metadata are exhausted · ca8a51b3
      David Sterba 提交于
      There is one ENOSPC case that's very confusing. There's Available
      greater than zero but no file operation succeds (besides removing
      files). This happens when the metadata are exhausted and there's no
      possibility to allocate another chunk.
      
      In this scenario it's normal that there's still some space in the data
      chunk and the calculation in df reflects that in the Avail value.
      
      To at least give some clue about the ENOSPC situation, let statfs report
      zero value in Avail, even if there's still data space available.
      
      Current:
        /dev/sdb1             4.0G  3.3G  719M  83% /mnt/test
      
      New:
        /dev/sdb1             4.0G  3.3G     0 100% /mnt/test
      
      We calculate the remaining metadata space minus global reserve. If this
      is (supposedly) smaller than zero, there's no space. But this does not
      hold in practice, the exhausted state happens where's still some
      positive delta. So we apply some guesswork and compare the delta to a 4M
      threshold. (Practically observed delta was 2M.)
      
      We probably cannot calculate the exact threshold value because this
      depends on the internal reservations requested by various operations, so
      some operations that consume a few metadata will succeed even if the
      Avail is zero. But this is better than the other way around.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca8a51b3
    • D
      btrfs: preallocate path for snapshot creation at ioctl time · 8546b570
      David Sterba 提交于
      We can also preallocate btrfs_path that's used during pending snapshot
      creation and avoid another late ENOMEM failure.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8546b570
    • D
      btrfs: allocate root item at snapshot ioctl time · b0c0ea63
      David Sterba 提交于
      The actual snapshot creation is delayed until transaction commit. If we
      cannot get enough memory for the root item there, we have to fail the
      whole transaction commit which is bad. So we'll allocate the memory at
      the ioctl call and pass it along with the pending_snapshot struct. The
      potential ENOMEM will be returned to the caller of snapshot ioctl.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0c0ea63
    • D
      btrfs: do an allocation earlier during snapshot creation · a1ee7362
      David Sterba 提交于
      We can allocate pending_snapshot earlier and do not have to do cleanup
      in case of failure.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a1ee7362
    • D
      btrfs: use smaller type for btrfs_path locks · 4fb72bf2
      David Sterba 提交于
      The values of btrfs_path::locks are 0 to 4, fit into a u8. Let's see:
      
      * overall size of btrfs_path drops down from 136 to 112 (-24 bytes),
      * better packing in a slab page +6 objects
      * the whole structure now fits to 2 cachelines
      * slight decrease in code size:
      
         text    data     bss     dec     hex filename
       938731   43670   23144 1005545   f57e9 fs/btrfs/btrfs.ko.before
       938203   43670   23144 1005017   f55d9 fs/btrfs/btrfs.ko.after
      
      (and the generated assembly does not change much)
      
      The main purpose is to decrease the size of the structure without
      affecting performance. The byte access is usually well behaving accross
      arches, the locks are not accessed frequently and sometimes just
      compared to zero.
      
      Note for further size reduction attempts: the slots could be made u16
      but this might generate worse code on some arches (non-byte and non-int
      access). Also the range of operations on slots is wider compared to
      locks and the potential performance drop should be evaluated first.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4fb72bf2
    • D
      btrfs: use smaller type for btrfs_path lowest_level · 7853f15b
      David Sterba 提交于
      The level is 0..7, we can use smaller type. The size of btrfs_path is now
      136 bytes from 144, which is +2 objects that fit into a 4k slab.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7853f15b
    • D
      btrfs: use smaller type for btrfs_path reada · dccabfad
      David Sterba 提交于
      The possible values for reada are all positive and bounded, we can later
      save some bytes by storing it in u8.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dccabfad
    • D
      btrfs: cleanup, use enum values for btrfs_path reada · e4058b54
      David Sterba 提交于
      Replace the integers by enums for better readability. The value 2 does
      not have any meaning since a7175319
      "Btrfs: do less aggressive btree readahead" (2009-01-22).
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4058b54
    • D
      btrfs: constify static arrays · 4d4ab6d6
      David Sterba 提交于
      There are a few statically initialized arrays that can be made const.
      The remaining (like file_system_type, sysfs attributes or prop handlers)
      do not allow that due to type mismatch when passed to the APIs or
      because the structures are modified through other members.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d4ab6d6
    • D
      btrfs: constify remaining structs with function pointers · 20e5506b
      David Sterba 提交于
      * struct extent_io_ops
      * struct btrfs_free_space_op
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      20e5506b