1. 11 3月, 2014 40 次提交
    • F
      Btrfs: make defrag not fragment files when using prealloc extents · e2127cf0
      Filipe Manana 提交于
      When using prealloc extents, a file defragment operation may actually
      fragment the file and increase the amount of data space used by the file.
      This change fixes that behaviour.
      
      Example:
      
      $ mkfs.btrfs -f /dev/sdb3
      $ mount /dev/sdb3 /mnt
      $ cd /mnt
      $ xfs_io -f -c 'falloc 0 1048576' foobar && sync
      $ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar
      $ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar
      $ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync
      
      Before defragmenting the file:
      
      $ btrfs filesystem df /mnt
      Data, single: total=8.00MiB, used=1.25MiB
      System, DUP: total=8.00MiB, used=16.00KiB
      System, single: total=4.00MiB, used=0.00
      Metadata, DUP: total=1.00GiB, used=112.00KiB
      Metadata, single: total=8.00MiB, used=0.00
      
      $ btrfs-debug-tree /dev/sdb3
      (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 0 nr 4096
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 4096 nr 102400 ram 1048576
      		extent compression 0
      	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 106496 nr 90112
      	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 196608 nr 106496 ram 1048576
      		extent compression 0
      	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 303104 nr 593920
      	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 897024 nr 106496 ram 1048576
      		extent compression 0
      	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 1003520 nr 45056
      (...)
      
      Now defragmenting the file results in more data space used than before:
      
      $ btrfs filesystem defragment -f foobar && sync
      $ btrfs filesystem df /mnt
      Data, single: total=8.00MiB, used=1.55MiB
      System, DUP: total=8.00MiB, used=16.00KiB
      System, single: total=4.00MiB, used=0.00
      Metadata, DUP: total=1.00GiB, used=112.00KiB
      Metadata, single: total=8.00MiB, used=0.00
      
      And the corresponding file extent items are now no longer perfectly sequential
      as before, and we're now needlessly using more space from data block groups:
      
      $ btrfs-debug-tree /dev/sdb3
      (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 0 nr 4096 ram 1048576
      		extent compression 0
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
      		extent data disk byte 13893632 nr 102400
      		extent data offset 0 nr 102400 ram 102400
      		extent compression 0
      	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 106496 nr 90112 ram 1048576
      		extent compression 0
      	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
      		extent data disk byte 13996032 nr 106496
      		extent data offset 0 nr 106496 ram 106496
      		extent compression 0
      	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
      		prealloc data disk byte 12845056 nr 1048576
      		prealloc data offset 303104 nr 593920
      	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
      		extent data disk byte 14102528 nr 106496
      		extent data offset 0 nr 106496 ram 106496
      		extent compression 0
      	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
      		extent data disk byte 12845056 nr 1048576
      		extent data offset 1003520 nr 45056 ram 1048576
      		extent compression 0
      (...)
      
      With this change, the above example will no longer cause allocation of new data
      space nor change the sequentiality of the file extents, that is, defragment will
      be effectless, leaving all extent items pointing to the extent starting at disk
      byte 12845056.
      
      In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of
      400Mb each, initially consisting of a single prealloc extent of 400Mb, having
      random writes happening at a low rate, lead to a total of over ~17Gb of data
      space used, not far from eventually reaching an ENOSPC state.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      e2127cf0
    • F
      Btrfs: correctly flush data on defrag when compression is enabled · dec8ef90
      Filipe Manana 提交于
      When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression
      enabled, we weren't flushing completely, as writing compressed extents
      is a 2 steps process, one to compress the data and another one to write
      the compressed data to disk.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      dec8ef90
    • Q
      btrfs: Cleanup the "_struct" suffix in btrfs_workequeue · d458b054
      Qu Wenruo 提交于
      Since the "_struct" suffix is mainly used for distinguish the differnt
      btrfs_work between the original and the newly created one,
      there is no need using the suffix since all btrfs_workers are changed
      into btrfs_workqueue.
      
      Also this patch fixed some codes whose code style is changed due to the
      too long "_struct" suffix.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      d458b054
    • Q
      btrfs: Cleanup the old btrfs_worker. · a046e9c8
      Qu Wenruo 提交于
      Since all the btrfs_worker is replaced with the newly created
      btrfs_workqueue, the old codes can be easily remove.
      Signed-off-by: NQuwenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      a046e9c8
    • Q
      btrfs: Replace fs_info->scrub_* workqueue with btrfs_workqueue. · 0339ef2f
      Qu Wenruo 提交于
      Replace the fs_info->scrub_* with the newly created
      btrfs_workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      0339ef2f
    • Q
      btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue. · fc97fab0
      Qu Wenruo 提交于
      Replace the fs_info->qgroup_rescan_worker with the newly created
      btrfs_workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      fc97fab0
    • Q
      btrfs: Replace fs_info->delayed_workers workqueue with btrfs_workqueue. · 5b3bc44e
      Qu Wenruo 提交于
      Replace the fs_info->delayed_workers with the newly created
      btrfs_workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      5b3bc44e
    • Q
      btrfs: Replace fs_info->fixup_workers workqueue with btrfs_workqueue. · dc6e3209
      Qu Wenruo 提交于
      Replace the fs_info->fixup_workers with the newly created
      btrfs_workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      dc6e3209
    • Q
      btrfs: Replace fs_info->readahead_workers workqueue with btrfs_workqueue. · 736cfa15
      Qu Wenruo 提交于
      Replace the fs_info->readahead_workers with the newly created
      btrfs_workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      736cfa15
    • Q
      btrfs: Replace fs_info->cache_workers workqueue with btrfs_workqueue. · e66f0bb1
      Qu Wenruo 提交于
      Replace the fs_info->cache_workers with the newly created
      btrfs_workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      e66f0bb1
    • Q
      btrfs: Replace fs_info->rmw_workers workqueue with btrfs_workqueue. · d05a33ac
      Qu Wenruo 提交于
      Replace the fs_info->rmw_workers with the newly created
      btrfs_workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      d05a33ac
    • Q
      btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue. · fccb5d86
      Qu Wenruo 提交于
      Replace the fs_info->endio_* workqueues with the newly created
      btrfs_workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      fccb5d86
    • Q
      btrfs: Replace fs_info->flush_workers with btrfs_workqueue. · a44903ab
      Qu Wenruo 提交于
      Replace the fs_info->submit_workers with the newly created
      btrfs_workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      a44903ab
    • Q
      btrfs: Replace fs_info->submit_workers with btrfs_workqueue. · a8c93d4e
      Qu Wenruo 提交于
      Much like the fs_info->workers, replace the fs_info->submit_workers
      use the same btrfs_workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      a8c93d4e
    • Q
      btrfs: Replace fs_info->delalloc_workers with btrfs_workqueue · afe3d242
      Qu Wenruo 提交于
      Much like the fs_info->workers, replace the fs_info->delalloc_workers
      use the same btrfs_workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      afe3d242
    • Q
      btrfs: Replace fs_info->workers with btrfs_workqueue. · 5cdc7ad3
      Qu Wenruo 提交于
      Use the newly created btrfs_workqueue_struct to replace the original
      fs_info->workers
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      5cdc7ad3
    • Q
      btrfs: Add threshold workqueue based on kernel workqueue · 0bd9289c
      Qu Wenruo 提交于
      The original btrfs_workers has thresholding functions to dynamically
      create or destroy kthreads.
      
      Though there is no such function in kernel workqueue because the worker
      is not created manually, we can still use the workqueue_set_max_active
      to simulated the behavior, mainly to achieve a better HDD performance by
      setting a high threshold on submit_workers.
      (Sadly, no resource can be saved)
      
      So in this patch, extra workqueue pending counters are introduced to
      dynamically change the max active of each btrfs_workqueue_struct, hoping
      to restore the behavior of the original thresholding function.
      
      Also, workqueue_set_max_active use a mutex to protect workqueue_struct,
      which is not meant to be called too frequently, so a new interval
      mechanism is applied, that will only call workqueue_set_max_active after
      a count of work is queued. Hoping to balance both the random and
      sequence performance on HDD.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      0bd9289c
    • Q
      btrfs: Add high priority workqueue support for btrfs_workqueue_struct · 1ca08976
      Qu Wenruo 提交于
      Add high priority function to btrfs_workqueue.
      
      This is implemented by embedding a btrfs_workqueue into a
      btrfs_workqueue and use some helper functions to differ the normal
      priority wq and high priority wq.
      So the high priority wq is completely independent from the normal
      workqueue.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      1ca08976
    • Q
      btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue · 08a9ff32
      Qu Wenruo 提交于
      Use kernel workqueue to implement a new btrfs_workqueue_struct, which
      has the ordering execution feature like the btrfs_worker.
      
      The func is executed in a concurrency way, and the
      ordred_func/ordered_free is executed in the sequence them are queued
      after the corresponding func is done.
      
      The new btrfs_workqueue works much like the original one, one workqueue
      for normal work and a list for ordered work.
      When a work is queued, ordered work will be added to the list and helper
      function will be queued into the workqueue.
      The helper function will execute a normal work and then check and execute as many
      ordered work as possible in the sequence they were queued.
      
      At this patch, high priority work queue or thresholding is not added yet.
      The high priority feature and thresholding will be added in the following patches.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      08a9ff32
    • Q
      btrfs: Cleanup the unused struct async_sched. · f5961d41
      Qu Wenruo 提交于
      The struct async_sched is not used by any codes and can be removed.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fusionio.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      f5961d41
    • L
      Btrfs: skip search tree for REG files · 644d1940
      Liu Bo 提交于
      It is really unnecessary to search tree again for @gen, @mode and @rdev
      in the case of REG inodes' creation, as we've got btrfs_inode_item in sctx,
      and @gen, @mode and @rdev can easily be fetched.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      644d1940
    • M
      Btrfs: fix preallocate vs double nocow write · 7b2b7085
      Miao Xie 提交于
      We can not release the reserved metadata space for the first write if we
      find the write position is pre-allocated. Because the kernel might write
      the data on the disk before we do the second write but after the can-nocow
      check, if we release the space for the first write, we might fail to update
      the metadata because of no space.
      
      Fix this problem by end nocow write if there is dirty data in the range whose
      space is pre-allocated.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      7b2b7085
    • M
      Btrfs: fix wrong lock range and write size in check_can_nocow() · c933956d
      Miao Xie 提交于
      The write range may not be sector-aligned, for example:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|--------|  <- correct lock range, size: 3blocks
      
      But according to the old code, we used the size of write range to calculate
      the lock range directly, not considered the offset, we would get a wrong lock
      range:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|		<- wrong lock range, size: 2blocks
      
      And besides that, the old code also had the same problem when calculating
      the real write size. Correct them.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      c933956d
    • D
      9c9ca00b
    • D
      btrfs: send: fix old buffer length in fs_path_ensure_buf · 1b2782c8
      David Sterba 提交于
      In "btrfs: send: lower memory requirements in common case" the code to
      save the old_buf_len was incorrectly moved to a wrong place and broke
      the original logic.
      Reported-by: NFilipe David Manana <fdmanana@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Reviewed-by: NFilipe David Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      1b2782c8
    • F
      Btrfs: more efficient btrfs_drop_extent_cache · 176840b3
      Filipe Manana 提交于
      While droping extent map structures from the extent cache that cover our
      target range, we would remove each extent map structure from the red black
      tree and then add either 1 or 2 new extent map structures if the former
      extent map covered sections outside our target range.
      
      This change simply attempts to replace the existing extent map structure
      with a new one that covers the subsection we're not interested in, instead
      of doing a red black remove operation followed by an insertion operation.
      
      The number of elements in an inode's extent map tree can get very high for large
      files under random writes. For example, while running the following test:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G \
              --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
              --max-requests=500000 --file-rw-ratio=2 [prepare|run]
      
      I captured the following histogram capturing the number of extent_map items
      in the red black tree while that test was running:
      
          Count: 122462
          Range:  1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
          Percentiles:  90th: 160120.000; 95th: 166335.000; 99th: 171070.000
             1.000 -    5.231:   452 |
             5.231 -  187.392:    87 |
           187.392 -  585.911:   206 |
           585.911 - 1827.438:   623 |
          1827.438 - 5695.245:  1962 #
          5695.245 - 17744.861:  6204 ####
         17744.861 - 55283.764: 21115 ############
         55283.764 - 172231.000: 91813 #####################################################
      
      Benchmark:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
              --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
              --file-io-mode=sync --file-fsync-freq=0 [prepare|run]
      
      Before this change: 122.1Mb/sec
      After this change:  125.07Mb/sec
      (averages of 5 test runs)
      
      Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      176840b3
    • F
      Btrfs: more efficient split extent state insertion · f2071b21
      Filipe Manana 提交于
      When we split an extent state there's no need to start the rbtree search
      from the root node - we can start it from the original extent state node,
      since we would end up in its subtree if we do the search starting at the
      root node anyway.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      f2071b21
    • F
      Btrfs: remove unneeded field / smaller extent_map structure · cbc0e928
      Filipe Manana 提交于
      We don't need to have an unsigned int field in the extent_map struct
      to tell us whether the extent map is in the inode's extent_map tree or
      not. We can use the rb_node struct field and the RB_CLEAR_NODE and
      RB_EMPTY_NODE macros to achieve the same task.
      
      This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a
      64 bits system).
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      cbc0e928
    • W
      Btrfs: skip locking when searching commit root · e84752d4
      Wang Shilong 提交于
      We won't change commit root, skip locking dance with commit root
      when walking backrefs, this can speed up btrfs send operations.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      e84752d4
    • W
      Btrfs: wake up @scrub_pause_wait as much as we can · 32a44789
      Wang Shilong 提交于
      check if @scrubs_running=@scrubs_paused condition inside wait_event()
      is not an atomic operation which means we may inc/dec @scrub_running/
      paused at any time. Let's wake up @scrub_pause_wait as much as we can
      to let commit transaction blocked less.
      
      An example below:
      
      Thread1				Thread2
      |->scrub_blocked_if_needed()	|->scrub_pending_trans_workers_inc
        |->increase @scrub_paused
                                             |->increase @scrub_running
        |->wake up scrub_pause_wait list
                                             |->scrub blocked
                                             |->increase @scrub_paused
      
      Thread3 is commiting transaction which is blocked at btrfs_scrub_pause().
      So after Thread2 increase @scrub_paused, we meet the condition
      @scrub_paused=@scrub_running, but transaction will be still blocked until
      another calling to wake up @scrub_pause_wait.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      32a44789
    • W
      Btrfs: cancel scrub on transaction abortion · c0af8f0b
      Wang Shilong 提交于
      If we fail to commit transaction, we'd better
      cancel scrub operations.
      Suggested-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      c0af8f0b
    • W
      Btrfs: device_replace: fix deadlock for nocow case · 12cf9372
      Wang Shilong 提交于
      commit cb7ab021 cause a following deadlock found by
      xfstests,btrfs/011:
      
      Thread1 is commiting transaction which is blocked at
      btrfs_scrub_pause().
      
      Thread2 is calling btrfs_file_aio_write() which has held
      inode's @i_mutex and commit transaction(blocked because
      Thread1 is committing transaction).
      
      Thread3 is copy_nocow_page worker which will also try to
      hold inode @i_mutex, so thread3 will wait Thread1 finished.
      
      Thread4 is waiting pending workers finished which will wait
      Thread3 finished. So the problem is like this:
      
      Thread1--->Thread4--->Thread3--->Thread2---->Thread1
      
      Deadlock happens! we fix it by letting Thread1 go firstly,
      which means we won't block transaction commit while we are
      waiting pending workers finished.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      12cf9372
    • W
      Btrfs: fix a possible deadlock between scrub and transaction committing · 6cf7f77e
      Wang Shilong 提交于
      btrfs_scrub_continue() will be called when cleaning up transaction.However,
      this can only be called if btrfs_scrub_pause() is called before.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      6cf7f77e
    • S
      btrfs: Use PTR_ERR_OR_ZERO · 886322e8
      Sachin Kamat 提交于
      PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it
      also include missing err.h header.
      Signed-off-by: NSachin Kamat <sachin.kamat@linaro.org>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      886322e8
    • F
      Btrfs: fix send issuing outdated paths for utimes, chown and chmod · bf0d1f44
      Filipe Manana 提交于
      When doing an incremental send, if we had a directory pending a move/rename
      operation and none of its parents, except for the immediate parent, were
      pending a move/rename, after processing the directory's references, we would
      be issuing utimes, chown and chmod intructions against am outdated path - a
      path which matched the one in the parent root.
      
      This change also simplifies a bit the code that deals with building a path
      for a directory which has a move/rename operation delayed.
      
      Steps to reproduce:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b/c/d/e
          $ mkdir /mnt/btrfs/a/b/c/f
          $ chmod 0777 /mnt/btrfs/a/b/c/d/e
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ mv /mnt/btrfs/a/b/c/f /mnt/btrfs/a/b/f2
          $ mv /mnt/btrfs/a/b/c/d/e /mnt/btrfs/a/b/f2/e2
          $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2
          $ mv /mnt/btrfs/a/b/c2/d /mnt/btrfs/a/b/c2/d2
          $ chmod 0700 /mnt/btrfs/a/b/f2/e2
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umount /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
      
          ERROR: chmod a/b/c/d/e failed. No such file or directory
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      bf0d1f44
    • F
      Btrfs: correctly determine if blocks are shared in btrfs_compare_trees · 6baa4293
      Filipe Manana 提交于
      Just comparing the pointers (logical disk addresses) of the btree nodes is
      not completely bullet proof, we have to check if their generation numbers
      match too.
      
      It is guaranteed that a COW operation will result in a block with a different
      logical disk address than the original block's address, but over time we can
      reuse that former logical disk address.
      
      For example, creating a 2Gb filesystem on a loop device, and having a script
      running in a loop always updating the access timestamp of a file, resulted in
      the same logical disk address being reused for the same fs btree block in about
      only 4 minutes.
      
      This could make us skip entire subtrees when doing an incremental send (which
      is currently the only user of btrfs_compare_trees). However the odds of getting
      2 blocks at the same tree level, with the same logical disk address, equal first
      slot keys and different generations, should hopefully be very low.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      6baa4293
    • F
      Btrfs: fix send attempting to rmdir non-empty directories · 9dc44214
      Filipe Manana 提交于
      The incremental send algorithm assumed that it was possible to issue
      a directory remove (rmdir) if the the inode number it was currently
      processing was greater than (or equal) to any inode that referenced
      the directory's inode. This wasn't a valid assumption because any such
      inode might be a child directory that is pending a move/rename operation,
      because it was moved into a directory that has a higher inode number and
      was moved/renamed too - in other words, the case the following commit
      addressed:
      
          9f03740a
          (Btrfs: fix infinite path build loops in incremental send)
      
      This made an incremental send issue an rmdir operation before the
      target directory was actually empty, which made btrfs receive fail.
      Therefore it needs to wait for all pending child directory inodes to
      be moved/renamed before sending an rmdir operation.
      
      Simple steps to reproduce this issue:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b/c/x
          $ mkdir /mnt/btrfs/a/b/y
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ mv /mnt/btrfs/a/b/y /mnt/btrfs/a/b/YY
          $ mv /mnt/btrfs/a/b/c/x /mnt/btrfs/a/b/YY
          $ rmdir /mnt/btrfs/a/b/c
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umount /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
      
          ERROR: rmdir o259-6-0 failed. Directory not empty
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      9dc44214
    • F
      Btrfs: send, don't send rmdir for same target multiple times · 29d6d30f
      Filipe Manana 提交于
      When doing an incremental send, if we delete a directory that has N > 1
      hardlinks for the same file and that file has the highest inode number
      inside the directory contents, an incremental send would send N times an
      rmdir operation against the directory. This made the btrfs receive command
      fail on the second rmdir instruction, as the target directory didn't exist
      anymore.
      
      Steps to reproduce the issue:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b/c
          $ echo 'ola mundo' > /mnt/btrfs/a/b/c/foo.txt
          $ ln /mnt/btrfs/a/b/c/foo.txt /mnt/btrfs/a/b/c/bar.txt
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ rm -f /mnt/btrfs/a/b/c/foo.txt
          $ rm -f /mnt/btrfs/a/b/c/bar.txt
          $ rmdir /mnt/btrfs/a/b/c
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umount /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
      
          ERROR: rmdir o259-6-0 failed. No such file or directory
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      29d6d30f
    • F
      Btrfs: incremental send, fix invalid path after dir rename · 2b863a13
      Filipe Manana 提交于
      This fixes yet one more case not caught by the commit titled:
      
         Btrfs: fix infinite path build loops in incremental send
      
      In this case, even before the initial full send, we have a directory
      which is a child of a directory with a higher inode number. Then we
      perform the initial send, and after we rename both the child and the
      parent, without moving them around. After doing these 2 renames, an
      incremental send sent a rename instruction for the child directory
      which contained an invalid "from" path (referenced the parent's old
      name, not the new one), which made the btrfs receive command fail.
      
      Steps to reproduce:
      
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ mkdir -p /mnt/btrfs/a/b
          $ mkdir /mnt/btrfs/d
          $ mkdir /mnt/btrfs/a/b/c
          $ mv /mnt/btrfs/d /mnt/btrfs/a/b/c
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1
          $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send
          $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/x
          $ mv /mnt/btrfs/a/b/x/d /mnt/btrfs/a/b/x/y
          $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2
          $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send
      
          $ umout /mnt/btrfs
          $ mkfs.btrfs -f /dev/sdb3
          $ mount /dev/sdb3 /mnt/btrfs
          $ btrfs receive /mnt/btrfs -f /tmp/base.send
          $ btrfs receive /mnt/btrfs -f /tmp/incremental.send
      
      The second btrfs receive command failed with:
        "ERROR: rename a/b/c/d -> a/b/x/y failed. No such file or directory"
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      2b863a13
    • F
      Btrfs: don't insert useless holes when punching beyond the inode's size · 12870f1c
      Filipe Manana 提交于
      If we punch beyond the size of an inode, we'll correctly remove any prealloc extents,
      but we'll also insert file extent items representing holes (disk bytenr == 0) that start
      with a key offset that lies beyond the inode's size and are not contiguous with the last
      file extent item.
      
      Example:
      
        $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
      
      btrfs-debug-tree output:
      
        item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
      	inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1
        item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13
      	inode ref index 2 namelen 3 name: foo
        item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
      	extent data disk byte 0 nr 0 gen 6
      	extent data offset 0 nr 90112 ram 122880
      	extent compression 0
        item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53
      	extent data disk byte 12845056 nr 4096 gen 6
      	extent data offset 0 nr 45056 ram 45056
      	extent compression 2
        item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53
      	extent data disk byte 0 nr 0 gen 6
      	extent data offset 0 nr 860160 ram 860160
      	extent compression 0
      
      The last extent item, which represents a hole, is useless as it lies beyond the inode's
      size.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      12870f1c