1. 08 7月, 2016 7 次提交
    • J
      Btrfs: introduce ticketed enospc infrastructure · 957780eb
      Josef Bacik 提交于
      Our enospc flushing sucks.  It is born from a time where we were early
      enospc'ing constantly because multiple threads would race in for the same
      reservation and randomly starve other ones out.  So I came up with this solution
      to block any other reservations from happening while one guy tried to flush
      stuff to satisfy his reservation.  This gives us pretty good correctness, but
      completely crap latency.
      
      The solution I've come up with is ticketed reservations.  Basically we try to
      make our reservation, and if we can't we put a ticket on a list in order and
      kick off an async flusher thread.  This async flusher thread does the same old
      flushing we always did, just asynchronously.  As space is freed and added back
      to the space_info it checks and sees if we have any tickets that need
      satisfying, and adds space to the tickets and wakes up anything we've satisfied.
      
      Once the flusher thread stops making progress it wakes up all the current
      tickets and tells them to take a hike.
      
      There is a priority list for things that can't flush, since the async flusher
      could do anything we need to avoid deadlocks.  These guys get priority for
      having their reservation made, and will still do manual flushing themselves in
      case the async flusher isn't running.
      
      This patch gives us significantly better latencies.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      957780eb
    • J
      Btrfs: add tracepoint for adding block groups · c83f8eff
      Josef Bacik 提交于
      I'm writing a tool to visualize the enospc system inside btrfs, I need this
      tracepoint in order to keep track of the block groups in the system.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c83f8eff
    • J
      Btrfs: warn_on for unaccounted spaces · d555b6c3
      Josef Bacik 提交于
      These were hidden behind enospc_debug, which isn't helpful as they indicate
      actual bugs, unlike the rest of the enospc_debug stuff which is really debug
      information.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d555b6c3
    • J
      Btrfs: change delayed reservation fallback behavior · c48f49d6
      Josef Bacik 提交于
      We reserve space for the inode update when we first reserve space for writing to
      a file.  However there are lots of ways that we can use this reservation and not
      have it for subsequent ordered extents.  Previously we'd fall through and try to
      reserve metadata bytes for this, then we'd just steal the full reservation from
      the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
      reservation from the global reserve.  The problem with this is we can easily
      just return ENOSPC and fallback to updating the inode item directly.  In the
      worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
      fall all the way through, however if we just fallback and update the inode
      directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.
      
      We would have also just added the extent item for the inode so we likely will
      have already cow'ed down most of the way to the leaf containing the inode item,
      so we are more often than not only need one or two nodesize's worth of
      reservations.  Given the reservation for the extent itself is also a worst case
      we will likely already have space to cover the inode update.
      
      This change will make us behave better in the theoretical worst case, and much
      better in the case that we don't have our reservation and cannot reserve more
      metadata.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c48f49d6
    • J
      Btrfs: always reserve metadata for delalloc extents · 48c3d480
      Josef Bacik 提交于
      There are a few races in the metadata reservation stuff.  First we add the bytes
      to the block_rsv well after we've set the bit on the inode saying that we have
      space for it and after we've reserved the bytes.  So use the normal
      btrfs_block_rsv_add helper for this case.  Secondly we can flush delalloc
      extents when we try to reserve space for our write, which means that we could
      have used up the space for the inode and we wouldn't know because we only check
      before the reservation.  So instead make sure we are always reserving space for
      the inode update, and then if we don't need it release those bytes afterward.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48c3d480
    • J
      Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8
      Josef Bacik 提交于
      So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
      Not only this but it unconditionally changes the size of the block_rsv.  This
      isn't a bug strictly speaking, but it makes truncate block rsv's look funny
      because every time we migrate bytes over its size grows, even though we only
      want it to be a specific size.  So collapse this into one function that takes an
      update_size argument and make truncate and evict not update the size for
      consistency sake.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25d609f8
    • J
      Btrfs: add bytes_readonly to the spaceinfo at once · e40edf2d
      Josef Bacik 提交于
      For some reason we're adding bytes_readonly to the space info after we update
      the space info with the block group info.  This creates a tiny race where we
      could over-reserve space because we haven't yet taken out the bytes_readonly
      bit.  Since we already know this information at the time we call
      update_space_info, just pass it along so it can be updated all at once.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e40edf2d
  2. 25 6月, 2016 1 次提交
    • O
      Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99
      Omar Sandoval 提交于
      Commit fe742fd4 ("Revert "btrfs: switch to ->iterate_shared()"")
      backed out the conversion to ->iterate_shared() for Btrfs because the
      delayed inode handling in btrfs_real_readdir() is racy. However, we can
      still do readdir in parallel if there are no delayed nodes.
      
      This is a temporary fix which upgrades the shared inode lock to an
      exclusive lock only when we have delayed items until we come up with a
      more complete solution. While we're here, rename the
      btrfs_{get,put}_delayed_items functions to make it very clear that
      they're just for readdir.
      
      Tested with xfstests and by doing a parallel kernel build:
      
      	while make tinyconfig && make -j4 && git clean dqfx; do
      		:
      	done
      
      along with a bunch of parallel finds in another shell:
      
      	while true; do
      		for ((i=0; i<4; i++)); do
      			find . >/dev/null &
      		done
      		wait
      	done
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      02dbfc99
  3. 24 6月, 2016 4 次提交
    • C
      Btrfs: Force stripesize to the value of sectorsize · b7f67055
      Chandan Rajendra 提交于
      Btrfs code currently assumes stripesize to be same as
      sectorsize. However Btrfs-progs (until commit
      df05c7ed455f519e6e15e46196392e4757257305) has been setting
      btrfs_super_block->stripesize to a value of 4096.
      
      This commit makes sure that the value of btrfs_super_block->stripesize
      is a power of 2. Later, it unconditionally sets btrfs_root->stripesize
      to sectorsize.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b7f67055
    • W
      btrfs: fix disk_i_size update bug when fallocate() fails · c0d2f610
      Wang Xiaoguang 提交于
      When doing truncate operation, btrfs_setsize() will first call
      truncate_setsize() to set new inode->i_size, but if later
      btrfs_truncate() fails, btrfs_setsize() will call
      "i_size_write(inode, BTRFS_I(inode)->disk_i_size)" to reset the
      inmemory inode size, now bug occurs. It's because for truncate
      case btrfs_ordered_update_i_size() directly uses inode->i_size
      to update BTRFS_I(inode)->disk_i_size, indeed we should use the
      "offset" argument to update disk_i_size. Here is the call graph:
      ==>btrfs_truncate()
      ====>btrfs_truncate_inode_items()
      ======>btrfs_ordered_update_i_size(inode, last_size, NULL);
      Here btrfs_ordered_update_i_size()'s offset argument is last_size.
      
      And below test case can reveal this bug:
      
      dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=100
      dev=$(losetup --show -f fs.img)
      mkdir -p /mnt/mntpoint
      mkfs.btrfs  -f $dev
      mount $dev /mnt/mntpoint
      cd /mnt/mntpoint
      
      echo "workdir is: /mnt/mntpoint"
      blocksize=$((128 * 1024))
      dd if=/dev/zero of=testfile bs=$blocksize count=1
      sync
      count=$((17*1024*1024*1024/blocksize))
      echo "file size is:" $((count*blocksize))
      for ((i = 1; i <= $count; i++)); do
      	i=$((i + 1))
      	dst_offset=$((blocksize * i))
      	xfs_io -f -c "reflink testfile 0 $dst_offset $blocksize"\
      		testfile > /dev/null
      done
      sync
      
      truncate --size 0 testfile
      ls -l testfile
      du -sh testfile
      exit
      
      In this case, truncate operation will fail for enospc reason and
      "du -sh testfile" returns value greater than 0, but testfile's
      size is 0, we need to reflect correct inode->i_size.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c0d2f610
    • L
      Btrfs: fix error handling in map_private_extent_buffer · 415b35a5
      Liu Bo 提交于
      map_private_extent_buffer() can return -EINVAL in two different cases,
      1. when the requested contents span two pages if nodesize is larger
         than pagesize,
      2. when it detects something insane.
      
      The 2nd one used to be only a WARN_ON(1), and we decided to return a error
      to callers, but we didn't fix up all its callers, which will be
      addressed by this patch.
      
      Without this, btrfs may end up with 'general protection', ie.
      reading invalid memory.
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      415b35a5
    • W
      Btrfs: fix error return code in btrfs_init_test_fs() · 04e1b65a
      Wei Yongjun 提交于
      Fix to return a negative error code from the kern_mount() error handling
      case instead of 0(ret is set to 0 by register_filesystem), as done
      elsewhere in this function.
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      04e1b65a
  4. 23 6月, 2016 3 次提交
    • J
      Btrfs: don't do nocow check unless we have to · c6887cd1
      Josef Bacik 提交于
      Before we write into prealloc/nocow space we have to make sure that there are no
      references to the extents we are writing into, which means checking the extent
      tree and csum tree in the case of nocow.  So we don't want to do the nocow dance
      unless we can't reserve data space, since it's a serious drag on performance.
      With the following sequence
      
      fallocate -l10737418240 /mnt/btrfs-test/file
      cp --reflink /mnt/btrfs-test/file /mnt/btrfs-test/link
      fio --name=randwrite --rw=randwrite --bs=4k --filename=/mnt/btrfs-test/file \
      	--end_fsync=1
      
      we get the worst case scenario where we have to fall back on to doing the check
      anyway.
      
      Without this patch
      lat (usec): min=5, max=111598, avg=27.65, stdev=124.51
      write: io=10240MB, bw=126876KB/s, iops=31718, runt= 82646msec
      
      With this patch
      lat (usec): min=3, max=91210, avg=14.09, stdev=110.62
      write: io=10240MB, bw=212753KB/s, iops=53188, runt= 49286msec
      
      We get twice the throughput, half of the runtime, and half of the average
      latency.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      [ PAGE_CACHE_ removal related fixups ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c6887cd1
    • C
      btrfs: fix deadlock in delayed_ref_async_start · 0f873eca
      Chris Mason 提交于
      "Btrfs: track transid for delayed ref flushing" was deadlocking on
      btrfs_attach_transaction because its not safe to call from the async
      delayed ref start code.  This commit brings back btrfs_join_transaction
      instead and checks for a blocked commit.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0f873eca
    • J
      Btrfs: track transid for delayed ref flushing · 31b9655f
      Josef Bacik 提交于
      Using the offwakecputime bpf script I noticed most of our time was spent waiting
      on the delayed ref throttling.  This is what is supposed to happen, but
      sometimes the transaction can commit and then we're waiting for throttling that
      doesn't matter anymore.  So change this stuff to be a little smarter by tracking
      the transid we were in when we initiated the throttling.  If the transaction we
      get is different then we can just bail out.  This resulted in a 50% speedup in
      my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
      over the entire run (which is about 30 minutes).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31b9655f
  5. 18 6月, 2016 8 次提交
  6. 06 6月, 2016 9 次提交
  7. 04 6月, 2016 1 次提交
    • C
      Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent · 8dff9c85
      Chris Mason 提交于
      When dealing with inline extents, btrfs_get_extent will incorrectly try
      to insert a duplicate extent_map.  The dup hits -EEXIST from
      add_extent_map, but then we try to merge with the existing one and end
      up trying to insert a zero length extent_map.
      
      This actually works most of the time, except when there are extent maps
      past the end of the inline extent.  rocksdb will trigger this sometimes
      because it preallocates an extent and then truncates down.
      
      Josef made a script to trigger with xfs_io:
      
      	#!/bin/bash
      
      	xfs_io -f -c "pwrite 0 1000" inline
      	xfs_io -c "falloc -k 4k 1M" inline
      	xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
      	xfs_io -c "fadvise -d 0 1000" inline
      	cat inline
      
      You'll get EIOs trying to read inline after this because add_extent_map
      is returning EEXIST
      Signed-off-by: NChris Mason <clm@fb.com>
      8dff9c85
  8. 03 6月, 2016 3 次提交
  9. 01 6月, 2016 1 次提交
  10. 31 5月, 2016 2 次提交
    • F
      Btrfs: fix race between device replace and read repair · b5de8d0d
      Filipe Manana 提交于
      While we are finishing a device replace operation we can have a concurrent
      task trying to do a read repair operation, in which case it will call
      btrfs_map_block() to get a struct btrfs_bio which can have a stripe that
      points to the source device of the device replace operation. This allows
      for the read repair task to dereference the stripe's device pointer after
      the device replace operation has freed the source device, resulting in
      an invalid memory access. This is similar to the problem solved by my
      previous patch in the same series and named "Btrfs: fix race between
      device replace and discard".
      
      So fix this by surrounding the call to btrfs_map_block() and the code
      that uses the returned struct btrfs_bio with calls to
      btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), giving the
      proper serialization with the finishing phase of the device replace
      operation.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      b5de8d0d
    • F
      Btrfs: fix race between device replace and discard · 2999241d
      Filipe Manana 提交于
      While we are finishing a device replace operation, we can make a discard
      operation (fs mounted with -o discard) do an invalid memory access like
      the one reported by the following trace:
      
      [ 3206.384654] general protection fault: 0000 [#1] PREEMPT SMP
      [ 3206.387520] Modules linked in: dm_mod btrfs crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis psmouse tpm ppdev sg parport_pc evdev i2c_piix4 parport
      processor serio_raw i2c_core pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci
      virtio_ring scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [ 3206.388595] CPU: 14 PID: 29194 Comm: fsstress Not tainted 4.6.0-rc7-btrfs-next-29+ #1
      [ 3206.388595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [ 3206.388595] task: ffff88017ace0100 ti: ffff880171b98000 task.ti: ffff880171b98000
      [ 3206.388595] RIP: 0010:[<ffffffff8124d233>]  [<ffffffff8124d233>] blkdev_issue_discard+0x5c/0x2a7
      [ 3206.388595] RSP: 0018:ffff880171b9bb80  EFLAGS: 00010246
      [ 3206.388595] RAX: ffff880171b9bc28 RBX: 000000000090d000 RCX: 0000000000000000
      [ 3206.388595] RDX: ffffffff82fa1b48 RSI: ffffffff8179f46c RDI: ffffffff82fa1b48
      [ 3206.388595] RBP: ffff880171b9bcc0 R08: 0000000000000000 R09: 0000000000000001
      [ 3206.388595] R10: ffff880171b9bce0 R11: 000000000090f000 R12: ffff880171b9bbe8
      [ 3206.388595] R13: 0000000000000010 R14: 0000000000004868 R15: 6b6b6b6b6b6b6b6b
      [ 3206.388595] FS:  00007f6182e4e700(0000) GS:ffff88023fdc0000(0000) knlGS:0000000000000000
      [ 3206.388595] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 3206.388595] CR2: 00007f617c2bbb18 CR3: 000000017ad9c000 CR4: 00000000000006e0
      [ 3206.388595] Stack:
      [ 3206.388595]  0000000000004878 0000000000000000 0000000002400040 0000000000000000
      [ 3206.388595]  0000000000000000 ffff880171b9bbe8 ffff880171b9bbb0 ffff880171b9bbb0
      [ 3206.388595]  ffff880171b9bbc0 ffff880171b9bbc0 ffff880171b9bbd0 ffff880171b9bbd0
      [ 3206.388595] Call Trace:
      [ 3206.388595]  [<ffffffffa042899e>] btrfs_issue_discard+0x12f/0x143 [btrfs]
      [ 3206.388595]  [<ffffffffa042899e>] ? btrfs_issue_discard+0x12f/0x143 [btrfs]
      [ 3206.388595]  [<ffffffffa042e862>] btrfs_discard_extent+0x87/0xde [btrfs]
      [ 3206.388595]  [<ffffffffa04303b5>] btrfs_finish_extent_commit+0xb2/0x1df [btrfs]
      [ 3206.388595]  [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
      [ 3206.388595]  [<ffffffffa04464c4>] btrfs_commit_transaction+0x7fc/0x980 [btrfs]
      [ 3206.388595]  [<ffffffff8149c246>] ? __mutex_unlock_slowpath+0x150/0x15b
      [ 3206.388595]  [<ffffffffa0459af6>] btrfs_sync_file+0x38f/0x428 [btrfs]
      [ 3206.388595]  [<ffffffff811a8292>] vfs_fsync_range+0x8c/0x9e
      [ 3206.388595]  [<ffffffff811a82c0>] vfs_fsync+0x1c/0x1e
      [ 3206.388595]  [<ffffffff811a8417>] do_fsync+0x31/0x4a
      [ 3206.388595]  [<ffffffff811a8637>] SyS_fsync+0x10/0x14
      [ 3206.388595]  [<ffffffff8149e025>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [ 3206.388595]  [<ffffffff81100c6b>] ? time_hardirqs_off+0x9/0x14
      [ 3206.388595]  [<ffffffff8108e87d>] ? trace_hardirqs_off_caller+0x1f/0xaa
      
      This happens because when we call btrfs_map_block() from
      btrfs_discard_extent() to get a btrfs_bio structure, the device replace
      operation has not finished yet, but before we use the device of one of the
      stripes from the returned btrfs_bio structure, the device object is freed.
      
      This is illustrated by the following diagram.
      
                  CPU 1                                                  CPU 2
      
       btrfs_dev_replace_start()
      
       (...)
      
       btrfs_dev_replace_finishing()
      
         btrfs_start_transaction()
         btrfs_commit_transaction()
      
         (...)
      
                                                                  btrfs_sync_file()
                                                                    btrfs_start_transaction()
      
                                                                    (...)
      
                                                                    btrfs_commit_transaction()
                                                                      btrfs_finish_extent_commit()
                                                                        btrfs_discard_extent()
                                                                          btrfs_map_block()
                                                                            --> returns a struct btrfs_bio
                                                                                with a stripe that has a
                                                                                device field pointing to
                                                                                source device of the replace
                                                                                operation (the device that
                                                                                is being replaced)
      
         mutex_lock(&uuid_mutex)
         mutex_lock(&fs_info->fs_devices->device_list_mutex)
         mutex_lock(&fs_info->chunk_mutex)
      
         btrfs_dev_replace_update_device_in_mapping_tree()
           --> iterates the mapping tree and for each
               extent map that has a stripe pointing to
               the source device, it updates the stripe
               to point to the target device instead
      
         btrfs_rm_dev_replace_blocked()
           --> waits for fs_info->bio_counter to go down to 0
      
         btrfs_rm_dev_replace_remove_srcdev()
           --> removes source device from the list of devices
      
         mutex_unlock(&fs_info->chunk_mutex)
         mutex_unlock(&fs_info->fs_devices->device_list_mutex)
         mutex_unlock(&uuid_mutex)
      
         btrfs_rm_dev_replace_free_srcdev()
           --> frees the source device
      
                                                                          --> iterates over all stripes
                                                                              of the returned struct
                                                                              btrfs_bio
                                                                          --> for each stripe it
                                                                              dereferences its device
                                                                              pointer
                                                                              --> it ends up finding a
                                                                                  pointer to the device
                                                                                  used as the source
                                                                                  device for the replace
                                                                                  operation and that was
                                                                                  already freed
      
      So fix this by surrounding the call to btrfs_map_block(), and the code
      that uses the returned struct btrfs_bio, with calls to
      btrfs_bio_counter_inc_blocked() and btrfs_bio_counter_dec(), so that
      the finishing phase of the device replace operation blocks until the
      the bio counter decreases to zero before it frees the source device.
      This is the same approach we do at btrfs_map_bio() for example.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      2999241d
  11. 30 5月, 2016 1 次提交
    • F
      Btrfs: fix race between device replace and chunk allocation · 22ab04e8
      Filipe Manana 提交于
      While iterating and copying extents from the source device, the device
      replace code keeps adjusting a left cursor that is used to make sure that
      once we finish processing a device extent, any future writes to extents
      from the corresponding block group will get into both the source and
      target devices. This left cursor is also used for resuming the device
      replace operation at mount time.
      
      However using this left cursor to decide whether writes go into both
      devices or only the source device is not enough to guarantee we don't
      miss copying extents into the target device. There are two cases where
      the current approach fails. The first one is related to when there are
      holes in the device and they get allocated for new block groups while
      the device replace operation is iterating the device extents (more on
      this explained below). The second one is that when that loop over the
      device extents finishes, we start dellaloc, wait for all ordered extents
      and then commit the current transaction, we might have got new block
      groups allocated that are now using a device extent that has an offset
      greater then or equals to the value of the left cursor, in which case
      writes to extents belonging to these new block groups will get issued
      only to the source device.
      
      For the first case where the current approach of using a left cursor
      fails, consider the source device currently has the following layout:
      
        [ extent bg A ] [ hole, unallocated space ] [extent bg B ]
        3Gb             4Gb                         5Gb
      
      While we are iterating the device extents from the source device using
      the commit root of the device tree, the following happens:
      
              CPU 1                                            CPU 2
      
                            <we are at transaction N>
      
        scrub_enumerate_chunks()
          --> searches the device tree for
              extents belonging to the source
              device using the device tree's
              commit root
          --> 1st iteration finds extent belonging to
              block group A
      
              --> sets block group A to RO mode
                  (btrfs_inc_block_group_ro)
      
              --> sets cursor left to found_key.offset
                  which is 3Gb
      
              --> scrub_chunk() starts
                  copies all allocated extents from
                  block group's A stripe at source
                  device into target device
      
                                                                 btrfs_alloc_chunk()
                                                                   --> allocates device extent
                                                                       in the range [4Gb, 5Gb[
                                                                       from the source device for
                                                                       a new block group C
      
                                                                 extent allocated from block
                                                                 group C for a direct IO,
                                                                 buffered write or btree node/leaf
      
                                                                 extent is written to, perhaps
                                                                 in response to a writepages()
                                                                 call from the VM or directly
                                                                 through direct IO
      
                                                                 the write is made only against
                                                                 the source device and not against
                                                                 the target device because the
                                                                 extent's offset is in the interval
                                                                 [4Gb, 5Gb[ which is larger then
                                                                 the value of cursor_left (3Gb)
      
              --> scrub_chunks() finishes
      
              --> updates left cursor from 3Gb to
                  4Gb
      
              --> btrfs_dec_block_group_ro() sets
                  block group A back to RW mode
      
                                   <we are still at transaction N>
      
          --> 2nd iteration finds extent belonging to
              block group B - it did not find the new
              extent in the range [4Gb, 5Gb[ for block
              group C because we are using the device
              tree's commit root or even because the
              block group's items are not all yet
              inserted in the respective btrees, that is,
              the block group is still attached to some
              transaction handle's new_bgs list and
              btrfs_create_pending_block_groups() was
              not called yet against that transaction
              handle, so the device extent items were
              not yet inserted into the devices tree
      
                                   <we are still at transaction N>
      
              --> so we end not copying anything from the newly
                  allocated device extent from the source device
                  to the target device
      
      So fix this by making __btrfs_map_block() always redirect writes to the
      target device as well, independently of the left cursor's value. With
      this change the left cursor is now used only for the purpose of tracking
      progress and allow a mount operation to resume a device replace.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      22ab04e8