1. 11 3月, 2014 1 次提交
    • W
      Btrfs: device_replace: fix deadlock for nocow case · 12cf9372
      Wang Shilong 提交于
      commit cb7ab021 cause a following deadlock found by
      xfstests,btrfs/011:
      
      Thread1 is commiting transaction which is blocked at
      btrfs_scrub_pause().
      
      Thread2 is calling btrfs_file_aio_write() which has held
      inode's @i_mutex and commit transaction(blocked because
      Thread1 is committing transaction).
      
      Thread3 is copy_nocow_page worker which will also try to
      hold inode @i_mutex, so thread3 will wait Thread1 finished.
      
      Thread4 is waiting pending workers finished which will wait
      Thread3 finished. So the problem is like this:
      
      Thread1--->Thread4--->Thread3--->Thread2---->Thread1
      
      Deadlock happens! we fix it by letting Thread1 go firstly,
      which means we won't block transaction commit while we are
      waiting pending workers finished.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      12cf9372
  2. 29 1月, 2014 6 次提交
  3. 25 11月, 2013 1 次提交
  4. 21 11月, 2013 1 次提交
  5. 12 11月, 2013 3 次提交
  6. 21 9月, 2013 1 次提交
    • J
      Btrfs: improve replacing nocow extents · 652f25a2
      Josef Bacik 提交于
      Various people have hit a deadlock when running btrfs/011.  This is because when
      replacing nocow extents we will take the i_mutex to make sure nobody messes with
      the file while we are replacing the extent.  The problem is we are already
      holding a transaction open, which is a locking inversion, so instead we need to
      save these inodes we find and then process them outside of the transaction.
      
      Further we can't just lock the inode and assume we are good to go.  We need to
      lock the extent range and then read back the extent cache for the inode to make
      sure the extent really still points at the physical block we want.  If it
      doesn't we don't have to copy it.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      652f25a2
  7. 01 9月, 2013 5 次提交
  8. 20 7月, 2013 1 次提交
    • S
      Btrfs: fix wrong write offset when replacing a device · 115930cb
      Stefan Behrens 提交于
      Miao Xie reported the following issue:
      
      The filesystem was corrupted after we did a device replace.
      
      Steps to reproduce:
       # mkfs.btrfs -f -m single -d raid10 <device0>..<device3>
       # mount <device0> <mnt>
       # btrfs replace start -rfB 1 <device4> <mnt>
       # umount <mnt>
       # btrfsck <device4>
      
      The reason for the issue is that we changed the write offset by mistake,
      introduced by commit 625f1c8d.
      
      We read the data from the source device at first, and then write the
      data into the corresponding place of the new device. In order to
      implement the "-r" option, the source location is remapped using
      btrfs_map_block(). The read takes place on the mapped location, and
      the write needs to take place on the unmapped location. Currently
      the write is using the mapped location, and this commit changes it
      back by undoing the change to the write address that the aforementioned
      commit added by mistake.
      Reported-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org> # 3.10+
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      115930cb
  9. 02 7月, 2013 4 次提交
    • M
      Btrfs: fix several potential problems in copy_nocow_pages_for_inode · edd1400b
      Miao Xie 提交于
      - It makes no sense that we deal with a inode in the dead tree.
      - fix the race between dio and page copy by waiting the dio completion
      - avoid the page copy vs truncate/punch hole
      - check if the page is in the page cache or not
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      edd1400b
    • M
      Btrfs: cleanup the code of copy_nocow_pages_for_inode() · 826aa0a8
      Miao Xie 提交于
      - It make no sense that we continue to do something after the error
        happened, just go back with this patch.
      - remove some check of copy_nocow_pages_for_inode(), such as page check
        after write, inode check in the end of the function, because we are
        sure they exist.
      - remove the unnecessary goto in the return value check of the write
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      826aa0a8
    • M
      Btrfs: fix oops when recovering the file data by scrub function · 26b25891
      Miao Xie 提交于
      We get oops while running btrfs replace start test,
      ------------[ cut here ]------------
      kernel BUG at mm/filemap.c:608!
      [SNIP]
      Call Trace:
        [<ffffffffa04b36c7>] copy_nocow_pages_for_inode+0x217/0x3f0 [btrfs]
        [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
        [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
        [<ffffffffa04bb8ce>] iterate_extent_inodes+0x1ae/0x300 [btrfs]
        [<ffffffffa04bbab2>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
        [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
        [<ffffffffa04b3b07>] copy_nocow_pages_worker+0x97/0x150 [btrfs]
        [<ffffffffa048eed4>] worker_loop+0x134/0x540 [btrfs]
        [<ffffffff816274ea>] ? __schedule+0x3ca/0x7f0
        [<ffffffffa048eda0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
        [<ffffffff8106f2f0>] kthread+0xc0/0xd0
        [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
        [<ffffffff8163181c>] ret_from_fork+0x7c/0xb0
        [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
      [SNIP]
       RIP  [<ffffffff8111f4c5>] unlock_page+0x35/0x40
        RSP <ffff88010316bb98>
       ---[ end trace 421e79ad0dd72c7d ]---
      
      it is because we forgot to lock the page again after we read data to
      the page. Fix it.
      Signed-off-by: NLin Feng <linfeng@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      26b25891
    • M
      Btrfs: remove btrfs_sector_sum structure · f51a4a18
      Miao Xie 提交于
      Using the structure btrfs_sector_sum to keep the checksum value is
      unnecessary, because the extents that btrfs_sector_sum points to are
      continuous, we can find out the expected checksums by btrfs_ordered_sum's
      bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
      removing bytenr, there is only one member in the structure, so it makes
      no sense to keep the structure, just remove it, and use a u32 array to
      store the checksum value.
      
      By this change, we don't use the while loop to get the checksums one by
      one. Now, we can get several checksum value at one time, it improved the
      performance by ~74% on my SSD (31MB/s -> 54MB/s).
      
      test command:
       # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f51a4a18
  10. 01 7月, 2013 1 次提交
  11. 18 5月, 2013 1 次提交
    • C
      Btrfs: use a btrfs bioset instead of abusing bio internals · 9be3395b
      Chris Mason 提交于
      Btrfs has been pointer tagging bi_private and using bi_bdev
      to store the stripe index and mirror number of failed IOs.
      
      As bios bubble back up through the call chain, we use these
      to decide if and how to retry our IOs.  They are also used
      to count IO failures on a per device basis.
      
      Recently a bio tracepoint was added lead to crashes because
      we were abusing bi_bdev.
      
      This commit adds a btrfs bioset, and creates explicit fields
      for the mirror number and stripe index.  The plan is to
      extend this structure for all of the fields currently in
      struct btrfs_bio, which will mean one less kmalloc in
      our IO path.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NTejun Heo <tj@kernel.org>
      9be3395b
  12. 07 5月, 2013 4 次提交
    • L
      Btrfs: improve the loop of scrub_stripe · 625f1c8d
      Liu Bo 提交于
      1) Right now scrub_stripe() is looping in some unnecessary cases:
      * when the found extent item's objectid has been out of the dev extent's range
        but we haven't finish scanning all the range within the dev extent
      * when all the items has been processed but we haven't finish scanning all the
        range within the dev extent
      
      In both cases, we can just finish the loop to save costs.
      
      2) Besides, when the found extent item's length is larger than the stripe
      len(64k), we don't have to release the path and search again as it'll get at the
      same key used in the last loop, we can instead increase the logical cursor in
      place till all space of the extent is scanned.
      
      3) And we use 0 as the key's offset to search btree, then get to previous item
      to find a smaller item, and again have to move to the next one to get the right
      item.  Setting offset=-1 and previous_item() is the correct way.
      
      4) As we won't find any checksum at offset unless this 'offset' is in a data
      extent, we can just find checksum when we're really going to scrub an extent.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      625f1c8d
    • E
      btrfs: make static code static & remove dead code · 48a3b636
      Eric Sandeen 提交于
      Big patch, but all it does is add statics to functions which
      are in fact static, then remove the associated dead-code fallout.
      
      removed functions:
      
      btrfs_iref_to_path()
      __btrfs_lookup_delayed_deletion_item()
      __btrfs_search_delayed_insertion_item()
      __btrfs_search_delayed_deletion_item()
      find_eb_for_page()
      btrfs_find_block_group()
      range_straddles_pages()
      extent_range_uptodate()
      btrfs_file_extent_length()
      btrfs_scrub_cancel_devid()
      btrfs_start_transaction_lflush()
      
      btrfs_print_tree() is left because it is used for debugging.
      btrfs_start_transaction_lflush() and btrfs_reada_detach() are
      left for symmetry.
      
      ulist.c functions are left, another patch will take care of those.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      48a3b636
    • J
      Btrfs: add a incompatible format change for smaller metadata extent refs · 3173a18f
      Josef Bacik 提交于
      We currently store the first key of the tree block inside the reference for the
      tree block in the extent tree.  This takes up quite a bit of space.  Make a new
      key type for metadata which holds the level as the offset and completely removes
      storing the btrfs_tree_block_info inside the extent ref.  This reduces the size
      from 51 bytes to 33 bytes per extent reference for each tree block.  In practice
      this results in a 30-35% decrease in the size of our extent tree, which means we
      COW less and can keep more of the extent tree in memory which makes our heavy
      metadata operations go much faster.  This is not an automatic format change, you
      must enable it at mkfs time or with btrfstune.  This patch deals with having
      metadata stored as either the old format or the new format so it is easy to
      convert.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3173a18f
    • L
      Btrfs: cleanup unused arguments of btrfs_csum_data · b0496686
      Liu Bo 提交于
      Argument 'root' is no more used in btrfs_csum_data().
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b0496686
  13. 29 3月, 2013 1 次提交
  14. 21 2月, 2013 1 次提交
    • M
      Btrfs: use bit operation for ->fs_state · 87533c47
      Miao Xie 提交于
      There is no lock to protect fs_info->fs_state, it will introduce
      some problems, such as the value may be covered by the other task
      when several tasks modify it. For example:
      	Task0 - CPU0		Task1 - CPU1
      	mov %fs_state rax
      	or $0x1 rax
      				mov %fs_state rax
      				or $0x2 rax
      	mov rax %fs_state
      				mov rax %fs_state
      The expected value is 3, but in fact, it is 2.
      
      Though this problem doesn't happen now (because there is only one
      flag currently), the code is error prone, if we add other flags,
      the above problem will happen to a certainty.
      
      Now we use bit operation for it to fix the above problem.
      In this way, we can make the code more robust and be easy to
      add new flags.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      87533c47
  15. 06 2月, 2013 1 次提交
    • L
      Btrfs: fix race between snapshot deletion and getting inode · 6f1c3605
      Liu Bo 提交于
      While running snapshot testscript created by Mitch and David,
      the race between autodefrag and snapshot deletion can lead to
      corruption of dead_root list so that we can get crash on
      btrfs_clean_old_snapshots().
      
      And besides autodefrag, scrub also does the same thing, ie. read
      root first and get inode.
      
      Here is the story(take autodefrag as an example):
      (1) when we delete a snapshot or subvolume, it will set its root's
      refs to zero and do a iput() on its own inode, and if this inode happens
      to be the only active in-meory one in root's inode rbtree, it will add
      itself to the global dead_roots list for later cleanup.
      
      (2) after (1), the autodefrag thread may read another inode for defrag
      and the inode is just in the deleted snapshot/subvolume, but all of these
      are without checking if the root is still valid(refs > 0).  So the end up
      result is adding the deleted snapshot/subvolume's root to the global
      dead_roots list AGAIN.
      
      Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.
      
      So all we need to do is to take the lock to protect 'read root and get inode',
      since we synchronize to wait for the rcu grace period before adding something
      to the global dead_roots list.
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6f1c3605
  16. 02 2月, 2013 1 次提交
    • D
      Btrfs: RAID5 and RAID6 · 53b381b3
      David Woodhouse 提交于
      This builds on David Woodhouse's original Btrfs raid5/6 implementation.
      The code has changed quite a bit, blame Chris Mason for any bugs.
      
      Read/modify/write is done after the higher levels of the filesystem have
      prepared a given bio.  This means the higher layers are not responsible
      for building full stripes, and they don't need to query for the topology
      of the extents that may get allocated during delayed allocation runs.
      It also means different files can easily share the same stripe.
      
      But, it does expose us to incorrect parity if we crash or lose power
      while doing a read/modify/write cycle.  This will be addressed in a
      later commit.
      
      Scrub is unable to repair crc errors on raid5/6 chunks.
      
      Discard does not work on raid5/6 (yet)
      
      The stripe size is fixed at 64KiB per disk.  This will be tunable
      in a later commit.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      53b381b3
  17. 17 12月, 2012 2 次提交
  18. 13 12月, 2012 5 次提交
    • S
      Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block() · 29a8d9a0
      Stefan Behrens 提交于
      Before this commit, btrfs_map_block() was called with REQ_WRITE
      in order to retrieve the list of mirrors for a disk block.
      This needs to be changed for the device replace procedure since
      it makes a difference whether you are asking for read mirrors
      or for locations to write to.
      GET_READ_MIRRORS is introduced as a new interface to call
      btrfs_map_block().
      In the current commit, the functionality is not yet changed,
      only the interface for GET_READ_MIRRORS is introduced and all
      the places that should use this new interface are adapted.
      
      The reason that REQ_WRITE cannot be abused anymore to retrieve
      a list of read mirrors is that during a running dev replace
      operation all write requests to the live filesystem are
      duplicated to also write to the target drive.
      Keep in mind that the target disk is only partially a valid
      copy of the source disk while the operation is ongoing. All
      writes go to the target disk, but not all reads would return
      valid data on the target disk. Therefore it is not possible
      anymore to abuse a REQ_WRITE interface to find valid mirrors
      for a REQ_READ.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      29a8d9a0
    • S
      Btrfs: change core code of btrfs to support the device replace operations · 8dabb742
      Stefan Behrens 提交于
      This commit contains all the essential changes to the core code
      of Btrfs for support of the device replace procedure.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      8dabb742
    • S
      Btrfs: add code to scrub to copy read data to another disk · ff023aac
      Stefan Behrens 提交于
      The device replace procedure makes use of the scrub code. The scrub
      code is the most efficient code to read the allocated data of a disk,
      i.e. it reads sequentially in order to avoid disk head movements, it
      skips unallocated blocks, it uses read ahead mechanisms, and it
      contains all the code to detect and repair defects.
      This commit adds code to scrub to allow the scrub code to copy read
      data to another disk.
      One goal is to be able to perform as fast as possible. Therefore the
      write requests are collected until huge bios are built, and the
      write process is decoupled from the read process with some kind of
      flow control, of course, in order to limit the allocated memory.
      The best performance on spinning disks could by reached when the
      head movements are avoided as much as possible. Therefore a single
      worker is used to interface the read process with the write process.
      The regular scrub operation works as fast as before, it is not
      negatively influenced and actually it is more or less unchanged.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      ff023aac
    • S
      Btrfs: disallow some operations on the device replace target device · 63a212ab
      Stefan Behrens 提交于
      This patch adds some code to disallow operations on the device that
      is used as the target for the device replace operation.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      63a212ab
    • S
      Btrfs: pass fs_info instead of root · aa1b8cd4
      Stefan Behrens 提交于
      A small number of functions that are used in a device replace
      procedure when the operation is resumed at mount time are unable
      to pass the same root pointer that would be used in the regular
      (ioctl) context. And since the root pointer is not required, only
      the fs_info is, the root pointer argument is replaced with the
      fs_info pointer argument.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      aa1b8cd4