1. 01 9月, 2013 1 次提交
  2. 20 7月, 2013 1 次提交
    • S
      Btrfs: fix wrong write offset when replacing a device · 115930cb
      Stefan Behrens 提交于
      Miao Xie reported the following issue:
      
      The filesystem was corrupted after we did a device replace.
      
      Steps to reproduce:
       # mkfs.btrfs -f -m single -d raid10 <device0>..<device3>
       # mount <device0> <mnt>
       # btrfs replace start -rfB 1 <device4> <mnt>
       # umount <mnt>
       # btrfsck <device4>
      
      The reason for the issue is that we changed the write offset by mistake,
      introduced by commit 625f1c8d.
      
      We read the data from the source device at first, and then write the
      data into the corresponding place of the new device. In order to
      implement the "-r" option, the source location is remapped using
      btrfs_map_block(). The read takes place on the mapped location, and
      the write needs to take place on the unmapped location. Currently
      the write is using the mapped location, and this commit changes it
      back by undoing the change to the write address that the aforementioned
      commit added by mistake.
      Reported-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org> # 3.10+
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      115930cb
  3. 02 7月, 2013 4 次提交
    • M
      Btrfs: fix several potential problems in copy_nocow_pages_for_inode · edd1400b
      Miao Xie 提交于
      - It makes no sense that we deal with a inode in the dead tree.
      - fix the race between dio and page copy by waiting the dio completion
      - avoid the page copy vs truncate/punch hole
      - check if the page is in the page cache or not
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      edd1400b
    • M
      Btrfs: cleanup the code of copy_nocow_pages_for_inode() · 826aa0a8
      Miao Xie 提交于
      - It make no sense that we continue to do something after the error
        happened, just go back with this patch.
      - remove some check of copy_nocow_pages_for_inode(), such as page check
        after write, inode check in the end of the function, because we are
        sure they exist.
      - remove the unnecessary goto in the return value check of the write
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      826aa0a8
    • M
      Btrfs: fix oops when recovering the file data by scrub function · 26b25891
      Miao Xie 提交于
      We get oops while running btrfs replace start test,
      ------------[ cut here ]------------
      kernel BUG at mm/filemap.c:608!
      [SNIP]
      Call Trace:
        [<ffffffffa04b36c7>] copy_nocow_pages_for_inode+0x217/0x3f0 [btrfs]
        [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
        [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
        [<ffffffffa04bb8ce>] iterate_extent_inodes+0x1ae/0x300 [btrfs]
        [<ffffffffa04bbab2>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
        [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
        [<ffffffffa04b3b07>] copy_nocow_pages_worker+0x97/0x150 [btrfs]
        [<ffffffffa048eed4>] worker_loop+0x134/0x540 [btrfs]
        [<ffffffff816274ea>] ? __schedule+0x3ca/0x7f0
        [<ffffffffa048eda0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
        [<ffffffff8106f2f0>] kthread+0xc0/0xd0
        [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
        [<ffffffff8163181c>] ret_from_fork+0x7c/0xb0
        [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
      [SNIP]
       RIP  [<ffffffff8111f4c5>] unlock_page+0x35/0x40
        RSP <ffff88010316bb98>
       ---[ end trace 421e79ad0dd72c7d ]---
      
      it is because we forgot to lock the page again after we read data to
      the page. Fix it.
      Signed-off-by: NLin Feng <linfeng@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      26b25891
    • M
      Btrfs: remove btrfs_sector_sum structure · f51a4a18
      Miao Xie 提交于
      Using the structure btrfs_sector_sum to keep the checksum value is
      unnecessary, because the extents that btrfs_sector_sum points to are
      continuous, we can find out the expected checksums by btrfs_ordered_sum's
      bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
      removing bytenr, there is only one member in the structure, so it makes
      no sense to keep the structure, just remove it, and use a u32 array to
      store the checksum value.
      
      By this change, we don't use the while loop to get the checksums one by
      one. Now, we can get several checksum value at one time, it improved the
      performance by ~74% on my SSD (31MB/s -> 54MB/s).
      
      test command:
       # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f51a4a18
  4. 01 7月, 2013 1 次提交
  5. 18 5月, 2013 1 次提交
    • C
      Btrfs: use a btrfs bioset instead of abusing bio internals · 9be3395b
      Chris Mason 提交于
      Btrfs has been pointer tagging bi_private and using bi_bdev
      to store the stripe index and mirror number of failed IOs.
      
      As bios bubble back up through the call chain, we use these
      to decide if and how to retry our IOs.  They are also used
      to count IO failures on a per device basis.
      
      Recently a bio tracepoint was added lead to crashes because
      we were abusing bi_bdev.
      
      This commit adds a btrfs bioset, and creates explicit fields
      for the mirror number and stripe index.  The plan is to
      extend this structure for all of the fields currently in
      struct btrfs_bio, which will mean one less kmalloc in
      our IO path.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NTejun Heo <tj@kernel.org>
      9be3395b
  6. 07 5月, 2013 4 次提交
    • L
      Btrfs: improve the loop of scrub_stripe · 625f1c8d
      Liu Bo 提交于
      1) Right now scrub_stripe() is looping in some unnecessary cases:
      * when the found extent item's objectid has been out of the dev extent's range
        but we haven't finish scanning all the range within the dev extent
      * when all the items has been processed but we haven't finish scanning all the
        range within the dev extent
      
      In both cases, we can just finish the loop to save costs.
      
      2) Besides, when the found extent item's length is larger than the stripe
      len(64k), we don't have to release the path and search again as it'll get at the
      same key used in the last loop, we can instead increase the logical cursor in
      place till all space of the extent is scanned.
      
      3) And we use 0 as the key's offset to search btree, then get to previous item
      to find a smaller item, and again have to move to the next one to get the right
      item.  Setting offset=-1 and previous_item() is the correct way.
      
      4) As we won't find any checksum at offset unless this 'offset' is in a data
      extent, we can just find checksum when we're really going to scrub an extent.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      625f1c8d
    • E
      btrfs: make static code static & remove dead code · 48a3b636
      Eric Sandeen 提交于
      Big patch, but all it does is add statics to functions which
      are in fact static, then remove the associated dead-code fallout.
      
      removed functions:
      
      btrfs_iref_to_path()
      __btrfs_lookup_delayed_deletion_item()
      __btrfs_search_delayed_insertion_item()
      __btrfs_search_delayed_deletion_item()
      find_eb_for_page()
      btrfs_find_block_group()
      range_straddles_pages()
      extent_range_uptodate()
      btrfs_file_extent_length()
      btrfs_scrub_cancel_devid()
      btrfs_start_transaction_lflush()
      
      btrfs_print_tree() is left because it is used for debugging.
      btrfs_start_transaction_lflush() and btrfs_reada_detach() are
      left for symmetry.
      
      ulist.c functions are left, another patch will take care of those.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      48a3b636
    • J
      Btrfs: add a incompatible format change for smaller metadata extent refs · 3173a18f
      Josef Bacik 提交于
      We currently store the first key of the tree block inside the reference for the
      tree block in the extent tree.  This takes up quite a bit of space.  Make a new
      key type for metadata which holds the level as the offset and completely removes
      storing the btrfs_tree_block_info inside the extent ref.  This reduces the size
      from 51 bytes to 33 bytes per extent reference for each tree block.  In practice
      this results in a 30-35% decrease in the size of our extent tree, which means we
      COW less and can keep more of the extent tree in memory which makes our heavy
      metadata operations go much faster.  This is not an automatic format change, you
      must enable it at mkfs time or with btrfstune.  This patch deals with having
      metadata stored as either the old format or the new format so it is easy to
      convert.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3173a18f
    • L
      Btrfs: cleanup unused arguments of btrfs_csum_data · b0496686
      Liu Bo 提交于
      Argument 'root' is no more used in btrfs_csum_data().
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      b0496686
  7. 29 3月, 2013 1 次提交
  8. 21 2月, 2013 1 次提交
    • M
      Btrfs: use bit operation for ->fs_state · 87533c47
      Miao Xie 提交于
      There is no lock to protect fs_info->fs_state, it will introduce
      some problems, such as the value may be covered by the other task
      when several tasks modify it. For example:
      	Task0 - CPU0		Task1 - CPU1
      	mov %fs_state rax
      	or $0x1 rax
      				mov %fs_state rax
      				or $0x2 rax
      	mov rax %fs_state
      				mov rax %fs_state
      The expected value is 3, but in fact, it is 2.
      
      Though this problem doesn't happen now (because there is only one
      flag currently), the code is error prone, if we add other flags,
      the above problem will happen to a certainty.
      
      Now we use bit operation for it to fix the above problem.
      In this way, we can make the code more robust and be easy to
      add new flags.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      87533c47
  9. 06 2月, 2013 1 次提交
    • L
      Btrfs: fix race between snapshot deletion and getting inode · 6f1c3605
      Liu Bo 提交于
      While running snapshot testscript created by Mitch and David,
      the race between autodefrag and snapshot deletion can lead to
      corruption of dead_root list so that we can get crash on
      btrfs_clean_old_snapshots().
      
      And besides autodefrag, scrub also does the same thing, ie. read
      root first and get inode.
      
      Here is the story(take autodefrag as an example):
      (1) when we delete a snapshot or subvolume, it will set its root's
      refs to zero and do a iput() on its own inode, and if this inode happens
      to be the only active in-meory one in root's inode rbtree, it will add
      itself to the global dead_roots list for later cleanup.
      
      (2) after (1), the autodefrag thread may read another inode for defrag
      and the inode is just in the deleted snapshot/subvolume, but all of these
      are without checking if the root is still valid(refs > 0).  So the end up
      result is adding the deleted snapshot/subvolume's root to the global
      dead_roots list AGAIN.
      
      Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.
      
      So all we need to do is to take the lock to protect 'read root and get inode',
      since we synchronize to wait for the rcu grace period before adding something
      to the global dead_roots list.
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6f1c3605
  10. 02 2月, 2013 1 次提交
    • D
      Btrfs: RAID5 and RAID6 · 53b381b3
      David Woodhouse 提交于
      This builds on David Woodhouse's original Btrfs raid5/6 implementation.
      The code has changed quite a bit, blame Chris Mason for any bugs.
      
      Read/modify/write is done after the higher levels of the filesystem have
      prepared a given bio.  This means the higher layers are not responsible
      for building full stripes, and they don't need to query for the topology
      of the extents that may get allocated during delayed allocation runs.
      It also means different files can easily share the same stripe.
      
      But, it does expose us to incorrect parity if we crash or lose power
      while doing a read/modify/write cycle.  This will be addressed in a
      later commit.
      
      Scrub is unable to repair crc errors on raid5/6 chunks.
      
      Discard does not work on raid5/6 (yet)
      
      The stripe size is fixed at 64KiB per disk.  This will be tunable
      in a later commit.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      53b381b3
  11. 17 12月, 2012 2 次提交
  12. 13 12月, 2012 12 次提交
    • S
      Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block() · 29a8d9a0
      Stefan Behrens 提交于
      Before this commit, btrfs_map_block() was called with REQ_WRITE
      in order to retrieve the list of mirrors for a disk block.
      This needs to be changed for the device replace procedure since
      it makes a difference whether you are asking for read mirrors
      or for locations to write to.
      GET_READ_MIRRORS is introduced as a new interface to call
      btrfs_map_block().
      In the current commit, the functionality is not yet changed,
      only the interface for GET_READ_MIRRORS is introduced and all
      the places that should use this new interface are adapted.
      
      The reason that REQ_WRITE cannot be abused anymore to retrieve
      a list of read mirrors is that during a running dev replace
      operation all write requests to the live filesystem are
      duplicated to also write to the target drive.
      Keep in mind that the target disk is only partially a valid
      copy of the source disk while the operation is ongoing. All
      writes go to the target disk, but not all reads would return
      valid data on the target disk. Therefore it is not possible
      anymore to abuse a REQ_WRITE interface to find valid mirrors
      for a REQ_READ.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      29a8d9a0
    • S
      Btrfs: change core code of btrfs to support the device replace operations · 8dabb742
      Stefan Behrens 提交于
      This commit contains all the essential changes to the core code
      of Btrfs for support of the device replace procedure.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      8dabb742
    • S
      Btrfs: add code to scrub to copy read data to another disk · ff023aac
      Stefan Behrens 提交于
      The device replace procedure makes use of the scrub code. The scrub
      code is the most efficient code to read the allocated data of a disk,
      i.e. it reads sequentially in order to avoid disk head movements, it
      skips unallocated blocks, it uses read ahead mechanisms, and it
      contains all the code to detect and repair defects.
      This commit adds code to scrub to allow the scrub code to copy read
      data to another disk.
      One goal is to be able to perform as fast as possible. Therefore the
      write requests are collected until huge bios are built, and the
      write process is decoupled from the read process with some kind of
      flow control, of course, in order to limit the allocated memory.
      The best performance on spinning disks could by reached when the
      head movements are avoided as much as possible. Therefore a single
      worker is used to interface the read process with the write process.
      The regular scrub operation works as fast as before, it is not
      negatively influenced and actually it is more or less unchanged.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      ff023aac
    • S
      Btrfs: disallow some operations on the device replace target device · 63a212ab
      Stefan Behrens 提交于
      This patch adds some code to disallow operations on the device that
      is used as the target for the device replace operation.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      63a212ab
    • S
      Btrfs: pass fs_info instead of root · aa1b8cd4
      Stefan Behrens 提交于
      A small number of functions that are used in a device replace
      procedure when the operation is resumed at mount time are unable
      to pass the same root pointer that would be used in the regular
      (ioctl) context. And since the root pointer is not required, only
      the fs_info is, the root pointer argument is replaced with the
      fs_info pointer argument.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      aa1b8cd4
    • S
      Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree · 3ec706c8
      Stefan Behrens 提交于
      This is required for the device replace procedure in a later step.
      Two calling functions also had to be changed to have the fs_info
      pointer: repair_io_failure() and scrub_setup_recheck_block().
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      3ec706c8
    • S
      Btrfs: cleanup scrub bio and worker wait code · b6bfebc1
      Stefan Behrens 提交于
      Just move some code into functions to make everything more readable.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      b6bfebc1
    • S
      Btrfs: in scrub repair code, simplify alloc error handling · 34f5c8e9
      Stefan Behrens 提交于
      In the scrub repair code, the code is changed to handle memory
      allocation errors a little bit smarter. The change is to handle
      it just like a read error. This simplifies the code and removes
      a couple of lines of code, since the code to handle read errors
      is there anyway.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      34f5c8e9
    • S
      Btrfs: in scrub repair code, optimize the reading of mirrors · cb2ced73
      Stefan Behrens 提交于
      In case that disk blocks need to be repaired (rewritten), the
      current code at first (for simplicity reasons) reads all alternate
      mirrors in the first step, afterwards selects the best one in a
      second step. This is now changed to read one alternate mirror
      after the other and to leave the loop early when a perfect mirror
      is found.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      cb2ced73
    • S
      Btrfs: make the scrub page array dynamically allocated · 7a9e9987
      Stefan Behrens 提交于
      With the modified design (in order to support the devive replace
      procedure) it is necessary to alloc the page array dynamically.
      The reason is that pages are reused. At first a page is used for
      the bio to read the data from the filesystem, then the same page
      is reused for the bio that writes the data to the target disk.
      Since the read process and the write process are completely
      decoupled, this requires a new concept of refcounts and get/put
      functions for pages, and it requires to use newly created pages
      for each read bio which are freed after the write operation
      is finished.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      7a9e9987
    • S
      Btrfs: remove the block device pointer from the scrub context struct · a36cf8b8
      Stefan Behrens 提交于
      The block device is removed from the scrub context state structure.
      The scrub code as it is used for the device replace procedure reads
      the source data from whereever it is optimal. The source device might
      even be gone (disconnected, for instance due to a hardware failure).
      Or the drive can be so faulty so that the device replace procedure
      tries to avoid access to the faulty source drive as much as possible,
      and only if all other mirrors are damaged, as a last resort, the
      source disk is accessed.
      The modified scrub code operates as if it would handle the source
      drive and thereby generates an exact copy of the source disk on the
      target disk, even if the source disk is not present at all. Therefore
      the block device pointer to the source disk is removed in the scrub
      context struct and moved into the lower level scope of scrub_bio,
      fixup and page structures where the block device context is known.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      a36cf8b8
    • S
      Btrfs: rename the scrub context structure · d9d181c1
      Stefan Behrens 提交于
      The device replace procedure makes use of the scrub code. The scrub
      code is the most efficient code to read the allocated data of a disk,
      i.e. it reads sequentially in order to avoid disk head movements, it
      skips unallocated blocks, it uses read ahead mechanisms, and it
      contains all the code to detect and repair defects.
      This commit is a first preparation step to adapt the scrub code to
      be shareable for the device replace procedure.
      The block device will be removed from the scrub context state
      structure in a later step. It used to be the source block device.
      The scrub code as it is used for the device replace procedure reads
      the source data from whereever it is optimal. The source device might
      even be gone (disconnected, for instance due to a hardware failure).
      Or the drive can be so faulty so that the device replace procedure
      tries to avoid access to the faulty source drive as much as possible,
      and only if all other mirrors are damaged, as a last resort, the
      source disk is accessed.
      The modified scrub code operates as if it would handle the source
      drive and thereby generates an exact copy of the source disk on the
      target disk, even if the source disk is not present at all. Therefore
      the block device pointer to the source disk is removed in a later
      patch, and therefore the context structure is renamed (this is the
      goal of the current patch) to reflect that no source block device
      scope is there anymore.
      
      Summary:
      This first preparation step consists of a textual substitution of the
      term "dev" to the term "ctx" whereever the scrub context is used.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      d9d181c1
  13. 02 10月, 2012 3 次提交
  14. 15 6月, 2012 1 次提交
    • J
      Btrfs: use rcu to protect device->name · 606686ee
      Josef Bacik 提交于
      Al pointed out that we can just toss out the old name on a device and add a
      new one arbitrarily, so anybody who uses device->name in printk could
      possibly use free'd memory.  Instead of adding locking around all of this he
      suggested doing it with RCU, so I've introduced a struct rcu_string that
      does just that and have gone through and protected all accesses to
      device->name that aren't under the uuid_mutex with rcu_read_lock().  This
      protects us and I will use it for dealing with removing the device that we
      used to mount the file system in a later patch.  Thanks,
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      606686ee
  15. 30 5月, 2012 1 次提交
  16. 05 5月, 2012 1 次提交
  17. 19 4月, 2012 1 次提交
  18. 13 4月, 2012 1 次提交
  19. 28 3月, 2012 2 次提交