1. 22 10月, 2015 2 次提交
  2. 17 2月, 2015 1 次提交
  3. 22 1月, 2015 1 次提交
    • D
      btrfs: switch extent_state state to unsigned · 9ee49a04
      David Sterba 提交于
      Currently there's a 4B hole in the structure between refs and state and there
      are only 16 bits used so we can make it unsigned. This will get a better
      packing and may save some stack space for local variables.
      
      The size of extent_state gets reduced by 8B and there are usually a lot
      of slab objects.
      
      struct extent_state {
      	u64                        start;                /*     0     8 */
      	u64                        end;                  /*     8     8 */
      	struct rb_node             rb_node;              /*    16    24 */
      	wait_queue_head_t          wq;                   /*    40    24 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	atomic_t                   refs;                 /*    64     4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	long unsigned int          state;                /*    72     8 */
      	u64                        private;              /*    80     8 */
      
      	/* size: 88, cachelines: 2, members: 7 */
      	/* sum members: 84, holes: 1, sum holes: 4 */
      	/* last cacheline: 24 bytes */
      };
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      9ee49a04
  4. 13 12月, 2014 2 次提交
  5. 21 11月, 2014 1 次提交
    • F
      Btrfs: set page and mapping error on compressed write failure · 704de49d
      Filipe Manana 提交于
      If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
      we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
      but we don't tag the pages, nor the inode's mapping, with an error. This makes it
      impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
      for e.g.) know that there was an error.
      
      Note that the return value of submit_compressed_extents() is useless, as that function
      is executed by a workqueue task and not directly by the fill_delalloc callback. This
      means the writepage/s callbacks of the inode's address space operations don't get that
      return value.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      704de49d
  6. 08 10月, 2014 1 次提交
  7. 04 10月, 2014 1 次提交
    • F
      Btrfs: be aware of btree inode write errors to avoid fs corruption · 656f30db
      Filipe Manana 提交于
      While we have a transaction ongoing, the VM might decide at any time
      to call btree_inode->i_mapping->a_ops->writepages(), which will start
      writeback of dirty pages belonging to btree nodes/leafs. This call
      might return an error or the writeback might finish with an error
      before we attempt to commit the running transaction. If this happens,
      we might have no way of knowing that such error happened when we are
      committing the transaction - because the pages might no longer be
      marked dirty nor tagged for writeback (if a subsequent modification
      to the extent buffer didn't happen before the transaction commit) which
      makes filemap_fdata[write|wait]_range unable to find such pages (even
      if they're marked with SetPageError).
      So if this happens we must abort the transaction, otherwise we commit
      a super block with btree roots that point to btree nodes/leafs whose
      content on disk is invalid - either garbage or the content of some
      node/leaf from a past generation that got cowed or deleted and is no
      longer valid (for this later case we end up getting error messages like
      "parent transid verify failed on 10826481664 wanted 25748 found 29562"
      when reading btree nodes/leafs from disk).
      
      Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
      i_mapping would not be enough because we need to distinguish between
      log tree extents (not fatal) vs non-log tree extents (fatal) and
      because the next call to filemap_fdatawait_range() will catch and clear
      such errors in the mapping - and that call might be from a log sync and
      not from a transaction commit, which means we would not know about the
      error at transaction commit time. Also, checking for the eb flag
      EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
      not be completely reliable, as the eb might be removed from memory and
      read back when trying to get it, which clears that flag right before
      reading the eb's pages from disk, making us not know about the previous
      write error.
      
      Using the new 3 flags for the btree inode also makes us achieve the
      goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
      writeback for all dirty pages and before filemap_fdatawait_range() is
      called, the writeback for all dirty pages had already finished with
      errors - because we were not using AS_EIO/AS_ENOSPC,
      filemap_fdatawait_range() would return success, as it could not know
      that writeback errors happened (the pages were no longer tagged for
      writeback).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      656f30db
  8. 02 10月, 2014 2 次提交
  9. 18 9月, 2014 7 次提交
    • M
      Btrfs: cleanup the read failure record after write or when the inode is freeing · f612496b
      Miao Xie 提交于
      After the data is written successfully, we should cleanup the read failure record
      in that range because
      - If we set data COW for the file, the range that the failure record pointed to is
        mapped to a new place, so it is invalid.
      - If we set no data COW for the file, and if there is no error during writting,
        the corrupted data is corrected, so the failure record can be removed. And if
        some errors happen on the mirrors, we also needn't worry about it because the
        failure record will be recreated if we read the same place again.
      
      Sometimes, we may fail to correct the data, so the failure records will be left
      in the tree, we need free them when we free the inode or the memory leak happens.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f612496b
    • M
      Btrfs: implement repair function when direct read fails · 8b110e39
      Miao Xie 提交于
      This patch implement data repair function when direct read fails.
      
      The detail of the implementation is:
      - When we find the data is not right, we try to read the data from the other
        mirror.
      - When the io on the mirror ends, we will insert the endio work into the
        dedicated btrfs workqueue, not common read endio workqueue, because the
        original endio work is still blocked in the btrfs endio workqueue, if we
        insert the endio work of the io on the mirror into that workqueue, deadlock
        would happen.
      - After we get right data, we write it back to the corrupted mirror.
      - And if the data on the new mirror is still corrupted, we will try next
        mirror until we read right data or all the mirrors are traversed.
      - After the above work, we set the uptodate flag according to the result.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8b110e39
    • M
      Btrfs: modify clean_io_failure and make it suit direct io · 1203b681
      Miao Xie 提交于
      We could not use clean_io_failure in the direct IO path because it got the
      filesystem information from the page structure, but the page in the direct
      IO bio didn't have the filesystem information in its structure. So we need
      modify it and pass all the information it need by parameters.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1203b681
    • M
      Btrfs: modify repair_io_failure and make it suit direct io · ffdd2018
      Miao Xie 提交于
      The original code of repair_io_failure was just used for buffered read,
      because it got some filesystem data from page structure, it is safe for
      the page in the page cache. But when we do a direct read, the pages in bio
      are not in the page cache, that is there is no filesystem data in the page
      structure. In order to implement direct read data repair, we need modify
      repair_io_failure and pass all filesystem data it need by function
      parameters.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ffdd2018
    • M
      Btrfs: split bio_readpage_error into several functions · 2fe6303e
      Miao Xie 提交于
      The data repair function of direct read will be implemented later, and some code
      in bio_readpage_error will be reused, so split bio_readpage_error into
      several functions which will be used in direct read repair later.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2fe6303e
    • F
      Btrfs: shrink further sizeof(struct extent_buffer) · 2a39e598
      Filipe Manana 提交于
      The map_start and map_len fields aren't used anywhere, so just remove
      them. On a x86_64 system, this reduced sizeof(struct extent_buffer)
      from 296 bytes to 280 bytes, and therefore 14 extent_buffer structs can
      now fit into a page instead of 13.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      2a39e598
    • F
      Btrfs: reduce size of struct extent_state · 27a3507d
      Filipe Manana 提交于
      The tree field of struct extent_state was only used to figure out if
      an extent state was connected to an inode's io tree or not. For this
      we can just use the rb_node field itself.
      
      On a x86_64 system with this change the sizeof(struct extent_state) is
      reduced from 96 bytes down to 88 bytes, meaning that with a page size
      of 4096 bytes we can now store 46 extent states per page instead of 42.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      27a3507d
  10. 20 6月, 2014 1 次提交
  11. 13 6月, 2014 1 次提交
  12. 10 6月, 2014 1 次提交
    • J
      Btrfs: add sanity tests for new qgroup accounting code · faa2dbf0
      Josef Bacik 提交于
      This exercises the various parts of the new qgroup accounting code.  We do some
      basic stuff and do some things with the shared refs to make sure all that code
      works.  I had to add a bunch of infrastructure because I needed to be able to
      insert items into a fake tree without having to do all the hard work myself,
      hopefully this will be usefull in the future.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      faa2dbf0
  13. 07 4月, 2014 1 次提交
    • J
      Btrfs: don't clear uptodate if the eb is under IO · a26e8c9f
      Josef Bacik 提交于
      So I have an awful exercise script that will run snapshot, balance and
      send/receive in parallel.  This sometimes would crash spectacularly and when it
      came back up the fs would be completely hosed.  Turns out this is because of a
      bad interaction of balance and send/receive.  Send will hold onto its entire
      path for the whole send, but its blocks could get relocated out from underneath
      it, and because it doesn't old tree locks theres nothing to keep this from
      happening.  So it will go to read in a slot with an old transid, and we could
      have re-allocated this block for something else and it could have a completely
      different transid.  But because we think it is invalid we clear uptodate and
      re-read in the block.  If we do this before we actually write out the new block
      we could write back stale data to the fs, and boom we're screwed.
      
      Now we definitely need to fix this disconnect between send and balance, but we
      really really need to not allow ourselves to accidently read in stale data over
      new data.  So make sure we check if the extent buffer is not under io before
      clearing uptodate, this will kick back EIO to the caller instead of reading in
      stale data and keep us from corrupting the fs.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a26e8c9f
  14. 29 1月, 2014 2 次提交
    • J
      Btrfs: move the extent buffer radix tree into the fs_info · f28491e0
      Josef Bacik 提交于
      I need to create a fake tree to test qgroups and I don't want to have to setup a
      fake btree_inode.  The fact is we only use the radix tree for the fs_info, so
      everybody else who allocates an extent_io_tree is just wasting the space anyway.
      This patch moves the radix tree and its lock into btrfs_fs_info so there is less
      stuff I have to fake to do qgroup sanity tests.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f28491e0
    • J
      Btrfs: use a bit to track if we're in the radix tree · 34b41ace
      Josef Bacik 提交于
      For creating a dummy in-memory btree I need to be able to use the radix tree to
      keep track of the buffers like normal extent buffers.  With dummy buffers we
      skip the radix tree step, and we still want to do that for the tree mod log
      dummy buffers but for my test buffers we need to be able to remove them from the
      radix tree like normal.  This will give me a way to do that.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      34b41ace
  15. 12 11月, 2013 2 次提交
  16. 01 9月, 2013 4 次提交
  17. 02 7月, 2013 1 次提交
    • J
      Btrfs: check if we can nocow if we don't have data space · 7ee9e440
      Josef Bacik 提交于
      We always just try and reserve data space when we write, but if we are out of
      space but have prealloc'ed extents we should still successfully write.  This
      patch will try and see if we can write to prealloc'ed space and if we can go
      ahead and allow the write to continue.  With this patch we now pass xfstests
      generic/274.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7ee9e440
  18. 18 5月, 2013 1 次提交
    • C
      Btrfs: use a btrfs bioset instead of abusing bio internals · 9be3395b
      Chris Mason 提交于
      Btrfs has been pointer tagging bi_private and using bi_bdev
      to store the stripe index and mirror number of failed IOs.
      
      As bios bubble back up through the call chain, we use these
      to decide if and how to retry our IOs.  They are also used
      to count IO failures on a per device basis.
      
      Recently a bio tracepoint was added lead to crashes because
      we were abusing bi_bdev.
      
      This commit adds a btrfs bioset, and creates explicit fields
      for the mirror number and stripe index.  The plan is to
      extend this structure for all of the fields currently in
      struct btrfs_bio, which will mean one less kmalloc in
      our IO path.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NTejun Heo <tj@kernel.org>
      9be3395b
  19. 07 5月, 2013 6 次提交
  20. 27 3月, 2013 1 次提交
    • C
      Btrfs: fix race between mmap writes and compression · 4adaa611
      Chris Mason 提交于
      Btrfs uses page_mkwrite to ensure stable pages during
      crc calculations and mmap workloads.  We call clear_page_dirty_for_io
      before we do any crcs, and this forces any application with the file
      mapped to wait for the crc to finish before it is allowed to change
      the file.
      
      With compression on, the clear_page_dirty_for_io step is happening after
      we've compressed the pages.  This means the applications might be
      changing the pages while we are compressing them, and some of those
      modifications might not hit the disk.
      
      This commit adds the clear_page_dirty_for_io before compression starts
      and makes sure to redirty the page if we have to fallback to
      uncompressed IO as well.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NAlexandre Oliva <oliva@gnu.org>
      cc: stable@vger.kernel.org
      4adaa611
  21. 01 3月, 2013 1 次提交