1. 21 11月, 2013 2 次提交
  2. 12 11月, 2013 4 次提交
  3. 21 9月, 2013 1 次提交
  4. 01 9月, 2013 3 次提交
    • J
      Btrfs: allow partial ordered extent completion · 77cef2ec
      Josef Bacik 提交于
      We currently have this problem where you can truncate pages that have not yet
      been written for an ordered extent.  We do this because the truncate will be
      coming behind to clean us up anyway so what's the harm right?  Well if truncate
      fails for whatever reason we leave an orphan item around for the file to be
      cleaned up later.  But if the user goes and truncates up the file and tries to
      read from the area that had been discarded previously they will get a csum error
      because we never actually wrote that data out.
      
      This patch fixes this by allowing us to either discard the ordered extent
      completely, by which I mean we just free up the space we had allocated and not
      add the file extent, or adjust the length of the file extent we write.  We do
      this by setting the length we truncated down to in the ordered extent, and then
      we set the file extent length and ram bytes to this length.  The total disk
      space stays unchanged since we may be compressed and we can't just chop off the
      disk space, but at least this way the file extent only points to the valid data.
      Then when the file extent is free'd the extent and csums will be freed normally.
      
      This patch is needed for the next series which will give us more graceful
      recovery of failed truncates.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      77cef2ec
    • G
      Btrfs: Remove superfluous casts from u64 to unsigned long long · c1c9ff7c
      Geert Uytterhoeven 提交于
      u64 is "unsigned long long" on all architectures now, so there's no need to
      cast it when formatting it using the "ll" length modifier.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      c1c9ff7c
    • J
      Btrfs: fix heavy delalloc related deadlock · 9ffba8cd
      Josef Bacik 提交于
      I added a patch where we started taking the ordered operations mutex when we
      waited on ordered extents.  We need this because we splice the list and process
      it, so if a flusher came in during this scenario it would think the list was
      empty and we'd usually get an early ENOSPC.  The problem with this is that this
      lock is used in transaction committing.  So we end up with something like this
      
      Transaction commit
      	-> wait on writers
      
      Delalloc flusher
      	-> run_ordered_operations (holds mutex)
      		->wait for filemap-flush to do its thing
      
      flush task
      	-> cow_file_range
      		->wait on btrfs_join_transaction because we're commiting
      
      some other task
      	-> commit_transaction because we notice trans->transaction->flush is set
      		-> run_ordered_operations (hang on mutex)
      
      We need to disentangle the ordered operations flushing from the delalloc
      flushing, since they are separate things.  This solves the deadlock issue I was
      seeing.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      9ffba8cd
  5. 02 7月, 2013 1 次提交
    • M
      Btrfs: remove btrfs_sector_sum structure · f51a4a18
      Miao Xie 提交于
      Using the structure btrfs_sector_sum to keep the checksum value is
      unnecessary, because the extents that btrfs_sector_sum points to are
      continuous, we can find out the expected checksums by btrfs_ordered_sum's
      bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
      removing bytenr, there is only one member in the structure, so it makes
      no sense to keep the structure, just remove it, and use a u32 array to
      store the checksum value.
      
      By this change, we don't use the while loop to get the checksums one by
      one. Now, we can get several checksum value at one time, it improved the
      performance by ~74% on my SSD (31MB/s -> 54MB/s).
      
      test command:
       # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f51a4a18
  6. 14 6月, 2013 1 次提交
  7. 07 5月, 2013 1 次提交
  8. 28 3月, 2013 1 次提交
  9. 21 2月, 2013 1 次提交
    • J
      Btrfs: place ordered operations on a per transaction list · 569e0f35
      Josef Bacik 提交于
      Miao made the ordered operations stuff run async, which introduced a
      deadlock where we could get somebody (sync) racing in and committing the
      transaction while a commit was already happening.  The new committer would
      try and flush ordered operations which would hang waiting for the commit to
      finish because it is done asynchronously and no longer inherits the callers
      trans handle.  To fix this we need to make the ordered operations list a per
      transaction list.  We can get new inodes added to the ordered operation list
      by truncating them and then having another process writing to them, so this
      makes it so that anybody trying to add an ordered operation _must_ start a
      transaction in order to add itself to the list, which will keep new inodes
      from getting added to the ordered operations list after we start committing.
      This should fix the deadlock and also keeps us from doing a lot more work
      than we need to during commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      569e0f35
  10. 20 2月, 2013 2 次提交
  11. 06 2月, 2013 2 次提交
    • J
      Btrfs: fix possible stale data exposure · 59fe4f41
      Josef Bacik 提交于
      We specifically do not update the disk i_size if there are ordered extents
      outstanding for any area between the current disk_i_size and our ordered
      extent so that we do not expose stale data.  The problem is the check we
      have only checks if the ordered extent starts at or after the current
      disk_i_size, which doesn't take into account an ordered extent that starts
      before the current disk_i_size and ends past the disk_i_size.  Fix this by
      checking if the extent ends past the disk_i_size.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      59fe4f41
    • J
      Btrfs: fix missing i_size update · 5d1f4020
      Josef Bacik 提交于
      If we have an ordered extent before the ordered extent we are currently
      completing that is after the current disk_i_size we will put our i_size
      update into that ordered extent so that we do not expose stale data.  The
      problem is that if our disk i_size is updated past the previous ordered
      extent we won't update the i_size with the pending i_size update.  So check
      the pending i_size update and if its above the current disk i_size we need
      to go ahead and try to update.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5d1f4020
  12. 13 12月, 2012 1 次提交
  13. 12 12月, 2012 2 次提交
  14. 04 10月, 2012 1 次提交
  15. 02 10月, 2012 2 次提交
    • M
      Btrfs: use a slab for ordered extents allocation · 6352b91d
      Miao Xie 提交于
      The ordered extent allocation is in the fast path of the IO, so use a slab
      to improve the speed of the allocation.
      
       "Size of the struct is 280, so this will fall into the size-512 bucket,
        giving 8 objects per page, while own slab will pack 14 objects into a page.
      
        Another benefit I see is to check for leaked objects when the module is
        removed (and the cache destroy takes place)."
      						-- David Sterba
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      6352b91d
    • M
      Btrfs: fix file extent discount problem in the, snapshot · b9a8cc5b
      Miao Xie 提交于
      If a snapshot is created while we are writing some data into the file,
      the i_size of the corresponding file in the snapshot will be wrong, it will
      be beyond the end of the last file extent. And btrfsck will report:
        root 256 inode 257 errors 100
      
      Steps to reproduce:
       # mkfs.btrfs <partition>
       # mount <partition> <mnt>
       # cd <mnt>
       # dd if=/dev/zero of=tmpfile bs=4M count=1024 &
       # for ((i=0; i<4; i++))
       > do
       > btrfs sub snap . $i
       > done
      
      This because the algorithm of disk_i_size update is wrong. Though there are
      some ordered extents behind the current one which we use to update disk_i_size,
      it doesn't mean those extents will be dealt with in the same transaction. So
      We shouldn't use the offset of those extents to update disk_i_size. Or we will
      get the wrong i_size in the snapshot.
      
      We fix this problem by recording the max real i_size. If we find there is a
      ordered extent which is in front of the current one and doesn't complete, we
      will record the end of the current one into that ordered extent. Surely, if
      the current extent holds the end of other extent(it must be greater than
      the current one because it is behind the current one), we will record the
      number that the current extent holds. In this way, we can exclude the ordered
      extents that may not be dealth with in the same transaction, and be easy to
      know the real disk_i_size.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      b9a8cc5b
  16. 04 8月, 2012 1 次提交
  17. 15 6月, 2012 1 次提交
    • J
      Btrfs: call filemap_fdatawrite twice for compression · 7ddf5a42
      Josef Bacik 提交于
      I removed this in an earlier commit and I was wrong.  Because compression
      can return from filemap_fdatawrite() without having actually set any of it's
      pages as writeback() it can make filemap_fdatawait() do essentially nothing,
      and then we won't find any ordered extents because they may not have been
      created yet.  So not only does this make fsync() completely useless, but it
      will also screw up if you truncate on a non-page aligned offset since we
      zero out the end and then wait on ordered extents and then call drop caches.
      We can drop the cache before the io completes and then we try to unpin the
      extent we just wrote we won't find it and everything goes sideways.  So fix
      this by putting it back and put a giant comment there to keep me from trying
      to remove it in the future.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7ddf5a42
  18. 30 5月, 2012 3 次提交
    • J
      Btrfs: finish ordered extents in their own thread · 5fd02043
      Josef Bacik 提交于
      We noticed that the ordered extent completion doesn't really rely on having
      a page and that it could be done independantly of ending the writeback on a
      page.  This patch makes us not do the threaded endio stuff for normal
      buffered writes and direct writes so we can end page writeback as soon as
      possible (in irq context) and only start threads to do the ordered work when
      it is actually done.  Compression needs to be reworked some to take
      advantage of this as well, but atm it has to do a find_get_page in its endio
      handler so it must be done in its own thread.  This makes direct writes
      quite a bit faster.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5fd02043
    • J
      Btrfs: do not check delalloc when updating disk_i_size · 4e899152
      Josef Bacik 提交于
      We are checking delalloc to see if it is ok to update the i_size.  There are
      2 cases it stops us from updating
      
      1) If there is delalloc between our current disk_i_size and this ordered
      extent
      
      2) If there is delalloc between our current ordered extent and the next
      ordered extent
      
      These tests are racy however since we can set delalloc for these ranges at
      any time.  Also for the first case if we notice there is delalloc between
      disk_i_size and our ordered extent we will not update disk_i_size and assume
      that when that delalloc bit gets written out it will update everything
      properly.  However if we crash before that we will have file extents outside
      of our i_size, which is not good, so this test is dangerous as well as racy.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4e899152
    • J
      Btrfs: remove useless waiting and extra filemap work · 551ebb2d
      Josef Bacik 提交于
      In btrfs_wait_ordered_range we have been calling filemap_fdata_write() twice
      because compression does strange things and then waiting.  Then we look up
      ordered extents and if we find any we will always schedule_timeout(); once
      and then loop back around and do it all again.  We will even check to see if
      there is delalloc pages on this range and loop again.  So this patch gets
      rid of the multipe fdata_write() calls and just does
      filemap_write_and_wait().  In the case of compression we will still find the
      ordered extents and start those individually if we need to so that is ok,
      but in the normal buffered case we avoid all this weird overhead.
      
      Then in the case of the schedule_timeout(1), we don't need it.  All callers
      either 1) don't care, they just want to make sure what they just wrote maeks
      it to disk or 2) are doing the lock()->lookup ordered->unlock->flush thing
      in which case it will lock and check for ordered extents _anyway_ so get
      back to them as quickly as possible.  The delaloc check is simply not
      needed, this only catches the case where we write to the file again since
      doing the filemap_write_and_wait() and if the caller truly cares about that
      it will take care of everything itself.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      551ebb2d
  19. 22 3月, 2012 2 次提交
  20. 28 3月, 2011 1 次提交
    • L
      Btrfs: add initial tracepoint support for btrfs · 1abe9b8a
      liubo 提交于
      Tracepoints can provide insight into why btrfs hits bugs and be greatly
      helpful for debugging, e.g
                    dd-7822  [000]  2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
                    dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
       btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
       btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
       btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
         flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
         flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
         flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
         flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
       btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
       btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
      
      Here is what I have added:
      
      1) ordere_extent:
              btrfs_ordered_extent_add
              btrfs_ordered_extent_remove
              btrfs_ordered_extent_start
              btrfs_ordered_extent_put
      
      These provide critical information to understand how ordered_extents are
      updated.
      
      2) extent_map:
              btrfs_get_extent
      
      extent_map is used in both read and write cases, and it is useful for tracking
      how btrfs specific IO is running.
      
      3) writepage:
              __extent_writepage
              btrfs_writepage_end_io_hook
      
      Pages are cirtical resourses and produce a lot of corner cases during writeback,
      so it is valuable to know how page is written to disk.
      
      4) inode:
              btrfs_inode_new
              btrfs_inode_request
              btrfs_inode_evict
      
      These can show where and when a inode is created, when a inode is evicted.
      
      5) sync:
              btrfs_sync_file
              btrfs_sync_fs
      
      These show sync arguments.
      
      6) transaction:
              btrfs_transaction_commit
      
      In transaction based filesystem, it will be useful to know the generation and
      who does commit.
      
      7) back reference and cow:
      	btrfs_delayed_tree_ref
      	btrfs_delayed_data_ref
      	btrfs_delayed_ref_head
      	btrfs_cow_block
      
      Btrfs natively supports back references, these tracepoints are helpful on
      understanding btrfs's COW mechanism.
      
      8) chunk:
      	btrfs_chunk_alloc
      	btrfs_chunk_free
      
      Chunk is a link between physical offset and logical offset, and stands for space
      infomation in btrfs, and these are helpful on tracing space things.
      
      9) reserved_extent:
      	btrfs_reserved_extent_alloc
      	btrfs_reserved_extent_free
      
      These can show how btrfs uses its space.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1abe9b8a
  21. 01 2月, 2011 1 次提交
  22. 22 12月, 2010 1 次提交
  23. 29 11月, 2010 1 次提交
  24. 30 10月, 2010 1 次提交
  25. 25 5月, 2010 2 次提交
    • J
      Btrfs: add basic DIO read/write support · 4b46fce2
      Josef Bacik 提交于
      This provides basic DIO support for reading and writing.  It does not do the
      work to recover from mismatching checksums, that will come later.  A few design
      changes have been made from Jim's code (sorry Jim!)
      
      1) Use the generic direct-io code.  Jim originally re-wrote all the generic DIO
      code in order to account for all of BTRFS's oddities, but thanks to that work it
      seems like the best bet is to just ignore compression and such and just opt to
      fallback on buffered IO.
      
      2) Fallback on buffered IO for compressed or inline extents.  Jim's code did
      it's own buffering to make dio with compressed extents work.  Now we just
      fallback onto normal buffered IO.
      
      3) Use ordered extents for the writes so that all of the
      
      lock_extent()
      lookup_ordered()
      
      type checks continue to work.
      
      4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
      DIO writes.
      
      I've tested this with fsx and everything works great.  This patch depends on my
      dio and filemap.c patches to work.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4b46fce2
    • Y
      Btrfs: Update metadata reservation for delayed allocation · 0ca1f7ce
      Yan, Zheng 提交于
      Introduce metadata reservation context for delayed allocation
      and update various related functions.
      
      This patch also introduces EXTENT_FIRST_DELALLOC control bit for
      set/clear_extent_bit. It tells set/clear_bit_hook whether they
      are processing the first extent_state with EXTENT_DELALLOC bit
      set. This change is important if set/clear_extent_bit involves
      multiple extent_state.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0ca1f7ce
  26. 31 3月, 2010 1 次提交