1. 09 10月, 2012 3 次提交
    • S
      Btrfs: make filesystem read-only when submitting barrier fails · 5af3e8cc
      Stefan Behrens 提交于
      So far the return code of barrier_all_devices() is ignored, which
      means that errors are ignored. The result can be a corrupt
      filesystem which is not consistent.
      This commit adds code to evaluate the return code of
      barrier_all_devices(). The normal btrfs_error() mechanism is used to
      switch the filesystem into read-only mode when errors are detected.
      
      In order to decide whether barrier_all_devices() should return
      error or success, the number of disks that are allowed to fail the
      barrier submission is calculated. This calculation accounts for the
      worst RAID level of metadata, system and data. If single, dup or
      RAID0 is in use, a single disk error is already considered to be
      fatal. Otherwise a single disk error is tolerated.
      
      The calculation of the number of disks that are tolerated to fail
      the barrier operation is performed when the filesystem gets mounted,
      when a balance operation is started and finished, and when devices
      are added or removed.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      5af3e8cc
    • M
      btrfs: extended inode refs · f186373f
      Mark Fasheh 提交于
      This patch adds basic support for extended inode refs. This includes support
      for link and unlink of the refs, which basically gets us support for rename
      as well.
      
      Inode creation does not need changing - extended refs are only added after
      the ref array is full.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      f186373f
    • D
      btrfs: move transaction aborts to the point of failure · 005d6427
      David Sterba 提交于
      Call btrfs_abort_transaction as early as possible when an error
      condition is detected, that way the line number reported is useful
      and we're not clueless anymore which error path led to the abort.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      005d6427
  2. 04 10月, 2012 1 次提交
  3. 02 10月, 2012 9 次提交
    • J
      Btrfs: delay block group item insertion · ea658bad
      Josef Bacik 提交于
      So we have lots of places where we try to preallocate chunks in order to
      make sure we have enough space as we make our allocations.  This has
      historically meant that we're constantly tweaking when we should allocate a
      new chunk, and historically we have gotten this horribly wrong so we way
      over allocate either metadata or data.  To try and keep this from happening
      we are going to make it so that the block group item insertion is done out
      of band at the end of a transaction.  This will allow us to create chunks
      even if we are trying to make an allocation for the extent tree.  With this
      patch my enospc tests run faster (didn't expect this) and more efficiently
      use the disk space (this is what I wanted).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ea658bad
    • L
      Btrfs: cleanup for unused ref cache stuff · 0647d6bd
      liubo 提交于
      As ref cache has been removed from btrfs, there is no user on
      its lock and its check.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      0647d6bd
    • M
      Btrfs: fix unprotected ->log_batch · 2ecb7923
      Miao Xie 提交于
      We forget to protect ->log_batch when syncing a file, this patch fix
      this problem by atomic operation. And ->log_batch is used to check
      if there are parallel sync operations or not, so it is unnecessary to
      reset it to 0 after the sync operation of the current log tree complete.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      2ecb7923
    • M
      Btrfs: add a new "type" field into the block reservation structure · 66d8f3dd
      Miao Xie 提交于
      Sometimes we need choose the method of the reservation according to the type
      of the block reservation, such as the reservation for the delayed inode update.
      Now we identify the type just by comparing the address of the reservation
      variants, it is very ugly if it is a temporary one because we need compare it
      with all the common reservation variants. So we add a new "type" field to keep
      the type the reservation variants.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      66d8f3dd
    • J
      Btrfs: btrfs_drop_extent_cache should never fail · 7014cdb4
      Josef Bacik 提交于
      I noticed this when I was doing the fsync stuff, we allocate split extents if we
      drop an extent range that is in the middle of an existing extent.  This BUG()'s
      if we fail to allocate memory, but the fact is this is just a cache, we will
      just regenerate the cache if we need it, the important part is that we free the
      range we are given.  This can be done without allocations, so if we fail to
      allocate splits just skip the splitting stage and free our em and look for more
      extents to drop.  This also makes btrfs_drop_extent_cache a void since nobody
      was checking the return value anyway.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7014cdb4
    • J
      Btrfs: add hole punching · 2aaa6655
      Josef Bacik 提交于
      This patch adds hole punching via fallocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2aaa6655
    • J
      Btrfs: remove unused hint byte argument for btrfs_drop_extents · 2671485d
      Josef Bacik 提交于
      I audited all users of btrfs_drop_extents and found that nobody actually uses
      the hint_byte argument.  I'm sure it was used for something at some point but
      it's not used now, and the way the pinning works the disk bytenr would never be
      immediately useful anyway so lets just remove it.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2671485d
    • J
      Btrfs: do not needlessly restart the transaction for enospc · ca7e70f5
      Josef Bacik 提交于
      We will stop and restart a transaction every time we move to a different leaf
      when truncating a file.  This is for enospc reasons, but really we could
      probably get away with doing this a little better by actually working until we
      hit an ENOSPC.  So add a ->failfast flag to the block_rsv and set it when we do
      truncates which will fail as soon as the block rsv runs out of space, and then
      at that point we can stop and restart the transaction and refill the block rsv
      and carry on.  This will make rm'ing of a file with lots of extents a bit
      faster.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ca7e70f5
    • J
      Btrfs: turbo charge fsync · 5dc562c5
      Josef Bacik 提交于
      At least for the vm workload.  Currently on fsync we will
      
      1) Truncate all items in the log tree for the given inode if they exist
      
      and
      
      2) Copy all items for a given inode into the log
      
      The problem with this is that for things like VMs you can have lots of
      extents from the fragmented writing behavior, and worst yet you may have
      only modified a few extents, not the entire thing.  This patch fixes this
      problem by tracking which transid modified our extent, and then when we do
      the tree logging we find all of the extents we've modified in our current
      transaction, sort them and commit them.  We also only truncate up to the
      xattrs of the inode and copy that stuff in normally, and then just drop any
      extents in the range we have that exist in the log already.  Here are some
      numbers of a 50 meg fio job that does random writes and fsync()s after every
      write
      
      		Original	Patched
      SATA drive	82KB/s		140KB/s
      Fusion drive	431KB/s		2532KB/s
      
      So around 2-6 times faster depending on your hardware.  There are a few
      corner cases, for example if you truncate at all we have to do it the old
      way since there is no way to be sure what is in the log is ok.  This
      probably could be done smarter, but if you write-fsync-truncate-write-fsync
      you deserve what you get.  All this work is in RAM of course so if your
      inode gets evicted from cache and you read it in and fsync it we'll do it
      the slow way if we are still in the same transaction that we last modified
      the inode in.
      
      The biggest cool part of this is that it requires no changes to the recovery
      code, so if you fsync with this patch and crash and load an old kernel, it
      will run the recovery and be a-ok.  I have tested this pretty thoroughly
      with an fsync tester and everything comes back fine, as well as xfstests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5dc562c5
  4. 29 8月, 2012 2 次提交
    • A
      Btrfs: fix deadlock in wait_for_more_refs · 1fa11e26
      Arne Jansen 提交于
      Commit a168650c introduced a waiting mechanism to prevent busy waiting in
      btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations,
      where a tree_mod_seq is held while waiting for the io to complete, while
      the end_io calls btrfs_run_delayed_refs.
      This whole mechanism is unnecessary. If not enough runnable refs are
      available to satisfy count, just return as count is more like a guideline
      than a strict requirement.
      In case we have to run all refs, commit transaction makes sure that no
      other threads are working in the transaction anymore, so we just assert
      here that no refs are blocked.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      1fa11e26
    • J
      Btrfs: don't allocate a seperate csums array for direct reads · c329861d
      Josef Bacik 提交于
      We've been allocating a big array for csums instead of storing them in the
      io_tree like we do for buffered reads because previously we were locking the
      entire range, so we didn't have an extent state for each sector of the
      range.  But now that we do the range locking as we map the buffers we can
      limit the mapping lenght to sectorsize and use the private part of the
      io_tree for our csums.  This allows us to avoid an extra memory allocation
      for direct reads which could incur latency.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      c329861d
  5. 31 7月, 2012 1 次提交
  6. 26 7月, 2012 3 次提交
  7. 25 7月, 2012 1 次提交
  8. 24 7月, 2012 2 次提交
    • L
      Btrfs: rewrite BTRFS_SETGET_FUNCS · 18077bb4
      Li Zefan 提交于
      BTRFS_SETGET_FUNCS macro is used to generate btrfs_set_foo() and
      btrfs_foo() functions, which read and write specific fields in the
      extent buffer.
      
      The total number of set/get functions is ~200, but in fact we only
      need 8 functions: 2 for u8 field, 2 for u16, 2 for u32 and 2 for u64.
      
      It results in redunction of ~37K bytes.
      
         text    data     bss     dec     hex filename
       629661   12489     216  642366   9cd3e fs/btrfs/btrfs.o.orig
       592637   12489     216  605342   93c9e fs/btrfs/btrfs.o
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      18077bb4
    • L
      Btrfs: kill free_space pointer from inode structure · b4d7c3c9
      Li Zefan 提交于
      Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type,
      which means every inode has the same BTRFS_I(inode)->free_space pointer.
      
      This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits).
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      b4d7c3c9
  9. 12 7月, 2012 2 次提交
  10. 10 7月, 2012 4 次提交
  11. 21 6月, 2012 1 次提交
  12. 15 6月, 2012 1 次提交
    • J
      Btrfs: add btrfs_next_old_leaf · 3d7806ec
      Jan Schmidt 提交于
      To make sense of the tree mod log, the backref walker not only needs
      btrfs_search_old_slot, but it also called btrfs_next_leaf, which in turn was
      calling btrfs_search_slot. This obviously didn't give the correct result.
      
      This commit adds btrfs_next_old_leaf, a drop-in replacement for
      btrfs_next_leaf with a time_seq parameter. If it is zero, it behaves exactly
      like btrfs_next_leaf. If it is non-zero, it will use btrfs_search_old_slot
      with this time_seq parameter.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      3d7806ec
  13. 02 6月, 2012 1 次提交
  14. 31 5月, 2012 1 次提交
    • J
      Btrfs: use delayed ref sequence numbers for all fs-tree updates · 95a06077
      Jan Schmidt 提交于
      The sequence number for delayed refs is needed to postpone certain delayed
      refs for a very short period while walking backrefs. Before the tree
      modification log, we thought we'd only have to hold back those references
      that don't have a counter operation.
      
      While now we've the tree mod log, we're rewinding fs tree blocks to a
      defined consistent state. We cannot know in advance for which tree block
      we'll be doing rewind operations later. Therefore, we must postpone all the
      delayed refs for fs-tree blocks, even those having a counter operation.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      95a06077
  15. 30 5月, 2012 5 次提交
    • S
      Btrfs: set ioprio of scrub readahead to idle · 3d136a11
      Stefan Behrens 提交于
      Reduce ioprio class of scrub readahead threads to idle priority.
      This setting is fixed. This priority has shown the best performance
      during all measurements.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      3d136a11
    • S
      Btrfs: read device stats on mount, write modified ones during commit · 733f4fbb
      Stefan Behrens 提交于
      The device statistics are written into the device tree with each
      transaction commit. Only modified statistics are written.
      When a filesystem is mounted, the device statistics for each involved
      device are read from the device tree and used to initialize the
      counters.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      733f4fbb
    • J
      Btrfs: fix how we deal with the orphan block rsv · 8a35d95f
      Josef Bacik 提交于
      Ceph was hitting this race where we would remove an inode from the per-root
      orphan list before we would release the space we had reserved for the inode.
      We actually don't need a list or anything, we just need to make sure the
      root doesn't try to free up the orphan reserve until after the inodes have
      released their reservations.  So use an atomic counter instead of a list on
      the root and only decrement the counter after we've released our
      reservation.  I've tested this as well as several others and we no longer
      see the warnings that you would see while running ceph.  Thanks,
      Btrfs: fix how we deal with the orphan block rsv
      
      Ceph was hitting this race where we would remove an inode from the per-root
      orphan list before we would release the space we had reserved for the inode.
      We actually don't need a list or anything, we just need to make sure the
      root doesn't try to free up the orphan reserve until after the inodes have
      released their reservations.  So use an atomic counter instead of a list on
      the root and only decrement the counter after we've released our
      reservation.  I've tested this as well as several others and we no longer
      see the warnings that you would see while running ceph.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      8a35d95f
    • J
      Btrfs: add btrfs_search_old_slot · 5d9e75c4
      Jan Schmidt 提交于
      The tree modification log together with the current state of the tree gives
      a consistent, old version of the tree. btrfs_search_old_slot is used to
      search through this old version and return old (dummy!) extent buffers.
      Naturally, this function cannot do any tree modifications.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      5d9e75c4
    • J
      Btrfs: add tree modification log functions · bd989ba3
      Jan Schmidt 提交于
      The tree mod log will log modifications made fs-tree nodes. Most
      modifications are done by autobalance of the tree. Such changes are recorded
      as long as a block entry exists. When released, the log is cleaned.
      
      With the tree modification log, it's possible to reconstruct a consistent
      old state of the tree. This is required to do backref walking on a busy
      file system.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      bd989ba3
  16. 26 5月, 2012 3 次提交