1. 18 9月, 2014 2 次提交
    • M
      Btrfs: implement repair function when direct read fails · 8b110e39
      Miao Xie 提交于
      This patch implement data repair function when direct read fails.
      
      The detail of the implementation is:
      - When we find the data is not right, we try to read the data from the other
        mirror.
      - When the io on the mirror ends, we will insert the endio work into the
        dedicated btrfs workqueue, not common read endio workqueue, because the
        original endio work is still blocked in the btrfs endio workqueue, if we
        insert the endio work of the io on the mirror into that workqueue, deadlock
        would happen.
      - After we get right data, we write it back to the corrupted mirror.
      - And if the data on the new mirror is still corrupted, we will try next
        mirror until we read right data or all the mirrors are traversed.
      - After the above work, we set the uptodate flag according to the result.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8b110e39
    • D
      btrfs: make close_ctree return void · 3abdbd78
      David Sterba 提交于
      There's no user of the return value and we can get rid of the comment in
      put_super.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      3abdbd78
  2. 10 6月, 2014 1 次提交
    • J
      Btrfs: add sanity tests for new qgroup accounting code · faa2dbf0
      Josef Bacik 提交于
      This exercises the various parts of the new qgroup accounting code.  We do some
      basic stuff and do some things with the shared refs to make sure all that code
      works.  I had to add a bunch of infrastructure because I needed to be able to
      insert items into a fake tree without having to do all the hard work myself,
      hopefully this will be usefull in the future.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      faa2dbf0
  3. 12 11月, 2013 1 次提交
  4. 11 10月, 2013 1 次提交
    • M
      Btrfs: fix oops caused by the space balance and dead roots · c00869f1
      Miao Xie 提交于
      When doing space balance and subvolume destroy at the same time, we met
      the following oops:
      
      kernel BUG at fs/btrfs/relocation.c:2247!
      RIP: 0010: [<ffffffffa04cec16>] prepare_to_merge+0x154/0x1f0 [btrfs]
      Call Trace:
       [<ffffffffa04b5ab7>] relocate_block_group+0x466/0x4e6 [btrfs]
       [<ffffffffa04b5c7a>] btrfs_relocate_block_group+0x143/0x275 [btrfs]
       [<ffffffffa0495c56>] btrfs_relocate_chunk.isra.27+0x5c/0x5a2 [btrfs]
       [<ffffffffa0459871>] ? btrfs_item_key_to_cpu+0x15/0x31 [btrfs]
       [<ffffffffa048b46a>] ? btrfs_get_token_64+0x7e/0xcd [btrfs]
       [<ffffffffa04a3467>] ? btrfs_tree_read_unlock_blocking+0xb2/0xb7 [btrfs]
       [<ffffffffa049907d>] btrfs_balance+0x9c7/0xb6f [btrfs]
       [<ffffffffa049ef84>] btrfs_ioctl_balance+0x234/0x2ac [btrfs]
       [<ffffffffa04a1e8e>] btrfs_ioctl+0xd87/0x1ef9 [btrfs]
       [<ffffffff81122f53>] ? path_openat+0x234/0x4db
       [<ffffffff813c3b78>] ? __do_page_fault+0x31d/0x391
       [<ffffffff810f8ab6>] ? vma_link+0x74/0x94
       [<ffffffff811250f5>] vfs_ioctl+0x1d/0x39
       [<ffffffff811258c8>] do_vfs_ioctl+0x32d/0x3e2
       [<ffffffff811259d4>] SyS_ioctl+0x57/0x83
       [<ffffffff813c3bfa>] ? do_page_fault+0xe/0x10
       [<ffffffff813c73c2>] system_call_fastpath+0x16/0x1b
      
      It is because we returned the error number if the reference of the root was 0
      when doing space relocation. It was not right here, because though the root
      was dead(refs == 0), but the space it held still need be relocated, or we
      could not remove the block group. So in this case, we should return the root
      no matter it is dead or not.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      c00869f1
  5. 14 6月, 2013 2 次提交
  6. 07 5月, 2013 2 次提交
  7. 30 4月, 2013 1 次提交
  8. 02 2月, 2013 1 次提交
    • D
      Btrfs: RAID5 and RAID6 · 53b381b3
      David Woodhouse 提交于
      This builds on David Woodhouse's original Btrfs raid5/6 implementation.
      The code has changed quite a bit, blame Chris Mason for any bugs.
      
      Read/modify/write is done after the higher levels of the filesystem have
      prepared a given bio.  This means the higher layers are not responsible
      for building full stripes, and they don't need to query for the topology
      of the extents that may get allocated during delayed allocation runs.
      It also means different files can easily share the same stripe.
      
      But, it does expose us to incorrect parity if we crash or lose power
      while doing a read/modify/write cycle.  This will be addressed in a
      later commit.
      
      Scrub is unable to repair crc errors on raid5/6 chunks.
      
      Discard does not work on raid5/6 (yet)
      
      The stripe size is fixed at 64KiB per disk.  This will be tunable
      in a later commit.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      53b381b3
  9. 13 12月, 2012 1 次提交
  10. 09 10月, 2012 1 次提交
    • S
      Btrfs: make filesystem read-only when submitting barrier fails · 5af3e8cc
      Stefan Behrens 提交于
      So far the return code of barrier_all_devices() is ignored, which
      means that errors are ignored. The result can be a corrupt
      filesystem which is not consistent.
      This commit adds code to evaluate the return code of
      barrier_all_devices(). The normal btrfs_error() mechanism is used to
      switch the filesystem into read-only mode when errors are detected.
      
      In order to decide whether barrier_all_devices() should return
      error or success, the number of disks that are allowed to fail the
      barrier submission is calculated. This calculation accounts for the
      worst RAID level of metadata, system and data. If single, dup or
      RAID0 is in use, a single disk error is already considered to be
      fatal. Otherwise a single disk error is tolerated.
      
      The calculation of the number of disks that are tolerated to fail
      the barrier operation is performed when the filesystem gets mounted,
      when a balance operation is started and finished, and when devices
      are added or removed.
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      5af3e8cc
  11. 29 8月, 2012 1 次提交
    • S
      Btrfs: remove superblock writing after fatal error · 68ce9682
      Stefan Behrens 提交于
      With commit acce952b, btrfs was changed to flag the filesystem with
      BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
      error happened like a write I/O errors of all mirrors.
      In such situations, on unmount, the superblock is written in
      btrfs_error_commit_super(). This is done with the intention to be able
      to evaluate the error flag on the next mount. A warning is printed
      in this case during the next mount and the log tree is ignored.
      
      The issue is that it is possible that the superblock points to a root
      that was not written (due to write I/O errors).
      The result is that the filesystem cannot be mounted. btrfsck also does
      not start and all the other btrfs-progs tools fail to start as well.
      However, mount -o recovery is working well and does the right things
      to recover the filesystem (i.e., don't use the log root, clear the
      free space cache and use the next mountable root that is stored in the
      root backup array).
      
      This patch removes the writing of the superblock when
      BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
      flag in the mount function.
      
      These lines can be used to reproduce the issue (using /dev/sdm):
      SCRATCH_DEV=/dev/sdm
      SCRATCH_MNT=/mnt
      echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup create foo
      ls -alLF /dev/mapper/foo
      mkfs.btrfs /dev/mapper/foo
      mount /dev/mapper/foo $SCRATCH_MNT
      echo bar > $SCRATCH_MNT/foo
      sync
      echo 0 25165824 error | dmsetup reload foo
      dmsetup resume foo
      ls -alF $SCRATCH_MNT
      touch $SCRATCH_MNT/1
      ls -alF $SCRATCH_MNT
      sleep 35
      echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup reload foo
      dmsetup resume foo
      sleep 1
      umount $SCRATCH_MNT
      btrfsck /dev/mapper/foo
      dmsetup remove foo
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      68ce9682
  12. 10 7月, 2012 1 次提交
  13. 30 5月, 2012 1 次提交
    • A
      btrfs: Drop unused function btrfs_abort_devices() · d07eb911
      Asias He 提交于
      1) This function is not used anywhere.
      
      2) Using the blk_abort_queue() to abort the queue seems not correct.
      blk_abort_queue() is used for timeout handling (block/blk-timeout.c).
      
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: linux-btrfs@vger.kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NAsias He <asias@redhat.com>
      d07eb911
  14. 06 5月, 2012 1 次提交
    • C
      Btrfs: avoid sleeping in verify_parent_transid while atomic · b9fab919
      Chris Mason 提交于
      verify_parent_transid needs to lock the extent range to make
      sure no IO is underway, and so it can safely clear the
      uptodate bits if our checks fail.
      
      But, a few callers are using it with spinlocks held.  Most
      of the time, the generation numbers are going to match, and
      we don't want to switch to a blocking lock just for the error
      case.  This adds an atomic flag to verify_parent_transid,
      and changes it to return EAGAIN if it needs to block to
      properly verifiy things.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9fab919
  15. 22 3月, 2012 3 次提交
  16. 09 1月, 2012 3 次提交
  17. 06 11月, 2011 1 次提交
    • C
      Btrfs: make sure to flush queued bios if write_cache_pages waits · 01d658f2
      Chris Mason 提交于
      write_cache_pages tries to build up a large bio to stuff down the pipe.
      But if it needs to wait for a page lock, it needs to make sure and send
      down any pending writes so we don't deadlock with anyone who has the
      page lock and is waiting for writeback of things inside the bio.
      
      Dave Sterba triggered this as a deadlock between the autodefrag code and
      the extent write_cache_pages
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      01d658f2
  18. 02 10月, 2011 1 次提交
    • A
      btrfs: add READAHEAD extent buffer flag · ab0fff03
      Arne Jansen 提交于
      Add a READAHEAD extent buffer flag.
      Add a function to trigger a read with this flag set.
      
      Changes v2:
       - use extent buffer flags instead of extent state flags
      
      Changes v5:
       - adapt to changed read_extent_buffer_pages interface
       - don't return eb from reada_tree_block_flagged if it has CORRUPT flag set
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      ab0fff03
  19. 28 7月, 2011 1 次提交
    • C
      Btrfs: make a lockdep class for each root · 85d4e461
      Chris Mason 提交于
      This patch was originally from Tejun Heo.  lockdep complains about the btrfs
      locking because we sometimes take btree locks from two different trees at the
      same time.  The current classes are based only on level in the btree, which
      isn't enough information for lockdep to figure out if the lock is safe.
      
      This patch makes a class for each type of tree, and lumps all the FS trees that
      actually have files and directories into the same class.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      85d4e461
  20. 21 5月, 2011 1 次提交
    • M
      btrfs: implement delayed inode items operation · 16cdcec7
      Miao Xie 提交于
      Changelog V5 -> V6:
      - Fix oom when the memory load is high, by storing the delayed nodes into the
        root's radix tree, and letting btrfs inodes go.
      
      Changelog V4 -> V5:
      - Fix the race on adding the delayed node to the inode, which is spotted by
        Chris Mason.
      - Merge Chris Mason's incremental patch into this patch.
      - Fix deadlock between readdir() and memory fault, which is reported by
        Itaru Kitayama.
      
      Changelog V3 -> V4:
      - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
        inode in time.
      
      Changelog V2 -> V3:
      - Fix the race between the delayed worker and the task which does delayed items
        balance, which is reported by Tsutomu Itoh.
      - Modify the patch address David Sterba's comment.
      - Fix the bug of the cpu recursion spinlock, reported by Chris Mason
      
      Changelog V1 -> V2:
      - break up the global rb-tree, use a list to manage the delayed nodes,
        which is created for every directory and file, and used to manage the
        delayed directory name index items and the delayed inode item.
      - introduce a worker to deal with the delayed nodes.
      
      Compare with Ext3/4, the performance of file creation and deletion on btrfs
      is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
      such as inode item, directory name item, directory name index and so on.
      
      If we can do some delayed b+ tree insertion or deletion, we can improve the
      performance, so we made this patch which implemented delayed directory name
      index insertion/deletion and delayed inode update.
      
      Implementation:
      - introduce a delayed root object into the filesystem, that use two lists to
        manage the delayed nodes which are created for every file/directory.
        One is used to manage all the delayed nodes that have delayed items. And the
        other is used to manage the delayed nodes which is waiting to be dealt with
        by the work thread.
      - Every delayed node has two rb-tree, one is used to manage the directory name
        index which is going to be inserted into b+ tree, and the other is used to
        manage the directory name index which is going to be deleted from b+ tree.
      - introduce a worker to deal with the delayed operation. This worker is used
        to deal with the works of the delayed directory name index items insertion
        and deletion and the delayed inode update.
        When the delayed items is beyond the lower limit, we create works for some
        delayed nodes and insert them into the work queue of the worker, and then
        go back.
        When the delayed items is beyond the upper bound, we create works for all
        the delayed nodes that haven't been dealt with, and insert them into the work
        queue of the worker, and then wait for that the untreated items is below some
        threshold value.
      - When we want to insert a directory name index into b+ tree, we just add the
        information into the delayed inserting rb-tree.
        And then we check the number of the delayed items and do delayed items
        balance. (The balance policy is above.)
      - When we want to delete a directory name index from the b+ tree, we search it
        in the inserting rb-tree at first. If we look it up, just drop it. If not,
        add the key of it into the delayed deleting rb-tree.
        Similar to the delayed inserting rb-tree, we also check the number of the
        delayed items and do delayed items balance.
        (The same to inserting manipulation)
      - When we want to update the metadata of some inode, we cached the data of the
        inode into the delayed node. the worker will flush it into the b+ tree after
        dealing with the delayed insertion and deletion.
      - We will move the delayed node to the tail of the list after we access the
        delayed node, By this way, we can cache more delayed items and merge more
        inode updates.
      - If we want to commit transaction, we will deal with all the delayed node.
      - the delayed node will be freed when we free the btrfs inode.
      - Before we log the inode items, we commit all the directory name index items
        and the delayed inode update.
      
      I did a quick test by the benchmark tool[1] and found we can improve the
      performance of file creation by ~15%, and file deletion by ~20%.
      
      Before applying this patch:
      Create files:
              Total files: 50000
              Total time: 1.096108
              Average time: 0.000022
      Delete files:
              Total files: 50000
              Total time: 1.510403
              Average time: 0.000030
      
      After applying this patch:
      Create files:
              Total files: 50000
              Total time: 0.932899
              Average time: 0.000019
      Delete files:
              Total files: 50000
              Total time: 1.215732
              Average time: 0.000024
      
      [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
      
      Many thanks for Kitayama-san's help!
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dave@jikos.cz>
      Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Tested-by: NItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      16cdcec7
  21. 06 5月, 2011 1 次提交
  22. 04 5月, 2011 1 次提交
  23. 18 1月, 2011 1 次提交
    • L
      Btrfs: forced readonly mounts on errors · acce952b
      liubo 提交于
      This patch comes from "Forced readonly mounts on errors" ideas.
      
      As we know, this is the first step in being more fault tolerant of disk
      corruptions instead of just using BUG() statements.
      
      The major content:
      - add a framework for generating errors that should result in filesystems
        going readonly.
      - keep FS state in disk super block.
      - make sure that all of resource will be freed and released at umount time.
      - make sure that fter FS is forced readonly on error, there will be no more
        disk change before FS is corrected. For this, we should stop write operation.
      
      After this patch is applied, the conversion from BUG() to such a framework can
      happen incrementally.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      acce952b
  24. 25 5月, 2010 2 次提交
  25. 25 3月, 2009 1 次提交
    • C
      Btrfs: leave btree locks spinning more often · b9473439
      Chris Mason 提交于
      btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
      for the buffers it was dirtying.  This may require a kmalloc and it
      was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
      set any btree locks they were holding to blocking first.
      
      This commit changes dirty tracking for extent buffers to just use a flag
      in the extent buffer.  Now that we have one and only one extent buffer
      per page, this can be safely done without losing dirty bits along the way.
      
      This also introduces a path->leave_spinning flag that callers of
      btrfs_search_slot can use to indicate they will properly deal with a
      path returned where all the locks are spinning instead of blocking.
      
      Many of the btree search callers now expect spinning paths,
      resulting in better btree concurrency overall.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9473439
  26. 13 2月, 2009 1 次提交
    • C
      Btrfs: make a lockdep class for the extent buffer locks · 4008c04a
      Chris Mason 提交于
      Btrfs is currently using spin_lock_nested with a nested value based
      on the tree depth of the block.  But, this doesn't quite work because
      the max tree depth is bigger than what spin_lock_nested can deal with,
      and because locks are sometimes taken before the level field is filled in.
      
      The solution here is to use lockdep_set_class_and_name instead, and to
      set the class before unlocking the pages when the block is read from the
      disk and just after init of a freshly allocated tree block.
      
      btrfs_clear_path_blocking is also changed to take the locks in the proper
      order, and it also makes sure all the locks currently held are properly
      set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
      gets upset about bad lock orderin.
      
      The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4008c04a
  27. 22 1月, 2009 1 次提交
    • Y
      Btrfs: fix tree logs parallel sync · 7237f183
      Yan Zheng 提交于
      To improve performance, btrfs_sync_log merges tree log sync
      requests. But it wrongly merges sync requests for different
      tree logs. If multiple tree logs are synced at the same time,
      only one of them actually gets synced.
      
      This patch has following changes to fix the bug:
      
      Move most tree log related fields in btrfs_fs_info to
      btrfs_root. This allows merging sync requests separately
      for each tree log.
      
      Don't insert root item into the log root tree immediately
      after log tree is allocated. Root item for log tree is
      inserted when log tree get synced for the first time. This
      allows syncing the log root tree without first syncing all
      log trees.
      
      At tree-log sync, btrfs_sync_log first sync the log tree;
      then updates corresponding root item in the log root tree;
      sync the log root tree; then update the super block.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      7237f183
  28. 09 12月, 2008 1 次提交
    • Y
      Btrfs: superblock duplication · a512bbf8
      Yan Zheng 提交于
      This patch implements superblock duplication. Superblocks
      are stored at offset 16K, 64M and 256G on every devices.
      Spaces used by superblocks are preserved by the allocator,
      which uses a reverse mapping function to find the logical
      addresses that correspond to superblocks. Thank you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      a512bbf8
  29. 13 11月, 2008 1 次提交
    • Y
      Btrfs: mount ro and remount support · c146afad
      Yan Zheng 提交于
      This patch adds mount ro and remount support. The main
      changes in patch are: adding btrfs_remount and related
      helper function; splitting the transaction related code
      out of close_ctree into btrfs_commit_super; updating
      allocator to properly handle read only block group.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      c146afad
  30. 07 11月, 2008 1 次提交
    • C
      Btrfs: Add ordered async work queues · 4a69a410
      Chris Mason 提交于
      Btrfs uses kernel threads to create async work queues for cpu intensive
      operations such as checksumming and decompression.  These work well,
      but they make it difficult to keep IO order intact.
      
      A single writepages call from pdflush or fsync will turn into a number
      of bios, and each bio is checksummed in parallel.  Once the checksum is
      computed, the bio is sent down to the disk, and since we don't control
      the order in which the parallel operations happen, they might go down to
      the disk in almost any order.
      
      The code deals with this somewhat by having deep work queues for a single
      kernel thread, making it very likely that a single thread will process all
      the bios for a single inode.
      
      This patch introduces an explicitly ordered work queue.  As work structs
      are placed into the queue they are put onto the tail of a list.  They have
      three callbacks:
      
      ->func (cpu intensive processing here)
      ->ordered_func (order sensitive processing here)
      ->ordered_free (free the work struct, all processing is done)
      
      The work struct has three callbacks.  The func callback does the cpu intensive
      work, and when it completes the work struct is marked as done.
      
      Every time a work struct completes, the list is checked to see if the head
      is marked as done.  If so the ordered_func callback is used to do the
      order sensitive processing and the ordered_free callback is used to do
      any cleanup.  Then we loop back and check the head of the list again.
      
      This patch also changes the checksumming code to use the ordered workqueues.
      One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4a69a410
  31. 30 10月, 2008 1 次提交
    • C
      Btrfs: Add zlib compression support · c8b97818
      Chris Mason 提交于
      This is a large change for adding compression on reading and writing,
      both for inline and regular extents.  It does some fairly large
      surgery to the writeback paths.
      
      Compression is off by default and enabled by mount -o compress.  Even
      when the -o compress mount option is not used, it is possible to read
      compressed extents off the disk.
      
      If compression for a given set of pages fails to make them smaller, the
      file is flagged to avoid future compression attempts later.
      
      * While finding delalloc extents, the pages are locked before being sent down
      to the delalloc handler.  This allows the delalloc handler to do complex things
      such as cleaning the pages, marking them writeback and starting IO on their
      behalf.
      
      * Inline extents are inserted at delalloc time now.  This allows us to compress
      the data before inserting the inline extent, and it allows us to insert
      an inline extent that spans multiple pages.
      
      * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
      are changed to record both an in-memory size and an on disk size, as well
      as a flag for compression.
      
      From a disk format point of view, the extent pointers in the file are changed
      to record the on disk size of a given extent and some encoding flags.
      Space in the disk format is allocated for compression encoding, as well
      as encryption and a generic 'other' field.  Neither the encryption or the
      'other' field are currently used.
      
      In order to limit the amount of data read for a single random read in the
      file, the size of a compressed extent is limited to 128k.  This is a
      software only limit, the disk format supports u64 sized compressed extents.
      
      In order to limit the ram consumed while processing extents, the uncompressed
      size of a compressed extent is limited to 256k.  This is a software only limit
      and will be subject to tuning later.
      
      Checksumming is still done on compressed extents, and it is done on the
      uncompressed version of the data.  This way additional encodings can be
      layered on without having to figure out which encoding to checksum.
      
      Compression happens at delalloc time, which is basically singled threaded because
      it is usually done by a single pdflush thread.  This makes it tricky to
      spread the compression load across all the cpus on the box.  We'll have to
      look at parallel pdflush walks of dirty inodes at a later time.
      
      Decompression is hooked into readpages and it does spread across CPUs nicely.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c8b97818
  32. 25 9月, 2008 1 次提交
    • C
      Btrfs: Tree logging fixes · 4bef0848
      Chris Mason 提交于
      * Pin down data blocks to prevent them from being reallocated like so:
      
      trans 1: allocate file extent
      trans 2: free file extent
      trans 3: free file extent during old snapshot deletion
      trans 3: allocate file extent to new file
      trans 3: fsync new file
      
      Before the tree logging code, this was legal because the fsync
      would commit the transation that did the final data extent free
      and the transaction that allocated the extent to the new file
      at the same time.
      
      With the tree logging code, the tree log subtransaction can commit
      before the transaction that freed the extent.  If we crash,
      we're left with two different files using the extent.
      
      * Don't wait in start_transaction if log replay is going on.  This
      avoids deadlocks from iput while we're cleaning up link counts in the
      replay code.
      
      * Don't deadlock in replay_one_name by trying to read an inode off
      the disk while holding paths for the directory
      
      * Hold the buffer lock while we mark a buffer as written.  This
      closes a race where someone is changing a buffer while we write it.
      They are supposed to mark it dirty again after they change it, but
      this violates the cow rules.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4bef0848