1. 22 1月, 2009 1 次提交
    • Y
      Btrfs: fix tree logs parallel sync · 7237f183
      Yan Zheng 提交于
      To improve performance, btrfs_sync_log merges tree log sync
      requests. But it wrongly merges sync requests for different
      tree logs. If multiple tree logs are synced at the same time,
      only one of them actually gets synced.
      
      This patch has following changes to fix the bug:
      
      Move most tree log related fields in btrfs_fs_info to
      btrfs_root. This allows merging sync requests separately
      for each tree log.
      
      Don't insert root item into the log root tree immediately
      after log tree is allocated. Root item for log tree is
      inserted when log tree get synced for the first time. This
      allows syncing the log root tree without first syncing all
      log trees.
      
      At tree-log sync, btrfs_sync_log first sync the log tree;
      then updates corresponding root item in the log root tree;
      sync the log root tree; then update the super block.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      7237f183
  2. 21 1月, 2009 4 次提交
  3. 06 1月, 2009 2 次提交
  4. 20 12月, 2008 1 次提交
  5. 18 12月, 2008 1 次提交
    • C
      Btrfs: shift all end_io work to thread pools · cad321ad
      Chris Mason 提交于
      bio_end_io for reads without checksumming on and btree writes were
      happening without using async thread pools.  This means the extent_io.c
      code had to use spin_lock_irq and friends on the rb tree locks for
      extent state.
      
      There were some irq safe vs unsafe lock inversions between the delallock
      lock and the extent state locks.  This patch gets rid of them by moving
      all end_io code into the thread pools.
      
      To avoid contention and deadlocks between the data end_io processing and the
      metadata end_io processing yet another thread pool is added to finish
      off metadata writes.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cad321ad
  6. 12 12月, 2008 1 次提交
  7. 11 12月, 2008 1 次提交
  8. 09 12月, 2008 2 次提交
    • Y
      Btrfs: superblock duplication · a512bbf8
      Yan Zheng 提交于
      This patch implements superblock duplication. Superblocks
      are stored at offset 16K, 64M and 256G on every devices.
      Spaces used by superblocks are preserved by the allocator,
      which uses a reverse mapping function to find the logical
      addresses that correspond to superblocks. Thank you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      a512bbf8
    • C
      Btrfs: move data checksumming into a dedicated tree · d20f7043
      Chris Mason 提交于
      Btrfs stores checksums for each data block.  Until now, they have
      been stored in the subvolume trees, indexed by the inode that is
      referencing the data block.  This means that when we read the inode,
      we've probably read in at least some checksums as well.
      
      But, this has a few problems:
      
      * The checksums are indexed by logical offset in the file.  When
      compression is on, this means we have to do the expensive checksumming
      on the uncompressed data.  It would be faster if we could checksum
      the compressed data instead.
      
      * If we implement encryption, we'll be checksumming the plain text and
      storing that on disk.  This is significantly less secure.
      
      * For either compression or encryption, we have to get the plain text
      back before we can verify the checksum as correct.  This makes the raid
      layer balancing and extent moving much more expensive.
      
      * It makes the front end caching code more complex, as we have touch
      the subvolume and inodes as we cache extents.
      
      * There is potentitally one copy of the checksum in each subvolume
      referencing an extent.
      
      The solution used here is to store the extent checksums in a dedicated
      tree.  This allows us to index the checksums by phyiscal extent
      start and length.  It means:
      
      * The checksum is against the data stored on disk, after any compression
      or encryption is done.
      
      * The checksum is stored in a central location, and can be verified without
      following back references, or reading inodes.
      
      This makes compression significantly faster by reducing the amount of
      data that needs to be checksummed.  It will also allow much faster
      raid management code in general.
      
      The checksums are indexed by a key with a fixed objectid (a magic value
      in ctree.h) and offset set to the starting byte of the extent.  This
      allows us to copy the checksum items into the fsync log tree directly (or
      any other tree), without having to invent a second format for them.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d20f7043
  9. 02 12月, 2008 5 次提交
  10. 20 11月, 2008 4 次提交
    • Y
      Btrfs: Drop dirty roots created by log replay immediately when · e556ce2c
      Yan Zheng 提交于
      The log replay produces dirty roots. These dirty roots
      should be dropped immediately if the fs is mounted as
      ro. Otherwise they can be added to the dirty root list
      again when remounting the fs as rw. Thank you,
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      e556ce2c
    • C
      Btrfs: compat code fixes · 4b4e25f2
      Chris Mason 提交于
      The btrfs git kernel trees is used to build a standalone tree for
      compiling against older kernels.  This commit makes the standalone tree
      work with 2.6.27
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4b4e25f2
    • C
      Btrfs: Do fsync log replay when mount -o ro, except when on readonly media · 7c2ca468
      Chris Mason 提交于
      fsync log replay can change the filesystem, so it cannot be delayed until
      mount -o rw,remount, and it can't be forgotten entirely.  So, this patch
      changes btrfs to do with reiserfs, ext3 and xfs do, which is to do the
      log replay even when mounted readonly.
      
      On a readonly device if log replay is required, the mount is aborted.
      
      Getting all of this right had required fixing up some of the error
      handling in open_ctree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7c2ca468
    • C
      Btrfs: Avoid writeback stalls · d2c3f4f6
      Chris Mason 提交于
      While building large bios in writepages, btrfs may end up waiting
      for other page writeback to finish if WB_SYNC_ALL is used.
      
      While it is waiting, the bio it is building has a number of pages with the
      writeback bit set and they aren't getting to the disk any time soon.  This
      lowers the latencies of writeback in general by sending down the bio being
      built before waiting for other pages.
      
      The bio submission code tries to limit the total number of async bios in
      flight by waiting when we're over a certain number of async bios.  But,
      the waits are happening while writepages is building bios, and this can easily
      lead to stalls and other problems for people calling wait_on_page_writeback.
      
      The current fix is to let the congestion tests take care of waiting.
      
      sync() and others make sure to drain the current async requests to make
      sure that everything that was pending when the sync was started really get
      to disk.  The code would drain pending requests both before and after
      submitting a new request.
      
      But, if one of the requests is waiting for page writeback to finish,
      the draining waits might block that page writeback.  This changes the
      draining code to only wait after submitting the bio being processed.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d2c3f4f6
  11. 18 11月, 2008 5 次提交
    • C
      Btrfs: unplug all devices in the unplug call back · 9f0ba5bd
      Chris Mason 提交于
      For larger multi-device filesystems, there was logic to limit the
      number of devices unplugged to just the page that was sent to our sync_page
      function.
      
      But, the code wasn't always unplugging the right device.  Since this was
      just an optimization, disable it for now.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9f0ba5bd
    • C
      Btrfs: prevent loops in the directory tree when creating snapshots · ea9e8b11
      Chris Mason 提交于
      For a directory tree:
      
      /mnt/subvolA/subvolB
      
      btrfsctl -s /mnt/subvolA/subvolB /mnt
      
      Will create a directory loop with subvolA under subvolB.  This
      commit uses the forward refs for each subvol and snapshot to error out
      before creating the loop.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ea9e8b11
    • C
      Btrfs: Give each subvol and snapshot their own anonymous devid · 3394e160
      Chris Mason 提交于
      Each subvolume has its own private inode number space, and so we need
      to fill in different device numbers for each subvolume to avoid confusing
      applications.
      
      This commit puts a struct super_block into struct btrfs_root so it can
      call set_anon_super() and get a different device number generated for
      each root.
      
      btrfs_rename is changed to prevent renames across subvols.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3394e160
    • C
      Btrfs: Allow subvolumes and snapshots anywhere in the directory tree · 3de4586c
      Chris Mason 提交于
      Before, all snapshots and subvolumes lived in a single flat directory.  This
      was awkward and confusing because the single flat directory was only writable
      with the ioctls.
      
      This commit changes the ioctls to create subvols and snapshots at any
      point in the directory tree.  This requires making separate ioctls for
      snapshot and subvol creation instead of a combining them into one.
      
      The subvol ioctl does:
      
      btrfsctl -S subvol_name parent_dir
      
      After the ioctl is done subvol_name lives inside parent_dir.
      
      The snapshot ioctl does:
      
      btrfsctl -s path_for_snapshot root_to_snapshot
      
      path_for_snapshot can be an absolute or relative path.  btrfsctl breaks it up
      into directory and basename components.
      
      root_to_snapshot can be any file or directory in the FS.  The snapshot
      is taken of the entire root where that file lives.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3de4586c
    • Y
      Btrfs: Seed device support · 2b82032c
      Yan Zheng 提交于
      Seed device is a special btrfs with SEEDING super flag
      set and can only be mounted in read-only mode. Seed
      devices allow people to create new btrfs on top of it.
      
      The new FS contains the same contents as the seed device,
      but it can be mounted in read-write mode.
      
      This patch does the following:
      
      1) split code in btrfs_alloc_chunk into two parts. The first part does makes
      the newly allocated chunk usable, but does not do any operation that modifies
      the chunk tree. The second part does the the chunk tree modifications. This
      division is for the bootstrap step of adding storage to the seed device.
      
      2) Update device management code to handle seed device.
      The basic idea is: For an FS grown from seed devices, its
      seed devices are put into a list. Seed devices are
      opened on demand at mounting time. If any seed device is
      missing or has been changed, btrfs kernel module will
      refuse to mount the FS.
      
      3) make btrfs_find_block_group not return NULL when all
      block groups are read-only.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      2b82032c
  12. 13 11月, 2008 2 次提交
    • Y
      Btrfs: mount ro and remount support · c146afad
      Yan Zheng 提交于
      This patch adds mount ro and remount support. The main
      changes in patch are: adding btrfs_remount and related
      helper function; splitting the transaction related code
      out of close_ctree into btrfs_commit_super; updating
      allocator to properly handle read only block group.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      c146afad
    • C
      Btrfs: Improve metadata read latencies · 6f3577bd
      Chris Mason 提交于
      This fixes latency problems on metadata reads by making sure they
      don't go through the async submit queue, and by tuning down the amount
      of readahead done during btree searches.
      
      Also, the btrfs bdi congestion function is tuned to ignore the
      number of pending async bios and checksums pending.  There is additional
      code that throttles new async bios now and the congestion function
      doesn't need to worry about it anymore.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      6f3577bd
  13. 11 11月, 2008 1 次提交
  14. 07 11月, 2008 2 次提交
    • C
      Btrfs: Optimize compressed writeback and reads · 771ed689
      Chris Mason 提交于
      When reading compressed extents, try to put pages into the page cache
      for any pages covered by the compressed extent that readpages didn't already
      preload.
      
      Add an async work queue to handle transformations at delayed allocation processing
      time.  Right now this is just compression.  The workflow is:
      
      1) Find offsets in the file marked for delayed allocation
      2) Lock the pages
      3) Lock the state bits
      4) Call the async delalloc code
      
      The async delalloc code clears the state lock bits and delalloc bits.  It is
      important this happens before the range goes into the work queue because
      otherwise it might deadlock with other work queue items that try to lock
      those extent bits.
      
      The file pages are compressed, and if the compression doesn't work the
      pages are written back directly.
      
      An ordered work queue is used to make sure the inodes are written in the same
      order that pdflush or writepages sent them down.
      
      This changes extent_write_cache_pages to let the writepage function
      update the wbc nr_written count.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      771ed689
    • C
      Btrfs: Add ordered async work queues · 4a69a410
      Chris Mason 提交于
      Btrfs uses kernel threads to create async work queues for cpu intensive
      operations such as checksumming and decompression.  These work well,
      but they make it difficult to keep IO order intact.
      
      A single writepages call from pdflush or fsync will turn into a number
      of bios, and each bio is checksummed in parallel.  Once the checksum is
      computed, the bio is sent down to the disk, and since we don't control
      the order in which the parallel operations happen, they might go down to
      the disk in almost any order.
      
      The code deals with this somewhat by having deep work queues for a single
      kernel thread, making it very likely that a single thread will process all
      the bios for a single inode.
      
      This patch introduces an explicitly ordered work queue.  As work structs
      are placed into the queue they are put onto the tail of a list.  They have
      three callbacks:
      
      ->func (cpu intensive processing here)
      ->ordered_func (order sensitive processing here)
      ->ordered_free (free the work struct, all processing is done)
      
      The work struct has three callbacks.  The func callback does the cpu intensive
      work, and when it completes the work struct is marked as done.
      
      Every time a work struct completes, the list is checked to see if the head
      is marked as done.  If so the ordered_func callback is used to do the
      order sensitive processing and the ordered_free callback is used to do
      any cleanup.  Then we loop back and check the head of the list again.
      
      This patch also changes the checksumming code to use the ordered workqueues.
      One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4a69a410
  15. 30 10月, 2008 4 次提交
    • Y
      Btrfs: Add root tree pointer transaction ids · 84234f3a
      Yan Zheng 提交于
      This patch adds transaction IDs to root tree pointers.
      Transaction IDs in tree pointers are compared with the
      generation numbers in block headers when reading root
      blocks of trees. This can detect some types of IO errors.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      
      84234f3a
    • J
      Btrfs: nuke fs wide allocation mutex V2 · 25179201
      Josef Bacik 提交于
      This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
      of little locks.
      
      There is now a pinned_mutex, which is used when messing with the pinned_extents
      extent io tree, and the extent_ins_mutex which is used with the pending_del and
      extent_ins extent io trees.
      
      The locking for the extent tree stuff was inspired by a patch that Yan Zheng
      wrote to fix a race condition, I cleaned it up some and changed the locking
      around a little bit, but the idea remains the same.  Basically instead of
      holding the extent_ins_mutex throughout the processing of an extent on the
      extent_ins or pending_del trees, we just hold it while we're searching and when
      we clear the bits on those trees, and lock the extent for the duration of the
      operations on the extent.
      
      Also to keep from getting hung up waiting to lock an extent, I've added a
      try_lock_extent so if we cannot lock the extent, move on to the next one in the
      tree and we'll come back to that one.  I have tested this heavily and it does
      not appear to break anything.  This has to be applied on top of my
      find_free_extent redo patch.
      
      I tested this patch on top of Yan's space reblancing code and it worked fine.
      The only thing that has changed since the last version is I pulled out all my
      debugging stuff, apparently I forgot to run guilt refresh before I sent the
      last patch out.  Thank you,
      Signed-off-by: NJosef Bacik <jbacik@redhat.com>
      
      25179201
    • Y
      Btrfs: Improve space balancing code · f82d02d9
      Yan Zheng 提交于
      This patch improves the space balancing code to keep more sharing
      of tree blocks. The only case that breaks sharing of tree blocks is
      data extents get fragmented during balancing. The main changes in
      this patch are:
      
      Add a 'drop sub-tree' function. This solves the problem in old code
      that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.
      
      Remove relocation mapping tree. Relocation mappings are stored in
      struct btrfs_ref_path and updated dynamically during walking up/down
      the reference path. This reduces CPU usage and simplifies code.
      
      This patch also fixes a bug. Root items for reloc trees should be
      updated in btrfs_free_reloc_root.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      
      f82d02d9
    • C
      Btrfs: Add zlib compression support · c8b97818
      Chris Mason 提交于
      This is a large change for adding compression on reading and writing,
      both for inline and regular extents.  It does some fairly large
      surgery to the writeback paths.
      
      Compression is off by default and enabled by mount -o compress.  Even
      when the -o compress mount option is not used, it is possible to read
      compressed extents off the disk.
      
      If compression for a given set of pages fails to make them smaller, the
      file is flagged to avoid future compression attempts later.
      
      * While finding delalloc extents, the pages are locked before being sent down
      to the delalloc handler.  This allows the delalloc handler to do complex things
      such as cleaning the pages, marking them writeback and starting IO on their
      behalf.
      
      * Inline extents are inserted at delalloc time now.  This allows us to compress
      the data before inserting the inline extent, and it allows us to insert
      an inline extent that spans multiple pages.
      
      * All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
      are changed to record both an in-memory size and an on disk size, as well
      as a flag for compression.
      
      From a disk format point of view, the extent pointers in the file are changed
      to record the on disk size of a given extent and some encoding flags.
      Space in the disk format is allocated for compression encoding, as well
      as encryption and a generic 'other' field.  Neither the encryption or the
      'other' field are currently used.
      
      In order to limit the amount of data read for a single random read in the
      file, the size of a compressed extent is limited to 128k.  This is a
      software only limit, the disk format supports u64 sized compressed extents.
      
      In order to limit the ram consumed while processing extents, the uncompressed
      size of a compressed extent is limited to 256k.  This is a software only limit
      and will be subject to tuning later.
      
      Checksumming is still done on compressed extents, and it is done on the
      uncompressed version of the data.  This way additional encodings can be
      layered on without having to figure out which encoding to checksum.
      
      Compression happens at delalloc time, which is basically singled threaded because
      it is usually done by a single pdflush thread.  This makes it tricky to
      spread the compression load across all the cpus on the box.  We'll have to
      look at parallel pdflush walks of dirty inodes at a later time.
      
      Decompression is hooked into readpages and it does spread across CPUs nicely.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c8b97818
  16. 02 10月, 2008 2 次提交
  17. 30 9月, 2008 1 次提交
    • C
      Btrfs: add and improve comments · d352ac68
      Chris Mason 提交于
      This improves the comments at the top of many functions.  It didn't
      dive into the guts of functions because I was trying to
      avoid merging problems with the new allocator and back reference work.
      
      extent-tree.c and volumes.c were both skipped, and there is definitely
      more work todo in cleaning and commenting the code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d352ac68
  18. 29 9月, 2008 1 次提交
    • C
      Btrfs: Wait for IO on the block device inodes of newly added devices · 8c8bee1d
      Chris Mason 提交于
      btrfs-vol -a /dev/xxx will zero the first and last two MB of the device.
      The kernel code needs to wait for this IO to finish before it adds
      the device.
      
      btrfs metadata IO does not happen through the block device inode.  A
      separate address space is used, allowing the zero filled buffer heads in
      the block device inode to be written to disk after FS metadata starts
      going down to the disk via the btrfs metadata inode.
      
      The end result is zero filled metadata blocks after adding new devices
      into the filesystem.
      
      The fix is a simple filemap_write_and_wait on the block device inode
      before actually inserting it into the pool of available devices.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8c8bee1d