1. 02 2月, 2013 8 次提交
    • C
      Btrfs: reduce CPU contention while waiting for delayed extent operations · bb721703
      Chris Mason 提交于
      We batch up operations to the extent allocation tree, which allows
      us to deal with the recursive nature of using the extent allocation
      tree to allocate extents to the extent allocation tree.
      
      It also provides a mechanism to sort and collect extent
      operations, which makes it much more efficient to record extents
      that are close together.
      
      The delayed extent operations must all be finished before the
      running transaction commits, so we have code to make sure and run a few
      of the batched operations when closing our transaction handles.
      
      This creates a great deal of contention for the locks in the
      delayed extent operation tree, and also contention for the lock on the
      extent allocation tree itself.  All the extra contention just slows
      down the operations and doesn't get things done any faster.
      
      This commit changes things to use a wait queue instead.  As procs
      want to run the delayed operations, one of them races in and gets
      permission to hit the tree, and the others step back and wait for
      progress to be made.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      bb721703
    • C
      Btrfs: reduce lock contention on extent buffer locks · 242e18c7
      Chris Mason 提交于
      The extent buffers have a refs_lock which we use to make coordinate freeing
      the extent buffer with operations on the radix tree.  On tree roots and
      other extent buffers that very cache hot, this can be highly contended.
      
      These are also the extent buffers that are basically pinned in memory.
      This commit adds code to cmpxchg our way through the ref modifications,
      and as long as the result of the reference change is still pinned in
      ram, we skip the expensive spinlock.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      242e18c7
    • C
      Btrfs: fix cluster alignment for mount -o ssd · 8de972b4
      Chris Mason 提交于
      With the new raid56 code, we want to make sure we're
      properly aligning our allocation clusters with -o ssd
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      8de972b4
    • C
      Btrfs: add a plugging callback to raid56 writes · 6ac0f488
      Chris Mason 提交于
      Buffered writes and DIRECT_IO writes will often break up
      big contiguous changes to the file into sub-stripe writes.
      
      This adds a plugging callback to gather those smaller writes full stripe
      writes.
      
      Example on flash:
      
      fio job to do 64K writes in batches of 3 (which makes a full stripe):
      
      With plugging: 450MB/s
      Without plugging: 220MB/s
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      6ac0f488
    • C
      Btrfs: Add a stripe cache to raid56 · 4ae10b3a
      Chris Mason 提交于
      The stripe cache allows us to avoid extra read/modify/write cycles
      by caching the pages we read off the disk.  Pages are cached when:
      
      * They are read in during a read/modify/write cycle
      
      * They are written during a read/modify/write cycle
      
      * They are involved in a parity rebuild
      
      Pages are not cached if we're doing a full stripe write.  We're
      assuming that a full stripe write won't be followed by another
      partial stripe write any time soon.
      
      This provides a substantial boost in performance for workloads that
      synchronously modify adjacent offsets in the file, and for the parity
      rebuild use case in general.
      
      The size of the stripe cache isn't tunable (yet) and is set at 1024
      entries.
      
      Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct
      
      Without the stripe cache  -- 2.1MB/s
      With the stripe cache 21MB/s
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      4ae10b3a
    • D
      Btrfs: RAID5 and RAID6 · 53b381b3
      David Woodhouse 提交于
      This builds on David Woodhouse's original Btrfs raid5/6 implementation.
      The code has changed quite a bit, blame Chris Mason for any bugs.
      
      Read/modify/write is done after the higher levels of the filesystem have
      prepared a given bio.  This means the higher layers are not responsible
      for building full stripes, and they don't need to query for the topology
      of the extents that may get allocated during delayed allocation runs.
      It also means different files can easily share the same stripe.
      
      But, it does expose us to incorrect parity if we crash or lose power
      while doing a read/modify/write cycle.  This will be addressed in a
      later commit.
      
      Scrub is unable to repair crc errors on raid5/6 chunks.
      
      Discard does not work on raid5/6 (yet)
      
      The stripe size is fixed at 64KiB per disk.  This will be tunable
      in a later commit.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      53b381b3
    • D
      Btrfs: add rw argument to merge_bio_hook() · 64a16701
      David Woodhouse 提交于
      We'll want to merge writes so they can fill a full RAID[56] stripe, but
      not necessarily reads.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      64a16701
    • E
      btrfs: don't try to notify udev about missing devices · 3c911608
      Eric Sandeen 提交于
      If we remove a missing device, bdev is null, and if we
      send that off to btrfs_kobject_uevent we'll panic.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      3c911608
  2. 19 12月, 2012 2 次提交
  3. 18 12月, 2012 2 次提交
    • L
      Btrfs: fix a bug of per-file nocow · 213490b3
      Liu Bo 提交于
      Users report a bug, the reproducer is:
      $ mkfs.btrfs /dev/loop0
      $ mount /dev/loop0 /mnt/btrfs/
      $ mkdir /mnt/btrfs/dir
      $ chattr +C /mnt/btrfs/dir/
      $ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10;
      $ lsattr /mnt/btrfs/dir/foo
      ---------------C- /mnt/btrfs/dir/foo
      $ filefrag /mnt/btrfs/dir/foo
      /mnt/btrfs/dir/foo: 1 extent found    ---> an extent
      $ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync
      $ filefrag /mnt/btrfs/dir/foo
      /mnt/btrfs/dir/foo: 3 extents found   ---> with nocow, btrfs breaks the extent into three parts
      
      The new created file should not only inherit the NODATACOW flag, but also
      honor NODATASUM flag, because we must do COW on a file extent with checksum.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      213490b3
    • C
      Btrfs: fix hash overflow handling · 9c52057c
      Chris Mason 提交于
      The handling for directory crc hash overflows was fairly obscure,
      split_leaf returns EOVERFLOW when we try to extend the item and that is
      supposed to bubble up to userland.  For a while it did so, but along the
      way we added better handling of errors and forced the FS readonly if we
      hit IO errors during the directory insertion.
      
      Along the way, we started testing only for EEXIST and the EOVERFLOW case
      was dropped.  The end result is that we may force the FS readonly if we
      catch a directory hash bucket overflow.
      
      This fixes a few problem spots.  First I add tests for EOVERFLOW in the
      places where we can safely just return the error up the chain.
      
      btrfs_rename is harder though, because it tries to insert the new
      directory item only after it has already unlinked anything the rename
      was going to overwrite.  Rather than adding very complex logic, I added
      a helper to test for the hash overflow case early while it is still safe
      to bail out.
      
      Snapshot and subvolume creation had a similar problem, so they are using
      the new helper now too.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NPascal Junod <pascal@junod.info>
      9c52057c
  4. 17 12月, 2012 28 次提交