1. 28 3月, 2013 1 次提交
  2. 27 3月, 2013 1 次提交
    • C
      Btrfs: fix race between mmap writes and compression · 4adaa611
      Chris Mason 提交于
      Btrfs uses page_mkwrite to ensure stable pages during
      crc calculations and mmap workloads.  We call clear_page_dirty_for_io
      before we do any crcs, and this forces any application with the file
      mapped to wait for the crc to finish before it is allowed to change
      the file.
      
      With compression on, the clear_page_dirty_for_io step is happening after
      we've compressed the pages.  This means the applications might be
      changing the pages while we are compressing them, and some of those
      modifications might not hit the disk.
      
      This commit adds the clear_page_dirty_for_io before compression starts
      and makes sure to redirty the page if we have to fallback to
      uncompressed IO as well.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NAlexandre Oliva <oliva@gnu.org>
      cc: stable@vger.kernel.org
      4adaa611
  3. 15 3月, 2013 1 次提交
  4. 06 3月, 2013 1 次提交
  5. 01 3月, 2013 1 次提交
    • J
      Btrfs: copy everything if we've created an inline extent · bdc20e67
      Josef Bacik 提交于
      I noticed while looking into a tree logging bug that we aren't logging inline
      extents properly.  Since this requires copying and it shouldn't happen too often
      just force us to copy everything for the inode into the tree log when we have an
      inline extent.  With this patch we have valid data after a crash when we write
      an inline extent.  Thanks,
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      bdc20e67
  6. 27 2月, 2013 2 次提交
    • Q
      btrfs: cleanup for open-coded alignment · fda2832f
      Qu Wenruo 提交于
      Though most of the btrfs codes are using ALIGN macro for page alignment,
      there are still some codes using open-coded alignment like the
      following:
      ------
              u64 mask = ((u64)root->stripesize - 1);
              u64 ret = (val + mask) & ~mask;
      ------
      Or even hidden one:
      ------
              num_bytes = (end - start + blocksize) & ~(blocksize - 1);
      ------
      
      Sometimes these open-coded alignment is not so easy to understand for
      newbie like me.
      
      This commit changes the open-coded alignment to the ALIGN macro for a
      better readability.
      
      Also there is a previous patch from David Sterba with similar changes,
      but the patch is for 3.2 kernel and seems not merged.
      http://www.spinics.net/lists/linux-btrfs/msg12747.html
      
      Cc: David Sterba <dave@jikos.cz>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fda2832f
    • L
      Btrfs: do not change inode flags in rename · 8c4ce81e
      Liu Bo 提交于
      Before we forced to change a file's NOCOW and COMPRESS flag due to
      the parent directory's, but this ends up a bad idea, because it
      confuses end users a lot about file's NOCOW status, eg. if someone
      change a file to NOCOW via 'chattr' and then rename it in the current
      directory which is without NOCOW attribute, the file will lose the
      NOCOW flag silently.
      
      This diables 'change flags in rename', so from now on we'll only
      inherit flags from the parent directory on creation stage while in
      other places we can use 'chattr' to set NOCOW or COMPRESS flags.
      Reported-by: NMarios Titas <redneb8888@gmail.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      8c4ce81e
  7. 26 2月, 2013 1 次提交
    • J
      Btrfs: make sure NODATACOW also gets NODATASUM set · f2bdf9a8
      Josef Bacik 提交于
      A user reported hitting the BUG_ON() in btrfs_finished_ordered_io() where we had
      csums on a NOCOW extent.  This can happen if we have NODATACOW set but not
      NODATASUM set, which can happen in two cases, either we mount with -o nodatacow
      and then write into preallocated space, or chattr +C a directory and move a file
      into that directory.  Liu has fixed the move case in a different place, but this
      fixes the mount -o nodatacow case.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f2bdf9a8
  8. 21 2月, 2013 14 次提交
    • M
      Btrfs: fix wrong outstanding_extents when doing DIO write · 172a5049
      Miao Xie 提交于
      When running the 083th case of xfstests on the filesystem with
      "compress-force=lzo", the following WARNINGs were triggered.
        WARNING: at fs/btrfs/inode.c:7908
        WARNING: at fs/btrfs/inode.c:7909
        WARNING: at fs/btrfs/inode.c:7911
        WARNING: at fs/btrfs/extent-tree.c:4510
        WARNING: at fs/btrfs/extent-tree.c:4511
      
      This problem was introduced by the patch "Btrfs: fix deadlock due
      to unsubmitted". In this patch, there are two bugs which caused
      the above problem.
      
      The 1st one is a off-by-one bug, if the DIO write return 0, it is
      also a short write, we need release the reserved space for it. But
      we didn't do it in that patch. Fix it by change "ret > 0" to
      "ret >= 0".
      
      The 2nd one is ->outstanding_extents was increased twice when
      a short write happened. As we know, ->outstanding_extents is
      a counter to keep track of the number of extent items we may
      use duo to delalloc, when we reserve the free space for a
      delalloc write, we assume that the write will introduce just
      one extent item, so we increase ->outstanding_extents by 1 at
      that time. And then we will increase it every time we split the
      write, it is done at the beginning of btrfs_get_blocks_direct().
      So when a short write happens, we needn't increase
      ->outstanding_extents again. But this patch done.
      
      In order to fix the 2nd problem, I re-write the logic for
      ->outstanding_extents operation. We don't increase it at the
      beginning of btrfs_get_blocks_direct(), instead, we just
      increase it when the split actually happens.
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      172a5049
    • L
      Btrfs: snapshot-aware defrag · 38c227d8
      Liu Bo 提交于
      This comes from one of btrfs's project ideas,
      As we defragment files, we break any sharing from other snapshots.
      The balancing code will preserve the sharing, and defrag needs to grow this
      as well.
      
      Now we're able to fill the blank with this patch, in which we make full use of
      backref walking stuff.
      
      Here is the basic idea,
      o  set the writeback ranges started by defragment with flag EXTENT_DEFRAG
      o  at endio, after we finish updating fs tree, we use backref walking to find
         all parents of the ranges and re-link them with the new COWed file layout by
         adding corresponding backrefs.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      38c227d8
    • Z
      btrfs: limit fallocate extent reservation to 256MB · 24542bf7
      Zach Brown 提交于
      Very large fallocate requests are cpu bound and result in extents with a
      repeating pattern of ever decreasing size:
      
      $ time fallocate -l 1T file
      real	0m13.039s
      
      ( an excerpt of the extents from btrfs-debug-tree: )
        prealloc data disk byte 1536292564992 nr 397312
        prealloc data disk byte 1536292962304 nr 196608
        prealloc data disk byte 1536293158912 nr 98304
        prealloc data disk byte 1536293257216 nr 49152
        prealloc data disk byte 1536293306368 nr 24576
        prealloc data disk byte 1536293330944 nr 12288
        prealloc data disk byte 1536293343232 nr 8192
        prealloc data disk byte 1536293351424 nr 4096
        prealloc data disk byte 1536293355520 nr 4096
        prealloc data disk byte 1536293359616 nr 4096
      
      The excessive cpu use comes from __btrfs_prealloc_file_range() trying to
      allocate the entire remaining size after each extent is allocated.
      btrfs_reserve_extent() repeatedly cuts this requested size in half until
      it gets down to the size that the allocators can return.  We limit the
      problem for now by capping each reservation at 256 meg.
      
      The small extents come from a masking bug when decreasing the requested
      reservation size.  The high 32bits are cleared and the remaining low
      bits might happen to reserve a small size.   Fix this by using
      round_down() which properly casts the mask.
      
      After these fixes huge fallocate requests are fast and result in nice
      large extents:
      
      $ time fallocate -l 1T file
      real	0m0.082s
      
        prealloc data disk byte 1112425889792 nr 268435456
        prealloc data disk byte 1112694325248 nr 268435456
        prealloc data disk byte 1112962760704 nr 268435456
      Reported-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      24542bf7
    • L
      Btrfs: fix cleaner thread not working with inode cache option · fa6ac876
      Liu Bo 提交于
      Right now inode cache inode is treated as the same as space cache
      inode, ie. keep inode in memory till putting super.
      
      But this leads to an awkward situation.
      
      If we're going to delete a snapshot/subvolume, btrfs will not
      actually delete it and return free space, but will add it to dead
      roots list until the last inode on this snap/subvol being destroyed.
      Then we'll fetch deleted roots and cleanup them via cleaner thread.
      
      So here is the problem, if we enable inode cache option, each
      snap/subvol has a cached inode which is used to store inode allcation
      information.  And this cache inode will be kept in memory, as the above
      said.  So with inode cache, snap/subvol can only be added into
      dead roots list during freeing roots stage in umount, so that we can
      ONLY get space back after another remount(we cleanup dead roots on mount).
      
      But the real thing is we'll no more use the snap/subvol if we mark it
      deleted, so we can safely iput its cache inode when we delete snap/subvol.
      
      Another thing is that we need to change the rules of droping inode, we
      don't keep snap/subvol's cache inode in memory till end so that we can
      add snap/subvol into dead roots list in time.
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fa6ac876
    • M
      Btrfs: implement unlocked dio write · 38851cc1
      Miao Xie 提交于
      This idea is from ext4. By this patch, we can make the dio write parallel,
      and improve the performance. But because we can not update isize without
      i_mutex, the unlocked dio write just can be done in front of the EOF.
      
      We needn't worry about the race between dio write and truncate, because the
      truncate need wait untill all the dio write end.
      
      And we also needn't worry about the race between dio write and punch hole,
      because we have extent lock to protect our operation.
      
      I ran fio to test the performance of this feature.
      
      == Hardware ==
      CPU: Intel(R) Core(TM)2 Duo CPU     E7500  @ 2.93GHz
      Mem: 2GB
      SSD: Intel X25-M 120GB (Test Partition: 60GB)
      
      == config file ==
      [global]
      ioengine=psync
      direct=1
      bs=4k
      size=32G
      runtime=60
      directory=/mnt/btrfs/
      filename=testfile
      group_reporting
      thread
      
      [file1]
      numjobs=1 # 2 4
      rw=randwrite
      
      == result (KBps) ==
      write	1	2	4
      lock	24936	24738	24726
      nolock	24962	30866	32101
      
      == result (iops) ==
      write	1	2	4
      lock	6234	6184	6181
      nolock	6240	7716	8025
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      38851cc1
    • M
      Btrfs: serialize unlocked dio reads with truncate · 2e60a51e
      Miao Xie 提交于
      Currently, we can do unlocked dio reads, but the following race
      is possible:
      
      dio_read_task			truncate_task
      				->btrfs_setattr()
      ->btrfs_direct_IO
          ->__blockdev_direct_IO
            ->btrfs_get_block
      				  ->btrfs_truncate()
      				 #alloc truncated blocks
      				 #to other inode
            ->submit_io()
           #INFORMATION LEAK
      
      In order to avoid this problem, we must serialize unlocked dio reads with
      truncate. There are two approaches:
      - use extent lock to protect the extent that we truncate
      - use inode_dio_wait() to make sure the truncating task will wait for
        the read DIO.
      
      If we use the 1st one, we will meet the endless truncation problem due to
      the nonlocked read DIO after we implement the nonlocked write DIO. It is
      because we still need invoke inode_dio_wait() avoid the race between write
      DIO and truncation. By that time, we have to introduce
      
        btrfs_inode_{block, resume}_nolock_dio()
      
      again. That is we have to implement this patch again, so I choose the 2nd
      way to fix the problem.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2e60a51e
    • M
      Btrfs: fix deadlock due to unsubmitted · 0934856d
      Miao Xie 提交于
      The deadlock problem happened when running fsstress(a test program in LTP).
      
      Steps to reproduce:
       # mkfs.btrfs -b 100M <partition>
       # mount <partition> <mnt>
       # <Path>/fsstress -p 3 -n 10000000 -d <mnt>
      
      The reason is:
      btrfs_direct_IO()
       |->do_direct_IO()
           |->get_page()
           |->get_blocks()
           |	 |->btrfs_delalloc_resereve_space()
           |	 |->btrfs_add_ordered_extent() -------	Add a new ordered extent
           |->dio_send_cur_page(page0) --------------	We didn't submit bio here
           |->get_page()
           |->get_blocks()
      	 |->btrfs_delalloc_resereve_space()
      	     |->flush_space()
      		 |->btrfs_start_ordered_extent()
      		     |->wait_event() ----------	Wait the completion of
      						the ordered extent that is
      						mentioned above
      
      But because we didn't submit the bio that is mentioned above, the ordered
      extent can not complete, we would wait for its completion forever.
      
      There are two methods which can fix this deadlock problem:
      1. submit the bio before we invoke get_blocks()
      2. reserve the space before we do dio
      
      Though the 1st is the simplest way, we need modify the code of VFS, and it
      is likely to break contiguous requests, and introduce performance regression
      for the other filesystems.
      
      So we have to choose the 2nd way.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0934856d
    • J
      Btrfs: cleanup orphan reservation if truncate fails · 4a7d0f68
      Josef Bacik 提交于
      I noticed we were getting lots of warnings with xfstest 83 because we have
      reservations outstanding.  This is because we moved the orphan add outside
      of the truncate, but we don't actually cleanup our reservation if something
      fails.  This fixes the problem and I no longer see warnings.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      4a7d0f68
    • J
      Btrfs: steal from global reserve if we are cleaning up orphans · 5d80366e
      Josef Bacik 提交于
      Sometimes xfstest 83 will fail to remount the scratch device because we've
      gotten ourselves so full that we cannot cleanup the orphan items.  In this
      case check to see if we're doing the orphan cleanup and if we are allow us
      to steal our reservation from the global block rsv.  With this patch I've
      not been able to reproduce the failed mount problem.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5d80366e
    • J
      Btrfs: handle errors in compression submission path · 3e04e7f1
      Josef Bacik 提交于
      I noticed we would deadlock if we aborted a transaction while doing
      compressed io.  This is because we don't unlock our pages if something goes
      horribly wrong.  To fix this we need to make sure that we call
      extent_clear_unlock_delalloc in order to unlock all the pages.  If we have
      to cow in the async submission thread we need to make sure to unlock our
      locked_page as the cow error path will not unlock the locked page as it
      depends on the caller to unlock that page.  With this patch we no longer
      deadlock on the page lock when we have an aborted transaction.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3e04e7f1
    • J
      Btrfs: account for orphan inodes properly during cleanup · 925396ec
      Josef Bacik 提交于
      Dave sent me a panic where we were doing the orphan cleanup and panic'ed
      trying to release our reservation from the orphan block rsv.  The reason for
      this is because our orphan block rsv had been free'd out from underneath us
      because the transaction commit found that there were no orphan inodes
      according to its count and decided to free it.  This is incorrect so make
      sure we inc the orphan inodes count so the accounting is all done properly.
      This would also cause the warning in the orphan commit code normally if you
      had any orphans to cleanup as they would only decrement the orphan count so
      you'd get a negative orphan count which could cause problems during runtime.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      925396ec
    • J
      Btrfs: unreserve space if our ordered extent fails to work · 0bec9ef5
      Josef Bacik 提交于
      When a transaction aborts or there's an EIO on an ordered extent or any
      error really we will not free up the space we reserved for this ordered
      extent.  This results in warnings from the block group cache cleanup in the
      case of a transaction abort, or leaking space in the case of EIO on an
      ordered extent.  Fix this up by free'ing the reserved space if we have an
      error at all trying to complete an ordered extent.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0bec9ef5
    • M
      Btrfs: use the inode own lock to protect its delalloc_bytes · df0af1a5
      Miao Xie 提交于
      We need not use a global lock to protect the delalloc_bytes of the
      inode, just use its own lock. In this way, we can reduce the lock
      contention and ->delalloc_lock will just protect delalloc inode
      list.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      df0af1a5
    • M
      Btrfs: use percpu counter for fs_info->delalloc_bytes · 963d678b
      Miao Xie 提交于
      fs_info->delalloc_bytes is accessed very frequently, so use percpu
      counter instead of the u64 variant for it to reduce the lock
      contention.
      
      This patch also fixed the problem that we access the variant
      without the lock protection.At worst, we would not flush the
      delalloc inodes, and just return ENOSPC error when we still have
      some free space in the fs.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      963d678b
  9. 20 2月, 2013 7 次提交
  10. 02 2月, 2013 2 次提交
    • D
      Btrfs: RAID5 and RAID6 · 53b381b3
      David Woodhouse 提交于
      This builds on David Woodhouse's original Btrfs raid5/6 implementation.
      The code has changed quite a bit, blame Chris Mason for any bugs.
      
      Read/modify/write is done after the higher levels of the filesystem have
      prepared a given bio.  This means the higher layers are not responsible
      for building full stripes, and they don't need to query for the topology
      of the extents that may get allocated during delayed allocation runs.
      It also means different files can easily share the same stripe.
      
      But, it does expose us to incorrect parity if we crash or lose power
      while doing a read/modify/write cycle.  This will be addressed in a
      later commit.
      
      Scrub is unable to repair crc errors on raid5/6 chunks.
      
      Discard does not work on raid5/6 (yet)
      
      The stripe size is fixed at 64KiB per disk.  This will be tunable
      in a later commit.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      53b381b3
    • D
      Btrfs: add rw argument to merge_bio_hook() · 64a16701
      David Woodhouse 提交于
      We'll want to merge writes so they can fill a full RAID[56] stripe, but
      not necessarily reads.
      Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      64a16701
  11. 25 1月, 2013 1 次提交
  12. 15 1月, 2013 4 次提交
  13. 21 12月, 2012 1 次提交
  14. 18 12月, 2012 2 次提交
    • L
      Btrfs: fix a bug of per-file nocow · 213490b3
      Liu Bo 提交于
      Users report a bug, the reproducer is:
      $ mkfs.btrfs /dev/loop0
      $ mount /dev/loop0 /mnt/btrfs/
      $ mkdir /mnt/btrfs/dir
      $ chattr +C /mnt/btrfs/dir/
      $ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10;
      $ lsattr /mnt/btrfs/dir/foo
      ---------------C- /mnt/btrfs/dir/foo
      $ filefrag /mnt/btrfs/dir/foo
      /mnt/btrfs/dir/foo: 1 extent found    ---> an extent
      $ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync
      $ filefrag /mnt/btrfs/dir/foo
      /mnt/btrfs/dir/foo: 3 extents found   ---> with nocow, btrfs breaks the extent into three parts
      
      The new created file should not only inherit the NODATACOW flag, but also
      honor NODATASUM flag, because we must do COW on a file extent with checksum.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      213490b3
    • C
      Btrfs: fix hash overflow handling · 9c52057c
      Chris Mason 提交于
      The handling for directory crc hash overflows was fairly obscure,
      split_leaf returns EOVERFLOW when we try to extend the item and that is
      supposed to bubble up to userland.  For a while it did so, but along the
      way we added better handling of errors and forced the FS readonly if we
      hit IO errors during the directory insertion.
      
      Along the way, we started testing only for EEXIST and the EOVERFLOW case
      was dropped.  The end result is that we may force the FS readonly if we
      catch a directory hash bucket overflow.
      
      This fixes a few problem spots.  First I add tests for EOVERFLOW in the
      places where we can safely just return the error up the chain.
      
      btrfs_rename is harder though, because it tries to insert the new
      directory item only after it has already unlinked anything the rename
      was going to overwrite.  Rather than adding very complex logic, I added
      a helper to test for the hash overflow case early while it is still safe
      to bail out.
      
      Snapshot and subvolume creation had a similar problem, so they are using
      the new helper now too.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      Reported-by: NPascal Junod <pascal@junod.info>
      9c52057c
  15. 17 12月, 2012 1 次提交