1. 27 2月, 2013 1 次提交
    • Q
      btrfs: cleanup for open-coded alignment · fda2832f
      Qu Wenruo 提交于
      Though most of the btrfs codes are using ALIGN macro for page alignment,
      there are still some codes using open-coded alignment like the
      following:
      ------
              u64 mask = ((u64)root->stripesize - 1);
              u64 ret = (val + mask) & ~mask;
      ------
      Or even hidden one:
      ------
              num_bytes = (end - start + blocksize) & ~(blocksize - 1);
      ------
      
      Sometimes these open-coded alignment is not so easy to understand for
      newbie like me.
      
      This commit changes the open-coded alignment to the ALIGN macro for a
      better readability.
      
      Also there is a previous patch from David Sterba with similar changes,
      but the patch is for 3.2 kernel and seems not merged.
      http://www.spinics.net/lists/linux-btrfs/msg12747.html
      
      Cc: David Sterba <dave@jikos.cz>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fda2832f
  2. 21 2月, 2013 3 次提交
    • M
      Btrfs: fix remount vs autodefrag · dc81cdc5
      Miao Xie 提交于
      If we remount the fs to close the auto defragment or make the fs R/O,
      we should stop the auto defragment.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      dc81cdc5
    • J
      Btrfs: place ordered operations on a per transaction list · 569e0f35
      Josef Bacik 提交于
      Miao made the ordered operations stuff run async, which introduced a
      deadlock where we could get somebody (sync) racing in and committing the
      transaction while a commit was already happening.  The new committer would
      try and flush ordered operations which would hang waiting for the commit to
      finish because it is done asynchronously and no longer inherits the callers
      trans handle.  To fix this we need to make the ordered operations list a per
      transaction list.  We can get new inodes added to the ordered operation list
      by truncating them and then having another process writing to them, so this
      makes it so that anybody trying to add an ordered operation _must_ start a
      transaction in order to add itself to the list, which will keep new inodes
      from getting added to the ordered operations list after we start committing.
      This should fix the deadlock and also keeps us from doing a lot more work
      than we need to during commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      569e0f35
    • M
      Btrfs: use bit operation for ->fs_state · 87533c47
      Miao Xie 提交于
      There is no lock to protect fs_info->fs_state, it will introduce
      some problems, such as the value may be covered by the other task
      when several tasks modify it. For example:
      	Task0 - CPU0		Task1 - CPU1
      	mov %fs_state rax
      	or $0x1 rax
      				mov %fs_state rax
      				or $0x2 rax
      	mov rax %fs_state
      				mov rax %fs_state
      The expected value is 3, but in fact, it is 2.
      
      Though this problem doesn't happen now (because there is only one
      flag currently), the code is error prone, if we add other flags,
      the above problem will happen to a certainty.
      
      Now we use bit operation for it to fix the above problem.
      In this way, we can make the code more robust and be easy to
      add new flags.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      87533c47
  3. 20 2月, 2013 2 次提交
  4. 06 2月, 2013 2 次提交
    • L
      Btrfs: fix race between snapshot deletion and getting inode · 6f1c3605
      Liu Bo 提交于
      While running snapshot testscript created by Mitch and David,
      the race between autodefrag and snapshot deletion can lead to
      corruption of dead_root list so that we can get crash on
      btrfs_clean_old_snapshots().
      
      And besides autodefrag, scrub also does the same thing, ie. read
      root first and get inode.
      
      Here is the story(take autodefrag as an example):
      (1) when we delete a snapshot or subvolume, it will set its root's
      refs to zero and do a iput() on its own inode, and if this inode happens
      to be the only active in-meory one in root's inode rbtree, it will add
      itself to the global dead_roots list for later cleanup.
      
      (2) after (1), the autodefrag thread may read another inode for defrag
      and the inode is just in the deleted snapshot/subvolume, but all of these
      are without checking if the root is still valid(refs > 0).  So the end up
      result is adding the deleted snapshot/subvolume's root to the global
      dead_roots list AGAIN.
      
      Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.
      
      So all we need to do is to take the lock to protect 'read root and get inode',
      since we synchronize to wait for the rcu grace period before adding something
      to the global dead_roots list.
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6f1c3605
    • M
      Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() · 0a3404dc
      Miao Xie 提交于
      If the checks at the beginning of btrfs_file_aio_write() fail, we needn't
      decrease ->sync_writers, because we have not increased it. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0a3404dc
  5. 15 1月, 2013 2 次提交
  6. 18 12月, 2012 1 次提交
  7. 17 12月, 2012 13 次提交
  8. 13 12月, 2012 3 次提交
  9. 12 12月, 2012 1 次提交
  10. 09 10月, 2012 1 次提交
    • K
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov 提交于
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      
      Now device drivers can implement this method and obtain nonlinear vma support.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b173bc4
  11. 04 10月, 2012 1 次提交
    • J
      Btrfs: fix punch hole when no extent exists · c3308f84
      Josef Bacik 提交于
      I saw the warning in btrfs_drop_extent_cache where our end is less than our
      start while running xfstests 68 in a loop.  This is because we
      unconditionally do drop_end = min(end, extent_end) in
      __btrfs_drop_extents(), even though we may not have found an extent in the
      range we were looking to drop.  So keep track of wether or not we found
      something, and if we didn't just use our end.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      c3308f84
  12. 02 10月, 2012 10 次提交
    • M
      Revert "Btrfs: do not do filemap_write_and_wait_range in fsync" · 90abccf2
      Miao Xie 提交于
      This reverts commit 0885ef5b
      
      After applying the above patch, the performance slowed down because the dirty
      page flush can only be done by one task, so revert it.
      
      The following is the test result of sysbench:
      	Before		After
      	24MB/s		39MB/s
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      90abccf2
    • L
      Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag · 9e8a4a8b
      Liu Bo 提交于
      We're going to use this flag EXTENT_DEFRAG to indicate which range
      belongs to defragment so that we can implement snapshow-aware defrag:
      
      We set the EXTENT_DEFRAG flag when dirtying the extents that need
      defragmented, so later on writeback thread can differentiate between
      normal writeback and writeback started by defragmentation.
      Original-Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      9e8a4a8b
    • M
      Btrfs: fix wrong size for the reservation when doing, file pre-allocation. · 903889f4
      Miao Xie 提交于
      When we ran fsstress(a program in xfstests), the filesystem hung up when it
      is full. It was because the space reserved in btrfs_fallocate() was wrong,
      btrfs_fallocate() just used the size of the pre-allocation to reserve the
      space, didn't took the block size aligning into account, so the size of
      the reserved space was less than the allocated space, it caused the over
      reserve problem and made the filesystem hung up when invoking cow_file_range().
      Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      903889f4
    • M
      Btrfs: fix unprotected ->log_batch · 2ecb7923
      Miao Xie 提交于
      We forget to protect ->log_batch when syncing a file, this patch fix
      this problem by atomic operation. And ->log_batch is used to check
      if there are parallel sync operations or not, so it is unnecessary to
      reset it to 0 after the sync operation of the current log tree complete.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      2ecb7923
    • M
      Btrfs: add a new "type" field into the block reservation structure · 66d8f3dd
      Miao Xie 提交于
      Sometimes we need choose the method of the reservation according to the type
      of the block reservation, such as the reservation for the delayed inode update.
      Now we identify the type just by comparing the address of the reservation
      variants, it is very ugly if it is a temporary one because we need compare it
      with all the common reservation variants. So we add a new "type" field to keep
      the type the reservation variants.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      66d8f3dd
    • J
      Btrfs: btrfs_drop_extent_cache should never fail · 7014cdb4
      Josef Bacik 提交于
      I noticed this when I was doing the fsync stuff, we allocate split extents if we
      drop an extent range that is in the middle of an existing extent.  This BUG()'s
      if we fail to allocate memory, but the fact is this is just a cache, we will
      just regenerate the cache if we need it, the important part is that we free the
      range we are given.  This can be done without allocations, so if we fail to
      allocate splits just skip the splitting stage and free our em and look for more
      extents to drop.  This also makes btrfs_drop_extent_cache a void since nobody
      was checking the return value anyway.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7014cdb4
    • J
      Btrfs: add hole punching · 2aaa6655
      Josef Bacik 提交于
      This patch adds hole punching via fallocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2aaa6655
    • J
      Btrfs: remove unused hint byte argument for btrfs_drop_extents · 2671485d
      Josef Bacik 提交于
      I audited all users of btrfs_drop_extents and found that nobody actually uses
      the hint_byte argument.  I'm sure it was used for something at some point but
      it's not used now, and the way the pinning works the disk bytenr would never be
      immediately useful anyway so lets just remove it.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2671485d
    • J
      Btrfs: turbo charge fsync · 5dc562c5
      Josef Bacik 提交于
      At least for the vm workload.  Currently on fsync we will
      
      1) Truncate all items in the log tree for the given inode if they exist
      
      and
      
      2) Copy all items for a given inode into the log
      
      The problem with this is that for things like VMs you can have lots of
      extents from the fragmented writing behavior, and worst yet you may have
      only modified a few extents, not the entire thing.  This patch fixes this
      problem by tracking which transid modified our extent, and then when we do
      the tree logging we find all of the extents we've modified in our current
      transaction, sort them and commit them.  We also only truncate up to the
      xattrs of the inode and copy that stuff in normally, and then just drop any
      extents in the range we have that exist in the log already.  Here are some
      numbers of a 50 meg fio job that does random writes and fsync()s after every
      write
      
      		Original	Patched
      SATA drive	82KB/s		140KB/s
      Fusion drive	431KB/s		2532KB/s
      
      So around 2-6 times faster depending on your hardware.  There are a few
      corner cases, for example if you truncate at all we have to do it the old
      way since there is no way to be sure what is in the log is ok.  This
      probably could be done smarter, but if you write-fsync-truncate-write-fsync
      you deserve what you get.  All this work is in RAM of course so if your
      inode gets evicted from cache and you read it in and fsync it we'll do it
      the slow way if we are still in the same transaction that we last modified
      the inode in.
      
      The biggest cool part of this is that it requires no changes to the recovery
      code, so if you fsync with this patch and crash and load an old kernel, it
      will run the recovery and be a-ok.  I have tested this pretty thoroughly
      with an fsync tester and everything comes back fine, as well as xfstests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5dc562c5
    • J
      Btrfs: fix possible corruption when fsyncing written prealloced extents · 224ecce5
      Josef Bacik 提交于
      While working on my fsync patch my fsync tester kept hitting mismatching
      md5sums when I would randomly write to a prealloc'ed region, syncfs() and
      then write to the prealloced region some more and then fsync() and then
      immediately reboot.  This is because the tree logging code will skip writing
      csums for file extents who's generation is less than the current running
      transaction.  When we mark extents as written we haven't been updating their
      generation so they were always being skipped.  This wouldn't happen if you
      were to preallocate and then write in the same transaction, but if you for
      example prealloced a VM you could definitely run into this problem.  This
      patch makes my fsync tester happy again.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      224ecce5