1. 06 2月, 2013 2 次提交
    • L
      Btrfs: fix race between snapshot deletion and getting inode · 6f1c3605
      Liu Bo 提交于
      While running snapshot testscript created by Mitch and David,
      the race between autodefrag and snapshot deletion can lead to
      corruption of dead_root list so that we can get crash on
      btrfs_clean_old_snapshots().
      
      And besides autodefrag, scrub also does the same thing, ie. read
      root first and get inode.
      
      Here is the story(take autodefrag as an example):
      (1) when we delete a snapshot or subvolume, it will set its root's
      refs to zero and do a iput() on its own inode, and if this inode happens
      to be the only active in-meory one in root's inode rbtree, it will add
      itself to the global dead_roots list for later cleanup.
      
      (2) after (1), the autodefrag thread may read another inode for defrag
      and the inode is just in the deleted snapshot/subvolume, but all of these
      are without checking if the root is still valid(refs > 0).  So the end up
      result is adding the deleted snapshot/subvolume's root to the global
      dead_roots list AGAIN.
      
      Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.
      
      So all we need to do is to take the lock to protect 'read root and get inode',
      since we synchronize to wait for the rcu grace period before adding something
      to the global dead_roots list.
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6f1c3605
    • M
      Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() · 0a3404dc
      Miao Xie 提交于
      If the checks at the beginning of btrfs_file_aio_write() fail, we needn't
      decrease ->sync_writers, because we have not increased it. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0a3404dc
  2. 15 1月, 2013 2 次提交
  3. 17 12月, 2012 13 次提交
  4. 13 12月, 2012 3 次提交
  5. 09 10月, 2012 1 次提交
    • K
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov 提交于
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      
      Now device drivers can implement this method and obtain nonlinear vma support.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b173bc4
  6. 04 10月, 2012 1 次提交
    • J
      Btrfs: fix punch hole when no extent exists · c3308f84
      Josef Bacik 提交于
      I saw the warning in btrfs_drop_extent_cache where our end is less than our
      start while running xfstests 68 in a loop.  This is because we
      unconditionally do drop_end = min(end, extent_end) in
      __btrfs_drop_extents(), even though we may not have found an extent in the
      range we were looking to drop.  So keep track of wether or not we found
      something, and if we didn't just use our end.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      c3308f84
  7. 02 10月, 2012 10 次提交
    • M
      Revert "Btrfs: do not do filemap_write_and_wait_range in fsync" · 90abccf2
      Miao Xie 提交于
      This reverts commit 0885ef5b
      
      After applying the above patch, the performance slowed down because the dirty
      page flush can only be done by one task, so revert it.
      
      The following is the test result of sysbench:
      	Before		After
      	24MB/s		39MB/s
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      90abccf2
    • L
      Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag · 9e8a4a8b
      Liu Bo 提交于
      We're going to use this flag EXTENT_DEFRAG to indicate which range
      belongs to defragment so that we can implement snapshow-aware defrag:
      
      We set the EXTENT_DEFRAG flag when dirtying the extents that need
      defragmented, so later on writeback thread can differentiate between
      normal writeback and writeback started by defragmentation.
      Original-Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      9e8a4a8b
    • M
      Btrfs: fix wrong size for the reservation when doing, file pre-allocation. · 903889f4
      Miao Xie 提交于
      When we ran fsstress(a program in xfstests), the filesystem hung up when it
      is full. It was because the space reserved in btrfs_fallocate() was wrong,
      btrfs_fallocate() just used the size of the pre-allocation to reserve the
      space, didn't took the block size aligning into account, so the size of
      the reserved space was less than the allocated space, it caused the over
      reserve problem and made the filesystem hung up when invoking cow_file_range().
      Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      903889f4
    • M
      Btrfs: fix unprotected ->log_batch · 2ecb7923
      Miao Xie 提交于
      We forget to protect ->log_batch when syncing a file, this patch fix
      this problem by atomic operation. And ->log_batch is used to check
      if there are parallel sync operations or not, so it is unnecessary to
      reset it to 0 after the sync operation of the current log tree complete.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      2ecb7923
    • M
      Btrfs: add a new "type" field into the block reservation structure · 66d8f3dd
      Miao Xie 提交于
      Sometimes we need choose the method of the reservation according to the type
      of the block reservation, such as the reservation for the delayed inode update.
      Now we identify the type just by comparing the address of the reservation
      variants, it is very ugly if it is a temporary one because we need compare it
      with all the common reservation variants. So we add a new "type" field to keep
      the type the reservation variants.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      66d8f3dd
    • J
      Btrfs: btrfs_drop_extent_cache should never fail · 7014cdb4
      Josef Bacik 提交于
      I noticed this when I was doing the fsync stuff, we allocate split extents if we
      drop an extent range that is in the middle of an existing extent.  This BUG()'s
      if we fail to allocate memory, but the fact is this is just a cache, we will
      just regenerate the cache if we need it, the important part is that we free the
      range we are given.  This can be done without allocations, so if we fail to
      allocate splits just skip the splitting stage and free our em and look for more
      extents to drop.  This also makes btrfs_drop_extent_cache a void since nobody
      was checking the return value anyway.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7014cdb4
    • J
      Btrfs: add hole punching · 2aaa6655
      Josef Bacik 提交于
      This patch adds hole punching via fallocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2aaa6655
    • J
      Btrfs: remove unused hint byte argument for btrfs_drop_extents · 2671485d
      Josef Bacik 提交于
      I audited all users of btrfs_drop_extents and found that nobody actually uses
      the hint_byte argument.  I'm sure it was used for something at some point but
      it's not used now, and the way the pinning works the disk bytenr would never be
      immediately useful anyway so lets just remove it.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2671485d
    • J
      Btrfs: turbo charge fsync · 5dc562c5
      Josef Bacik 提交于
      At least for the vm workload.  Currently on fsync we will
      
      1) Truncate all items in the log tree for the given inode if they exist
      
      and
      
      2) Copy all items for a given inode into the log
      
      The problem with this is that for things like VMs you can have lots of
      extents from the fragmented writing behavior, and worst yet you may have
      only modified a few extents, not the entire thing.  This patch fixes this
      problem by tracking which transid modified our extent, and then when we do
      the tree logging we find all of the extents we've modified in our current
      transaction, sort them and commit them.  We also only truncate up to the
      xattrs of the inode and copy that stuff in normally, and then just drop any
      extents in the range we have that exist in the log already.  Here are some
      numbers of a 50 meg fio job that does random writes and fsync()s after every
      write
      
      		Original	Patched
      SATA drive	82KB/s		140KB/s
      Fusion drive	431KB/s		2532KB/s
      
      So around 2-6 times faster depending on your hardware.  There are a few
      corner cases, for example if you truncate at all we have to do it the old
      way since there is no way to be sure what is in the log is ok.  This
      probably could be done smarter, but if you write-fsync-truncate-write-fsync
      you deserve what you get.  All this work is in RAM of course so if your
      inode gets evicted from cache and you read it in and fsync it we'll do it
      the slow way if we are still in the same transaction that we last modified
      the inode in.
      
      The biggest cool part of this is that it requires no changes to the recovery
      code, so if you fsync with this patch and crash and load an old kernel, it
      will run the recovery and be a-ok.  I have tested this pretty thoroughly
      with an fsync tester and everything comes back fine, as well as xfstests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5dc562c5
    • J
      Btrfs: fix possible corruption when fsyncing written prealloced extents · 224ecce5
      Josef Bacik 提交于
      While working on my fsync patch my fsync tester kept hitting mismatching
      md5sums when I would randomly write to a prealloc'ed region, syncfs() and
      then write to the prealloced region some more and then fsync() and then
      immediately reboot.  This is because the tree logging code will skip writing
      csums for file extents who's generation is less than the current running
      transaction.  When we mark extents as written we haven't been updating their
      generation so they were always being skipped.  This wouldn't happen if you
      were to preallocate and then write in the same transaction, but if you for
      example prealloced a VM you could definitely run into this problem.  This
      patch makes my fsync tester happy again.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      224ecce5
  8. 31 7月, 2012 1 次提交
    • J
      btrfs: Convert to new freezing mechanism · b2b5ef5c
      Jan Kara 提交于
      We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
      freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
      the transaction mechanism to avoid starting transactions on frozen filesystem.
      At minimum this is necessary to stop iput() of unlinked file to change frozen
      filesystem during truncation.
      
      Checks in cleaner_kthread() and transaction_kthread() can be safely removed
      since btrfs_freeze() will lock the mutexes and thus block the threads (and they
      shouldn't have anything to do anyway).
      
      CC: linux-btrfs@vger.kernel.org
      CC: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b2b5ef5c
  9. 03 7月, 2012 1 次提交
    • J
      Btrfs: fix dio write vs buffered read race · c3473e83
      Josef Bacik 提交于
      Miao pointed out there's a problem with mixing dio writes and buffered
      reads.  If the read happens between us invalidating the page range and
      actually locking the extent we can bring in pages into page cache.  Then
      once the write finishes if somebody tries to read again it will just find
      uptodate pages and we'll read stale data.  So we need to lock the extent and
      check for uptodate bits in the range.  If there are uptodate bits we need to
      unlock and invalidate again.  This will keep this race from happening since
      we will hold the extent locked until we create the ordered extent, and then
      teh read side always waits for ordered extents.  There was also a race in
      how we updated i_size, previously we were relying on the generic DIO stuff
      to adjust the i_size after the DIO had completed, but this happens outside
      of the extent lock which means reads could come in and not see the updated
      i_size.  So instead move this work into where we create the extents, and
      then this way the update ordered i_size stuff works properly in the endio
      handlers.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      c3473e83
  10. 02 6月, 2012 1 次提交
  11. 30 5月, 2012 5 次提交
    • J
      Btrfs: check to see if the inode is in the log before fsyncing · 22ee6985
      Josef Bacik 提交于
      We have this check down in the actual logging code, but this is after we
      start a transaction and all that good stuff.  So move the helper
      inode_in_log() out so we can call it in fsync() and avoid starting a
      transaction altogether and just exit if we've already fsync()'ed this file
      recently.  You would notice this issue if you fsync()'ed a file over and
      over again until the transaction committed.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      22ee6985
    • M
      Btrfs: fix the same inode id problem when doing auto defragment · 762f2263
      Miao Xie 提交于
      Two files in the different subvolumes may have the same inode id, so
      The rb-tree which is used to manage the defragment object must take it
      into account. This patch fix this problem.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      762f2263
    • J
      Btrfs: convert the inode bit field to use the actual bit operations · 72ac3c0d
      Josef Bacik 提交于
      Miao pointed this out while I was working on an orphan problem that messing
      with a bitfield where different ranges are protected by different locks
      doesn't work out right.  Turns out we've been doing this forever where we
      have different parts of the bit field protected by either no lock at all or
      different locks which could cause all sorts of weird problems including the
      issue I was hitting.  So instead make a runtime_flags thing that we use the
      normal bit operations on that are all atomic so we can keep having our
      no/different locking for the different flags and then make force_compress
      it's own thing so it can be treated normally.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      72ac3c0d
    • J
      Btrfs: do not do filemap_write_and_wait_range in fsync · 0885ef5b
      Josef Bacik 提交于
      We already do the btrfs_wait_ordered_range which will do this for us, so
      just remove this call so we don't call it twice.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0885ef5b
    • J
      Btrfs: use i_version instead of our own sequence · 0c4d2d95
      Josef Bacik 提交于
      We've been keeping around the inode sequence number in hopes that somebody
      would use it, but nobody uses it and people actually use i_version which
      serves the same purpose, so use i_version where we used the incore inode's
      sequence number and that way the sequence is updated properly across the
      board, and not just in file write.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0c4d2d95