1. 09 10月, 2012 1 次提交
    • K
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov 提交于
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      
      Now device drivers can implement this method and obtain nonlinear vma support.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b173bc4
  2. 04 10月, 2012 1 次提交
    • J
      Btrfs: fix punch hole when no extent exists · c3308f84
      Josef Bacik 提交于
      I saw the warning in btrfs_drop_extent_cache where our end is less than our
      start while running xfstests 68 in a loop.  This is because we
      unconditionally do drop_end = min(end, extent_end) in
      __btrfs_drop_extents(), even though we may not have found an extent in the
      range we were looking to drop.  So keep track of wether or not we found
      something, and if we didn't just use our end.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      c3308f84
  3. 02 10月, 2012 10 次提交
    • M
      Revert "Btrfs: do not do filemap_write_and_wait_range in fsync" · 90abccf2
      Miao Xie 提交于
      This reverts commit 0885ef5b
      
      After applying the above patch, the performance slowed down because the dirty
      page flush can only be done by one task, so revert it.
      
      The following is the test result of sysbench:
      	Before		After
      	24MB/s		39MB/s
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      90abccf2
    • L
      Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag · 9e8a4a8b
      Liu Bo 提交于
      We're going to use this flag EXTENT_DEFRAG to indicate which range
      belongs to defragment so that we can implement snapshow-aware defrag:
      
      We set the EXTENT_DEFRAG flag when dirtying the extents that need
      defragmented, so later on writeback thread can differentiate between
      normal writeback and writeback started by defragmentation.
      Original-Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      9e8a4a8b
    • M
      Btrfs: fix wrong size for the reservation when doing, file pre-allocation. · 903889f4
      Miao Xie 提交于
      When we ran fsstress(a program in xfstests), the filesystem hung up when it
      is full. It was because the space reserved in btrfs_fallocate() was wrong,
      btrfs_fallocate() just used the size of the pre-allocation to reserve the
      space, didn't took the block size aligning into account, so the size of
      the reserved space was less than the allocated space, it caused the over
      reserve problem and made the filesystem hung up when invoking cow_file_range().
      Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      903889f4
    • M
      Btrfs: fix unprotected ->log_batch · 2ecb7923
      Miao Xie 提交于
      We forget to protect ->log_batch when syncing a file, this patch fix
      this problem by atomic operation. And ->log_batch is used to check
      if there are parallel sync operations or not, so it is unnecessary to
      reset it to 0 after the sync operation of the current log tree complete.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      2ecb7923
    • M
      Btrfs: add a new "type" field into the block reservation structure · 66d8f3dd
      Miao Xie 提交于
      Sometimes we need choose the method of the reservation according to the type
      of the block reservation, such as the reservation for the delayed inode update.
      Now we identify the type just by comparing the address of the reservation
      variants, it is very ugly if it is a temporary one because we need compare it
      with all the common reservation variants. So we add a new "type" field to keep
      the type the reservation variants.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      66d8f3dd
    • J
      Btrfs: btrfs_drop_extent_cache should never fail · 7014cdb4
      Josef Bacik 提交于
      I noticed this when I was doing the fsync stuff, we allocate split extents if we
      drop an extent range that is in the middle of an existing extent.  This BUG()'s
      if we fail to allocate memory, but the fact is this is just a cache, we will
      just regenerate the cache if we need it, the important part is that we free the
      range we are given.  This can be done without allocations, so if we fail to
      allocate splits just skip the splitting stage and free our em and look for more
      extents to drop.  This also makes btrfs_drop_extent_cache a void since nobody
      was checking the return value anyway.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7014cdb4
    • J
      Btrfs: add hole punching · 2aaa6655
      Josef Bacik 提交于
      This patch adds hole punching via fallocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2aaa6655
    • J
      Btrfs: remove unused hint byte argument for btrfs_drop_extents · 2671485d
      Josef Bacik 提交于
      I audited all users of btrfs_drop_extents and found that nobody actually uses
      the hint_byte argument.  I'm sure it was used for something at some point but
      it's not used now, and the way the pinning works the disk bytenr would never be
      immediately useful anyway so lets just remove it.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2671485d
    • J
      Btrfs: turbo charge fsync · 5dc562c5
      Josef Bacik 提交于
      At least for the vm workload.  Currently on fsync we will
      
      1) Truncate all items in the log tree for the given inode if they exist
      
      and
      
      2) Copy all items for a given inode into the log
      
      The problem with this is that for things like VMs you can have lots of
      extents from the fragmented writing behavior, and worst yet you may have
      only modified a few extents, not the entire thing.  This patch fixes this
      problem by tracking which transid modified our extent, and then when we do
      the tree logging we find all of the extents we've modified in our current
      transaction, sort them and commit them.  We also only truncate up to the
      xattrs of the inode and copy that stuff in normally, and then just drop any
      extents in the range we have that exist in the log already.  Here are some
      numbers of a 50 meg fio job that does random writes and fsync()s after every
      write
      
      		Original	Patched
      SATA drive	82KB/s		140KB/s
      Fusion drive	431KB/s		2532KB/s
      
      So around 2-6 times faster depending on your hardware.  There are a few
      corner cases, for example if you truncate at all we have to do it the old
      way since there is no way to be sure what is in the log is ok.  This
      probably could be done smarter, but if you write-fsync-truncate-write-fsync
      you deserve what you get.  All this work is in RAM of course so if your
      inode gets evicted from cache and you read it in and fsync it we'll do it
      the slow way if we are still in the same transaction that we last modified
      the inode in.
      
      The biggest cool part of this is that it requires no changes to the recovery
      code, so if you fsync with this patch and crash and load an old kernel, it
      will run the recovery and be a-ok.  I have tested this pretty thoroughly
      with an fsync tester and everything comes back fine, as well as xfstests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5dc562c5
    • J
      Btrfs: fix possible corruption when fsyncing written prealloced extents · 224ecce5
      Josef Bacik 提交于
      While working on my fsync patch my fsync tester kept hitting mismatching
      md5sums when I would randomly write to a prealloc'ed region, syncfs() and
      then write to the prealloced region some more and then fsync() and then
      immediately reboot.  This is because the tree logging code will skip writing
      csums for file extents who's generation is less than the current running
      transaction.  When we mark extents as written we haven't been updating their
      generation so they were always being skipped.  This wouldn't happen if you
      were to preallocate and then write in the same transaction, but if you for
      example prealloced a VM you could definitely run into this problem.  This
      patch makes my fsync tester happy again.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      224ecce5
  4. 31 7月, 2012 1 次提交
    • J
      btrfs: Convert to new freezing mechanism · b2b5ef5c
      Jan Kara 提交于
      We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
      freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
      the transaction mechanism to avoid starting transactions on frozen filesystem.
      At minimum this is necessary to stop iput() of unlinked file to change frozen
      filesystem during truncation.
      
      Checks in cleaner_kthread() and transaction_kthread() can be safely removed
      since btrfs_freeze() will lock the mutexes and thus block the threads (and they
      shouldn't have anything to do anyway).
      
      CC: linux-btrfs@vger.kernel.org
      CC: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b2b5ef5c
  5. 03 7月, 2012 1 次提交
    • J
      Btrfs: fix dio write vs buffered read race · c3473e83
      Josef Bacik 提交于
      Miao pointed out there's a problem with mixing dio writes and buffered
      reads.  If the read happens between us invalidating the page range and
      actually locking the extent we can bring in pages into page cache.  Then
      once the write finishes if somebody tries to read again it will just find
      uptodate pages and we'll read stale data.  So we need to lock the extent and
      check for uptodate bits in the range.  If there are uptodate bits we need to
      unlock and invalidate again.  This will keep this race from happening since
      we will hold the extent locked until we create the ordered extent, and then
      teh read side always waits for ordered extents.  There was also a race in
      how we updated i_size, previously we were relying on the generic DIO stuff
      to adjust the i_size after the DIO had completed, but this happens outside
      of the extent lock which means reads could come in and not see the updated
      i_size.  So instead move this work into where we create the extents, and
      then this way the update ordered i_size stuff works properly in the endio
      handlers.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      c3473e83
  6. 02 6月, 2012 1 次提交
  7. 30 5月, 2012 5 次提交
    • J
      Btrfs: check to see if the inode is in the log before fsyncing · 22ee6985
      Josef Bacik 提交于
      We have this check down in the actual logging code, but this is after we
      start a transaction and all that good stuff.  So move the helper
      inode_in_log() out so we can call it in fsync() and avoid starting a
      transaction altogether and just exit if we've already fsync()'ed this file
      recently.  You would notice this issue if you fsync()'ed a file over and
      over again until the transaction committed.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      22ee6985
    • M
      Btrfs: fix the same inode id problem when doing auto defragment · 762f2263
      Miao Xie 提交于
      Two files in the different subvolumes may have the same inode id, so
      The rb-tree which is used to manage the defragment object must take it
      into account. This patch fix this problem.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      762f2263
    • J
      Btrfs: convert the inode bit field to use the actual bit operations · 72ac3c0d
      Josef Bacik 提交于
      Miao pointed this out while I was working on an orphan problem that messing
      with a bitfield where different ranges are protected by different locks
      doesn't work out right.  Turns out we've been doing this forever where we
      have different parts of the bit field protected by either no lock at all or
      different locks which could cause all sorts of weird problems including the
      issue I was hitting.  So instead make a runtime_flags thing that we use the
      normal bit operations on that are all atomic so we can keep having our
      no/different locking for the different flags and then make force_compress
      it's own thing so it can be treated normally.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      72ac3c0d
    • J
      Btrfs: do not do filemap_write_and_wait_range in fsync · 0885ef5b
      Josef Bacik 提交于
      We already do the btrfs_wait_ordered_range which will do this for us, so
      just remove this call so we don't call it twice.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0885ef5b
    • J
      Btrfs: use i_version instead of our own sequence · 0c4d2d95
      Josef Bacik 提交于
      We've been keeping around the inode sequence number in hopes that somebody
      would use it, but nobody uses it and people actually use i_version which
      serves the same purpose, so use i_version where we used the incore inode's
      sequence number and that way the sequence is updated properly across the
      board, and not just in file write.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0c4d2d95
  8. 28 4月, 2012 1 次提交
  9. 22 3月, 2012 2 次提交
  10. 15 2月, 2012 1 次提交
  11. 01 2月, 2012 1 次提交
  12. 17 1月, 2012 1 次提交
  13. 11 1月, 2012 1 次提交
  14. 22 12月, 2011 1 次提交
    • A
      Btrfs: mark delayed refs as for cow · 66d7e7f0
      Arne Jansen 提交于
      Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
      from every call site. The for_cow parameter will later on be used to
      determine if a ref will change anything with respect to qgroups.
      
      Delayed refs coming from relocation are always counted as for_cow, as they
      don't change subvol quota.
      
      Also pass in the fs_info for later use.
      
      btrfs_find_all_roots() will use this as an optimization, as changes that are
      for_cow will not change anything with respect to which root points to a
      certain leaf. Thus, we don't need to add the current sequence number to
      those delayed refs.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      66d7e7f0
  15. 18 12月, 2011 1 次提交
  16. 17 12月, 2011 1 次提交
  17. 16 12月, 2011 1 次提交
    • J
      Btrfs: deal with enospc from dirtying inodes properly · 22c44fe6
      Josef Bacik 提交于
      Now that we're properly keeping track of delayed inode space we've been getting
      a lot of warnings out of btrfs_dirty_inode() when running xfstest 83.  This is
      because a bunch of people call mark_inode_dirty, which is void so we can't
      return ENOSPC.  This needs to be fixed in a few areas
      
      1) file_update_time - this updates the mtime and such when writing to a file,
      which will call mark_inode_dirty.  So copy file_update_time into btrfs so we can
      call btrfs_dirty_inode directly and return an error if we get one appropriately.
      
      2) fix symlinks to use btrfs_setattr for ->setattr.  For some reason we weren't
      setting ->setattr for symlinks, even though we should have been.  This catches
      one of the cases where we were getting errors in mark_inode_dirty.
      
      3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly
      instead of mark_inode_dirty.  This lets us return errors properly for truncate
      and chown/anything related to setattr.
      
      4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and
      print an error if we have one.  The only remaining user we can't control for
      this is touch_atime(), but we don't really want to keep people from walking
      down the tree if we don't have space to save the atime update, so just complain
      but don't worry about it.
      
      With this patch xfstests 83 complains a handful of times instead of hundreds of
      times.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      22c44fe6
  18. 28 10月, 2011 1 次提交
    • A
      vfs: do (nearly) lockless generic_file_llseek · ef3d0fd2
      Andi Kleen 提交于
      The i_mutex lock use of generic _file_llseek hurts.  Independent processes
      accessing the same file synchronize over a single lock, even though
      they have no need for synchronization at all.
      
      Under high utilization this can cause llseek to scale very poorly on larger
      systems.
      
      This patch does some rethinking of the llseek locking model:
      
      First the 64bit f_pos is not necessarily atomic without locks
      on 32bit systems. This can already cause races with read() today.
      This was discussed on linux-kernel in the past and deemed acceptable.
      The patch does not change that.
      
      Let's look at the different seek variants:
      
      SEEK_SET: Doesn't really need any locking.
      If there's a race one writer wins, the other loses.
      
      For 32bit the non atomic update races against read()
      stay the same. Without a lock they can also happen
      against write() now.  The read() race was deemed
      acceptable in past discussions, and I think if it's
      ok for read it's ok for write too.
      
      => Don't need a lock.
      
      SEEK_END: This behaves like SEEK_SET plus it reads
      the maximum size too. Reading the maximum size would have the
      32bit atomic problem. But luckily we already have a way to read
      the maximum size without locking (i_size_read), so we
      can just use that instead.
      
      Without i_mutex there is no synchronization with write() anymore,
      however since the write() update is atomic on 64bit it just behaves
      like another racy SEEK_SET.  On non atomic 32bit it's the same
      as SEEK_SET.
      
      => Don't need a lock, but need to use i_size_read()
      
      SEEK_CUR: This has a read-modify-write race window
      on the same file. One could argue that any application
      doing unsynchronized seeks on the same file is already broken.
      But for the sake of not adding a regression here I'm
      using the file->f_lock to synchronize this. Using this
      lock is much better than the inode mutex because it doesn't
      synchronize between processes.
      
      => So still need a lock, but can use a f_lock.
      
      This patch implements this new scheme in generic_file_llseek.
      I dropped generic_file_llseek_unlocked and changed all callers.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      ef3d0fd2
  19. 20 10月, 2011 2 次提交
  20. 01 10月, 2011 1 次提交
    • J
      Btrfs: force a page fault if we have a shorty copy on a page boundary · b6316429
      Josef Bacik 提交于
      A user reported a problem where ceph was getting into 100% cpu usage while doing
      some writing.  It turns out it's because we were doing a short write on a not
      uptodate page, which means we'd fall back at one page at a time and fault the
      page in.  The problem is our position is on the page boundary, so our fault in
      logic wasn't actually reading the page, so we'd just spin forever or until the
      page got read in by somebody else.  This will force a readpage if we end up
      doing a short copy.  Alexandre could reproduce this easily with ceph and reports
      it fixes his problem.  I also wrote a reproducer that no longer hangs my box
      with this patch.  Thanks,
      Reported-and-tested-by: NAlexandre Oliva <aoliva@redhat.com>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b6316429
  21. 18 9月, 2011 1 次提交
  22. 11 9月, 2011 1 次提交
  23. 18 8月, 2011 2 次提交
  24. 17 8月, 2011 1 次提交