1. 10 6月, 2014 5 次提交
    • A
      btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking · fc4adbff
      Alex Gartrell 提交于
      In these instances, we are trying to determine if a page has been accessed
      since we began the operation for the sake of retry.  This is easily
      accomplished by doing a gang lookup in the page mapping radix tree, and it
      saves us the dependency on the flag (so that we might eventually delete
      it).
      
      btrfs_page_exists_in_range borrows heavily from find_get_page, replacing
      the radix tree look up with a gang lookup of 1, so that we can find the
      next highest page >= index and see if it falls into our lock range.
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NAlex Gartrell <agartrell@fb.com>
      fc4adbff
    • J
      Btrfs: rework qgroup accounting · fcebe456
      Josef Bacik 提交于
      Currently qgroups account for space by intercepting delayed ref updates to fs
      trees.  It does this by adding sequence numbers to delayed ref updates so that
      it can figure out how the tree looked before the update so we can adjust the
      counters properly.  The problem with this is that it does not allow delayed refs
      to be merged, so if you say are defragging an extent with 5k snapshots pointing
      to it we will thrash the delayed ref lock because we need to go back and
      manually merge these things together.  Instead we want to process quota changes
      when we know they are going to happen, like when we first allocate an extent, we
      free a reference for an extent, we add new references etc.  This patch
      accomplishes this by only adding qgroup operations for real ref changes.  We
      only modify the sequence number when we need to lookup roots for bytenrs, this
      reduces the amount of churn on the sequence number and allows us to merge
      delayed refs as we add them most of the time.  This patch encompasses a bunch of
      architectural changes
      
      1) qgroup ref operations: instead of tracking qgroup operations through the
      delayed refs we simply add new ref operations whenever we notice that we need to
      when we've modified the refs themselves.
      
      2) tree mod seq:  we no longer have this separation of major/minor counters.
      this makes the sequence number stuff much more sane and we can remove some
      locking that was needed to protect the counter.
      
      3) delayed ref seq: we now read the tree mod seq number and use that as our
      sequence.  This means each new delayed ref doesn't have it's own unique sequence
      number, rather whenever we go to lookup backrefs we inc the sequence number so
      we can make sure to keep any new operations from screwing up our world view at
      that given point.  This allows us to merge delayed refs during runtime.
      
      With all of these changes the delayed ref stuff is a little saner and the qgroup
      accounting stuff no longer goes negative in some cases like it was before.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fcebe456
    • M
    • F
      Btrfs: fix leaf corruption caused by ENOSPC while hole punching · fc19c5e7
      Filipe Manana 提交于
      While running a stress test with multiple threads writing to the same btrfs
      file system, I ended up with a situation where a leaf was corrupted in that
      it had 2 file extent item keys that had the same exact key. I was able to
      detect this quickly thanks to the following patch which triggers an assertion
      as soon as a leaf is marked dirty if there are duplicated keys or out of order
      keys:
      
          Btrfs: check if items are ordered when a leaf is marked dirty
          (https://patchwork.kernel.org/patch/3955431/)
      
      Basically while running the test, I got the following in dmesg:
      
          [28877.415877] WARNING: CPU: 2 PID: 10706 at fs/btrfs/file.c:553 btrfs_drop_extent_cache+0x435/0x440 [btrfs]()
          (...)
          [28877.415917] Call Trace:
          [28877.415922]  [<ffffffff816f1189>] dump_stack+0x4e/0x68
          [28877.415926]  [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0
          [28877.415929]  [<ffffffff8104a37a>] warn_slowpath_null+0x1a/0x20
          [28877.415944]  [<ffffffffa03775a5>] btrfs_drop_extent_cache+0x435/0x440 [btrfs]
          [28877.415949]  [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
          [28877.415962]  [<ffffffffa03777d9>] fill_holes+0x229/0x3e0 [btrfs]
          [28877.415972]  [<ffffffffa0345865>] ? block_rsv_add_bytes+0x55/0x80 [btrfs]
          [28877.415984]  [<ffffffffa03792cb>] btrfs_fallocate+0xb6b/0xc20 [btrfs]
          (...)
          [29854.132560] BTRFS critical (device sdc): corrupt leaf, bad key order: block=955232256,root=1, slot=24
          [29854.132565] BTRFS info (device sdc): leaf 955232256 total ptrs 40 free space 778
          (...)
          [29854.132637] 	item 23 key (3486 108 667648) itemoff 2694 itemsize 53
          [29854.132638] 		extent data disk bytenr 14574411776 nr 286720
          [29854.132639] 		extent data offset 0 nr 286720 ram 286720
          [29854.132640] 	item 24 key (3486 108 954368) itemoff 2641 itemsize 53
          [29854.132641] 		extent data disk bytenr 0 nr 0
          [29854.132643] 		extent data offset 0 nr 0 ram 0
          [29854.132644] 	item 25 key (3486 108 954368) itemoff 2588 itemsize 53
          [29854.132645] 		extent data disk bytenr 8699670528 nr 77824
          [29854.132646] 		extent data offset 0 nr 77824 ram 77824
          [29854.132647] 	item 26 key (3486 108 1146880) itemoff 2535 itemsize 53
          [29854.132648] 		extent data disk bytenr 8699670528 nr 77824
          [29854.132649] 		extent data offset 0 nr 77824 ram 77824
          (...)
          [29854.132707] kernel BUG at fs/btrfs/ctree.h:3901!
          (...)
          [29854.132771] Call Trace:
          [29854.132779]  [<ffffffffa0342b5c>] setup_items_for_insert+0x2dc/0x400 [btrfs]
          [29854.132791]  [<ffffffffa0378537>] __btrfs_drop_extents+0xba7/0xdd0 [btrfs]
          [29854.132794]  [<ffffffff8109c0d6>] ? trace_hardirqs_on_caller+0x16/0x1d0
          [29854.132797]  [<ffffffff8109c29d>] ? trace_hardirqs_on+0xd/0x10
          [29854.132800]  [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0
          [29854.132810]  [<ffffffffa036783b>] insert_reserved_file_extent.constprop.66+0xab/0x310 [btrfs]
          [29854.132820]  [<ffffffffa036a6c6>] __btrfs_prealloc_file_range+0x116/0x340 [btrfs]
          [29854.132830]  [<ffffffffa0374d53>] btrfs_prealloc_file_range+0x23/0x30 [btrfs]
          (...)
      
      So this is caused by getting an -ENOSPC error while punching a file hole, more
      specifically, we get -ENOSPC error from __btrfs_drop_extents in the while loop
      of file.c:btrfs_punch_hole() when it's unable to modify the btree to delete one
      or more file extent items due to lack of enough free space. When this happens,
      in btrfs_punch_hole(), we attempt to reclaim free space by switching our transaction
      block reservation object to root->fs_info->trans_block_rsv, end our transaction and
      start a new transaction basically - and, we keep increasing our current offset
      (cur_offset) as long as it's smaller than the end of the target range (lockend) -
      this makes use leave the loop with cur_offset == drop_end which in turn makes us
      call fill_holes() for inserting a file extent item that represents a 0 bytes range
      hole (and this insertion succeeds, as in the meanwhile more space became available).
      
      This 0 bytes file hole extent item is a problem because any subsequent caller of
      __btrfs_drop_extents (regular file writes, or fallocate calls for e.g.), with a
      start file offset that is equal to the offset of the hole, will not remove this
      extent item due to the following conditional in the while loop of
      __btrfs_drop_extents:
      
          if (extent_end <= search_start) {
                  path->slots[0]++;
                  goto next_slot;
          }
      
      This later makes the call to setup_items_for_insert() (at the very end of
      __btrfs_drop_extents), insert a new file extent item with the same offset as
      the 0 bytes file hole extent item that follows it. Needless is to say that this
      causes chaos, either when reading the leaf from disk (btree_readpage_end_io_hook),
      where we perform leaf sanity checks or in subsequent operations that manipulate
      file extent items, as in the fallocate call as shown by the dmesg trace above.
      
      Without my other patch to perform the leaf sanity checks once a leaf is marked
      as dirty (if the integrity checker is enabled), it would have been much harder
      to debug this issue.
      
      This change might fix a few similar issues reported by users in the mailing
      list regarding assertion failures in btrfs_set_item_key_safe calls performed
      by __btrfs_drop_extents, such as the following report:
      
          http://comments.gmane.org/gmane.comp.file-systems.btrfs/32938
      
      Asking fill_holes() to create a 0 bytes wide file hole item also produced the
      first warning in the trace above, as we passed a range to btrfs_drop_extent_cache
      that has an end smaller (by -1) than its start.
      
      On 3.14 kernels this issue manifests itself through leaf corruption, as we get
      duplicated file extent item keys in a leaf when calling setup_items_for_insert(),
      but on older kernels, setup_items_for_insert() isn't called by __btrfs_drop_extents(),
      instead we have callers of __btrfs_drop_extents(), namely the functions
      inode.c:insert_inline_extent() and inode.c:insert_reserved_file_extent(), calling
      btrfs_insert_empty_item() to insert the new file extent item, which would fail with
      error -EEXIST, instead of inserting a duplicated key - which is still a serious
      issue as it would make all similar file extent item replace operations keep
      failing if they target the same file range.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fc19c5e7
    • F
      Btrfs: read inode size after acquiring the mutex when punching a hole · a1a50f60
      Filipe Manana 提交于
      In a previous change, commit 12870f1c,
      I accidentally moved the roundup of inode->i_size to outside of the
      critical section delimited by the inode mutex, which is not atomic and
      not correct since the size can be changed by other task before we acquire
      the mutex. Therefore fix it.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a1a50f60
  2. 25 4月, 2014 2 次提交
  3. 08 4月, 2014 2 次提交
    • K
      mm: implement ->map_pages for page cache · f1820361
      Kirill A. Shutemov 提交于
      filemap_map_pages() is generic implementation of ->map_pages() for
      filesystems who uses page cache.
      
      It should be safe to use filemap_map_pages() for ->map_pages() if
      filesystem use filemap_fault() for ->fault().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1820361
    • Q
      btrfs: Change the expanding write sequence to fix snapshot related bug. · 3ac0d7b9
      Qu Wenruo 提交于
      When testing fsstress with snapshot making background, some snapshot
      following problem.
      
      Snapshot 270:
      inode 323: size 0
      
      Snapshot 271:
      inode 323: size 349145
      |-------Hole---|---------Empty gap-------|-------Hole-----|
      0	    122880			172032	      349145
      
      Snapshot 272:
      inode 323: size 349145
      |-------Hole---|------------Data---------|-------Hole-----|
      0	    122880			172032	      349145
      
      The fsstress operation on inode 323 is the following:
      write: 		offset 	126832 	len 43124
      truncate: 	size 	349145
      
      Since the write with offset is consist of 2 operations:
      1. punch hole
      2. write data
      Hole punching is faster than data write, so hole punching in write
      and truncate is done first and then buffered write, so the snapshot 271 got
      empty gap, which will not pass btrfsck.
      
      To fix the bug, this patch will change the write sequence which will
      first punch a hole covering the write end if a hole is needed.
      Reported-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3ac0d7b9
  4. 04 4月, 2014 1 次提交
  5. 02 4月, 2014 3 次提交
  6. 22 3月, 2014 1 次提交
    • L
      Btrfs: fix a crash of clone with inline extents's split · 00fdf13a
      Liu Bo 提交于
      xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
      of inline extents in __btrfs_drop_extents().
      
      For inline extents, we cannot duplicate another EXTENT_DATA item, because
      it breaks the rule of inline extents, that is, 'start offset' needs to be 0.
      
      We have set limitations for the source inode's compressed inline extents,
      because it needs to decompress and recompress.  Now the destination inode's
      inline extents also need similar limitations.
      
      With this, xfstests btrfs/035 doesn't run into panic.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      00fdf13a
  7. 11 3月, 2014 8 次提交
    • M
      b88935bf
    • M
      Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume · 8257b2dc
      Miao Xie 提交于
      If the snapshot creation happened after the nocow write but before the dirty
      data flush, we would fail to flush the dirty data because of no space.
      
      So we must keep track of when those nocow write operations start and when they
      end, if there are nocow writers, the snapshot creators must wait. In order
      to implement this function, I introduce btrfs_{start, end}_nocow_write(),
      which is similar to mnt_{want,drop}_write().
      
      These two functions are only used for nocow file write operations.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      8257b2dc
    • M
      Btrfs: fix preallocate vs double nocow write · 7b2b7085
      Miao Xie 提交于
      We can not release the reserved metadata space for the first write if we
      find the write position is pre-allocated. Because the kernel might write
      the data on the disk before we do the second write but after the can-nocow
      check, if we release the space for the first write, we might fail to update
      the metadata because of no space.
      
      Fix this problem by end nocow write if there is dirty data in the range whose
      space is pre-allocated.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      7b2b7085
    • M
      Btrfs: fix wrong lock range and write size in check_can_nocow() · c933956d
      Miao Xie 提交于
      The write range may not be sector-aligned, for example:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|--------|  <- correct lock range, size: 3blocks
      
      But according to the old code, we used the size of write range to calculate
      the lock range directly, not considered the offset, we would get a wrong lock
      range:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|		<- wrong lock range, size: 2blocks
      
      And besides that, the old code also had the same problem when calculating
      the real write size. Correct them.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      c933956d
    • F
      Btrfs: more efficient btrfs_drop_extent_cache · 176840b3
      Filipe Manana 提交于
      While droping extent map structures from the extent cache that cover our
      target range, we would remove each extent map structure from the red black
      tree and then add either 1 or 2 new extent map structures if the former
      extent map covered sections outside our target range.
      
      This change simply attempts to replace the existing extent map structure
      with a new one that covers the subsection we're not interested in, instead
      of doing a red black remove operation followed by an insertion operation.
      
      The number of elements in an inode's extent map tree can get very high for large
      files under random writes. For example, while running the following test:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G \
              --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
              --max-requests=500000 --file-rw-ratio=2 [prepare|run]
      
      I captured the following histogram capturing the number of extent_map items
      in the red black tree while that test was running:
      
          Count: 122462
          Range:  1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
          Percentiles:  90th: 160120.000; 95th: 166335.000; 99th: 171070.000
             1.000 -    5.231:   452 |
             5.231 -  187.392:    87 |
           187.392 -  585.911:   206 |
           585.911 - 1827.438:   623 |
          1827.438 - 5695.245:  1962 #
          5695.245 - 17744.861:  6204 ####
         17744.861 - 55283.764: 21115 ############
         55283.764 - 172231.000: 91813 #####################################################
      
      Benchmark:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
              --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
              --file-io-mode=sync --file-fsync-freq=0 [prepare|run]
      
      Before this change: 122.1Mb/sec
      After this change:  125.07Mb/sec
      (averages of 5 test runs)
      
      Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      176840b3
    • F
      Btrfs: don't insert useless holes when punching beyond the inode's size · 12870f1c
      Filipe Manana 提交于
      If we punch beyond the size of an inode, we'll correctly remove any prealloc extents,
      but we'll also insert file extent items representing holes (disk bytenr == 0) that start
      with a key offset that lies beyond the inode's size and are not contiguous with the last
      file extent item.
      
      Example:
      
        $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
      
      btrfs-debug-tree output:
      
        item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
      	inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1
        item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13
      	inode ref index 2 namelen 3 name: foo
        item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
      	extent data disk byte 0 nr 0 gen 6
      	extent data offset 0 nr 90112 ram 122880
      	extent compression 0
        item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53
      	extent data disk byte 12845056 nr 4096 gen 6
      	extent data offset 0 nr 45056 ram 45056
      	extent compression 2
        item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53
      	extent data disk byte 0 nr 0 gen 6
      	extent data offset 0 nr 860160 ram 860160
      	extent compression 0
      
      The last extent item, which represents a hole, is useless as it lies beyond the inode's
      size.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      12870f1c
    • M
      Btrfs: fix skipped error handle when log sync failed · 8b050d35
      Miao Xie 提交于
      It is possible that many tasks sync the log tree at the same time, but
      only one task can do the sync work, the others will wait for it. But those
      wait tasks didn't get the result of the log sync, and returned 0 when they
      ended the wait. It caused those tasks skipped the error handle, and the
      serious problem was they told the users the file sync succeeded but in
      fact they failed.
      
      This patch fixes this problem by introducing a log context structure,
      we insert it into the a global list. When the sync fails, we will set
      the error number of every log context in the list, then the waiting tasks
      get the error number of the log context and handle the error if need.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      8b050d35
    • F
      Btrfs: faster/more efficient insertion of file extent items · d5f37527
      Filipe David Borba Manana 提交于
      This is an extension to my previous commit titled:
      
        "Btrfs: faster file extent item replace operations"
        (hash 1acae57b)
      
      Instead of inserting the new file extent item if we deleted existing
      file extent items covering our target file range, also allow to insert
      the new file extent item if we didn't find any existing items to delete
      and replace_extent != 0, since in this case our caller would do another
      tree search to insert the new file extent item anyway, therefore just
      combine the two tree searches into a single one, saving cpu time, reducing
      lock contention and reducing btree node/leaf COW operations.
      
      This covers the case where applications keep doing tail append writes to
      files, which for example is the case of Apache CouchDB (its database and
      view index files are always open with O_APPEND).
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      d5f37527
  8. 29 1月, 2014 9 次提交
    • C
      Btrfs: don't use ram_bytes for uncompressed inline items · 514ac8ad
      Chris Mason 提交于
      If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
      the new size.  The fixe uses the size directly from the item header when
      reading uncompressed inlines, and also fixes truncate to update the
      size as it goes.
      Reported-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org
      514ac8ad
    • M
      Btrfs: fix the race between write back and nocow buffered write · f1de9683
      Miao Xie 提交于
      When we ran the 274th case of xfstests with nodatacow mount option,
      We met the following warning message:
      WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0
      
      It is caused by the race between the write back and nocow buffered
      write:
        Task1				Task2
        __btrfs_buffered_write()
          skip data reservation
          reserve the metadata space
          copy the data
          dirty the pages
          unlock the pages
      				write back the pages
      				release the data space
         				  becasue there is no
      				  noreserve flag
         set the noreserve flag
      
      This patch fixes this problem by unlocking the pages after
      the noreserve flag is set.
      Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f1de9683
    • J
      Btrfs: make fsync latency less sucky · 5039eddc
      Josef Bacik 提交于
      Looking into some performance related issues with large amounts of metadata
      revealed that we can have some pretty huge swings in fsync() performance.  If we
      have a lot of delayed refs backed up (as you will tend to do with lots of
      metadata) fsync() will wander off and try to run some of those delayed refs
      which can result in reading from disk and such.  Since the actual act of fsync()
      doesn't create any delayed refs there is no need to make it throttle on delayed
      ref stuff, that will be handled by other people.  With this patch we get much
      smoother fsync performance with large amounts of metadata.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5039eddc
    • F
      Btrfs: faster file extent item replace operations · 1acae57b
      Filipe David Borba Manana 提交于
      When writing to a file we drop existing file extent items that cover the
      write range and then add a new file extent item that represents that write
      range.
      
      Before this change we were doing a tree lookup to remove the file extent
      items, and then after we did another tree lookup to insert the new file
      extent item.
      Most of the time all the file extent items we need to drop are located
      within a single leaf - this is the leaf where our new file extent item ends
      up at. Therefore, in this common case just combine these 2 operations into
      a single one.
      
      By avoiding the second btree navigation for insertion of the new file extent
      item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
      COW operations, CPU time on btree node/leaf key binary searches, etc.
      
      Besides for file writes, this is an operation that happens for file fsync's
      as well. However log btrees are much less likely to big as big as regular
      fs btrees, therefore the impact of this change is smaller.
      
      The following benchmark was performed against an SSD drive and a
      HDD drive, both for random and sequential writes:
      
        sysbench --test=fileio --file-num=4096 --file-total-size=8G \
           --file-test-mode=[rndwr|seqwr] --num-threads=512 \
           --file-block-size=8192 \ --max-requests=1000000 \
           --file-fsync-freq=0 --file-io-mode=sync [prepare|run]
      
      All results below are averages of 10 runs of the respective test.
      
      ** SSD sequential writes
      
      Before this change: 225.88 Mb/sec
      After this change:  277.26 Mb/sec
      
      ** SSD random writes
      
      Before this change: 49.91 Mb/sec
      After this change:  56.39 Mb/sec
      
      ** HDD sequential writes
      
      Before this change: 68.53 Mb/sec
      After this change:  69.87 Mb/sec
      
      ** HDD random writes
      
      Before this change: 13.04 Mb/sec
      After this change:  14.39 Mb/sec
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1acae57b
    • F
      Btrfs: fix use of uninitialized err variable · fc28b62d
      Filipe David Borba Manana 提交于
      fs/btrfs/file.c: In function ‘prepare_pages.isra.18’:
      fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized]
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fc28b62d
    • F
      Btrfs: fix ordered extent check in btrfs_punch_hole · 6126e3ca
      Filipe David Borba Manana 提交于
      If the ordered extent's last byte was 1 less than our region's
      start byte, we would unnecessarily wait for the completion of
      that ordered extent, because it doesn't intersect our target
      range.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6126e3ca
    • M
      Btrfs: fix the reserved space leak caused by the race between nonlock dio and buffered io · 376cc685
      Miao Xie 提交于
      When we ran sysbench on the fs with compression, the following WARN_ONs were
      triggered:
       fs/btrfs/inode.c:7829	WARN_ON(BTRFS_I(inode)->outstanding_extents);
       fs/btrfs/inode.c:7830	WARN_ON(BTRFS_I(inode)->reserved_extents);
       fs/btrfs/inode.c:7832	WARN_ON(BTRFS_I(inode)->csum_bytes);
      
      Steps to reproduce:
       # mkfs.btrfs -f <dev>
       # mount -o compress <dev> <mnt>
       # cd <mnt>
       # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
       > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
       > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
       > --file-test-mode=sync prepare
       # cd -
       # umount <mnt>
       # mount -o compress <dev> <mnt>
       # cd <mnt>
       # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
       > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
       > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
       > --file-test-mode=sync run
       # cd -
       # umount <mnt>
      
      The reason of this problem is:
      Task0				Task1
      btrfs_direct_IO
        unlock(&inode->i_mutex)
      				lock(&inode->i_mutex)
      				reserve_space()
      				prepare_pages()
      				  lock_extent()
      				  clear_extent()
      				  unlock_extent()
        lock_extent()
        test_extent(uptodate)
          return false
      				copy_data()
      				set_delalloc_extent()
        extent need compress
          go back to buffered write
        clear_extent(DELALLOC | DIRTY)
        unlock_extent()
      
      Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which
      was set by task1, it made the dirty pages in that extents couldn't be flushed
      into the disk, so the reserved space for that extent was not released at
      the end.
      
      This patch fixes the above bug by unlocking the extent after the delalloc.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      376cc685
    • M
      Btrfs: cleanup unnecessary parameter and variant of prepare_pages() · b37392ea
      Miao Xie 提交于
      - the caller has gotten the inode object, needn't pass the file object.
        And if so, we needn't define a inode pointer variant.
      - the position should be aligned by the page size not sector size, so
        we also needn't pass the root object into prepare_pages().
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b37392ea
    • J
      Btrfs: incompatible format change to remove hole extents · 16e7549f
      Josef Bacik 提交于
      Btrfs has always had these filler extent data items for holes in inodes.  This
      has made somethings very easy, like logging hole punches and sending hole
      punches.  However for large holey files these extent data items are pure
      overhead.  So add an incompatible feature to no longer add hole extents to
      reduce the amount of metadata used by these sort of files.  This has a few
      changes for logging and send obviously since they will need to detect holes and
      log/send the holes if there are any.  I've tested this thoroughly with xfstests
      and it doesn't cause any issues with and without the incompat format set.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      16e7549f
  9. 12 11月, 2013 4 次提交
  10. 21 9月, 2013 1 次提交
  11. 04 9月, 2013 1 次提交
  12. 01 9月, 2013 3 次提交