1. 07 5月, 2014 4 次提交
  2. 25 4月, 2014 2 次提交
  3. 08 4月, 2014 2 次提交
    • K
      mm: implement ->map_pages for page cache · f1820361
      Kirill A. Shutemov 提交于
      filemap_map_pages() is generic implementation of ->map_pages() for
      filesystems who uses page cache.
      
      It should be safe to use filemap_map_pages() for ->map_pages() if
      filesystem use filemap_fault() for ->fault().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1820361
    • Q
      btrfs: Change the expanding write sequence to fix snapshot related bug. · 3ac0d7b9
      Qu Wenruo 提交于
      When testing fsstress with snapshot making background, some snapshot
      following problem.
      
      Snapshot 270:
      inode 323: size 0
      
      Snapshot 271:
      inode 323: size 349145
      |-------Hole---|---------Empty gap-------|-------Hole-----|
      0	    122880			172032	      349145
      
      Snapshot 272:
      inode 323: size 349145
      |-------Hole---|------------Data---------|-------Hole-----|
      0	    122880			172032	      349145
      
      The fsstress operation on inode 323 is the following:
      write: 		offset 	126832 	len 43124
      truncate: 	size 	349145
      
      Since the write with offset is consist of 2 operations:
      1. punch hole
      2. write data
      Hole punching is faster than data write, so hole punching in write
      and truncate is done first and then buffered write, so the snapshot 271 got
      empty gap, which will not pass btrfsck.
      
      To fix the bug, this patch will change the write sequence which will
      first punch a hole covering the write end if a hole is needed.
      Reported-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3ac0d7b9
  4. 04 4月, 2014 1 次提交
  5. 02 4月, 2014 3 次提交
  6. 22 3月, 2014 1 次提交
    • L
      Btrfs: fix a crash of clone with inline extents's split · 00fdf13a
      Liu Bo 提交于
      xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
      of inline extents in __btrfs_drop_extents().
      
      For inline extents, we cannot duplicate another EXTENT_DATA item, because
      it breaks the rule of inline extents, that is, 'start offset' needs to be 0.
      
      We have set limitations for the source inode's compressed inline extents,
      because it needs to decompress and recompress.  Now the destination inode's
      inline extents also need similar limitations.
      
      With this, xfstests btrfs/035 doesn't run into panic.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      00fdf13a
  7. 11 3月, 2014 8 次提交
    • M
      b88935bf
    • M
      Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume · 8257b2dc
      Miao Xie 提交于
      If the snapshot creation happened after the nocow write but before the dirty
      data flush, we would fail to flush the dirty data because of no space.
      
      So we must keep track of when those nocow write operations start and when they
      end, if there are nocow writers, the snapshot creators must wait. In order
      to implement this function, I introduce btrfs_{start, end}_nocow_write(),
      which is similar to mnt_{want,drop}_write().
      
      These two functions are only used for nocow file write operations.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      8257b2dc
    • M
      Btrfs: fix preallocate vs double nocow write · 7b2b7085
      Miao Xie 提交于
      We can not release the reserved metadata space for the first write if we
      find the write position is pre-allocated. Because the kernel might write
      the data on the disk before we do the second write but after the can-nocow
      check, if we release the space for the first write, we might fail to update
      the metadata because of no space.
      
      Fix this problem by end nocow write if there is dirty data in the range whose
      space is pre-allocated.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      7b2b7085
    • M
      Btrfs: fix wrong lock range and write size in check_can_nocow() · c933956d
      Miao Xie 提交于
      The write range may not be sector-aligned, for example:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|--------|  <- correct lock range, size: 3blocks
      
      But according to the old code, we used the size of write range to calculate
      the lock range directly, not considered the offset, we would get a wrong lock
      range:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|		<- wrong lock range, size: 2blocks
      
      And besides that, the old code also had the same problem when calculating
      the real write size. Correct them.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      c933956d
    • F
      Btrfs: more efficient btrfs_drop_extent_cache · 176840b3
      Filipe Manana 提交于
      While droping extent map structures from the extent cache that cover our
      target range, we would remove each extent map structure from the red black
      tree and then add either 1 or 2 new extent map structures if the former
      extent map covered sections outside our target range.
      
      This change simply attempts to replace the existing extent map structure
      with a new one that covers the subsection we're not interested in, instead
      of doing a red black remove operation followed by an insertion operation.
      
      The number of elements in an inode's extent map tree can get very high for large
      files under random writes. For example, while running the following test:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G \
              --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
              --max-requests=500000 --file-rw-ratio=2 [prepare|run]
      
      I captured the following histogram capturing the number of extent_map items
      in the red black tree while that test was running:
      
          Count: 122462
          Range:  1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
          Percentiles:  90th: 160120.000; 95th: 166335.000; 99th: 171070.000
             1.000 -    5.231:   452 |
             5.231 -  187.392:    87 |
           187.392 -  585.911:   206 |
           585.911 - 1827.438:   623 |
          1827.438 - 5695.245:  1962 #
          5695.245 - 17744.861:  6204 ####
         17744.861 - 55283.764: 21115 ############
         55283.764 - 172231.000: 91813 #####################################################
      
      Benchmark:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
              --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
              --file-io-mode=sync --file-fsync-freq=0 [prepare|run]
      
      Before this change: 122.1Mb/sec
      After this change:  125.07Mb/sec
      (averages of 5 test runs)
      
      Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      176840b3
    • F
      Btrfs: don't insert useless holes when punching beyond the inode's size · 12870f1c
      Filipe Manana 提交于
      If we punch beyond the size of an inode, we'll correctly remove any prealloc extents,
      but we'll also insert file extent items representing holes (disk bytenr == 0) that start
      with a key offset that lies beyond the inode's size and are not contiguous with the last
      file extent item.
      
      Example:
      
        $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
      
      btrfs-debug-tree output:
      
        item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
      	inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1
        item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13
      	inode ref index 2 namelen 3 name: foo
        item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
      	extent data disk byte 0 nr 0 gen 6
      	extent data offset 0 nr 90112 ram 122880
      	extent compression 0
        item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53
      	extent data disk byte 12845056 nr 4096 gen 6
      	extent data offset 0 nr 45056 ram 45056
      	extent compression 2
        item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53
      	extent data disk byte 0 nr 0 gen 6
      	extent data offset 0 nr 860160 ram 860160
      	extent compression 0
      
      The last extent item, which represents a hole, is useless as it lies beyond the inode's
      size.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      12870f1c
    • M
      Btrfs: fix skipped error handle when log sync failed · 8b050d35
      Miao Xie 提交于
      It is possible that many tasks sync the log tree at the same time, but
      only one task can do the sync work, the others will wait for it. But those
      wait tasks didn't get the result of the log sync, and returned 0 when they
      ended the wait. It caused those tasks skipped the error handle, and the
      serious problem was they told the users the file sync succeeded but in
      fact they failed.
      
      This patch fixes this problem by introducing a log context structure,
      we insert it into the a global list. When the sync fails, we will set
      the error number of every log context in the list, then the waiting tasks
      get the error number of the log context and handle the error if need.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      8b050d35
    • F
      Btrfs: faster/more efficient insertion of file extent items · d5f37527
      Filipe David Borba Manana 提交于
      This is an extension to my previous commit titled:
      
        "Btrfs: faster file extent item replace operations"
        (hash 1acae57b)
      
      Instead of inserting the new file extent item if we deleted existing
      file extent items covering our target file range, also allow to insert
      the new file extent item if we didn't find any existing items to delete
      and replace_extent != 0, since in this case our caller would do another
      tree search to insert the new file extent item anyway, therefore just
      combine the two tree searches into a single one, saving cpu time, reducing
      lock contention and reducing btree node/leaf COW operations.
      
      This covers the case where applications keep doing tail append writes to
      files, which for example is the case of Apache CouchDB (its database and
      view index files are always open with O_APPEND).
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      d5f37527
  8. 29 1月, 2014 9 次提交
    • C
      Btrfs: don't use ram_bytes for uncompressed inline items · 514ac8ad
      Chris Mason 提交于
      If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
      the new size.  The fixe uses the size directly from the item header when
      reading uncompressed inlines, and also fixes truncate to update the
      size as it goes.
      Reported-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org
      514ac8ad
    • M
      Btrfs: fix the race between write back and nocow buffered write · f1de9683
      Miao Xie 提交于
      When we ran the 274th case of xfstests with nodatacow mount option,
      We met the following warning message:
      WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0
      
      It is caused by the race between the write back and nocow buffered
      write:
        Task1				Task2
        __btrfs_buffered_write()
          skip data reservation
          reserve the metadata space
          copy the data
          dirty the pages
          unlock the pages
      				write back the pages
      				release the data space
         				  becasue there is no
      				  noreserve flag
         set the noreserve flag
      
      This patch fixes this problem by unlocking the pages after
      the noreserve flag is set.
      Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f1de9683
    • J
      Btrfs: make fsync latency less sucky · 5039eddc
      Josef Bacik 提交于
      Looking into some performance related issues with large amounts of metadata
      revealed that we can have some pretty huge swings in fsync() performance.  If we
      have a lot of delayed refs backed up (as you will tend to do with lots of
      metadata) fsync() will wander off and try to run some of those delayed refs
      which can result in reading from disk and such.  Since the actual act of fsync()
      doesn't create any delayed refs there is no need to make it throttle on delayed
      ref stuff, that will be handled by other people.  With this patch we get much
      smoother fsync performance with large amounts of metadata.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5039eddc
    • F
      Btrfs: faster file extent item replace operations · 1acae57b
      Filipe David Borba Manana 提交于
      When writing to a file we drop existing file extent items that cover the
      write range and then add a new file extent item that represents that write
      range.
      
      Before this change we were doing a tree lookup to remove the file extent
      items, and then after we did another tree lookup to insert the new file
      extent item.
      Most of the time all the file extent items we need to drop are located
      within a single leaf - this is the leaf where our new file extent item ends
      up at. Therefore, in this common case just combine these 2 operations into
      a single one.
      
      By avoiding the second btree navigation for insertion of the new file extent
      item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
      COW operations, CPU time on btree node/leaf key binary searches, etc.
      
      Besides for file writes, this is an operation that happens for file fsync's
      as well. However log btrees are much less likely to big as big as regular
      fs btrees, therefore the impact of this change is smaller.
      
      The following benchmark was performed against an SSD drive and a
      HDD drive, both for random and sequential writes:
      
        sysbench --test=fileio --file-num=4096 --file-total-size=8G \
           --file-test-mode=[rndwr|seqwr] --num-threads=512 \
           --file-block-size=8192 \ --max-requests=1000000 \
           --file-fsync-freq=0 --file-io-mode=sync [prepare|run]
      
      All results below are averages of 10 runs of the respective test.
      
      ** SSD sequential writes
      
      Before this change: 225.88 Mb/sec
      After this change:  277.26 Mb/sec
      
      ** SSD random writes
      
      Before this change: 49.91 Mb/sec
      After this change:  56.39 Mb/sec
      
      ** HDD sequential writes
      
      Before this change: 68.53 Mb/sec
      After this change:  69.87 Mb/sec
      
      ** HDD random writes
      
      Before this change: 13.04 Mb/sec
      After this change:  14.39 Mb/sec
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1acae57b
    • F
      Btrfs: fix use of uninitialized err variable · fc28b62d
      Filipe David Borba Manana 提交于
      fs/btrfs/file.c: In function ‘prepare_pages.isra.18’:
      fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized]
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fc28b62d
    • F
      Btrfs: fix ordered extent check in btrfs_punch_hole · 6126e3ca
      Filipe David Borba Manana 提交于
      If the ordered extent's last byte was 1 less than our region's
      start byte, we would unnecessarily wait for the completion of
      that ordered extent, because it doesn't intersect our target
      range.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6126e3ca
    • M
      Btrfs: fix the reserved space leak caused by the race between nonlock dio and buffered io · 376cc685
      Miao Xie 提交于
      When we ran sysbench on the fs with compression, the following WARN_ONs were
      triggered:
       fs/btrfs/inode.c:7829	WARN_ON(BTRFS_I(inode)->outstanding_extents);
       fs/btrfs/inode.c:7830	WARN_ON(BTRFS_I(inode)->reserved_extents);
       fs/btrfs/inode.c:7832	WARN_ON(BTRFS_I(inode)->csum_bytes);
      
      Steps to reproduce:
       # mkfs.btrfs -f <dev>
       # mount -o compress <dev> <mnt>
       # cd <mnt>
       # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
       > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
       > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
       > --file-test-mode=sync prepare
       # cd -
       # umount <mnt>
       # mount -o compress <dev> <mnt>
       # cd <mnt>
       # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
       > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
       > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
       > --file-test-mode=sync run
       # cd -
       # umount <mnt>
      
      The reason of this problem is:
      Task0				Task1
      btrfs_direct_IO
        unlock(&inode->i_mutex)
      				lock(&inode->i_mutex)
      				reserve_space()
      				prepare_pages()
      				  lock_extent()
      				  clear_extent()
      				  unlock_extent()
        lock_extent()
        test_extent(uptodate)
          return false
      				copy_data()
      				set_delalloc_extent()
        extent need compress
          go back to buffered write
        clear_extent(DELALLOC | DIRTY)
        unlock_extent()
      
      Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which
      was set by task1, it made the dirty pages in that extents couldn't be flushed
      into the disk, so the reserved space for that extent was not released at
      the end.
      
      This patch fixes the above bug by unlocking the extent after the delalloc.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      376cc685
    • M
      Btrfs: cleanup unnecessary parameter and variant of prepare_pages() · b37392ea
      Miao Xie 提交于
      - the caller has gotten the inode object, needn't pass the file object.
        And if so, we needn't define a inode pointer variant.
      - the position should be aligned by the page size not sector size, so
        we also needn't pass the root object into prepare_pages().
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b37392ea
    • J
      Btrfs: incompatible format change to remove hole extents · 16e7549f
      Josef Bacik 提交于
      Btrfs has always had these filler extent data items for holes in inodes.  This
      has made somethings very easy, like logging hole punches and sending hole
      punches.  However for large holey files these extent data items are pure
      overhead.  So add an incompatible feature to no longer add hole extents to
      reduce the amount of metadata used by these sort of files.  This has a few
      changes for logging and send obviously since they will need to detect holes and
      log/send the holes if there are any.  I've tested this thoroughly with xfstests
      and it doesn't cause any issues with and without the incompat format set.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      16e7549f
  9. 12 11月, 2013 4 次提交
  10. 21 9月, 2013 1 次提交
  11. 04 9月, 2013 1 次提交
  12. 01 9月, 2013 3 次提交
  13. 10 8月, 2013 1 次提交
    • J
      Btrfs: allow splitting of hole em's when dropping extent cache · ee20a983
      Josef Bacik 提交于
      I noticed while running multi-threaded fsync tests that sometimes fsck would
      complain about an improper gap.  This happens because we fail to add a hole
      extent to the file, which was happening when we'd split a hole EM because
      btrfs_drop_extent_cache was just discarding the whole em instead of splitting
      it.  So this patch fixes this by allowing us to split a hole em properly, which
      means that added holes actually get logged properly and we no longer see this
      fsck error.  Thankfully we're tolerant of these sort of problems so a user would
      not see any adverse effects of this bug, other than fsck complaining.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      ee20a983