1. 05 6月, 2014 1 次提交
    • M
      mm: non-atomically mark page accessed during page cache allocation where possible · 2457aec6
      Mel Gorman 提交于
      aops->write_begin may allocate a new page and make it visible only to have
      mark_page_accessed called almost immediately after.  Once the page is
      visible the atomic operations are necessary which is noticable overhead
      when writing to an in-memory filesystem like tmpfs but should also be
      noticable with fast storage.  The objective of the patch is to initialse
      the accessed information with non-atomic operations before the page is
      visible.
      
      The bulk of filesystems directly or indirectly use
      grab_cache_page_write_begin or find_or_create_page for the initial
      allocation of a page cache page.  This patch adds an init_page_accessed()
      helper which behaves like the first call to mark_page_accessed() but may
      called before the page is visible and can be done non-atomically.
      
      The primary APIs of concern in this care are the following and are used
      by most filesystems.
      
      	find_get_page
      	find_lock_page
      	find_or_create_page
      	grab_cache_page_nowait
      	grab_cache_page_write_begin
      
      All of them are very similar in detail to the patch creates a core helper
      pagecache_get_page() which takes a flags parameter that affects its
      behavior such as whether the page should be marked accessed or not.  Then
      old API is preserved but is basically a thin wrapper around this core
      function.
      
      Each of the filesystems are then updated to avoid calling
      mark_page_accessed when it is known that the VM interfaces have already
      done the job.  There is a slight snag in that the timing of the
      mark_page_accessed() has now changed so in rare cases it's possible a page
      gets to the end of the LRU as PageReferenced where as previously it might
      have been repromoted.  This is expected to be rare but it's worth the
      filesystem people thinking about it in case they see a problem with the
      timing change.  It is also the case that some filesystems may be marking
      pages accessed that previously did not but it makes sense that filesystems
      have consistent behaviour in this regard.
      
      The test case used to evaulate this is a simple dd of a large file done
      multiple times with the file deleted on each iterations.  The size of the
      file is 1/10th physical memory to avoid dirty page balancing.  In the
      async case it will be possible that the workload completes without even
      hitting the disk and will have variable results but highlight the impact
      of mark_page_accessed for async IO.  The sync results are expected to be
      more stable.  The exception is tmpfs where the normal case is for the "IO"
      to not hit the disk.
      
      The test machine was single socket and UMA to avoid any scheduling or NUMA
      artifacts.  Throughput and wall times are presented for sync IO, only wall
      times are shown for async as the granularity reported by dd and the
      variability is unsuitable for comparison.  As async results were variable
      do to writback timings, I'm only reporting the maximum figures.  The sync
      results were stable enough to make the mean and stddev uninteresting.
      
      The performance results are reported based on a run with no profiling.
      Profile data is based on a separate run with oprofile running.
      
      async dd
                                          3.15.0-rc3            3.15.0-rc3
                                             vanilla           accessed-v2
      ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
      tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
      btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
      ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
      xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)
      
      The XFS figure is a bit strange as it managed to avoid a worst case by
      sheer luck but the average figures looked reasonable.
      
              samples percentage
      ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
      tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
      tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
      
      [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Tested-by: NPrabhakar Lad <prabhakar.csengg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2457aec6
  2. 25 4月, 2014 2 次提交
  3. 08 4月, 2014 2 次提交
    • K
      mm: implement ->map_pages for page cache · f1820361
      Kirill A. Shutemov 提交于
      filemap_map_pages() is generic implementation of ->map_pages() for
      filesystems who uses page cache.
      
      It should be safe to use filemap_map_pages() for ->map_pages() if
      filesystem use filemap_fault() for ->fault().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1820361
    • Q
      btrfs: Change the expanding write sequence to fix snapshot related bug. · 3ac0d7b9
      Qu Wenruo 提交于
      When testing fsstress with snapshot making background, some snapshot
      following problem.
      
      Snapshot 270:
      inode 323: size 0
      
      Snapshot 271:
      inode 323: size 349145
      |-------Hole---|---------Empty gap-------|-------Hole-----|
      0	    122880			172032	      349145
      
      Snapshot 272:
      inode 323: size 349145
      |-------Hole---|------------Data---------|-------Hole-----|
      0	    122880			172032	      349145
      
      The fsstress operation on inode 323 is the following:
      write: 		offset 	126832 	len 43124
      truncate: 	size 	349145
      
      Since the write with offset is consist of 2 operations:
      1. punch hole
      2. write data
      Hole punching is faster than data write, so hole punching in write
      and truncate is done first and then buffered write, so the snapshot 271 got
      empty gap, which will not pass btrfsck.
      
      To fix the bug, this patch will change the write sequence which will
      first punch a hole covering the write end if a hole is needed.
      Reported-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3ac0d7b9
  4. 04 4月, 2014 1 次提交
  5. 02 4月, 2014 3 次提交
  6. 22 3月, 2014 1 次提交
    • L
      Btrfs: fix a crash of clone with inline extents's split · 00fdf13a
      Liu Bo 提交于
      xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split
      of inline extents in __btrfs_drop_extents().
      
      For inline extents, we cannot duplicate another EXTENT_DATA item, because
      it breaks the rule of inline extents, that is, 'start offset' needs to be 0.
      
      We have set limitations for the source inode's compressed inline extents,
      because it needs to decompress and recompress.  Now the destination inode's
      inline extents also need similar limitations.
      
      With this, xfstests btrfs/035 doesn't run into panic.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      00fdf13a
  7. 11 3月, 2014 8 次提交
    • M
      b88935bf
    • M
      Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume · 8257b2dc
      Miao Xie 提交于
      If the snapshot creation happened after the nocow write but before the dirty
      data flush, we would fail to flush the dirty data because of no space.
      
      So we must keep track of when those nocow write operations start and when they
      end, if there are nocow writers, the snapshot creators must wait. In order
      to implement this function, I introduce btrfs_{start, end}_nocow_write(),
      which is similar to mnt_{want,drop}_write().
      
      These two functions are only used for nocow file write operations.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      8257b2dc
    • M
      Btrfs: fix preallocate vs double nocow write · 7b2b7085
      Miao Xie 提交于
      We can not release the reserved metadata space for the first write if we
      find the write position is pre-allocated. Because the kernel might write
      the data on the disk before we do the second write but after the can-nocow
      check, if we release the space for the first write, we might fail to update
      the metadata because of no space.
      
      Fix this problem by end nocow write if there is dirty data in the range whose
      space is pre-allocated.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      7b2b7085
    • M
      Btrfs: fix wrong lock range and write size in check_can_nocow() · c933956d
      Miao Xie 提交于
      The write range may not be sector-aligned, for example:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|--------|  <- correct lock range, size: 3blocks
      
      But according to the old code, we used the size of write range to calculate
      the lock range directly, not considered the offset, we would get a wrong lock
      range:
      
             |--------|--------|	<- write range, sector-unaligned, size: 2blocks
        |--------|--------|		<- wrong lock range, size: 2blocks
      
      And besides that, the old code also had the same problem when calculating
      the real write size. Correct them.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      c933956d
    • F
      Btrfs: more efficient btrfs_drop_extent_cache · 176840b3
      Filipe Manana 提交于
      While droping extent map structures from the extent cache that cover our
      target range, we would remove each extent map structure from the red black
      tree and then add either 1 or 2 new extent map structures if the former
      extent map covered sections outside our target range.
      
      This change simply attempts to replace the existing extent map structure
      with a new one that covers the subsection we're not interested in, instead
      of doing a red black remove operation followed by an insertion operation.
      
      The number of elements in an inode's extent map tree can get very high for large
      files under random writes. For example, while running the following test:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G \
              --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \
              --max-requests=500000 --file-rw-ratio=2 [prepare|run]
      
      I captured the following histogram capturing the number of extent_map items
      in the red black tree while that test was running:
      
          Count: 122462
          Range:  1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981
          Percentiles:  90th: 160120.000; 95th: 166335.000; 99th: 171070.000
             1.000 -    5.231:   452 |
             5.231 -  187.392:    87 |
           187.392 -  585.911:   206 |
           585.911 - 1827.438:   623 |
          1827.438 - 5695.245:  1962 #
          5695.245 - 17744.861:  6204 ####
         17744.861 - 55283.764: 21115 ############
         55283.764 - 172231.000: 91813 #####################################################
      
      Benchmark:
      
          sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \
              --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \
              --file-io-mode=sync --file-fsync-freq=0 [prepare|run]
      
      Before this change: 122.1Mb/sec
      After this change:  125.07Mb/sec
      (averages of 5 test runs)
      
      Test machine: quad core intel i5-3570K, 32Gb of ram, SSD
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      176840b3
    • F
      Btrfs: don't insert useless holes when punching beyond the inode's size · 12870f1c
      Filipe Manana 提交于
      If we punch beyond the size of an inode, we'll correctly remove any prealloc extents,
      but we'll also insert file extent items representing holes (disk bytenr == 0) that start
      with a key offset that lies beyond the inode's size and are not contiguous with the last
      file extent item.
      
      Example:
      
        $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo
        $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo
      
      btrfs-debug-tree output:
      
        item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160
      	inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1
        item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13
      	inode ref index 2 namelen 3 name: foo
        item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
      	extent data disk byte 0 nr 0 gen 6
      	extent data offset 0 nr 90112 ram 122880
      	extent compression 0
        item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53
      	extent data disk byte 12845056 nr 4096 gen 6
      	extent data offset 0 nr 45056 ram 45056
      	extent compression 2
        item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53
      	extent data disk byte 0 nr 0 gen 6
      	extent data offset 0 nr 860160 ram 860160
      	extent compression 0
      
      The last extent item, which represents a hole, is useless as it lies beyond the inode's
      size.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      12870f1c
    • M
      Btrfs: fix skipped error handle when log sync failed · 8b050d35
      Miao Xie 提交于
      It is possible that many tasks sync the log tree at the same time, but
      only one task can do the sync work, the others will wait for it. But those
      wait tasks didn't get the result of the log sync, and returned 0 when they
      ended the wait. It caused those tasks skipped the error handle, and the
      serious problem was they told the users the file sync succeeded but in
      fact they failed.
      
      This patch fixes this problem by introducing a log context structure,
      we insert it into the a global list. When the sync fails, we will set
      the error number of every log context in the list, then the waiting tasks
      get the error number of the log context and handle the error if need.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      8b050d35
    • F
      Btrfs: faster/more efficient insertion of file extent items · d5f37527
      Filipe David Borba Manana 提交于
      This is an extension to my previous commit titled:
      
        "Btrfs: faster file extent item replace operations"
        (hash 1acae57b)
      
      Instead of inserting the new file extent item if we deleted existing
      file extent items covering our target file range, also allow to insert
      the new file extent item if we didn't find any existing items to delete
      and replace_extent != 0, since in this case our caller would do another
      tree search to insert the new file extent item anyway, therefore just
      combine the two tree searches into a single one, saving cpu time, reducing
      lock contention and reducing btree node/leaf COW operations.
      
      This covers the case where applications keep doing tail append writes to
      files, which for example is the case of Apache CouchDB (its database and
      view index files are always open with O_APPEND).
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      d5f37527
  8. 29 1月, 2014 9 次提交
    • C
      Btrfs: don't use ram_bytes for uncompressed inline items · 514ac8ad
      Chris Mason 提交于
      If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
      the new size.  The fixe uses the size directly from the item header when
      reading uncompressed inlines, and also fixes truncate to update the
      size as it goes.
      Reported-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      CC: stable@vger.kernel.org
      514ac8ad
    • M
      Btrfs: fix the race between write back and nocow buffered write · f1de9683
      Miao Xie 提交于
      When we ran the 274th case of xfstests with nodatacow mount option,
      We met the following warning message:
      WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0
      
      It is caused by the race between the write back and nocow buffered
      write:
        Task1				Task2
        __btrfs_buffered_write()
          skip data reservation
          reserve the metadata space
          copy the data
          dirty the pages
          unlock the pages
      				write back the pages
      				release the data space
         				  becasue there is no
      				  noreserve flag
         set the noreserve flag
      
      This patch fixes this problem by unlocking the pages after
      the noreserve flag is set.
      Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f1de9683
    • J
      Btrfs: make fsync latency less sucky · 5039eddc
      Josef Bacik 提交于
      Looking into some performance related issues with large amounts of metadata
      revealed that we can have some pretty huge swings in fsync() performance.  If we
      have a lot of delayed refs backed up (as you will tend to do with lots of
      metadata) fsync() will wander off and try to run some of those delayed refs
      which can result in reading from disk and such.  Since the actual act of fsync()
      doesn't create any delayed refs there is no need to make it throttle on delayed
      ref stuff, that will be handled by other people.  With this patch we get much
      smoother fsync performance with large amounts of metadata.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5039eddc
    • F
      Btrfs: faster file extent item replace operations · 1acae57b
      Filipe David Borba Manana 提交于
      When writing to a file we drop existing file extent items that cover the
      write range and then add a new file extent item that represents that write
      range.
      
      Before this change we were doing a tree lookup to remove the file extent
      items, and then after we did another tree lookup to insert the new file
      extent item.
      Most of the time all the file extent items we need to drop are located
      within a single leaf - this is the leaf where our new file extent item ends
      up at. Therefore, in this common case just combine these 2 operations into
      a single one.
      
      By avoiding the second btree navigation for insertion of the new file extent
      item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
      COW operations, CPU time on btree node/leaf key binary searches, etc.
      
      Besides for file writes, this is an operation that happens for file fsync's
      as well. However log btrees are much less likely to big as big as regular
      fs btrees, therefore the impact of this change is smaller.
      
      The following benchmark was performed against an SSD drive and a
      HDD drive, both for random and sequential writes:
      
        sysbench --test=fileio --file-num=4096 --file-total-size=8G \
           --file-test-mode=[rndwr|seqwr] --num-threads=512 \
           --file-block-size=8192 \ --max-requests=1000000 \
           --file-fsync-freq=0 --file-io-mode=sync [prepare|run]
      
      All results below are averages of 10 runs of the respective test.
      
      ** SSD sequential writes
      
      Before this change: 225.88 Mb/sec
      After this change:  277.26 Mb/sec
      
      ** SSD random writes
      
      Before this change: 49.91 Mb/sec
      After this change:  56.39 Mb/sec
      
      ** HDD sequential writes
      
      Before this change: 68.53 Mb/sec
      After this change:  69.87 Mb/sec
      
      ** HDD random writes
      
      Before this change: 13.04 Mb/sec
      After this change:  14.39 Mb/sec
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1acae57b
    • F
      Btrfs: fix use of uninitialized err variable · fc28b62d
      Filipe David Borba Manana 提交于
      fs/btrfs/file.c: In function ‘prepare_pages.isra.18’:
      fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized]
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fc28b62d
    • F
      Btrfs: fix ordered extent check in btrfs_punch_hole · 6126e3ca
      Filipe David Borba Manana 提交于
      If the ordered extent's last byte was 1 less than our region's
      start byte, we would unnecessarily wait for the completion of
      that ordered extent, because it doesn't intersect our target
      range.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6126e3ca
    • M
      Btrfs: fix the reserved space leak caused by the race between nonlock dio and buffered io · 376cc685
      Miao Xie 提交于
      When we ran sysbench on the fs with compression, the following WARN_ONs were
      triggered:
       fs/btrfs/inode.c:7829	WARN_ON(BTRFS_I(inode)->outstanding_extents);
       fs/btrfs/inode.c:7830	WARN_ON(BTRFS_I(inode)->reserved_extents);
       fs/btrfs/inode.c:7832	WARN_ON(BTRFS_I(inode)->csum_bytes);
      
      Steps to reproduce:
       # mkfs.btrfs -f <dev>
       # mount -o compress <dev> <mnt>
       # cd <mnt>
       # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
       > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
       > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
       > --file-test-mode=sync prepare
       # cd -
       # umount <mnt>
       # mount -o compress <dev> <mnt>
       # cd <mnt>
       # sysbench --test=fileio --num-threads=8 --file-total-size=8G \
       > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \
       > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \
       > --file-test-mode=sync run
       # cd -
       # umount <mnt>
      
      The reason of this problem is:
      Task0				Task1
      btrfs_direct_IO
        unlock(&inode->i_mutex)
      				lock(&inode->i_mutex)
      				reserve_space()
      				prepare_pages()
      				  lock_extent()
      				  clear_extent()
      				  unlock_extent()
        lock_extent()
        test_extent(uptodate)
          return false
      				copy_data()
      				set_delalloc_extent()
        extent need compress
          go back to buffered write
        clear_extent(DELALLOC | DIRTY)
        unlock_extent()
      
      Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which
      was set by task1, it made the dirty pages in that extents couldn't be flushed
      into the disk, so the reserved space for that extent was not released at
      the end.
      
      This patch fixes the above bug by unlocking the extent after the delalloc.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      376cc685
    • M
      Btrfs: cleanup unnecessary parameter and variant of prepare_pages() · b37392ea
      Miao Xie 提交于
      - the caller has gotten the inode object, needn't pass the file object.
        And if so, we needn't define a inode pointer variant.
      - the position should be aligned by the page size not sector size, so
        we also needn't pass the root object into prepare_pages().
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b37392ea
    • J
      Btrfs: incompatible format change to remove hole extents · 16e7549f
      Josef Bacik 提交于
      Btrfs has always had these filler extent data items for holes in inodes.  This
      has made somethings very easy, like logging hole punches and sending hole
      punches.  However for large holey files these extent data items are pure
      overhead.  So add an incompatible feature to no longer add hole extents to
      reduce the amount of metadata used by these sort of files.  This has a few
      changes for logging and send obviously since they will need to detect holes and
      log/send the holes if there are any.  I've tested this thoroughly with xfstests
      and it doesn't cause any issues with and without the incompat format set.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      16e7549f
  9. 12 11月, 2013 4 次提交
  10. 21 9月, 2013 1 次提交
  11. 04 9月, 2013 1 次提交
  12. 01 9月, 2013 3 次提交
  13. 10 8月, 2013 1 次提交
    • J
      Btrfs: allow splitting of hole em's when dropping extent cache · ee20a983
      Josef Bacik 提交于
      I noticed while running multi-threaded fsync tests that sometimes fsck would
      complain about an improper gap.  This happens because we fail to add a hole
      extent to the file, which was happening when we'd split a hole EM because
      btrfs_drop_extent_cache was just discarding the whole em instead of splitting
      it.  So this patch fixes this by allowing us to split a hole em properly, which
      means that added holes actually get logged properly and we no longer see this
      fsck error.  Thankfully we're tolerant of these sort of problems so a user would
      not see any adverse effects of this bug, other than fsck complaining.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      ee20a983
  14. 03 7月, 2013 1 次提交
    • J
      vfs: export lseek_execute() to modules · 46a1c2c7
      Jie Liu 提交于
      For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
      SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
      matter in lseek_execute() to update the current file offset
      to the desired offset if it is valid, ceph also does the
      simliar things at ceph_llseek().
      
      To reduce the duplications, this patch make lseek_execute()
      public accessible so that we can call it directly from the
      underlying file systems.
      
      Thanks Dave Chinner for this suggestion.
      
      [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
      
      v2->v1:
      - Add kernel-doc comments for lseek_execute()
      - Call lseek_execute() in ceph->llseek()
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Ted Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Sage Weil <sage@inktank.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      46a1c2c7
  15. 02 7月, 2013 1 次提交
    • J
      Btrfs: check if we can nocow if we don't have data space · 7ee9e440
      Josef Bacik 提交于
      We always just try and reserve data space when we write, but if we are out of
      space but have prealloc'ed extents we should still successfully write.  This
      patch will try and see if we can write to prealloc'ed space and if we can go
      ahead and allow the write to continue.  With this patch we now pass xfstests
      generic/274.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7ee9e440
  16. 01 7月, 2013 1 次提交
    • J
      Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate · a71754fc
      Josef Bacik 提交于
      This has plagued us forever and I'm so over working around it.  When we truncate
      down to a non-page aligned offset we will call btrfs_truncate_page to zero out
      the end of the page and write it back to disk, this will keep us from exposing
      stale data if we truncate back up from that point.  The problem with this is it
      requires data space to do this, and people don't really expect to get ENOSPC
      from truncate() for these sort of things.  This also tends to bite the orphan
      cleanup stuff too which keeps people from mounting.  To get around this we can
      just move this into btrfs_cont_expand() to make sure if we are truncating up
      from a non-page size aligned i_size we will zero out the rest of this page so
      that we don't expose stale data.  This will give ENOSPC if you try to truncate()
      up or if you try to write past the end of isize, which is much more reasonable.
      This fixes xfstests generic/083 failing to mount because of the orphan cleanup
      failing.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      a71754fc