1. 21 8月, 2014 3 次提交
    • F
      Btrfs: don't monopolize a core when evicting inode · 7064dd5c
      Filipe Manana 提交于
      If an inode has a very large number of extent maps, we can spend
      a lot of time freeing them, which triggers a soft lockup warning.
      Therefore reschedule if we need to when freeing the extent maps
      while evicting the inode.
      
      I could trigger this all the time by running xfstests/generic/299 on
      a file system with the no-holes feature enabled. That test creates
      an inode with 11386677 extent maps.
      
          $ mkfs.btrfs -f -O no-holes $TEST_DEV
          $ MKFS_OPTIONS="-O no-holes" ./check generic/299
          generic/299 382s ...
          Message from syslogd@debian-vm3 at Aug  7 10:44:29 ...
           kernel:[85304.208017] BUG: soft lockup - CPU#0 stuck for 22s! [umount:25330]
           384s
          Ran: generic/299
          Passed all 1 tests
      
          $ dmesg
          (...)
          [86304.300017] BUG: soft lockup - CPU#0 stuck for 23s! [umount:25330]
          (...)
          [86304.300036] Call Trace:
          [86304.300036]  [<ffffffff81698ba9>] __slab_free+0x54/0x295
          [86304.300036]  [<ffffffffa02ee9cc>] ? free_extent_map+0x5c/0xb0 [btrfs]
          [86304.300036]  [<ffffffff811a6cd2>] kmem_cache_free+0x282/0x2a0
          [86304.300036]  [<ffffffffa02ee9cc>] free_extent_map+0x5c/0xb0 [btrfs]
          [86304.300036]  [<ffffffffa02e3775>] btrfs_evict_inode+0xd5/0x660 [btrfs]
          [86304.300036]  [<ffffffff811e7c8d>] ? __inode_wait_for_writeback+0x6d/0xc0
          [86304.300036]  [<ffffffff816a389b>] ? _raw_spin_unlock+0x2b/0x40
          [86304.300036]  [<ffffffff811d8cbb>] evict+0xab/0x180
          [86304.300036]  [<ffffffff811d8dce>] dispose_list+0x3e/0x60
          [86304.300036]  [<ffffffff811d9b04>] evict_inodes+0xf4/0x110
          [86304.300036]  [<ffffffff811bd953>] generic_shutdown_super+0x53/0x110
          [86304.300036]  [<ffffffff811bdaa6>] kill_anon_super+0x16/0x30
          [86304.300036]  [<ffffffffa02a78ba>] btrfs_kill_super+0x1a/0xa0 [btrfs]
          [86304.300036]  [<ffffffff811bd3a9>] deactivate_locked_super+0x59/0x80
          [86304.300036]  [<ffffffff811be44e>] deactivate_super+0x4e/0x70
          [86304.300036]  [<ffffffff811dec14>] mntput_no_expire+0x174/0x1f0
          [86304.300036]  [<ffffffff811deab7>] ? mntput_no_expire+0x17/0x1f0
          [86304.300036]  [<ffffffff811e0517>] SyS_umount+0x97/0x100
          (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Tested-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7064dd5c
    • F
      Btrfs: ensure tmpfile inode is always persisted with link count of 0 · 5762b5c9
      Filipe Manana 提交于
      If we open a file with O_TMPFILE, don't do any further operation on
      it (so that the inode item isn't updated) and then force a transaction
      commit, we get a persisted inode item with a link count of 1, and not 0
      as it should be.
      
      Steps to reproduce it (requires a modern xfs_io with -T support):
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount -o /dev/sdd /mnt
          $ xfs_io -T /mnt &
          $ sync
      
      Then btrfs-debug-tree shows the inode item with a link count of 1:
      
          $ btrfs-debug-tree /dev/sdd
          (...)
          fs tree key (FS_TREE ROOT_ITEM 0)
          leaf 29556736 items 4 free space 15851 generation 6 owner 5
          fs uuid f164d01b-1b92-481d-a4e4-435fb0f843d0
          chunk uuid 0e3d0e56-bcca-4a1c-aa5f-cec2c6f4f7a6
          	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
      		inode generation 3 transid 6 size 0 block group 0 mode 40755 links 1
          	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
          		inode ref index 0 namelen 2 name: ..
          	item 2 key (257 INODE_ITEM 0) itemoff 15951 itemsize 160
          		inode generation 6 transid 6 size 0 block group 0 mode 100600 links 1
          	item 3 key (ORPHAN ORPHAN_ITEM 257) itemoff 15951 itemsize 0
      		orphan item
          checksum tree key (CSUM_TREE ROOT_ITEM 0)
          (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5762b5c9
    • F
      Btrfs: race free update of commit root for ro snapshots · 9c3b306e
      Filipe Manana 提交于
      This is a better solution for the problem addressed in the following
      commit:
      
          Btrfs: update commit root on snapshot creation after orphan cleanup
          (3821f348)
      
      The previous solution wasn't the best because of 2 reasons:
      
          1) It added another full transaction commit, which is more expensive
             than just swapping the commit root with the root;
      
          2) If a reboot happened after the first transaction commit (the one
             that creates the snapshot) and before the second transaction commit,
             then we would end up with the same problem if a send using that
             snapshot was requested before the first transaction commit after
             the reboot.
      
      This change addresses those 2 issues. The second issue is addressed by
      switching the commit root in the dentry lookup VFS callback, which is
      also called by the snapshot/subvol creation ioctl and performs orphan
      cleanup if needed. Like the vfs, the ioctl locks the parent inode too,
      preventing race issues between a dentry lookup and snapshot creation.
      
      Cc: Alex Lyakas <alex.btrfs@zadarastorage.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9c3b306e
  2. 19 8月, 2014 3 次提交
  3. 15 8月, 2014 2 次提交
    • C
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason 提交于
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: NChris Mason <clm@fb.com>
      8d875f95
    • L
      Btrfs: fix compressed write corruption on enospc · ce62003f
      Liu Bo 提交于
      When failing to allocate space for the whole compressed extent, we'll
      fallback to uncompressed IO, but we've forgotten to redirty the pages
      which belong to this compressed extent, and these 'clean' pages will
      simply skip 'submit' part and go to endio directly, at last we got data
      corruption as we write nothing.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Tested-By: NMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      ce62003f
  4. 20 6月, 2014 1 次提交
    • M
      Btrfs: fix broken free space cache after the system crashed · e570fd27
      Miao Xie 提交于
      When we mounted the filesystem after the crash, we got the following
      message:
        BTRFS error (device xxx): block group xxxx has wrong amount of free space
        BTRFS error (device xxx): failed to load free space cache for block group xxx
      
      It is because we didn't update the metadata of the allocated space (in extent
      tree) until the file data was written into the disk. During this time, there was
      no information about the allocated spaces in either the extent tree nor the
      free space cache. when we wrote out the free space cache at this time (commit
      transaction), those spaces were lost. In fact, only the free space that is
      used to store the file data had this problem, the others didn't because
      the metadata of them is updated in the same transaction context.
      
      There are many methods which can fix the above problem
      - track the allocated space, and write it out when we write out the free
        space cache
      - account the size of the allocated space that is used to store the file
        data, if the size is not zero, don't write out the free space cache.
      
      The first one is complex and may make the performance drop down.
      This patch chose the second method, we use a per-block-group variant to
      account the size of that allocated space. Besides that, we also introduce
      a per-block-group read-write semaphore to avoid the race between
      the allocation and the free space cache write out.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e570fd27
  5. 10 6月, 2014 13 次提交
  6. 07 5月, 2014 4 次提交
  7. 18 4月, 2014 1 次提交
  8. 08 4月, 2014 3 次提交
    • W
      Btrfs: fix unlock in __start_delalloc_inodes() · a1ecaabb
      Wang Shilong 提交于
      This patch fix a regression caused by the following patch:
      Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock
      
      break while loop will make us call @spin_unlock() without
      calling @spin_lock() before, fix it.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      a1ecaabb
    • W
      Btrfs: don't compress for a small write · 68bb462d
      Wang Shilong 提交于
      To compress a small file range(<=blocksize) that is not
      an inline extent can not save disk space at all. skip it can
      save us some cpu time.
      
      This patch can also fix wrong setting nocompression flag for
      inode, say a case when @total_in is 4096, and then we get
      @total_compressed 52,because we do aligment to page cache size
      firstly, and then we get into conclusion @total_in=@total_compressed
      thus we will clear this inode's compression flag.
      
      An exception comes from inserting inline extent failure but we
      still have @total_compressed < @total_in,so we will still reset
      inode's flag, this is ok, because we don't have good compression
      effect.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      68bb462d
    • W
      Btrfs: fix snapshot vs nocow writting · e9894fd3
      Wang Shilong 提交于
      While running fsstress and snapshots concurrently, we will hit something
      like followings:
      
      Thread 1			Thread 2
      
      |->fallocate
        |->write pages
          |->join transaction
             |->add ordered extent
          |->end transaction
      				|->flushing data
      				  |->creating pending snapshots
      |->write data into src root's
         fallocated space
      
      After above work flows finished, we will get a state that source and
      snapshot root share same space, but source root have written data into
      fallocated space, this will make fsck fail to verify checksums for
      snapshot root's preallocating file extent data.Nocow writting also
      has this same problem.
      
      Fix this problem by syncing snapshots with nocow writting:
      
       1.for nocow writting,if there are pending snapshots, we will
       fall into COW way.
      
       2.if there are pending nocow writes, snapshots for this root
       will be blocked until nocow writting finish.
      Reported-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e9894fd3
  9. 04 4月, 2014 1 次提交
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
  10. 11 3月, 2014 9 次提交