1. 02 10月, 2014 1 次提交
  2. 23 9月, 2014 1 次提交
    • J
      Btrfs: try not to ENOSPC on log replay · 1d52c78a
      Josef Bacik 提交于
      When doing log replay we may have to update inodes, which traditionally goes
      through our delayed inode stuff.  This will try to move space over from the
      trans handle, but we don't reserve space in our trans handle on replay since we
      don't know how much we will need, so instead we try to flush.  But because we
      have a trans handle open we won't flush anything, so if we are out of reserve
      space we will simply return ENOSPC.  Since we know that if an operation made it
      into the log then we definitely had space before the box bought the farm then we
      don't need to worry about doing this space reservation.  Use the
      fs_info->log_root_recovering flag to skip the delayed inode stuff and update the
      item directly.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1d52c78a
  3. 18 9月, 2014 12 次提交
  4. 09 9月, 2014 1 次提交
    • C
      Btrfs: use insert_inode_locked4 for inode creation · b0d5d10f
      Chris Mason 提交于
      Btrfs was inserting inodes into the hash table before we had fully
      set the inode up on disk.  This leaves us open to rare races that allow
      two different inodes in memory for the same [root, inode] pair.
      
      This patch fixes things by using insert_inode_locked4 to insert an I_NEW
      inode and unlock_new_inode when we're ready for the rest of the kernel
      to use the inode.
      
      It also makes sure to init the operations pointers on the inode before
      going into the error handling paths.
      Signed-off-by: NChris Mason <clm@fb.com>
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      b0d5d10f
  5. 03 9月, 2014 2 次提交
    • F
      Btrfs: fix crash while doing a ranged fsync · dac5705c
      Filipe Manana 提交于
      While doing a ranged fsync, that is, one whose range doesn't cover the
      whole possible file range (0 to LLONG_MAX), we can crash under certain
      circumstances with a trace like the following:
      
      [41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      (...)
      [41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
      (...)
      [41074.643886] RIP: 0010:[<ffffffffa01ecc99>]  [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
      (...)
      [41074.644919] Stack:
      (...)
      [41074.644919] Call Trace:
      [41074.644919]  [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
      [41074.644919]  [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
      [41074.644919]  [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
      [41074.644919]  [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
      [41074.644919]  [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
      [41074.644919]  [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
      [41074.644919]  [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
      [41074.644919]  [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
      [41074.644919]  [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
      [41074.644919]  [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
      [41074.644919]  [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
      (...)
      
      The necessary conditions that lead to such crash are:
      
      * an incremental fsync (when the inode doesn't have the
        BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
        a file extent item ending at offset X;
      
      * the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
        to a file truncate operation that reduces the file to a size smaller
        than X;
      
      * a ranged fsync call happens (via an msync for example), with a range that
        doesn't cover the whole file and the end of this range, lets call it Y, is
        smaller than X;
      
      * btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
        calls btrfs_truncate_inode_items() to remove all items from the log
        tree that are associated with our file;
      
      * btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
        file extent item it removed is the one ending at offset X, where X > 0 and
        X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
        parameter set to X;
      
      * btrfs_ordered_update_i_size() sees that X is greater then the current ordered
        size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
        ordered operation with a range covering the offset X, calling a BUG_ON() if
        such ordered operation exists. This assumption is made because the disk_i_size
        is only increased after the corresponding file extent item is added to the
        btree (btrfs_finish_ordered_io);
      
      * But because our fsync covers only a limited range, such an ordered extent might
        exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
        extent to finish when calling btrfs_wait_ordered_range();
      
      And then by the time btrfs_ordered_update_i_size() is called, via:
      
         btrfs_sync_file() ->
             btrfs_log_dentry_safe() ->
                 btrfs_log_inode_parent() ->
                     btrfs_log_inode() ->
                         btrfs_truncate_inode_items() ->
                             btrfs_ordered_update_i_size()
      
      We hit the BUG_ON(), which could never happen if the fsync range covered the whole
      possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
      finish before calling btrfs_truncate_inode_items().
      
      So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
      from a log tree, which isn't supposed to change the in memory inode's disk_i_size.
      
      Issue found while running xfstests/generic/127 (happens very rarely for me), more
      specifically via the fsx calls that use memory mapped IO (and issue msync calls).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dac5705c
    • F
      Btrfs: fix corruption after write/fsync failure + fsync + log recovery · d9f85963
      Filipe Manana 提交于
      While writing to a file, in inode.c:cow_file_range() (and same applies to
      submit_compressed_extents()), after reserving an extent for the file data,
      we create a new extent map for the written range and insert it into the
      extent map cache. After that, we create an ordered operation, but if it
      fails (due to a transient/temporary-ENOMEM), we return without dropping
      that extent map, which points to a reserved extent that is freed when we
      return. A subsequent incremental fsync (when the btrfs inode doesn't have
      the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
      logs a file extent item based on that extent map, which points to a disk
      extent that doesn't contain valid data - it was freed by us earlier, at this
      point it might contain any random/garbage data.
      
      Therefore, if we reach an error condition when cowing a file range after
      we added the new extent map to the cache, drop it from the cache before
      returning.
      
      Some sequence of steps that lead to this:
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount -o commit=9999 /dev/sdd /mnt
          $ cd /mnt
      
          $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
          $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
          $ sync
      
          $ od -t x1 foo
          0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
          *
          0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
          *
          0020000
      
          $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo
      
          # Now this write + fsync fail with -ENOMEM, which was returned by
          # btrfs_add_ordered_extent() in inode.c:cow_file_range().
          $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
          $ xfs_io -c "fsync" foo
          fsync: Cannot allocate memory
      
          # Now do a new write + fsync, which will succeed. Our previous
          # -ENOMEM was a transient/temporary error.
          $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
          $ xfs_io -c "fsync" foo
      
          # Our file content (in page cache) is now:
          $ od -t x1 foo
          0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
          *
          0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
          *
          0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
          *
          0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
          *
          0050000
      
          # Now reboot the machine, and mount the fs, so that fsync log replay
          # takes place.
      
          # The file content is now weird, in particular the first 8Kb, which
          # do not match our data before nor after the sync command above.
          $ od -t x1 foo
          0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
          *
          0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
          *
          0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
          *
          0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
          *
          0050000
      
          # In fact these first 4Kb are a duplicate of the last 4kb block.
          # The last write got an extent map/file extent item that points to
          # the same disk extent that we got in the write+fsync that failed
          # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
          # verify that:
      
          $ btrfs-debug-tree /dev/sdd
          (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
      		extent data disk byte 12582912 nr 8192
      		extent data offset 0 nr 8192 ram 8192
      	item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
      		extent data disk byte 0 nr 0
      		extent data offset 0 nr 8192 ram 8192
      	item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
      		extent data disk byte 12582912 nr 4096
      		extent data offset 0 nr 4096 ram 4096
      
          $ umount /dev/sdd
          $ btrfsck /dev/sdd
          Checking filesystem on /dev/sdd
          UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
          checking extents
          extent item 12582912 has multiple extent items
          ref mismatch on [12582912 4096] extent item 1, found 2
          Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
          backpointer mismatch on [12582912 4096]
          Errors found in extent allocation tree or chunk allocation
          checking free space cache
          checking fs roots
          root 5 inode 257 errors 1000, some csum missing
          found 131074 bytes used err is 1
          total csum bytes: 4
          total tree bytes: 131072
          total fs tree bytes: 32768
          total extent tree bytes: 16384
          btree space waste bytes: 123404
          file data blocks allocated: 274432
           referenced 274432
          Btrfs v3.14.1-96-gcc7fd5a-dirty
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d9f85963
  6. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
  7. 21 8月, 2014 4 次提交
    • Q
      btrfs: Use right extent length when inserting overlap extent map. · 51f395ad
      Qu Wenruo 提交于
      When current btrfs finds that a new extent map is going to be insereted
      but failed with -EEXIST, it will try again to insert the extent map
      but with the length of sectorsize.
      This is OK if we don't enable 'no-holes' feature since all extent space
      is continuous, we will not go into the not found->insert routine.
      
      But if we enable 'no-holes' feature, it will make things out of control.
      e.g. in 4K sectorsize, we pass the following args to btrfs_get_extent():
      btrfs_get_extent() args: start:  27874 len 4100
      28672		  27874		28672	27874+4100	32768
                          |-----------------------|
      |---------hole--------------------|---------data----------|
      
      1) not found and insert
      Since no extent map containing the range, btrfs_get_extent() will go
      into the not_found and insert routine, which will try to insert the
      extent map (27874, 27847 + 4100).
      
      2) first overlap
      But it overlaps with (28672, 32768) extent, so -EEXIST will be returned
      by add_extent_mapping().
      
      3) retry but still overlap
      After catching the -EEXIST, then btrfs_get_extent() will try insert it
      again but with 4K length, which still overlaps, so -EEXIST will be
      returned.
      
      This makes the following patch fail to punch hole.
      d7781546 btrfs: Avoid trucating page or punching hole in a already existed hole.
      
      This patch will use the right length, which is the (exsisting->start -
      em->start) to insert, making the above patch works in 'no-holes' mode.
      Also, some small code style problems in above patch is fixed too.
      Reported-by: NFilipe David Manana <fdmanana@gmail.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NFilipe David Manana <fdmanana@suse.com>
      Tested-by: NFilipe David Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      51f395ad
    • F
      Btrfs: don't monopolize a core when evicting inode · 7064dd5c
      Filipe Manana 提交于
      If an inode has a very large number of extent maps, we can spend
      a lot of time freeing them, which triggers a soft lockup warning.
      Therefore reschedule if we need to when freeing the extent maps
      while evicting the inode.
      
      I could trigger this all the time by running xfstests/generic/299 on
      a file system with the no-holes feature enabled. That test creates
      an inode with 11386677 extent maps.
      
          $ mkfs.btrfs -f -O no-holes $TEST_DEV
          $ MKFS_OPTIONS="-O no-holes" ./check generic/299
          generic/299 382s ...
          Message from syslogd@debian-vm3 at Aug  7 10:44:29 ...
           kernel:[85304.208017] BUG: soft lockup - CPU#0 stuck for 22s! [umount:25330]
           384s
          Ran: generic/299
          Passed all 1 tests
      
          $ dmesg
          (...)
          [86304.300017] BUG: soft lockup - CPU#0 stuck for 23s! [umount:25330]
          (...)
          [86304.300036] Call Trace:
          [86304.300036]  [<ffffffff81698ba9>] __slab_free+0x54/0x295
          [86304.300036]  [<ffffffffa02ee9cc>] ? free_extent_map+0x5c/0xb0 [btrfs]
          [86304.300036]  [<ffffffff811a6cd2>] kmem_cache_free+0x282/0x2a0
          [86304.300036]  [<ffffffffa02ee9cc>] free_extent_map+0x5c/0xb0 [btrfs]
          [86304.300036]  [<ffffffffa02e3775>] btrfs_evict_inode+0xd5/0x660 [btrfs]
          [86304.300036]  [<ffffffff811e7c8d>] ? __inode_wait_for_writeback+0x6d/0xc0
          [86304.300036]  [<ffffffff816a389b>] ? _raw_spin_unlock+0x2b/0x40
          [86304.300036]  [<ffffffff811d8cbb>] evict+0xab/0x180
          [86304.300036]  [<ffffffff811d8dce>] dispose_list+0x3e/0x60
          [86304.300036]  [<ffffffff811d9b04>] evict_inodes+0xf4/0x110
          [86304.300036]  [<ffffffff811bd953>] generic_shutdown_super+0x53/0x110
          [86304.300036]  [<ffffffff811bdaa6>] kill_anon_super+0x16/0x30
          [86304.300036]  [<ffffffffa02a78ba>] btrfs_kill_super+0x1a/0xa0 [btrfs]
          [86304.300036]  [<ffffffff811bd3a9>] deactivate_locked_super+0x59/0x80
          [86304.300036]  [<ffffffff811be44e>] deactivate_super+0x4e/0x70
          [86304.300036]  [<ffffffff811dec14>] mntput_no_expire+0x174/0x1f0
          [86304.300036]  [<ffffffff811deab7>] ? mntput_no_expire+0x17/0x1f0
          [86304.300036]  [<ffffffff811e0517>] SyS_umount+0x97/0x100
          (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Tested-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7064dd5c
    • F
      Btrfs: ensure tmpfile inode is always persisted with link count of 0 · 5762b5c9
      Filipe Manana 提交于
      If we open a file with O_TMPFILE, don't do any further operation on
      it (so that the inode item isn't updated) and then force a transaction
      commit, we get a persisted inode item with a link count of 1, and not 0
      as it should be.
      
      Steps to reproduce it (requires a modern xfs_io with -T support):
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount -o /dev/sdd /mnt
          $ xfs_io -T /mnt &
          $ sync
      
      Then btrfs-debug-tree shows the inode item with a link count of 1:
      
          $ btrfs-debug-tree /dev/sdd
          (...)
          fs tree key (FS_TREE ROOT_ITEM 0)
          leaf 29556736 items 4 free space 15851 generation 6 owner 5
          fs uuid f164d01b-1b92-481d-a4e4-435fb0f843d0
          chunk uuid 0e3d0e56-bcca-4a1c-aa5f-cec2c6f4f7a6
          	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
      		inode generation 3 transid 6 size 0 block group 0 mode 40755 links 1
          	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
          		inode ref index 0 namelen 2 name: ..
          	item 2 key (257 INODE_ITEM 0) itemoff 15951 itemsize 160
          		inode generation 6 transid 6 size 0 block group 0 mode 100600 links 1
          	item 3 key (ORPHAN ORPHAN_ITEM 257) itemoff 15951 itemsize 0
      		orphan item
          checksum tree key (CSUM_TREE ROOT_ITEM 0)
          (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5762b5c9
    • F
      Btrfs: race free update of commit root for ro snapshots · 9c3b306e
      Filipe Manana 提交于
      This is a better solution for the problem addressed in the following
      commit:
      
          Btrfs: update commit root on snapshot creation after orphan cleanup
          (3821f348)
      
      The previous solution wasn't the best because of 2 reasons:
      
          1) It added another full transaction commit, which is more expensive
             than just swapping the commit root with the root;
      
          2) If a reboot happened after the first transaction commit (the one
             that creates the snapshot) and before the second transaction commit,
             then we would end up with the same problem if a send using that
             snapshot was requested before the first transaction commit after
             the reboot.
      
      This change addresses those 2 issues. The second issue is addressed by
      switching the commit root in the dentry lookup VFS callback, which is
      also called by the snapshot/subvol creation ioctl and performs orphan
      cleanup if needed. Like the vfs, the ioctl locks the parent inode too,
      preventing race issues between a dentry lookup and snapshot creation.
      
      Cc: Alex Lyakas <alex.btrfs@zadarastorage.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9c3b306e
  8. 19 8月, 2014 3 次提交
  9. 15 8月, 2014 2 次提交
    • C
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason 提交于
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: NChris Mason <clm@fb.com>
      8d875f95
    • L
      Btrfs: fix compressed write corruption on enospc · ce62003f
      Liu Bo 提交于
      When failing to allocate space for the whole compressed extent, we'll
      fallback to uncompressed IO, but we've forgotten to redirty the pages
      which belong to this compressed extent, and these 'clean' pages will
      simply skip 'submit' part and go to endio directly, at last we got data
      corruption as we write nothing.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Tested-By: NMartin Steigerwald <martin@lichtvoll.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      ce62003f
  10. 08 8月, 2014 1 次提交
  11. 20 6月, 2014 1 次提交
    • M
      Btrfs: fix broken free space cache after the system crashed · e570fd27
      Miao Xie 提交于
      When we mounted the filesystem after the crash, we got the following
      message:
        BTRFS error (device xxx): block group xxxx has wrong amount of free space
        BTRFS error (device xxx): failed to load free space cache for block group xxx
      
      It is because we didn't update the metadata of the allocated space (in extent
      tree) until the file data was written into the disk. During this time, there was
      no information about the allocated spaces in either the extent tree nor the
      free space cache. when we wrote out the free space cache at this time (commit
      transaction), those spaces were lost. In fact, only the free space that is
      used to store the file data had this problem, the others didn't because
      the metadata of them is updated in the same transaction context.
      
      There are many methods which can fix the above problem
      - track the allocated space, and write it out when we write out the free
        space cache
      - account the size of the allocated space that is used to store the file
        data, if the size is not zero, don't write out the free space cache.
      
      The first one is complex and may make the performance drop down.
      This patch chose the second method, we use a per-block-group variant to
      account the size of that allocated space. Besides that, we also introduce
      a per-block-group read-write semaphore to avoid the race between
      the allocation and the free space cache write out.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e570fd27
  12. 10 6月, 2014 11 次提交