1. 22 1月, 2015 3 次提交
  2. 15 1月, 2015 1 次提交
  3. 03 1月, 2015 1 次提交
  4. 25 11月, 2014 1 次提交
    • F
      Btrfs: fix snapshot inconsistency after a file write followed by truncate · 9ea24bbe
      Filipe Manana 提交于
      If right after starting the snapshot creation ioctl we perform a write against a
      file followed by a truncate, with both operations increasing the file's size, we
      can get a snapshot tree that reflects a state of the source subvolume's tree where
      the file truncation happened but the write operation didn't. This leaves a gap
      between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
      
      For example, if we perform the following file operations:
      
          $ mkfs.btrfs -f /dev/vdd
          $ mount /dev/vdd /mnt
          $ xfs_io -f \
                -c "pwrite -S 0xaa -b 32K 0 32K" \
                -c "fsync" \
                -c "pwrite -S 0xbb -b 32770 16K 32770" \
                -c "truncate 90123" \
                /mnt/foobar
      
      and the snapshot creation ioctl was just called before the second write, we often
      can get the following inode items in the snapshot's btree:
      
              item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                      inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
              item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                      inode ref index 282 namelen 10 name: foobar
              item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                      extent data disk byte 1104855040 nr 32768
                      extent data offset 0 nr 32768 ram 32768
                      extent compression 0
              item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                      extent data disk byte 0 nr 0
                      extent data offset 0 nr 40960 ram 40960
                      extent compression 0
      
      There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
      for which there's no file extent item covering it. This is because the file write
      and file truncate operations happened both right after the snapshot creation ioctl
      called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
      ordered extent that matches the write and, in btrfs_setsize(), we were able to call
      btrfs_cont_expand() before being able to commit the current transaction in the
      snapshot creation ioctl. So this made it possibe to insert the hole file extent
      item in the source subvolume (which represents the region added by the truncate)
      right before the transaction commit from the snapshot creation ioctl.
      
      Btrfs' fsck tool complains about such cases with a message like the following:
      
          "root 331 inode 257 errors 100, file extent discount"
      
      >From a user perspective, the expectation when a snapshot is created while those
      file operations are being performed is that the snapshot will have a file that
      either:
      
      1) is empty
      2) only the first write was captured
      3) only the 2 writes were captured
      4) both writes and the truncation were captured
      
      But never capture a state where only the first write and the truncation were
      captured (since the second write was performed before the truncation).
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9ea24bbe
  5. 22 11月, 2014 1 次提交
    • F
      Btrfs: ensure ordered extent errors aren't missed on fsync · b38ef71c
      Filipe Manana 提交于
      When doing a fsync with a fast path we have a time window where we can miss
      the fact that writeback of some file data failed, and therefore we endup
      returning success (0) from fsync when we should return an error.
      The steps that lead to this are the following:
      
      1) We start all ordered extents by calling filemap_fdatawrite_range();
      
      2) We do some other work like locking the inode's i_mutex, start a transaction,
         start a log transaction, etc;
      
      3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
         ordered extents from inode's ordered tree into a list;
      
      4) But by the time we do ordered extent collection, some ordered extents we started
         at step 1) might have already completed with an error, and therefore we didn't
         found them in the ordered tree and had no idea they finished with an error. This
         makes our fsync return success (0) to userspace, but has no bad effects on the log
         like for example insertion of file extent items into the log that point to unwritten
         extents, because the invalid extent maps were removed before the ordered extent
         completed (in inode.c:btrfs_finish_ordered_io).
      
      So after collecting the ordered extents just check if the inode's i_mapping has any
      error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
      writeback fails for a page of an ordered extent, we call mapping_set_error (done in
      extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
      that sets one of those error flags in the inode's i_mapping flags.
      
      This change also has the side effect of fixing the issue where for fast fsyncs we
      never checked/cleared the error flags from the inode's i_mapping flags, which means
      that a full fsync performed after a fast fsync could get such errors that belonged
      to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
      calls filemap_fdatawait_range(), and the later checks for and clears those flags,
      while for fast fsyncs we never call filemap_fdatawait_range() or anything else
      that checks for and clears the error flags from the inode's i_mapping.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b38ef71c
  6. 21 11月, 2014 10 次提交
    • F
      Btrfs: report error after failure inlining extent in compressed write path · e6eb4314
      Filipe Manana 提交于
      If cow_file_range_inline() failed, when called from compress_file_range(),
      we were tagging the locked page for writeback, end its writeback and unlock it,
      but not marking it with an error nor setting AS_EIO in inode's mapping flags.
      
      This made it impossible for a caller of filemap_fdatawrite_range (writepages)
      or filemap_fdatawait_range() to know that an error happened. And the return
      value of compress_file_range() is useless because it's returned to a workqueue
      task and not to the task calling filemap_fdatawrite_range (writepages).
      
      This change applies on top of the previous patchset starting at the patch
      titled:
      
          "[1/5] Btrfs: set page and mapping error on compressed write failure"
      
      Which changed extent_clear_unlock_delalloc() to use SetPageError and
      mapping_set_error().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e6eb4314
    • F
      Btrfs: add helper btrfs_fdatawrite_range · 728404da
      Filipe Manana 提交于
      To avoid duplicating this double filemap_fdatawrite_range() call for
      inodes with async extents (compressed writes) so often.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      728404da
    • F
      Btrfs: correctly flush compressed data before/after direct IO · 075bdbdb
      Filipe Manana 提交于
      For compressed writes, after doing the first filemap_fdatawrite_range() we
      don't get the pages tagged for writeback immediately. Instead we create
      a workqueue task, which is run by other kthread, and keep the pages locked.
      That other kthread compresses data, creates the respective ordered extent/s,
      tags the pages for writeback and unlocks them. Therefore we need a second
      call to filemap_fdatawrite_range() if we have compressed writes, as this
      second call will wait for the pages to become unlocked, then see they became
      tagged for writeback and finally wait for the writeback to finish.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      075bdbdb
    • F
      Btrfs: make inode.c:compress_file_range() return void · c44f649e
      Filipe Manana 提交于
      Its return value is useless, its single caller ignores it and can't do
      anything with it anyway, since it's a workqueue task and not the task
      calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
      Failure is communicated to such functions via start and end of writeback
      with the respective pages tagged with an error and AS_EIO flag set in the
      inode's imapping.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c44f649e
    • S
      Btrfs: fix incorrect compression ratio detection · 4bcbb332
      Shilong Wang 提交于
      Steps to reproduce:
       # mkfs.btrfs -f /dev/sdb
       # mount -t btrfs /dev/sdb /mnt -o compress=lzo
       # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1
      
      after previous steps, inode will be detected as bad compression ratio,
      and NOCOMPRESS flag will be set for that inode.
      
      Reason is that compress have a max limit pages every time(128K), if a
      132k write in, it will be splitted into two write(128k+4k), this bug
      is a leftover for commit 68bb462d(Btrfs: don't compress for a small write)
      
      Fix this problem by checking every time before compression, if it is a
      small write(<=blocksize), we bail out and fall into nocompression directly.
      Signed-off-by: NWang Shilong <wangshilong1991@gmail.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4bcbb332
    • F
      Btrfs: make inode.c:submit_compressed_extents() return void · dec8f175
      Filipe Manana 提交于
      Its return value is completely ignored by its single caller and it's
      useless anyway, since errors are indicated through SetPageError and
      the bit AS_EIO set in the flags of the inode's mapping. The caller
      can't do anything with the value, as it's invoked from a workqueue
      task and not by the task calling filemap_fdatawrite_range (which calls
      the writepages address space callback, which in turn calls the inode's
      fill_delalloc callback).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dec8f175
    • F
      Btrfs: process all async extents on compressed write failure · 3d7a820f
      Filipe Manana 提交于
      If we had an error when processing one of the async extents from our list,
      we were not processing the remaining async extents, meaning we would leak
      those async_extent structs, never release the pages with the compressed
      data and never unlock and clear the dirty flag from the inode's pages (those
      that correspond to the uncompressed content).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3d7a820f
    • F
      Btrfs: don't leak pages and memory on compressed write error · 40ae837b
      Filipe Manana 提交于
      In inode.c:submit_compressed_extents(), if we fail before calling
      btrfs_submit_compressed_write(), or when that function fails, we
      were freeing the async_extent structure without releasing its pages
      and freeing the pages array.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      40ae837b
    • F
      Btrfs: fix hang on compressed write error · fce2a4e6
      Filipe Manana 提交于
      In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write()
      we start writeback for all pages, clear their dirty flag, unlock them, etc, but if
      btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM),
      we never end the writeback on the pages, so any filemap_fdatawait_range() call will
      hang forever. We were also not calling the writepage end io hook, which means the
      corresponding ordered extent will never complete and all its waiters will block
      forever, such as a full fsync (via btrfs_wait_ordered_range()).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fce2a4e6
    • F
      Btrfs: set page and mapping error on compressed write failure · 704de49d
      Filipe Manana 提交于
      If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
      we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
      but we don't tag the pages, nor the inode's mapping, with an error. This makes it
      impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
      for e.g.) know that there was an error.
      
      Note that the return value of submit_compressed_extents() is useless, as that function
      is executed by a workqueue task and not directly by the fill_delalloc callback. This
      means the writepage/s callbacks of the inode's address space operations don't get that
      return value.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      704de49d
  7. 20 11月, 2014 1 次提交
  8. 17 10月, 2014 1 次提交
    • C
      Revert "Btrfs: race free update of commit root for ro snapshots" · d3797308
      Chris Mason 提交于
      This reverts commit 9c3b306e.
      
      Switching only one commit root during a transaction is wrong because it
      leads the fs into an inconsistent state. All commit roots should be
      switched at once, at transaction commit time, otherwise backref walking
      can often miss important references that were only accessible through
      the old commit root.  Plus, the root item for the snapshot's root wasn't
      getting updated and preventing the next transaction commit to do it.
      
      This made several users get into random corruption issues after creation
      of readonly snapshots.
      
      A regression test for xfstests will follow soon.
      
      Cc: stable@vger.kernel.org # 3.17
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d3797308
  9. 04 10月, 2014 1 次提交
  10. 02 10月, 2014 3 次提交
  11. 23 9月, 2014 1 次提交
    • J
      Btrfs: try not to ENOSPC on log replay · 1d52c78a
      Josef Bacik 提交于
      When doing log replay we may have to update inodes, which traditionally goes
      through our delayed inode stuff.  This will try to move space over from the
      trans handle, but we don't reserve space in our trans handle on replay since we
      don't know how much we will need, so instead we try to flush.  But because we
      have a trans handle open we won't flush anything, so if we are out of reserve
      space we will simply return ENOSPC.  Since we know that if an operation made it
      into the log then we definitely had space before the box bought the farm then we
      don't need to worry about doing this space reservation.  Use the
      fs_info->log_root_recovering flag to skip the delayed inode stuff and update the
      item directly.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1d52c78a
  12. 18 9月, 2014 12 次提交
  13. 09 9月, 2014 1 次提交
    • C
      Btrfs: use insert_inode_locked4 for inode creation · b0d5d10f
      Chris Mason 提交于
      Btrfs was inserting inodes into the hash table before we had fully
      set the inode up on disk.  This leaves us open to rare races that allow
      two different inodes in memory for the same [root, inode] pair.
      
      This patch fixes things by using insert_inode_locked4 to insert an I_NEW
      inode and unlock_new_inode when we're ready for the rest of the kernel
      to use the inode.
      
      It also makes sure to init the operations pointers on the inode before
      going into the error handling paths.
      Signed-off-by: NChris Mason <clm@fb.com>
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      b0d5d10f
  14. 03 9月, 2014 2 次提交
    • F
      Btrfs: fix crash while doing a ranged fsync · dac5705c
      Filipe Manana 提交于
      While doing a ranged fsync, that is, one whose range doesn't cover the
      whole possible file range (0 to LLONG_MAX), we can crash under certain
      circumstances with a trace like the following:
      
      [41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      (...)
      [41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1
      (...)
      [41074.643886] RIP: 0010:[<ffffffffa01ecc99>]  [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs]
      (...)
      [41074.644919] Stack:
      (...)
      [41074.644919] Call Trace:
      [41074.644919]  [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs]
      [41074.644919]  [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs]
      [41074.644919]  [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs]
      [41074.644919]  [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0
      [41074.644919]  [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10
      [41074.644919]  [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10
      [41074.644919]  [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs]
      [41074.644919]  [<ffffffff811d0c55>] ? dget_parent+0x5/0x180
      [41074.644919]  [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs]
      [41074.644919]  [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs]
      [41074.644919]  [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30
      (...)
      
      The necessary conditions that lead to such crash are:
      
      * an incremental fsync (when the inode doesn't have the
        BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged
        a file extent item ending at offset X;
      
      * the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due
        to a file truncate operation that reduces the file to a size smaller
        than X;
      
      * a ranged fsync call happens (via an msync for example), with a range that
        doesn't cover the whole file and the end of this range, lets call it Y, is
        smaller than X;
      
      * btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and
        calls btrfs_truncate_inode_items() to remove all items from the log
        tree that are associated with our file;
      
      * btrfs_truncate_inode_items() removes all of the inode's items, and the lowest
        file extent item it removed is the one ending at offset X, where X > 0 and
        X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset
        parameter set to X;
      
      * btrfs_ordered_update_i_size() sees that X is greater then the current ordered
        size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing
        ordered operation with a range covering the offset X, calling a BUG_ON() if
        such ordered operation exists. This assumption is made because the disk_i_size
        is only increased after the corresponding file extent item is added to the
        btree (btrfs_finish_ordered_io);
      
      * But because our fsync covers only a limited range, such an ordered extent might
        exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered
        extent to finish when calling btrfs_wait_ordered_range();
      
      And then by the time btrfs_ordered_update_i_size() is called, via:
      
         btrfs_sync_file() ->
             btrfs_log_dentry_safe() ->
                 btrfs_log_inode_parent() ->
                     btrfs_log_inode() ->
                         btrfs_truncate_inode_items() ->
                             btrfs_ordered_update_i_size()
      
      We hit the BUG_ON(), which could never happen if the fsync range covered the whole
      possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to
      finish before calling btrfs_truncate_inode_items().
      
      So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items
      from a log tree, which isn't supposed to change the in memory inode's disk_i_size.
      
      Issue found while running xfstests/generic/127 (happens very rarely for me), more
      specifically via the fsx calls that use memory mapped IO (and issue msync calls).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dac5705c
    • F
      Btrfs: fix corruption after write/fsync failure + fsync + log recovery · d9f85963
      Filipe Manana 提交于
      While writing to a file, in inode.c:cow_file_range() (and same applies to
      submit_compressed_extents()), after reserving an extent for the file data,
      we create a new extent map for the written range and insert it into the
      extent map cache. After that, we create an ordered operation, but if it
      fails (due to a transient/temporary-ENOMEM), we return without dropping
      that extent map, which points to a reserved extent that is freed when we
      return. A subsequent incremental fsync (when the btrfs inode doesn't have
      the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and
      logs a file extent item based on that extent map, which points to a disk
      extent that doesn't contain valid data - it was freed by us earlier, at this
      point it might contain any random/garbage data.
      
      Therefore, if we reach an error condition when cowing a file range after
      we added the new extent map to the cache, drop it from the cache before
      returning.
      
      Some sequence of steps that lead to this:
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount -o commit=9999 /dev/sdd /mnt
          $ cd /mnt
      
          $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo
          $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096"
          $ sync
      
          $ od -t x1 foo
          0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
          *
          0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
          *
          0020000
      
          $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo
      
          # Now this write + fsync fail with -ENOMEM, which was returned by
          # btrfs_add_ordered_extent() in inode.c:cow_file_range().
          $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo
          $ xfs_io -c "fsync" foo
          fsync: Cannot allocate memory
      
          # Now do a new write + fsync, which will succeed. Our previous
          # -ENOMEM was a transient/temporary error.
          $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo
          $ xfs_io -c "fsync" foo
      
          # Our file content (in page cache) is now:
          $ od -t x1 foo
          0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
          *
          0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
          *
          0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
          *
          0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
          *
          0050000
      
          # Now reboot the machine, and mount the fs, so that fsync log replay
          # takes place.
      
          # The file content is now weird, in particular the first 8Kb, which
          # do not match our data before nor after the sync command above.
          $ od -t x1 foo
          0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
          *
          0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
          *
          0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
          *
          0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee
          *
          0050000
      
          # In fact these first 4Kb are a duplicate of the last 4kb block.
          # The last write got an extent map/file extent item that points to
          # the same disk extent that we got in the write+fsync that failed
          # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to
          # verify that:
      
          $ btrfs-debug-tree /dev/sdd
          (...)
      	item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53
      		extent data disk byte 12582912 nr 8192
      		extent data offset 0 nr 8192 ram 8192
      	item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53
      		extent data disk byte 0 nr 0
      		extent data offset 0 nr 8192 ram 8192
      	item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53
      		extent data disk byte 12582912 nr 4096
      		extent data offset 0 nr 4096 ram 4096
      
          $ umount /dev/sdd
          $ btrfsck /dev/sdd
          Checking filesystem on /dev/sdd
          UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f
          checking extents
          extent item 12582912 has multiple extent items
          ref mismatch on [12582912 4096] extent item 1, found 2
          Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192
          backpointer mismatch on [12582912 4096]
          Errors found in extent allocation tree or chunk allocation
          checking free space cache
          checking fs roots
          root 5 inode 257 errors 1000, some csum missing
          found 131074 bytes used err is 1
          total csum bytes: 4
          total tree bytes: 131072
          total fs tree bytes: 32768
          total extent tree bytes: 16384
          btree space waste bytes: 123404
          file data blocks allocated: 274432
           referenced 274432
          Btrfs v3.14.1-96-gcc7fd5a-dirty
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d9f85963
  15. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237