1. 03 12月, 2014 3 次提交
    • F
      Btrfs: fix freeing used extents after removing empty block group · ae0ab003
      Filipe Manana 提交于
      There's a race between adding a block group to the list of the unused
      block groups and removing an unused block group (cleaner kthread) that
      leads to freeing extents that are in use or a crash during transaction
      commmit. Basically the cleaner kthread, when executing
      btrfs_delete_unused_bgs(), might catch the newly added block group to
      the list fs_info->unused_bgs and clear the range representing the whole
      group from fs_info->freed_extents[] before the task that added the block
      group to the list (running update_block_group()) marked the last freed
      extent as dirty in fs_info->freed_extents (pinned_extents).
      
      That is:
      
           CPU 1                                CPU 2
      
                                        btrfs_delete_unused_bgs()
      update_block_group()
         add block group to
         fs_info->unused_bgs
                                          got block group from the list
                                          clear_extent_bits for the whole
                                          block group range in freed_extents[]
         set_extent_dirty for the
         range covering the freed
         extent in freed_extents[]
         (fs_info->pinned_extents)
      
                                        block group deleted, and a new block
                                        group with the same logical address is
                                        created
      
                                        reserve space from the new block group
                                        for new data or metadata - the reserved
                                        space overlaps the range specified by
                                        CPU 1 for set_extent_dirty()
      
                                        commit transaction
                                          find all ranges marked as dirty in
                                          fs_info->pinned_extents, clear them
                                          and add them to the free space cache
      
      Alternatively, if CPU 2 doesn't create a new block group with the same
      logical address, we get a crash/BUG_ON at transaction commit when unpining
      extent ranges because we can't find a block group for the range marked as
      dirty by CPU 1. Sample trace:
      
      [ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      [ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
      gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
       sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
      [ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
      [ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
      [ 2163.429157] RIP: 0010:[<ffffffffa037728e>]  [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
      [ 2163.429562] RSP: 0018:ffff8801356bfda8  EFLAGS: 00010246
      [ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
      [ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
      [ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
      [ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
      [ 2163.430042] FS:  0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
      [ 2163.430042] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
      [ 2163.430042] Stack:
      [ 2163.430042]  ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
      [ 2163.430042]  ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
      [ 2163.430042]  ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
      [ 2163.430042] Call Trace:
      [ 2163.430042]  [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
      [ 2163.430042]  [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
      [ 2163.430042]  [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs]
      [ 2163.430042]  [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
      [ 2163.430042]  [<ffffffff8105966b>] kthread+0xb7/0xbf
      [ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [ 2163.430042]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
      [ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      
      So fix this by making update_block_group() first set the range as dirty
      in pinned_extents before adding the block group to the unused_bgs list.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ae0ab003
    • F
      Btrfs: fix crash caused by block group removal · 4f69cb98
      Filipe Manana 提交于
      If we remove a block group (because it became empty), we might have left
      a caching_ctl structure in fs_info->caching_block_groups that points to
      the block group and is accessed at transaction commit time. This results
      in accessing an invalid or incorrect block group. This issue became visible
      after Josef's patch "Btrfs: remove empty block groups automatically".
      
      So if the block group is removed make sure we don't leave a dangling
      caching_ctl in caching_block_groups.
      
      Sample crash trace:
      
      [58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
      [58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
      [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
      [58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
      [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
      [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
      [58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
      [58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
      [58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
      [58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
      [58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
      [58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
      [58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
      [58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
      [58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
      [58380.443238] Stack:
      [58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
      [58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
      [58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
      [58380.443238] Call Trace:
      [58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
      [58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
      [58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
      [58380.443238]  [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
      [58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
      [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
      [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4f69cb98
    • F
      Btrfs: fix invalid block group rbtree access after bg is removed · 292cbd51
      Filipe Manana 提交于
      If we grab a block group, for example in btrfs_trim_fs(), we will be holding
      a reference on it but the block group can be removed after we got it (via
      btrfs_remove_block_group), which means it will no longer be part of the
      rbtree.
      
      However, btrfs_remove_block_group() was only calling rb_erase() which leaves
      the block group's rb_node left and right child pointers with the same content
      they had before calling rb_erase. This was dangerous because a call to
      next_block_group() would access the node's left and right child pointers (via
      rb_next), which can be no longer valid.
      
      Fix this by clearing a block group's node after removing it from the tree,
      and have next_block_group() do a tree search to get the next block group
      instead of using rb_next() if our block group was removed.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      292cbd51
  2. 25 11月, 2014 5 次提交
    • F
      Btrfs: fix snapshot inconsistency after a file write followed by truncate · 9ea24bbe
      Filipe Manana 提交于
      If right after starting the snapshot creation ioctl we perform a write against a
      file followed by a truncate, with both operations increasing the file's size, we
      can get a snapshot tree that reflects a state of the source subvolume's tree where
      the file truncation happened but the write operation didn't. This leaves a gap
      between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
      
      For example, if we perform the following file operations:
      
          $ mkfs.btrfs -f /dev/vdd
          $ mount /dev/vdd /mnt
          $ xfs_io -f \
                -c "pwrite -S 0xaa -b 32K 0 32K" \
                -c "fsync" \
                -c "pwrite -S 0xbb -b 32770 16K 32770" \
                -c "truncate 90123" \
                /mnt/foobar
      
      and the snapshot creation ioctl was just called before the second write, we often
      can get the following inode items in the snapshot's btree:
      
              item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                      inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
              item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                      inode ref index 282 namelen 10 name: foobar
              item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                      extent data disk byte 1104855040 nr 32768
                      extent data offset 0 nr 32768 ram 32768
                      extent compression 0
              item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                      extent data disk byte 0 nr 0
                      extent data offset 0 nr 40960 ram 40960
                      extent compression 0
      
      There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
      for which there's no file extent item covering it. This is because the file write
      and file truncate operations happened both right after the snapshot creation ioctl
      called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
      ordered extent that matches the write and, in btrfs_setsize(), we were able to call
      btrfs_cont_expand() before being able to commit the current transaction in the
      snapshot creation ioctl. So this made it possibe to insert the hole file extent
      item in the source subvolume (which represents the region added by the truncate)
      right before the transaction commit from the snapshot creation ioctl.
      
      Btrfs' fsck tool complains about such cases with a message like the following:
      
          "root 331 inode 257 errors 100, file extent discount"
      
      >From a user perspective, the expectation when a snapshot is created while those
      file operations are being performed is that the snapshot will have a file that
      either:
      
      1) is empty
      2) only the first write was captured
      3) only the 2 writes were captured
      4) both writes and the truncation were captured
      
      But never capture a state where only the first write and the truncation were
      captured (since the second write was performed before the truncation).
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9ea24bbe
    • F
      Btrfs: ensure send always works on roots without orphans · e5fa8f86
      Filipe Manana 提交于
      Move the logic from the snapshot creation ioctl into send. This avoids
      doing the transaction commit if send isn't used, and ensures that if
      a crash/reboot happens after the transaction commit that created the
      snapshot and before the transaction commit that switched the commit
      root, send will not get a commit root that differs from the main root
      (that has orphan items).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e5fa8f86
    • F
      Btrfs: fix freeing used extent after removing empty block group · 758eb51e
      Filipe Manana 提交于
      Due to ignoring errors returned by clear_extent_bits (at the moment only
      -ENOMEM is possible), we can end up freeing an extent that is actually in
      use (i.e. return the extent to the free space cache).
      
      The sequence of steps that lead to this:
      
      1) Cleaner thread starts execution and calls btrfs_delete_unused_bgs(), with
         the goal of freeing empty block groups;
      
      2) btrfs_delete_unused_bgs() finds an empty block group, joins the current
         transaction (or starts a new one if none is running) and attempts to
         clear the EXTENT_DIRTY bit for the block group's range from freed_extents[0]
         and freed_extents[1] (of which one corresponds to fs_info->pinned_extents);
      
      3) Clearing the EXTENT_DIRTY bit (via clear_extent_bits()) fails with
         -ENOMEM, but such error is ignored and btrfs_delete_unused_bgs() proceeds
         to delete the block group and the respective chunk, while pinned_extents
         remains with that bit set for the whole (or a part of the) range covered
         by the block group;
      
      4) Later while the transaction is still running, the chunk ends up being reused
         for a new block group (maybe for different purpose, data or metadata), and
         extents belonging to the new block group are allocated for file data or btree
         nodes/leafs;
      
      5) The current transaction is committed, meaning that we unpinned one or more
         extents from the new block group (through btrfs_finish_extent_commit() and
         unpin_extent_range()) which are now being used for new file data or new
         metadata (through btrfs_finish_extent_commit() and unpin_extent_range()).
         And unpinning means we returned the extents to the free space cache of the
         new block group, which implies those extents can be used for future allocations
         while they're still in use.
      
      Alternatively, we can hit a BUG_ON() when doing a lookup for a block group's cache
      object in unpin_extent_range() if a new block group didn't end up being allocated for
      the same chunk (step 4 above).
      
      Fix this by not freeing the block group and chunk if we fail to clear the dirty bit.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      758eb51e
    • C
      Btrfs: include vmalloc.h in check-integrity.c · 8f608de6
      Chris Mason 提交于
      Fengguang's build monster reported warnings on some arches because we
      don't have vmalloc.h included
      Signed-off-by: NChris Mason <clm@fb.com>
      Reported-by: fengguang.wu@intel.com
      8f608de6
    • Q
      btrfs: Fix a lockdep warning when running xfstest. · 084b6e7c
      Qu Wenruo 提交于
      The following lockdep warning is triggered during xfstests:
      
      [ 1702.980872] =========================================================
      [ 1702.981181] [ INFO: possible irq lock inversion dependency detected ]
      [ 1702.981482] 3.18.0-rc1 #27 Not tainted
      [ 1702.981781] ---------------------------------------------------------
      [ 1702.982095] kswapd0/77 just changed the state of lock:
      [ 1702.982415]  (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa03b0b51>] __btrfs_release_delayed_node+0x41/0x1f0 [btrfs]
      [ 1702.982794] but this lock took another, RECLAIM_FS-unsafe lock in the past:
      [ 1702.983160]  (&fs_info->dev_replace.lock){+.+.+.}
      
      and interrupts could create inverse lock ordering between them.
      
      [ 1702.984675]
      other info that might help us debug this:
      [ 1702.985524] Chain exists of:
        &delayed_node->mutex --> &found->groups_sem --> &fs_info->dev_replace.lock
      
      [ 1702.986799]  Possible interrupt unsafe locking scenario:
      
      [ 1702.987681]        CPU0                    CPU1
      [ 1702.988137]        ----                    ----
      [ 1702.988598]   lock(&fs_info->dev_replace.lock);
      [ 1702.989069]                                local_irq_disable();
      [ 1702.989534]                                lock(&delayed_node->mutex);
      [ 1702.990038]                                lock(&found->groups_sem);
      [ 1702.990494]   <Interrupt>
      [ 1702.990938]     lock(&delayed_node->mutex);
      [ 1702.991407]
       *** DEADLOCK ***
      
      It is because the btrfs_kobj_{add/rm}_device() will call memory
      allocation with GFP_KERNEL,
      which may flush fs page cache to free space, waiting for it self to do
      the commit, causing the deadlock.
      
      To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
      dev_replace lock range, also involing split the
      btrfs_rm_dev_replace_srcdev() function into remove and free parts.
      
      Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
      lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
      called out of the lock range.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      084b6e7c
  3. 22 11月, 2014 5 次提交
    • F
      Btrfs: ensure ordered extent errors aren't missed on fsync · b38ef71c
      Filipe Manana 提交于
      When doing a fsync with a fast path we have a time window where we can miss
      the fact that writeback of some file data failed, and therefore we endup
      returning success (0) from fsync when we should return an error.
      The steps that lead to this are the following:
      
      1) We start all ordered extents by calling filemap_fdatawrite_range();
      
      2) We do some other work like locking the inode's i_mutex, start a transaction,
         start a log transaction, etc;
      
      3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
         ordered extents from inode's ordered tree into a list;
      
      4) But by the time we do ordered extent collection, some ordered extents we started
         at step 1) might have already completed with an error, and therefore we didn't
         found them in the ordered tree and had no idea they finished with an error. This
         makes our fsync return success (0) to userspace, but has no bad effects on the log
         like for example insertion of file extent items into the log that point to unwritten
         extents, because the invalid extent maps were removed before the ordered extent
         completed (in inode.c:btrfs_finish_ordered_io).
      
      So after collecting the ordered extents just check if the inode's i_mapping has any
      error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
      writeback fails for a page of an ordered extent, we call mapping_set_error (done in
      extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
      that sets one of those error flags in the inode's i_mapping flags.
      
      This change also has the side effect of fixing the issue where for fast fsyncs we
      never checked/cleared the error flags from the inode's i_mapping flags, which means
      that a full fsync performed after a fast fsync could get such errors that belonged
      to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
      calls filemap_fdatawait_range(), and the later checks for and clears those flags,
      while for fast fsyncs we never call filemap_fdatawait_range() or anything else
      that checks for and clears the error flags from the inode's i_mapping.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b38ef71c
    • F
      Btrfs: collect only the necessary ordered extents on ranged fsync · 0870295b
      Filipe Manana 提交于
      Instead of collecting all ordered extents from the inode's ordered tree
      and then wait for all of them to complete, just collect the ones that
      overlap the fsync range.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0870295b
    • F
      Btrfs: don't ignore log btree writeback errors · 5ab5e44a
      Filipe Manana 提交于
      If an error happens during writeback of log btree extents, make sure the
      error is returned to the caller (fsync), so that it takes proper action
      (commit current transaction) instead of writing a superblock that points
      to log btrees with all or some nodes that weren't durably persisted.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5ab5e44a
    • J
      Btrfs: do not move em to modified list when unpinning · a2804695
      Josef Bacik 提交于
      We use the modified list to keep track of which extents have been modified so we
      know which ones are candidates for logging at fsync() time.  Newly modified
      extents are added to the list at modification time, around the same time the
      ordered extent is created.  We do this so that we don't have to wait for ordered
      extents to complete before we know what we need to log.  The problem is when
      something like this happens
      
      log extent 0-4k on inode 1
      copy csum for 0-4k from ordered extent into log
      sync log
      commit transaction
      log some other extent on inode 1
      ordered extent for 0-4k completes and adds itself onto modified list again
      log changed extents
      see ordered extent for 0-4k has already been logged
      	at this point we assume the csum has been copied
      sync log
      crash
      
      On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
      which is the same one that we are replaying which also drops the csum, and then
      we won't find the csum in the log for that bytenr.  This of course causes us to
      have errors about not having csums for certain ranges of our inode.  So remove
      the modified list manipulation in unpin_extent_cache, any modified extents
      should have been added well before now, and we don't want them re-logged.  This
      fixes my test that I could reliably reproduce this problem with.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a2804695
    • J
      Btrfs: make sure logged extents complete in the current transaction V3 · 50d9aa99
      Josef Bacik 提交于
      Liu Bo pointed out that my previous fix would lose the generation update in the
      scenario I described.  It is actually much worse than that, we could lose the
      entire extent if we lose power right after the transaction commits.  Consider
      the following
      
      write extent 0-4k
      log extent in log tree
      commit transaction
      	< power fail happens here
      ordered extent completes
      
      We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
      the transaction commit will reset the log so it isn't replayed.  If we lose
      power before the transaction commit we are save, otherwise we are not.
      
      Fix this by keeping track of all extents we logged in this transaction.  Then
      when we go to commit the transaction make sure we wait for all of those ordered
      extents to complete before proceeding.  This will make sure that if we lose
      power after the transaction commit we still have our data.  This also fixes the
      problem of the improperly updated extent generation.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50d9aa99
  4. 21 11月, 2014 26 次提交
    • J
      Btrfs: make sure we wait on logged extents when fsycning two subvols · 9dba8cf1
      Josef Bacik 提交于
      If we have two fsync()'s race on different subvols one will do all of its work
      to get into the log_tree, wait on it's outstanding IO, and then allow the
      log_tree to finish it's commit.  The problem is we were just free'ing that
      subvols logged extents instead of waiting on them, so whoever lost the race
      wouldn't really have their data on disk.  Fix this by waiting properly instead
      of freeing the logged extents.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9dba8cf1
    • D
      btrfs: fix wrong accounting of raid1 data profile in statfs · 0d95c1be
      David Sterba 提交于
      The sizes that are obtained from space infos are in raw units and have
      to be adjusted according to the raid factor. This was missing for
      f_bavail and df reported doubled size for raid1.
      Reported-by: NMartin Steigerwald <Martin@lichtvoll.de>
      Fixes: ba7b6e62 ("btrfs: adjust statfs calculations according to raid profiles")
      CC: stable@vger.kernel.org
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      0d95c1be
    • G
      btrfs: fix dead lock while running replace and defrag concurrently · 32159242
      Gui Hecheng 提交于
      This can be reproduced by fstests: btrfs/070
      
      The scenario is like the following:
      
      replace worker thread		defrag thread
      ---------------------		-------------
      copy_nocow_pages_worker		btrfs_defrag_file
        copy_nocow_pages_for_inode	    ...
      				  btrfs_writepages
        |A| lock_extent_bits		    extent_write_cache_pages
      				|B|   lock_page
      					__extent_writepage
      		...			  writepage_delalloc
      					    find_lock_delalloc_range
      				|B| 	      lock_extent_bits
        find_or_create_page
          pagecache_get_page
        |A| lock_page
      
      This leads to an ABBA pattern deadlock. To fix it,
      o we just change it to an AABB pattern which means to @unlock_extent_bits()
        before we @lock_page(), and in this way the @extent_read_full_page_nolock()
        is no longer in an locked context, so change it back to @extent_read_full_page()
        to regain protection.
      
      o Since we @unlock_extent_bits() earlier, then before @write_page_nocow(),
        the extent may not really point at the physical block we want, so we
        have to check it before write.
      Signed-off-by: NGui Hecheng <guihc.fnst@cn.fujitsu.com>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      32159242
    • F
      Btrfs: make xattr replace operations atomic · 5f5bc6b1
      Filipe Manana 提交于
      Replacing a xattr consists of doing a lookup for its existing value, delete
      the current value from the respective leaf, release the search path and then
      finally insert the new value. This leaves a time window where readers (getxattr,
      listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
      so this has security implications.
      
      This change also fixes 2 other existing issues which were:
      
      *) Deleting the old xattr value without verifying first if the new xattr will
         fit in the existing leaf item (in case multiple xattrs are packed in the
         same item due to name hash collision);
      
      *) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
         exist but we have have an existing item that packs muliple xattrs with
         the same name hash as the input xattr. In this case we should return ENOSPC.
      
      A test case for xfstests follows soon.
      
      Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
      implementation.
      Reported-by: NAlexandre Oliva <oliva@gnu.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5f5bc6b1
    • F
      Btrfs: avoid premature -ENOMEM in clear_extent_bit() · c7bc6319
      Filipe Manana 提交于
      We try to allocate an extent state structure before acquiring the extent
      state tree's spinlock as we might need a new one later and therefore avoid
      doing later an atomic allocation while holding the tree's spinlock. However
      we returned -ENOMEM if that initial non-atomic allocation failed, which is
      a bit excessive since we might end up not needing the pre-allocated extent
      state at all - for the case where the tree doesn't have any extent states
      that cover the input range and cover too any other range. Therefore don't
      return -ENOMEM if that pre-allocation fails.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c7bc6319
    • J
      Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2 · 7e33fd99
      Josef Bacik 提交于
      Our gluster boxes get several thousand statfs() calls per second, which begins
      to suck hardcore with all of the lock contention on the chunk mutex and dev list
      mutex.  We don't really need to hold these things, if we have transient
      weirdness with statfs() because of the chunk allocator we don't care, so remove
      this locking.
      
      We still need the dev_list lock if you mount with -o alloc_start however, which
      is a good argument for nuking that thing from orbit, but that's a patch for
      another day.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7e33fd99
    • J
      Btrfs: move read only block groups onto their own list V2 · 633c0aad
      Josef Bacik 提交于
      Our gluster boxes were spending lots of time in statfs because our fs'es are
      huge.  The problem is statfs loops through all of the block groups looking for
      read only block groups, and when you have several terabytes worth of data that
      ends up being a lot of block groups.  Move the read only block groups onto a
      read only list and only proces that list in
      btrfs_account_ro_block_groups_free_space to reduce the amount of churn.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      633c0aad
    • D
      btrfs: fix typos in btrfs_check_super_valid · cd743fac
      David Sterba 提交于
      Copy&paste errors in some messages and add few more missing macro
      accessors.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      cd743fac
    • S
      Btrfs: check-int: don't complain about balanced blocks · cf90c59e
      Stefan Behrens 提交于
      The xfstest btrfs/014 which tests the balance operation caused that the
      check_int module complained that known blocks changed their physical
      location. Since this is not an error in this case, only print such
      message if the verbose mode was enabled.
      Reported-by: NWang Shilong <wangshilong1991@gmail.com>
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Tested-by: NWang Shilong <wangshilong1991@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      cf90c59e
    • S
      Btrfs: check_int: use the known block location · f382e465
      Stefan Behrens 提交于
      The xfstest btrfs/014 which tests the balance operation caused issues with
      the check_int module. The attempt was made to use btrfs_map_block() to
      find the physical location for a written block. However, this was not
      at all needed since the location of the written block was known since
      a hook to submit_bio() was the reason for entering the check_int module.
      Additionally, after a block relocation it happened that btrfs_map_block()
      failed causing misleading error messages afterwards.
      
      This patch changes the check_int module to use the known information of
      the physical location from the bio.
      Reported-by: NWang Shilong <wangshilong1991@gmail.com>
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Tested-by: NWang Shilong <wangshilong1991@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f382e465
    • F
      Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early · c8fd3de7
      Filipe Manana 提交于
      We try to allocate an extent state before acquiring the tree's spinlock
      just in case we end up needing to split an existing extent state into two.
      If that allocation failed, we would return -ENOMEM.
      However, our only single caller (transaction/log commit code), passes in
      an extent state that was cached from a call to find_first_extent_bit() and
      that has a very high chance to match exactly the input range (always true
      for a transaction commit and very often, but not always, true for a log
      commit) - in this case we end up not needing at all that initial extent
      state used for an eventual split. Therefore just don't return -ENOMEM if
      we can't allocate the temporary extent state, since we might not need it
      at all, and if we end up needing one, we'll do it later anyway.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c8fd3de7
    • F
      Btrfs: make find_first_extent_bit be able to cache any state · e38e2ed7
      Filipe Manana 提交于
      Right now the only caller of find_first_extent_bit() that is interested
      in caching extent states (transaction or log commit), never gets an extent
      state cached. This is because find_first_extent_bit() only caches states
      that have at least one of the flags EXTENT_IOBITS or EXTENT_BOUNDARY, and
      the transaction/log commit caller always passes a tree that doesn't have
      ever extent states with any of those flags (they can only have one of the
      following flags: EXTENT_DIRTY, EXTENT_NEW or EXTENT_NEED_WAIT).
      
      This change together with the following one in the patch series (titled
      "Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early") will
      help reduce significantly the chances of calls to convert_extent_bit()
      fail with -ENOMEM when called from the transaction/log commit code.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e38e2ed7
    • F
      Btrfs: deal with convert_extent_bit errors to avoid fs corruption · 663dfbb0
      Filipe Manana 提交于
      When committing a transaction or a log, we look for btree extents that
      need to be durably persisted by searching for ranges in a io tree that
      have some bits set (EXTENT_DIRTY or EXTENT_NEW). We then attempt to clear
      those bits and set the EXTENT_NEED_WAIT bit, with calls to the function
      convert_extent_bit, and then start writeback for the extents.
      
      That function however can return an error (at the moment only -ENOMEM
      is possible, specially when it does GFP_ATOMIC allocation requests
      through alloc_extent_state_atomic) - that means the ranges didn't got
      the EXTENT_NEED_WAIT bit set (or at least not for the whole range),
      which in turn means a call to btrfs_wait_marked_extents() won't find
      those ranges for which we started writeback, causing a transaction
      commit or a log commit to persist a new superblock without waiting
      for the writeback of extents in that range to finish first.
      
      Therefore if a crash happens after persisting the new superblock and
      before writeback finishes, we have a superblock pointing to roots that
      weren't fully persisted or roots that point to nodes or leafs that weren't
      fully persisted, causing all sorts of unexpected/bad behaviour as we endup
      reading garbage from disk or the content of some node/leaf from a past
      generation that got cowed or deleted and is no longer valid (for this later
      case we end up getting error messages like "parent transid verify failed on
      X wanted Y found Z" when reading btree nodes/leafs from disk).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      663dfbb0
    • E
      Btrfs: return failure if btrfs_dev_replace_finishing() failed · 2fc9f6ba
      Eryu Guan 提交于
      device replace could fail due to another running scrub process or any
      other errors btrfs_scrub_dev() may hit, but this failure doesn't get
      returned to userspace.
      
      The following steps could reproduce this issue
      
      	mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
      	mount /dev/sdb1 /mnt/btrfs
      	while true; do btrfs scrub start -B /mnt/btrfs >/dev/null 2>&1; done &
      	btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
      	# if this replace succeeded, do the following and repeat until
      	# you see this log in dmesg
      	# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
      	#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs
      
      	# once you see the error log in dmesg, check return value of
      	# replace
      	echo $?
      
      Introduce a new dev replace result
      
      BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS
      
      to catch -EINPROGRESS explicitly and return other errors directly to
      userspace.
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2fc9f6ba
    • S
      Btrfs: fix allocationg memory failure for btrfsic_state structure · 6b3a4d60
      Shilong Wang 提交于
      size of @btrfsic_state needs more than 2M, it is very likely to
      fail allocating memory using kzalloc(). see following mesage:
      
      [91428.902148] Call Trace:
      [<ffffffff816f6e0f>] dump_stack+0x4d/0x66
      [<ffffffff811b1c7f>] warn_alloc_failed+0xff/0x170
      [<ffffffff811b66e1>] __alloc_pages_nodemask+0x951/0xc30
      [<ffffffff811fd9da>] alloc_pages_current+0x11a/0x1f0
      [<ffffffff811b1e0b>] ? alloc_kmem_pages+0x3b/0xf0
      [<ffffffff811b1e0b>] alloc_kmem_pages+0x3b/0xf0
      [<ffffffff811d1018>] kmalloc_order+0x18/0x50
      [<ffffffff811d1074>] kmalloc_order_trace+0x24/0x140
      [<ffffffffa06c097b>] btrfsic_mount+0x8b/0xae0 [btrfs]
      [<ffffffff810af555>] ? check_preempt_curr+0x85/0xa0
      [<ffffffff810b2de3>] ? try_to_wake_up+0x103/0x430
      [<ffffffffa063d200>] open_ctree+0x1bd0/0x2130 [btrfs]
      [<ffffffffa060fdde>] btrfs_mount+0x62e/0x8b0 [btrfs]
      [<ffffffff811fd9da>] ? alloc_pages_current+0x11a/0x1f0
      [<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
      [<ffffffff81230429>] mount_fs+0x39/0x1b0
      [<ffffffff812509fb>] vfs_kern_mount+0x6b/0x150
      [<ffffffff812537fb>] do_mount+0x27b/0xc30
      [<ffffffff811b0a5e>] ? __get_free_pages+0xe/0x50
      [<ffffffff812544f6>] SyS_mount+0x96/0xf0
      [<ffffffff81701970>] system_call_fastpath+0x16/0x1b
      
      Since we are allocating memory for hash table array, so
      it will be good if we could allocate continuous pages here.
      
      Fix this problem by firstly trying kzalloc(), if we fail,
      use vzalloc() instead.
      Signed-off-by: NWang Shilong <wangshilong1991@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6b3a4d60
    • F
      Btrfs: report error after failure inlining extent in compressed write path · e6eb4314
      Filipe Manana 提交于
      If cow_file_range_inline() failed, when called from compress_file_range(),
      we were tagging the locked page for writeback, end its writeback and unlock it,
      but not marking it with an error nor setting AS_EIO in inode's mapping flags.
      
      This made it impossible for a caller of filemap_fdatawrite_range (writepages)
      or filemap_fdatawait_range() to know that an error happened. And the return
      value of compress_file_range() is useless because it's returned to a workqueue
      task and not to the task calling filemap_fdatawrite_range (writepages).
      
      This change applies on top of the previous patchset starting at the patch
      titled:
      
          "[1/5] Btrfs: set page and mapping error on compressed write failure"
      
      Which changed extent_clear_unlock_delalloc() to use SetPageError and
      mapping_set_error().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e6eb4314
    • F
      Btrfs: add helper btrfs_fdatawrite_range · 728404da
      Filipe Manana 提交于
      To avoid duplicating this double filemap_fdatawrite_range() call for
      inodes with async extents (compressed writes) so often.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      728404da
    • F
      Btrfs: correctly flush compressed data before/after direct IO · 075bdbdb
      Filipe Manana 提交于
      For compressed writes, after doing the first filemap_fdatawrite_range() we
      don't get the pages tagged for writeback immediately. Instead we create
      a workqueue task, which is run by other kthread, and keep the pages locked.
      That other kthread compresses data, creates the respective ordered extent/s,
      tags the pages for writeback and unlocks them. Therefore we need a second
      call to filemap_fdatawrite_range() if we have compressed writes, as this
      second call will wait for the pages to become unlocked, then see they became
      tagged for writeback and finally wait for the writeback to finish.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      075bdbdb
    • F
      Btrfs: make inode.c:compress_file_range() return void · c44f649e
      Filipe Manana 提交于
      Its return value is useless, its single caller ignores it and can't do
      anything with it anyway, since it's a workqueue task and not the task
      calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
      Failure is communicated to such functions via start and end of writeback
      with the respective pages tagged with an error and AS_EIO flag set in the
      inode's imapping.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c44f649e
    • S
      Btrfs: fix incorrect compression ratio detection · 4bcbb332
      Shilong Wang 提交于
      Steps to reproduce:
       # mkfs.btrfs -f /dev/sdb
       # mount -t btrfs /dev/sdb /mnt -o compress=lzo
       # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1
      
      after previous steps, inode will be detected as bad compression ratio,
      and NOCOMPRESS flag will be set for that inode.
      
      Reason is that compress have a max limit pages every time(128K), if a
      132k write in, it will be splitted into two write(128k+4k), this bug
      is a leftover for commit 68bb462d(Btrfs: don't compress for a small write)
      
      Fix this problem by checking every time before compression, if it is a
      small write(<=blocksize), we bail out and fall into nocompression directly.
      Signed-off-by: NWang Shilong <wangshilong1991@gmail.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4bcbb332
    • F
      Btrfs: don't ignore compressed bio write errors · 7bdcefc1
      Filipe Manana 提交于
      Our compressed bio write end callback was essentially ignoring the error
      parameter. When a write error happens, it must pass a value of 0 to the
      inode's write_page_end_io_hook callback, SetPageError on the respective
      pages and set AS_EIO in the inode's mapping flags, so that a call to
      filemap_fdatawait_range() / filemap_fdatawait() can find out that errors
      happened (we surely don't want silent failures on fsync for example).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7bdcefc1
    • F
      Btrfs: make inode.c:submit_compressed_extents() return void · dec8f175
      Filipe Manana 提交于
      Its return value is completely ignored by its single caller and it's
      useless anyway, since errors are indicated through SetPageError and
      the bit AS_EIO set in the flags of the inode's mapping. The caller
      can't do anything with the value, as it's invoked from a workqueue
      task and not by the task calling filemap_fdatawrite_range (which calls
      the writepages address space callback, which in turn calls the inode's
      fill_delalloc callback).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dec8f175
    • F
      Btrfs: process all async extents on compressed write failure · 3d7a820f
      Filipe Manana 提交于
      If we had an error when processing one of the async extents from our list,
      we were not processing the remaining async extents, meaning we would leak
      those async_extent structs, never release the pages with the compressed
      data and never unlock and clear the dirty flag from the inode's pages (those
      that correspond to the uncompressed content).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3d7a820f
    • F
      Btrfs: don't leak pages and memory on compressed write error · 40ae837b
      Filipe Manana 提交于
      In inode.c:submit_compressed_extents(), if we fail before calling
      btrfs_submit_compressed_write(), or when that function fails, we
      were freeing the async_extent structure without releasing its pages
      and freeing the pages array.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      40ae837b
    • F
      Btrfs: fix hang on compressed write error · fce2a4e6
      Filipe Manana 提交于
      In inode.c:submit_compressed_extents(), before calling btrfs_submit_compressed_write()
      we start writeback for all pages, clear their dirty flag, unlock them, etc, but if
      btrfs_submit_compressed_write() fails (at the moment it can only fail with -ENOMEM),
      we never end the writeback on the pages, so any filemap_fdatawait_range() call will
      hang forever. We were also not calling the writepage end io hook, which means the
      corresponding ordered extent will never complete and all its waiters will block
      forever, such as a full fsync (via btrfs_wait_ordered_range()).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fce2a4e6
    • F
      Btrfs: set page and mapping error on compressed write failure · 704de49d
      Filipe Manana 提交于
      If we fail in submit_compressed_extents() before calling btrfs_submit_compressed_write(),
      we start and end the writeback for the pages (clear their dirty flag, unlock them, etc)
      but we don't tag the pages, nor the inode's mapping, with an error. This makes it
      impossible for a caller of filemap_fdatawait_range() (fsync, or transaction commit
      for e.g.) know that there was an error.
      
      Note that the return value of submit_compressed_extents() is useless, as that function
      is executed by a workqueue task and not directly by the fill_delalloc callback. This
      means the writepage/s callbacks of the inode's address space operations don't get that
      return value.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      704de49d
  5. 12 11月, 2014 1 次提交