1. 17 3月, 2015 1 次提交
    • J
      Btrfs: prepare block group cache before writing · dcdf7f6d
      Josef Bacik 提交于
      Writing the block group cache will modify the extent tree quite a bit because it
      truncates the old space cache and pre-allocates new stuff.  To try and cut down
      on the churn lets do the setup dance first, then later on hopefully we can avoid
      looping with newly dirtied roots.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      dcdf7f6d
  2. 15 2月, 2015 1 次提交
    • J
      Btrfs: account for large extents with enospc · dcab6a3b
      Josef Bacik 提交于
      On our gluster boxes we stream large tar balls of backups onto our fses.  With
      160gb of ram this means we get really large contiguous ranges of dirty data, but
      the way our ENOSPC stuff works is that as long as it's contiguous we only hold
      metadata reservation for one extent.  The problem is we limit our extents to
      128mb, so we'll end up with at least 800 extents so our enospc accounting is
      quite a bit lower than what we need.  To keep track of this make sure we
      increase outstanding_extents for every multiple of the max extent size so we can
      be sure to have enough reserved metadata space.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dcab6a3b
  3. 03 2月, 2015 2 次提交
    • F
      Btrfs: fix race between transaction commit and empty block group removal · d4b450cd
      Filipe Manana 提交于
      Committing a transaction can race with automatic removal of empty block
      groups (cleaner kthread), leading to a BUG_ON() in the transaction
      commit code while running btrfs_finish_extent_commit(). The following
      sequence diagram shows how it can happen:
      
                 CPU 1                                       CPU 2
      
      btrfs_commit_transaction()
        fs_info->running_transaction = NULL
        btrfs_finish_extent_commit()
          find_first_extent_bit()
            -> found range for block group X
               in fs_info->freed_extents[]
      
                                                     btrfs_delete_unused_bgs()
                                                       -> found block group X
      
                                                       Removed block group X's range
                                                       from fs_info->freed_extents[]
      
                                                       btrfs_remove_chunk()
                                                          btrfs_remove_block_group(bg X)
      
          unpin_extent_range(bg X range)
             btrfs_lookup_block_group(bg X)
                -> returns NULL
                  -> BUG_ON()
      
      The trace that results from the BUG_ON() is:
      
      [48665.187808] ------------[ cut here ]------------
      [48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675!
      [48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
      [48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode
      [48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
      [48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
      [48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000
      [48665.197388] RIP: 0010:[<ffffffffa0350d05>]  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
      [48665.197388] RSP: 0018:ffff8801b56a7b88  EFLAGS: 00010246
      [48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8
      [48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0
      [48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000
      [48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000
      [48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000
      [48665.197388] FS:  0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
      [48665.197388] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0
      [48665.197388] Stack:
      [48665.197388]  ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff
      [48665.197388]  ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0
      [48665.197388]  ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227
      [48665.197388] Call Trace:
      [48665.197388]  [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs]
      [48665.197388]  [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs]
      [48665.197388]  [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs]
      [48665.197388]  [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33
      [48665.197388]  [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs]
      [48665.197388]  [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab
      [48665.197388]  [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab
      [48665.197388]  [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf
      [48665.197388]  [<ffffffff8105a55b>] worker_thread+0x210/0x2d0
      [48665.197388]  [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3
      [48665.197388]  [<ffffffff8105e5c0>] kthread+0xef/0xf7
      [48665.197388]  [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39
      [48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
      [48665.197388]  [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0
      [48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
      [48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b
      [48665.197388] RIP  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
      [48665.197388]  RSP <ffff8801b56a7b88>
      [48665.272246] ---[ end trace b9c6ab9957521376 ]---
      
      Fix this by ensuring that unpining the block group's range in
      btrfs_finish_extent_commit() is done in a synchronized fashion
      with removing the block group's range from freed_extents[]
      in btrfs_delete_unused_bgs()
      
      This race got introduced with the change:
      
          Btrfs: remove empty block groups automatically
          commit 47ab2a6cSigned-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d4b450cd
    • D
      btrfs: kill btrfs_inode_*time helpers · a937b979
      David Sterba 提交于
      They just opencode taking address of the timespec member.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      a937b979
  4. 22 1月, 2015 4 次提交
    • A
      Btrfs: fix unused members in struct btrfs_root · 78f55e5e
      Anand Jain 提交于
      There isn't any real use of following members of struct btrfs_root
      so delete them.
      
      struct kobject root_kobj;
      struct completion kobj_unregister;
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      78f55e5e
    • Z
      Btrfs: Introduce BTRFS_BLOCK_GROUP_RAID56_MASK to check raid56 simply · ffe2d203
      Zhao Lei 提交于
      So we can check raid56 with:
       (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
      instead of long:
       (map->type & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6))
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ffe2d203
    • J
      Btrfs: track dirty block groups on their own list · ce93ec54
      Josef Bacik 提交于
      Currently any time we try to update the block groups on disk we will walk _all_
      block groups and check for the ->dirty flag to see if it is set.  This function
      can get called several times during a commit.  So if you have several terabytes
      of data you will be a very sad panda as we will loop through _all_ of the block
      groups several times, which makes the commit take a while which slows down the
      rest of the file system operations.
      
      This patch introduces a dirty list for the block groups that we get added to
      when we dirty the block group for the first time.  Then we simply update any
      block groups that have been dirtied since the last time we called
      btrfs_write_dirty_block_groups.  This allows us to clean up how we write the
      free space cache out so it is much cleaner.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ce93ec54
    • J
      Btrfs: change how we track dirty roots · e7070be1
      Josef Bacik 提交于
      I've been overloading root->dirty_list to keep track of dirty roots and which
      roots need to have their commit roots switched at transaction commit time.  This
      could cause us to lose an update to the root which could corrupt the file
      system.  To fix this use a state bit to know if the root is dirty, and if it
      isn't set we go ahead and move the root to the dirty list.  This way if we
      re-dirty the root after adding it to the switch_commit list we make sure to
      update it.  This also makes it so that the extent root is always the last root
      on the dirty list to try and keep the amount of churn down at this point in the
      commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e7070be1
  5. 11 12月, 2014 1 次提交
  6. 03 12月, 2014 3 次提交
    • F
      Btrfs: fix race between fs trimming and block group remove/allocation · 04216820
      Filipe Manana 提交于
      Our fs trim operation, which is completely transactionless (doesn't start
      or joins an existing transaction) consists of visiting all block groups
      and then for each one to iterate its free space entries and perform a
      discard operation against the space range represented by the free space
      entries. However before performing a discard, the corresponding free space
      entry is removed from the free space rbtree, and when the discard completes
      it is added back to the free space rbtree.
      
      If a block group remove operation happens while the discard is ongoing (or
      before it starts and after a free space entry is hidden), we end up not
      waiting for the discard to complete, remove the extent map that maps
      logical address to physical addresses and the corresponding chunk metadata
      from the the chunk and device trees. After that and before the discard
      completes, the current running transaction can finish and a new one start,
      allowing for new block groups that map to the same physical addresses to
      be allocated and written to.
      
      So fix this by keeping the extent map in memory until the discard completes
      so that the same physical addresses aren't reused before it completes.
      
      If the physical locations that are under a discard operation end up being
      used for a new metadata block group for example, and dirty metadata extents
      are written before the discard finishes (the VM might call writepages() of
      our btree inode's i_mapping for example, or an fsync log commit happens) we
      end up overwriting metadata with zeroes, which leads to errors from fsck
      like the following:
      
              checking extents
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              owner ref check failed [833912832 16384]
              Errors found in extent allocation tree or chunk allocation
              checking free space cache
              checking fs roots
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              root 5 root dir 256 error
              root 5 inode 260 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 262 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 263 errors 2001, no inode item, link count wrong
              (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      04216820
    • F
      Btrfs: fix crash caused by block group removal · 4f69cb98
      Filipe Manana 提交于
      If we remove a block group (because it became empty), we might have left
      a caching_ctl structure in fs_info->caching_block_groups that points to
      the block group and is accessed at transaction commit time. This results
      in accessing an invalid or incorrect block group. This issue became visible
      after Josef's patch "Btrfs: remove empty block groups automatically".
      
      So if the block group is removed make sure we don't leave a dangling
      caching_ctl in caching_block_groups.
      
      Sample crash trace:
      
      [58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
      [58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
      [58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
      [58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      [58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
      [58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
      [58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
      [58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
      [58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
      [58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
      [58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
      [58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
      [58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
      [58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
      [58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
      [58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
      [58380.443238] Stack:
      [58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
      [58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
      [58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
      [58380.443238] Call Trace:
      [58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
      [58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
      [58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
      [58380.443238]  [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
      [58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
      [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
      [58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4f69cb98
    • M
      Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 · 4245215d
      Miao Xie 提交于
      The commit c404e0dc (Btrfs: fix use-after-free in the finishing
      procedure of the device replace) fixed a use-after-free problem
      which happened when removing the source device at the end of device
      replace, but at that time, btrfs didn't support device replace
      on raid56, so we didn't fix the problem on the raid56 profile.
      Currently, we implemented device replace for raid56, so we need
      kick that problem out before we enable that function for raid56.
      
      The fix method is very simple, we just increase the bio per-cpu
      counter before we submit a raid56 io, and decrease the counter
      when the raid56 io ends.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      4245215d
  7. 25 11月, 2014 1 次提交
    • F
      Btrfs: fix snapshot inconsistency after a file write followed by truncate · 9ea24bbe
      Filipe Manana 提交于
      If right after starting the snapshot creation ioctl we perform a write against a
      file followed by a truncate, with both operations increasing the file's size, we
      can get a snapshot tree that reflects a state of the source subvolume's tree where
      the file truncation happened but the write operation didn't. This leaves a gap
      between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
      
      For example, if we perform the following file operations:
      
          $ mkfs.btrfs -f /dev/vdd
          $ mount /dev/vdd /mnt
          $ xfs_io -f \
                -c "pwrite -S 0xaa -b 32K 0 32K" \
                -c "fsync" \
                -c "pwrite -S 0xbb -b 32770 16K 32770" \
                -c "truncate 90123" \
                /mnt/foobar
      
      and the snapshot creation ioctl was just called before the second write, we often
      can get the following inode items in the snapshot's btree:
      
              item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                      inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
              item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                      inode ref index 282 namelen 10 name: foobar
              item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                      extent data disk byte 1104855040 nr 32768
                      extent data offset 0 nr 32768 ram 32768
                      extent compression 0
              item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                      extent data disk byte 0 nr 0
                      extent data offset 0 nr 40960 ram 40960
                      extent compression 0
      
      There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
      for which there's no file extent item covering it. This is because the file write
      and file truncate operations happened both right after the snapshot creation ioctl
      called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
      ordered extent that matches the write and, in btrfs_setsize(), we were able to call
      btrfs_cont_expand() before being able to commit the current transaction in the
      snapshot creation ioctl. So this made it possibe to insert the hole file extent
      item in the source subvolume (which represents the region added by the truncate)
      right before the transaction commit from the snapshot creation ioctl.
      
      Btrfs' fsck tool complains about such cases with a message like the following:
      
          "root 331 inode 257 errors 100, file extent discount"
      
      >From a user perspective, the expectation when a snapshot is created while those
      file operations are being performed is that the snapshot will have a file that
      either:
      
      1) is empty
      2) only the first write was captured
      3) only the 2 writes were captured
      4) both writes and the truncation were captured
      
      But never capture a state where only the first write and the truncation were
      captured (since the second write was performed before the truncation).
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9ea24bbe
  8. 22 11月, 2014 1 次提交
    • F
      Btrfs: ensure ordered extent errors aren't missed on fsync · b38ef71c
      Filipe Manana 提交于
      When doing a fsync with a fast path we have a time window where we can miss
      the fact that writeback of some file data failed, and therefore we endup
      returning success (0) from fsync when we should return an error.
      The steps that lead to this are the following:
      
      1) We start all ordered extents by calling filemap_fdatawrite_range();
      
      2) We do some other work like locking the inode's i_mutex, start a transaction,
         start a log transaction, etc;
      
      3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
         ordered extents from inode's ordered tree into a list;
      
      4) But by the time we do ordered extent collection, some ordered extents we started
         at step 1) might have already completed with an error, and therefore we didn't
         found them in the ordered tree and had no idea they finished with an error. This
         makes our fsync return success (0) to userspace, but has no bad effects on the log
         like for example insertion of file extent items into the log that point to unwritten
         extents, because the invalid extent maps were removed before the ordered extent
         completed (in inode.c:btrfs_finish_ordered_io).
      
      So after collecting the ordered extents just check if the inode's i_mapping has any
      error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
      writeback fails for a page of an ordered extent, we call mapping_set_error (done in
      extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
      that sets one of those error flags in the inode's i_mapping flags.
      
      This change also has the side effect of fixing the issue where for fast fsyncs we
      never checked/cleared the error flags from the inode's i_mapping flags, which means
      that a full fsync performed after a fast fsync could get such errors that belonged
      to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
      calls filemap_fdatawait_range(), and the later checks for and clears those flags,
      while for fast fsyncs we never call filemap_fdatawait_range() or anything else
      that checks for and clears the error flags from the inode's i_mapping.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b38ef71c
  9. 21 11月, 2014 3 次提交
    • F
      Btrfs: make xattr replace operations atomic · 5f5bc6b1
      Filipe Manana 提交于
      Replacing a xattr consists of doing a lookup for its existing value, delete
      the current value from the respective leaf, release the search path and then
      finally insert the new value. This leaves a time window where readers (getxattr,
      listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs,
      so this has security implications.
      
      This change also fixes 2 other existing issues which were:
      
      *) Deleting the old xattr value without verifying first if the new xattr will
         fit in the existing leaf item (in case multiple xattrs are packed in the
         same item due to name hash collision);
      
      *) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't
         exist but we have have an existing item that packs muliple xattrs with
         the same name hash as the input xattr. In this case we should return ENOSPC.
      
      A test case for xfstests follows soon.
      
      Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace
      implementation.
      Reported-by: NAlexandre Oliva <oliva@gnu.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5f5bc6b1
    • J
      Btrfs: move read only block groups onto their own list V2 · 633c0aad
      Josef Bacik 提交于
      Our gluster boxes were spending lots of time in statfs because our fs'es are
      huge.  The problem is statfs loops through all of the block groups looking for
      read only block groups, and when you have several terabytes worth of data that
      ends up being a lot of block groups.  Move the read only block groups onto a
      read only list and only proces that list in
      btrfs_account_ro_block_groups_free_space to reduce the amount of churn.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      633c0aad
    • F
      Btrfs: add helper btrfs_fdatawrite_range · 728404da
      Filipe Manana 提交于
      To avoid duplicating this double filemap_fdatawrite_range() call for
      inodes with async extents (compressed writes) so often.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      728404da
  10. 12 11月, 2014 3 次提交
    • D
      btrfs: introduce pending action: commit · d51033d0
      David Sterba 提交于
      In some contexts, like in sysfs handlers, we don't want to trigger a
      transaction commit. It's a heavy operation, we don't know what external
      locks may be taken. Instead, make it possible to finish the operation
      through sync syscall or SYNC_FS ioctl.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      d51033d0
    • D
      btrfs: switch inode_cache option handling to pending changes · 7e1876ac
      David Sterba 提交于
      The pending mount option(s) now share namespace and bits with the normal
      options, and the existing one for (inode_cache) is unset unconditionally
      at each transaction commit.
      
      Introduce a separate namespace for pending changes and enhance the
      descriptions of the intended change to use separate bits for each
      action.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      7e1876ac
    • D
      btrfs: add support for processing pending changes · 572d9ab7
      David Sterba 提交于
      There are some actions that modify global filesystem state but cannot be
      performed at the time of request, but later at the transaction commit
      time when the filesystem is in a known state.
      
      For example enabling new incompat features on-the-fly or issuing
      transaction commit from unsafe contexts (sysfs handlers).
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      572d9ab7
  11. 28 10月, 2014 1 次提交
    • F
      Btrfs: fix invalid leaf slot access in btrfs_lookup_extent() · 1a4ed8fd
      Filipe Manana 提交于
      If we couldn't find our extent item, we accessed the current slot
      (path->slots[0]) to check if it corresponds to an equivalent skinny
      metadata item. However this slot could be beyond our last item in the
      leaf (i.e. path->slots[0] >= btrfs_header_nritems(leaf)), in which case
      we shouldn't process it.
      
      Since btrfs_lookup_extent() is only used to find extent items for data
      extents, fix this by removing completely the logic that looks up for an
      equivalent skinny metadata item, since it can not exist.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1a4ed8fd
  12. 08 10月, 2014 1 次提交
  13. 06 10月, 2014 1 次提交
    • Q
      btrfs: Make btrfs handle security mount options internally to avoid losing security label. · f667aef6
      Qu Wenruo 提交于
      [BUG]
      Originally when mount btrfs with "-o subvol=" mount option, btrfs will
      lose all security lable.
      And if the btrfs fs is mounted somewhere else, due to the lost of
      security lable, SELinux will refuse to mount since the same super block
      is being mounted using different security lable.
      
      [REPRODUCER]
      With SELinux enabled:
       #mkfs -t btrfs /dev/sda5
       #mount -o context=system_u:object_r:nfs_t:s0 /dev/sda5 /mnt/btrfs
       #btrfs subvolume create /mnt/btrfs/subvol
       #mount -o subvol=subvol,context=system_u:object_r:nfs_t:s0 /dev/sda5
        /mnt/test
      
      kernel message:
      SELinux: mount invalid.  Same superblock, different security settings
      for (dev sda5, type btrfs)
      
      [REASON]
      This happens because btrfs will call vfs_kern_mount() and then
      mount_subtree() to handle subvolume name lookup.
      First mount will cut off all the security lables and when it comes to
      the second vfs_kern_mount(), it has no security label now.
      
      [FIX]
      This patch will makes btrfs behavior much more like nfs,
      which has the type flag FS_BINARY_MOUNTDATA,
      making btrfs handles the security label internally.
      So security label will be set in the real mount time and won't lose
      label when use with "subvol=" mount option.
      Reported-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f667aef6
  14. 04 10月, 2014 1 次提交
  15. 02 10月, 2014 4 次提交
  16. 23 9月, 2014 1 次提交
    • J
      Btrfs: remove empty block groups automatically · 47ab2a6c
      Josef Bacik 提交于
      One problem that has plagued us is that a user will use up all of his space with
      data, remove a bunch of that data, and then try to create a bunch of small files
      and run out of space.  This happens because all the chunks were allocated for
      data since the metadata requirements were so low.  But now there's a bunch of
      empty data block groups and not enough metadata space to do anything.  This
      patch solves this problem by automatically deleting empty block groups.  If we
      notice the used count go down to 0 when deleting or on mount notice that a block
      group has a used count of 0 then we will queue it to be deleted.
      
      When the cleaner thread runs we will double check to make sure the block group
      is still empty and then we will delete it.  This patch has the side effect of no
      longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
      helpful for both this and relocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      47ab2a6c
  17. 18 9月, 2014 6 次提交
  18. 15 8月, 2014 1 次提交
    • J
      Btrfs: __btrfs_mod_ref should always use no_quota · e339a6b0
      Josef Bacik 提交于
      Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't
      understand how snapshot delete was using it and assumed that we needed the
      quota operations there.  With Mark's work this has turned out to be not the
      case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the
      argument and make __btrfs_mod_ref call it's process function with no_quota set
      always.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e339a6b0
  19. 20 6月, 2014 1 次提交
    • M
      Btrfs: fix broken free space cache after the system crashed · e570fd27
      Miao Xie 提交于
      When we mounted the filesystem after the crash, we got the following
      message:
        BTRFS error (device xxx): block group xxxx has wrong amount of free space
        BTRFS error (device xxx): failed to load free space cache for block group xxx
      
      It is because we didn't update the metadata of the allocated space (in extent
      tree) until the file data was written into the disk. During this time, there was
      no information about the allocated spaces in either the extent tree nor the
      free space cache. when we wrote out the free space cache at this time (commit
      transaction), those spaces were lost. In fact, only the free space that is
      used to store the file data had this problem, the others didn't because
      the metadata of them is updated in the same transaction context.
      
      There are many methods which can fix the above problem
      - track the allocated space, and write it out when we write out the free
        space cache
      - account the size of the allocated space that is used to store the file
        data, if the size is not zero, don't write out the free space cache.
      
      The first one is complex and may make the performance drop down.
      This patch chose the second method, we use a per-block-group variant to
      account the size of that allocated space. Besides that, we also introduce
      a per-block-group read-write semaphore to avoid the race between
      the allocation and the free space cache write out.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e570fd27
  20. 10 6月, 2014 3 次提交
    • F
      Btrfs: make fsync work after cloning into a file · 7ffbb598
      Filipe Manana 提交于
      When cloning into a file, we were correctly replacing the extent
      items in the target range and removing the extent maps. However
      we weren't replacing the extent maps with new ones that point to
      the new extents - as a consequence, an incremental fsync (when the
      inode doesn't have the full sync flag) was a NOOP, since it relies
      on the existence of extent maps in the modified list of the inode's
      extent map tree, which was empty. Therefore add new extent maps to
      reflect the target clone range.
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7ffbb598
    • J
      btrfs: allocate raid type kobjects dynamically · c1895442
      Jeff Mahoney 提交于
      We are currently allocating space_info objects in an array when we
      allocate space_info. When a user does something like:
      
      # btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt
      # btrfs balance start -mconvert=single -dconvert=single /mnt -f
      # btrfs balance start -mconvert=raid1 -dconvert=raid1 /
      
      We can end up with memory corruption since the kobject hasn't
      been reinitialized properly and the name pointer was left set.
      
      The rationale behind allocating them statically was to avoid
      creating a separate kobject container that just contained the
      raid type. It used the index in the array to determine the index.
      
      Ultimately, though, this wastes more memory than it saves in all
      but the most complex scenarios and introduces kobject lifetime
      questions.
      
      This patch allocates the kobjects dynamically instead. Note that
      we also remove the kobject_get/put of the parent kobject since
      kobject_add and kobject_del do that internally.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      c1895442
    • C
      Btrfs: async delayed refs · a79b7d4b
      Chris Mason 提交于
      Delayed extent operations are triggered during transaction commits.
      The goal is to queue up a healthly batch of changes to the extent
      allocation tree and run through them in bulk.
      
      This farms them off to async helper threads.  The goal is to have the
      bulk of the delayed operations being done in the background, but this is
      also important to limit our stack footprint.
      Signed-off-by: NChris Mason <clm@fb.com>
      a79b7d4b