1. 11 4月, 2015 5 次提交
    • C
      Btrfs: allow block group cache writeout outside critical section in commit · 1bbc621e
      Chris Mason 提交于
      We loop through all of the dirty block groups during commit and write
      the free space cache.  In order to make sure the cache is currect, we do
      this while no other writers are allowed in the commit.
      
      If a large number of block groups are dirty, this can introduce long
      stalls during the final stages of the commit, which can block new procs
      trying to change the filesystem.
      
      This commit changes the block group cache writeout to take appropriate
      locks and allow it to run earlier in the commit.  We'll still have to
      redo some of the block groups, but it means we can get most of the work
      out of the way without blocking the entire FS.
      Signed-off-by: NChris Mason <clm@fb.com>
      1bbc621e
    • C
      Btrfs: don't use highmem for free space cache pages · 2b108268
      Chris Mason 提交于
      In order to create the free space cache concurrently with FS modifications,
      we need to take a few block group locks.
      
      The cache code also does kmap, which would schedule with the locks held.
      Instead of going through kmap_atomic, lets just use lowmem for the cache
      pages.
      Signed-off-by: NChris Mason <clm@fb.com>
      2b108268
    • C
      Btrfs: two stage dirty block group writeout · c9dc4c65
      Chris Mason 提交于
      Block group cache writeout is currently waiting on the pages for each
      block group cache before moving on to writing the next one.  This commit
      switches things around to send down all the caches and then wait on them
      in batches.
      
      The end result is much faster, since we're keeping the disk pipeline
      full.
      Signed-off-by: NChris Mason <clm@fb.com>
      c9dc4c65
    • C
      btrfs: move struct io_ctl into ctree.h and rename it · 4c6d1d85
      Chris Mason 提交于
      We'll need to put the io_ctl into the block_group cache struct, so
      name it struct btrfs_io_ctl and move it into ctree.h
      Signed-off-by: NChris Mason <clm@fb.com>
      4c6d1d85
    • C
      btrfs: actively run the delayed refs while deleting large files · 28ed1345
      Chris Mason 提交于
      When we are deleting large files with large extents, we are building up
      a huge set of delayed refs for processing.  Truncate isn't checking
      often enough to see if we need to back off and process those, or let
      a commit proceed.
      
      The end result is long stalls after the rm, and very long commit times.
      During the commits, other processes back up waiting to start new
      transactions and we get into trouble.
      Signed-off-by: NChris Mason <clm@fb.com>
      28ed1345
  2. 04 3月, 2015 7 次提交
  3. 21 2月, 2015 1 次提交
  4. 03 2月, 2015 1 次提交
  5. 22 1月, 2015 1 次提交
    • J
      Btrfs: track dirty block groups on their own list · ce93ec54
      Josef Bacik 提交于
      Currently any time we try to update the block groups on disk we will walk _all_
      block groups and check for the ->dirty flag to see if it is set.  This function
      can get called several times during a commit.  So if you have several terabytes
      of data you will be a very sad panda as we will loop through _all_ of the block
      groups several times, which makes the commit take a while which slows down the
      rest of the file system operations.
      
      This patch introduces a dirty list for the block groups that we get added to
      when we dirty the block group for the first time.  Then we simply update any
      block groups that have been dirtied since the last time we called
      btrfs_write_dirty_block_groups.  This allows us to clean up how we write the
      free space cache out so it is much cleaner.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      ce93ec54
  6. 11 12月, 2014 2 次提交
  7. 03 12月, 2014 3 次提交
    • F
      Btrfs: fix memory leak after block remove + trimming · 946ddbe8
      Filipe Manana 提交于
      There was a free space entry structure memeory leak if a block
      group is remove while a free space entry is being trimmed, which
      the following diagram explains:
      
                 CPU 1                                          CPU 2
      
        btrfs_trim_block_group()
            trim_no_bitmap()
                remove free space entry from
                block group cache's rbtree
                do_trimming()
      
                                                      btrfs_remove_block_group()
                                                          btrfs_remove_free_space_cache()
      
                    add back free space entry to
                    block group's cache rbtree
        btrfs_put_block_group()
      
                                                          (...)
                                                          btrfs_put_block_group()
                                                              kfree(bg->free_space_ctl)
                                                              kfree(bg)
      
      The free space entry added after doing the discard of its respective
      range ends up never being freed.
      Detected after doing an "rmmod btrfs" after running the stress test
      recently submitted for fstests:
      
      [ 8234.642212] kmem_cache_destroy btrfs_free_space: Slab cache still has objects
      [ 8234.642657] CPU: 1 PID: 32276 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-2+ #1
      [ 8234.642660] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [ 8234.642664]  0000000000000000 ffff8801af1b3eb8 ffffffff8140c7b6 ffff8801dbedd0c0
      [ 8234.642670]  ffff8801af1b3ed0 ffffffff811149ce 0000000000000000 ffff8801af1b3ee0
      [ 8234.642676]  ffffffffa042dbe7 ffff8801af1b3ef0 ffffffffa0487422 ffff8801af1b3f78
      [ 8234.642682] Call Trace:
      [ 8234.642692]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
      [ 8234.642699]  [<ffffffff811149ce>] kmem_cache_destroy+0x4d/0x92
      [ 8234.642731]  [<ffffffffa042dbe7>] btrfs_destroy_cachep+0x63/0x76 [btrfs]
      [ 8234.642757]  [<ffffffffa0487422>] exit_btrfs_fs+0x9/0xbe7 [btrfs]
      [ 8234.642762]  [<ffffffff810a76a5>] SyS_delete_module+0x155/0x1c6
      [ 8234.642768]  [<ffffffff8122a7eb>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [ 8234.642773]  [<ffffffff814122d2>] system_call_fastpath+0x16/0x1b
      
      This applies on top (depends on) of my previous patch titled:
      "Btrfs: fix race between fs trimming and block group remove/allocation"
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      946ddbe8
    • F
      Btrfs: fix race between writing free space cache and trimming · 55507ce3
      Filipe Manana 提交于
      Trimming is completely transactionless, and the way it operates consists
      of hiding free space entries from a block group, perform the trim/discard
      and then make the free space entries visible again.
      Therefore while a free space entry is being trimmed, we can have free space
      cache writing running in parallel (as part of a transaction commit) which
      will miss the free space entry. This means that an unmount (or crash/reboot)
      after that transaction commit and mount again before another transaction
      starts/commits after the discard finishes, we will have some free space
      that won't be used again unless the free space cache is rebuilt. After the
      unmount, fsck (btrfsck, btrfs check) reports the issue like the following
      example:
      
              *** fsck.btrfs output ***
              checking extents
              checking free space cache
              There is no free space entry for 521764864-521781248
              There is no free space entry for 521764864-1103101952
              cache appears valid but isnt 29360128
              Checking filesystem on /dev/sdc
              UUID: b4789e27-4774-4626-98e9-ae8dfbfb0fb5
              found 1235681286 bytes used err is -22
              (...)
      
      Another issue caused by this race is a crash while writing bitmap entries
      to the cache, because while the cache writeout task accesses the bitmaps,
      the trim task can be concurrently modifying the bitmap or worse might
      be freeing the bitmap. The later case results in the following crash:
      
      [55650.804460] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
      [55650.804835] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop parport_pc parport i2c_piix4 psmouse evdev pcspkr microcode processor i2ccore serio_raw thermal_sys button ext4 crc16 jbd2 mbcache sg sd_mod crc_t10dif sr_mod cdrom crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 [last unloaded: btrfs]
      [55650.806169] CPU: 1 PID: 31002 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
      [55650.806493] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [55650.806867] task: ffff8800b12f6410 ti: ffff880071538000 task.ti: ffff880071538000
      [55650.807166] RIP: 0010:[<ffffffffa037cf45>]  [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
      [55650.807514] RSP: 0018:ffff88007153bc30  EFLAGS: 00010246
      [55650.807687] RAX: 000000005d1ec000 RBX: ffff8800a665df08 RCX: 0000000000000400
      [55650.807885] RDX: ffff88005d1ec000 RSI: 6b6b6b6b6b6b6b6b RDI: ffff88005d1ec000
      [55650.808017] RBP: ffff88007153bc58 R08: 00000000ddd51536 R09: 00000000000001e0
      [55650.808017] R10: 0000000000000000 R11: 0000000000000037 R12: 6b6b6b6b6b6b6b6b
      [55650.808017] R13: ffff88007153bca8 R14: 6b6b6b6b6b6b6b6b R15: ffff88007153bc98
      [55650.808017] FS:  0000000000000000(0000) GS:ffff88023ec80000(0000) knlGS:0000000000000000
      [55650.808017] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [55650.808017] CR2: 0000000002273b88 CR3: 00000000b18f6000 CR4: 00000000000006e0
      [55650.808017] Stack:
      [55650.808017]  ffff88020e834e00 ffff880172d68db0 0000000000000000 ffff88019257c800
      [55650.808017]  ffff8801d42ea720 ffff88007153bd10 ffffffffa037d2fa ffff880224e99180
      [55650.808017]  ffff8801469a6188 ffff880224e99140 ffff880172d68c50 00000003000000b7
      [55650.808017] Call Trace:
      [55650.808017]  [<ffffffffa037d2fa>] __btrfs_write_out_cache+0x1ea/0x37f [btrfs]
      [55650.808017]  [<ffffffffa037d959>] btrfs_write_out_cache+0xa1/0xd8 [btrfs]
      [55650.808017]  [<ffffffffa033936b>] btrfs_write_dirty_block_groups+0x4b5/0x505 [btrfs]
      [55650.808017]  [<ffffffffa03aa98e>] commit_cowonly_roots+0x15e/0x1f7 [btrfs]
      [55650.808017]  [<ffffffff813eb9c7>] ? _raw_spin_lock+0xe/0x10
      [55650.808017]  [<ffffffffa0346e46>] btrfs_commit_transaction+0x411/0x882 [btrfs]
      [55650.808017]  [<ffffffffa03432a4>] transaction_kthread+0xf2/0x1a4 [btrfs]
      [55650.808017]  [<ffffffffa03431b2>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
      [55650.808017]  [<ffffffff8105966b>] kthread+0xb7/0xbf
      [55650.808017]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [55650.808017]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
      [55650.808017]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
      [55650.808017] Code: 4c 89 ef 8d 70 ff e8 d4 fc ff ff 41 8b 45 34 41 39 45 30 7d 5c 31 f6 4c 89 ef e8 80 f6 ff ff 49 8b 7d 00 4c 89 f6 b9 00 04 00 00 <f3> a5 4c 89 ef 41 8b 45 30 8d 70 ff e8 a3 fc ff ff 41 8b 45 34
      [55650.808017] RIP  [<ffffffffa037cf45>] write_bitmap_entries+0x65/0xbb [btrfs]
      [55650.808017]  RSP <ffff88007153bc30>
      [55650.815725] ---[ end trace 1c032e96b149ff86 ]---
      
      Fix this by serializing both tasks in such a way that cache writeout
      doesn't wait for the trim/discard of free space entries to finish and
      doesn't miss any free space entry.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      55507ce3
    • F
      Btrfs: fix race between fs trimming and block group remove/allocation · 04216820
      Filipe Manana 提交于
      Our fs trim operation, which is completely transactionless (doesn't start
      or joins an existing transaction) consists of visiting all block groups
      and then for each one to iterate its free space entries and perform a
      discard operation against the space range represented by the free space
      entries. However before performing a discard, the corresponding free space
      entry is removed from the free space rbtree, and when the discard completes
      it is added back to the free space rbtree.
      
      If a block group remove operation happens while the discard is ongoing (or
      before it starts and after a free space entry is hidden), we end up not
      waiting for the discard to complete, remove the extent map that maps
      logical address to physical addresses and the corresponding chunk metadata
      from the the chunk and device trees. After that and before the discard
      completes, the current running transaction can finish and a new one start,
      allowing for new block groups that map to the same physical addresses to
      be allocated and written to.
      
      So fix this by keeping the extent map in memory until the discard completes
      so that the same physical addresses aren't reused before it completes.
      
      If the physical locations that are under a discard operation end up being
      used for a new metadata block group for example, and dirty metadata extents
      are written before the discard finishes (the VM might call writepages() of
      our btree inode's i_mapping for example, or an fsync log commit happens) we
      end up overwriting metadata with zeroes, which leads to errors from fsck
      like the following:
      
              checking extents
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              owner ref check failed [833912832 16384]
              Errors found in extent allocation tree or chunk allocation
              checking free space cache
              checking fs roots
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              root 5 root dir 256 error
              root 5 inode 260 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 262 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 263 errors 2001, no inode item, link count wrong
              (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      04216820
  8. 18 9月, 2014 3 次提交
    • F
      Btrfs: improve free space cache management and space allocation · 20005523
      Filipe Manana 提交于
      While under random IO, a block group's free space cache eventually reaches
      a state where it has a mix of extent entries and bitmap entries representing
      free space regions.
      
      As later free space regions are returned to the cache, some of them are merged
      with existing extent entries if they are contiguous with them. But others are
      not merged, because despite the existence of adjacent free space regions in
      the cache, the merging doesn't happen because the existing free space regions
      are represented in bitmap extents. Even when new free space regions are merged
      with existing extent entries (enlarging the free space range they represent),
      we create chances of having after an enlarged region that is contiguous with
      some other region represented in a bitmap entry.
      
      Both clustered and non-clustered space allocation work by iterating over our
      extent and bitmap entries and skipping any that represents a region smaller
      then the allocation request (and giving preference to extent entries before
      bitmap entries). By having a contiguous free space region that is represented
      by 2 (or more) entries (mix of extent and bitmap entries), we end up not
      satisfying an allocation request with a size larger than the size of any of
      the entries but no larger than the sum of their sizes. Making the caller assume
      we're under a ENOSPC condition or force it to allocate multiple smaller space
      regions (as we do for file data writes), which adds extra overhead and more
      chances of causing fragmentation due to the smaller regions being all spread
      apart from each other (more likely when under concurrency).
      
      For example, if we have the following in the cache:
      
      * extent entry representing free space range: [128Mb - 256Kb, 128Mb[
      
      * bitmap entry covering the range [128Mb, 256Mb[, but only with the bits
        representing the range [128Mb, 128Mb + 768Kb[ set - that is, only that
        space in this 128Mb area is marked as free
      
      An allocation request for 1Mb, starting at offset not greater than 128Mb - 256Kb,
      would fail before, despite the existence of such contiguous free space area in the
      cache. The caller could only allocate up to 768Kb of space at once and later another
      256Kb (or vice-versa). In between each smaller allocation request, another task
      working on a different file/inode might come in and take that space, preventing the
      former task of getting a contiguous 1Mb region of free space.
      
      Therefore this change implements the ability to move free space from bitmap
      entries into existing and new free space regions represented with extent
      entries. This is done when a space region is added to the cache.
      
      A test was added to the sanity tests that explains in detail the issue too.
      
      Some performance test results with compilebench on a 4 cores machine, with
      32Gb of ram and using an HDD follow.
      
      Test: compilebench -D /mnt -i 30 -r 1000 --makej
      
      Before this change:
      
         intial create total runs 30 avg 69.02 MB/s (user 0.28s sys 0.57s)
         compile total runs 30 avg 314.96 MB/s (user 0.12s sys 0.25s)
         read compiled tree total runs 3 avg 27.14 MB/s (user 1.52s sys 0.90s)
         delete compiled tree total runs 30 avg 3.14 seconds (user 0.15s sys 0.66s)
      
      After this change:
      
         intial create total runs 30 avg 68.37 MB/s (user 0.29s sys 0.55s)
         compile total runs 30 avg 382.83 MB/s (user 0.12s sys 0.24s)
         read compiled tree total runs 3 avg 27.82 MB/s (user 1.45s sys 0.97s)
         delete compiled tree total runs 30 avg 3.18 seconds (user 0.17s sys 0.65s)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      20005523
    • D
      btrfs: use DIV_ROUND_UP instead of open-coded variants · ed6078f7
      David Sterba 提交于
      The form
      
        (value + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT
      
      is equivalent to
      
        (value + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE
      
      The rest is a simple subsitution, no difference in the generated
      assembly code.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      ed6078f7
    • D
      btrfs: cleanup ino cache members of btrfs_root · 57cdc8db
      David Sterba 提交于
      The naming is confusing, generic yet used for a specific cache. Add a
      prefix 'ino_' or rename appropriately.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      57cdc8db
  9. 20 6月, 2014 2 次提交
    • M
      Btrfs: fix broken free space cache after the system crashed · e570fd27
      Miao Xie 提交于
      When we mounted the filesystem after the crash, we got the following
      message:
        BTRFS error (device xxx): block group xxxx has wrong amount of free space
        BTRFS error (device xxx): failed to load free space cache for block group xxx
      
      It is because we didn't update the metadata of the allocated space (in extent
      tree) until the file data was written into the disk. During this time, there was
      no information about the allocated spaces in either the extent tree nor the
      free space cache. when we wrote out the free space cache at this time (commit
      transaction), those spaces were lost. In fact, only the free space that is
      used to store the file data had this problem, the others didn't because
      the metadata of them is updated in the same transaction context.
      
      There are many methods which can fix the above problem
      - track the allocated space, and write it out when we write out the free
        space cache
      - account the size of the allocated space that is used to store the file
        data, if the size is not zero, don't write out the free space cache.
      
      The first one is complex and may make the performance drop down.
      This patch chose the second method, we use a per-block-group variant to
      account the size of that allocated space. Besides that, we also introduce
      a per-block-group read-write semaphore to avoid the race between
      the allocation and the free space cache write out.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e570fd27
    • M
      Btrfs: make free space cache write out functions more readable · 5349d6c3
      Miao Xie 提交于
      This patch makes the free space cache write out functions more readable,
      and beisdes that, it also reduces the stack space that the function --
      __btrfs_write_out_cache uses from 194bytes to 144bytes.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5349d6c3
  10. 10 6月, 2014 2 次提交
  11. 29 1月, 2014 2 次提交
  12. 12 11月, 2013 4 次提交
  13. 21 9月, 2013 1 次提交
    • M
      Btrfs: allocate the free space by the existed max extent size when ENOSPC · a4820398
      Miao Xie 提交于
      By the current code, if the requested size is very large, and all the extents
      in the free space cache are small, we will waste lots of the cpu time to cut
      the requested size in half and search the cache again and again until it gets
      down to the size the allocator can return. In fact, we can know the max extent
      size in the cache after the first search, so we needn't cut the size in half
      repeatedly, and just use the max extent size directly. This way can save
      lots of cpu time and make the performance grow up when there are only fragments
      in the free space cache.
      
      According to my test, if there are only 4KB free space extents in the fs,
      and the total size of those extents are 256MB, we can reduce the execute
      time of the following test from 5.4s to 1.4s.
        dd if=/dev/zero of=<testfile> bs=1MB count=1 oflag=sync
      
      Changelog v2 -> v3:
      - fix the problem that we skip the block group with the space which is
        less than we need.
      
      Changelog v1 -> v2:
      - address the problem that we return a wrong start position when searching
        the free space in a bitmap.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      a4820398
  14. 13 9月, 2013 1 次提交
  15. 01 9月, 2013 4 次提交
  16. 14 6月, 2013 1 次提交