1. 07 3月, 2013 1 次提交
    • C
      Btrfs: improve the delayed inode throttling · de3cb945
      Chris Mason 提交于
      The delayed inode code batches up changes to the btree in hopes of doing
      them in bulk.  As the changes build up, processes kick off worker
      threads and wait for them to make progress.
      
      The current code kicks off an async work queue item for each delayed
      node, which creates a lot of churn.  It also uses a fixed 1 HZ waiting
      period for the throttle, which allows us to build a lot of pending
      work and can slow down the commit.
      
      This changes us to watch a sequence counter as it is bumped during the
      operations.  We kick off fewer work items and have each work item do
      more work.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      de3cb945
  2. 21 2月, 2013 1 次提交
  3. 20 2月, 2013 2 次提交
  4. 13 12月, 2012 1 次提交
  5. 12 12月, 2012 1 次提交
    • M
      Btrfs: improve the noflush reservation · 08e007d2
      Miao Xie 提交于
      In some places(such as: evicting inode), we just can not flush the reserved
      space of delalloc, flushing the delayed directory index and delayed inode
      is OK, but we don't try to flush those things and just go back when there is
      no enough space to be reserved. This patch fixes this problem.
      
      We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
      If we can in the transaction, we should not flush anything, or the deadlock
      would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
      would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
      and we will flush all things.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      08e007d2
  6. 02 10月, 2012 2 次提交
  7. 21 9月, 2012 1 次提交
  8. 29 8月, 2012 2 次提交
  9. 24 7月, 2012 2 次提交
    • L
      Btrfs: zero unused bytes in inode item · 293f7e07
      Li Zefan 提交于
      The otime field is not zeroed, so users will see random otime in an old
      filesystem with a new kernel which has otime support in the future.
      
      The reserved bytes are also not zeroed, and we'll have compatibility
      issue if we make use of those bytes.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      293f7e07
    • J
      Btrfs: flush delayed inodes if we're short on space · 96c3f433
      Josef Bacik 提交于
      Those crazy gentoo guys have been complaining about ENOSPC errors on their
      portage volumes.  This is because doing things like untar tends to create
      lots of new files which will soak up all the reservation space in the
      delayed inodes.  Usually this gets papered over by the fact that we will try
      and commit the transaction, however if this happens in the wrong spot or we
      choose not to commit the transaction you will be screwed.  So add the
      ability to expclitly flush delayed inodes to free up space.  Please test
      this out guys to make sure it works since as usual I cannot reproduce.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      96c3f433
  10. 15 6月, 2012 1 次提交
  11. 30 5月, 2012 2 次提交
    • J
      Btrfs: convert the inode bit field to use the actual bit operations · 72ac3c0d
      Josef Bacik 提交于
      Miao pointed this out while I was working on an orphan problem that messing
      with a bitfield where different ranges are protected by different locks
      doesn't work out right.  Turns out we've been doing this forever where we
      have different parts of the bit field protected by either no lock at all or
      different locks which could cause all sorts of weird problems including the
      issue I was hitting.  So instead make a runtime_flags thing that we use the
      normal bit operations on that are all atomic so we can keep having our
      no/different locking for the different flags and then make force_compress
      it's own thing so it can be treated normally.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      72ac3c0d
    • J
      Btrfs: use i_version instead of our own sequence · 0c4d2d95
      Josef Bacik 提交于
      We've been keeping around the inode sequence number in hopes that somebody
      would use it, but nobody uses it and people actually use i_version which
      serves the same purpose, so use i_version where we used the incore inode's
      sequence number and that way the sequence is updated properly across the
      board, and not just in file write.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0c4d2d95
  12. 22 3月, 2012 2 次提交
  13. 17 1月, 2012 1 次提交
  14. 16 12月, 2011 1 次提交
  15. 11 11月, 2011 1 次提交
    • C
      Btrfs: tweak the delayed inode reservations again · 2115133f
      Chris Mason 提交于
      Josef sent along an incremental to the inode reservation
      code to make sure we try and fall back to directly updating
      the inode item if things go horribly wrong.
      
      This reworks that patch slightly, adding a fallback function
      that will always try to update the inode item directly without
      going through the delayed_inode code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2115133f
  16. 09 11月, 2011 1 次提交
    • J
      Btrfs: fix our reservations for updating an inode when completing io · 7fd2ae21
      Josef Bacik 提交于
      People have been reporting ENOSPC crashes in finish_ordered_io.  This is because
      we try to steal from the delalloc block rsv to satisfy a reservation to update
      the inode.  The problem with this is we don't explicitly save space for updating
      the inode when doing delalloc.  This is kind of a problem and we've gotten away
      with this because way back when we just stole from the delalloc reserve without
      any questions, and this worked out fine because generally speaking the leaf had
      been modified either by the mtime update when we did the original write or
      because we just updated the leaf when we inserted the file extent item, only on
      rare occasions had the leaf not actually been modified, and that was still ok
      because we'd just use a block or two out of the over-reservation that is
      delalloc.
      
      Then came the delayed inode stuff.  This is amazing, except it wants a full
      reservation for updating the inode since it may do it at some point down the
      road after we've written the blocks and we have to recow everything again.  This
      worked out because the delayed inode stuff just stole from the global reserve,
      that is until recently when I changed that because it caused other problems.
      
      So here we are, we're doing everything right and being screwed for it.  So take
      an extra reservation for the inode at delalloc reservation time and carry it
      through the life of the delalloc reservation.  If we need it we can steal it in
      the delayed inode stuff.  If we have already stolen it try and do a normal
      metadata reservation.  If that fails try to steal from the delalloc reservation.
      If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
      solve this and in the meantime we'll steal from the global reserve.
      
      With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
      any problems.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7fd2ae21
  17. 06 11月, 2011 2 次提交
    • J
      Btrfs: fix delayed insertion reservation · c06a0e12
      Josef Bacik 提交于
      We all keep getting those stupid warnings from use_block_rsv when running
      stress.sh, and it's because the delayed insertion stuff is being stupid.  It's
      not the delayed insertion stuffs fault, it's all just stupid.  When marking an
      inode dirty for oh say updating the time on it, we just do a
      btrfs_join_transaction, which doesn't reserve any space.  This is stupid because
      we're going to have to have space reserve to make this change, but we do it
      because it's fast because chances are we're going to call it over and over again
      and it doesn't matter.  Well thanks to the delayed insertion stuff this is
      mostly the case, so we do actually need to make this reservation.  So if
      trans->bytes_reserved is 0 then try to do a normal reservation.  If not return
      ENOSPC which will make the btrfs_dirty_inode start a proper transaction which
      will let it do the whole ENOSPC dance and reserve enough space for the delayed
      insertion to steal the reservation from the transaction.
      
      The other stupid thing we do is not reserve space for the inode when writing to
      the thing.  Usually this is ok since we have to update the time so we'd have
      already done all this work before we get to the endio stuff, so it doesn't
      matter.  But this is stupid because we could write the data after the
      transaction commits where we changed the mtime of the inode so we have to cow
      all the way down to the inode anyway.  This used to be masked by the delalloc
      reservation stuff, but because we delay the update it doesn't get masked in this
      case.  So again the delayed insertion stuff bites us in the ass.  So if our
      trans->block_rsv is delalloc, just steal the reservation from the delalloc
      reserve.  Hopefully this won't bite us in the ass, but I've said that before.
      
      With this patch stress.sh no longer spits out those stupid warnings (famous last
      words).  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c06a0e12
    • J
      Btrfs: make a delayed_block_rsv for the delayed item insertion · 6d668dda
      Josef Bacik 提交于
      I've been hitting warnings in use_block_rsv when running the delayed insertion
      stuff.  It's because we will readjust global block rsv based on what is in use,
      which means we could end up discarding reservations that are for the delayed
      insertion stuff.  So instead create a seperate block rsv for the delayed
      insertion stuff.  This will also make it easier to debug problems with the
      delayed insertion reservations since we will know that only the delayed
      insertion code touches this block_rsv.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      6d668dda
  18. 02 11月, 2011 1 次提交
  19. 28 7月, 2011 1 次提交
    • C
      Btrfs: switch the btrfs tree locks to reader/writer · bd681513
      Chris Mason 提交于
      The btrfs metadata btree is the source of significant
      lock contention, especially in the root node.   This
      commit changes our locking to use a reader/writer
      lock.
      
      The lock is built on top of rw spinlocks, and it
      extends the lock tracking to remember if we have a
      read lock or a write lock when we go to blocking.  Atomics
      count the number of blocking readers or writers at any
      given time.
      
      It removes all of the adaptive spinning from the old code
      and uses only the spinning/blocking hints inside of btrfs
      to decide when it should continue spinning.
      
      In read heavy workloads this is dramatically faster.  In write
      heavy workloads we're still faster because of less contention
      on the root node lock.
      
      We suffer slightly in dbench because we schedule more often
      during write locks, but all other benchmarks so far are improved.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      bd681513
  20. 27 6月, 2011 1 次提交
    • M
      btrfs: fix inconsonant inode information · 2f7e33d4
      Miao Xie 提交于
      When iputting the inode, We may leave the delayed nodes if they have some
      delayed items that have not been dealt with. So when the inode is read again,
      we must look up the relative delayed node, and use the information in it to
      initialize the inode. Or we will get inconsonant inode information, it may
      cause that the same directory index number is allocated again, and hit the
      following oops:
      
      [ 5447.554187] err add delayed dir index item(name: pglog_0.965_0) into the
      insertion tree of the delayed node(root id: 262, inode id: 258, errno: -17)
      [ 5447.569766] ------------[ cut here ]------------
      [ 5447.575361] kernel BUG at fs/btrfs/delayed-inode.c:1301!
      [SNIP]
      [ 5447.790721] Call Trace:
      [ 5447.793191]  [<ffffffffa0641c4e>] btrfs_insert_dir_item+0x189/0x1bb [btrfs]
      [ 5447.800156]  [<ffffffffa0651a45>] btrfs_add_link+0x12b/0x191 [btrfs]
      [ 5447.806517]  [<ffffffffa0651adc>] btrfs_add_nondir+0x31/0x58 [btrfs]
      [ 5447.812876]  [<ffffffffa0651d6a>] btrfs_create+0xf9/0x197 [btrfs]
      [ 5447.818961]  [<ffffffff8111f840>] vfs_create+0x72/0x92
      [ 5447.824090]  [<ffffffff8111fa8c>] do_last+0x22c/0x40b
      [ 5447.829133]  [<ffffffff8112076a>] path_openat+0xc0/0x2ef
      [ 5447.834438]  [<ffffffff810c58e2>] ? __perf_event_task_sched_out+0x24/0x44
      [ 5447.841216]  [<ffffffff8103ecdd>] ? perf_event_task_sched_out+0x59/0x67
      [ 5447.847846]  [<ffffffff81121a79>] do_filp_open+0x3d/0x87
      [ 5447.853156]  [<ffffffff811e126c>] ? strncpy_from_user+0x43/0x4d
      [ 5447.859072]  [<ffffffff8111f1f5>] ? getname_flags+0x2e/0x80
      [ 5447.864636]  [<ffffffff8111f179>] ? do_getname+0x14b/0x173
      [ 5447.870112]  [<ffffffff8111f1b7>] ? audit_getname+0x16/0x26
      [ 5447.875682]  [<ffffffff8112b1ab>] ? spin_lock+0xe/0x10
      [ 5447.880882]  [<ffffffff81112d39>] do_sys_open+0x69/0xae
      [ 5447.886153]  [<ffffffff81112db1>] sys_open+0x20/0x22
      [ 5447.891114]  [<ffffffff813b9aab>] system_call_fastpath+0x16/0x1b
      
      Fix it by reusing the old delayed node.
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2f7e33d4
  21. 18 6月, 2011 2 次提交
    • C
      Btrfs: avoid delayed metadata items during commits · e999376f
      Chris Mason 提交于
      Snapshot creation has two phases.  One is the initial snapshot setup,
      and the second is done during commit, while nobody is allowed to modify
      the root we are snapshotting.
      
      The delayed metadata insertion code can break that rule, it does a
      delayed inode update on the inode of the parent of the snapshot,
      and delayed directory item insertion.
      
      This makes sure to run the pending delayed operations before we
      record the snapshot root, which avoids corruptions.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e999376f
    • M
      btrfs: fix wrong reservation when doing delayed inode operations · 19fd2949
      Miao Xie 提交于
      We have migrated the space for the delayed inode items from
      trans_block_rsv to global_block_rsv, but we forgot to set trans->block_rsv to
      global_block_rsv when we doing delayed inode operations, and the following Oops
      happened:
      
      [ 9792.654889] ------------[ cut here ]------------
      [ 9792.654898] WARNING: at fs/btrfs/extent-tree.c:5681
      btrfs_alloc_free_block+0xca/0x27c [btrfs]()
      [ 9792.654899] Hardware name: To Be Filled By O.E.M.
      [ 9792.654900] Modules linked in: btrfs zlib_deflate libcrc32c
      ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
      arc4 rt61pci rt2x00pci rt2x00lib snd_hda_codec_hdmi mac80211
      snd_hda_codec_realtek cfg80211 snd_hda_intel edac_core snd_seq rfkill
      pcspkr serio_raw snd_hda_codec eeprom_93cx6 edac_mce_amd sp5100_tco
      i2c_piix4 k10temp snd_hwdep snd_seq_device snd_pcm floppy r8169 xhci_hcd
      mii snd_timer snd soundcore snd_page_alloc ipv6 firewire_ohci pata_acpi
      ata_generic firewire_core pata_via crc_itu_t radeon ttm drm_kms_helper
      drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
      [ 9792.654919] Pid: 2762, comm: rm Tainted: G        W   2.6.39+ #1
      [ 9792.654920] Call Trace:
      [ 9792.654922]  [<ffffffff81053c4a>] warn_slowpath_common+0x83/0x9b
      [ 9792.654925]  [<ffffffff81053c7c>] warn_slowpath_null+0x1a/0x1c
      [ 9792.654933]  [<ffffffffa038e747>] btrfs_alloc_free_block+0xca/0x27c [btrfs]
      [ 9792.654945]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
      [ 9792.654953]  [<ffffffffa038189b>] __btrfs_cow_block+0xfc/0x30c [btrfs]
      [ 9792.654963]  [<ffffffffa0396aa6>] ? btrfs_buffer_uptodate+0x47/0x58 [btrfs]
      [ 9792.654970]  [<ffffffffa0382e48>] ? read_block_for_search+0x94/0x368 [btrfs]
      [ 9792.654978]  [<ffffffffa0381ba9>] btrfs_cow_block+0xfe/0x146 [btrfs]
      [ 9792.654986]  [<ffffffffa03848b0>] btrfs_search_slot+0x14d/0x4b6 [btrfs]
      [ 9792.654997]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
      [ 9792.655022]  [<ffffffffa03938e8>] btrfs_lookup_inode+0x2f/0x8f [btrfs]
      [ 9792.655025]  [<ffffffff8147afac>] ? _cond_resched+0xe/0x22
      [ 9792.655027]  [<ffffffff8147b892>] ? mutex_lock+0x29/0x50
      [ 9792.655039]  [<ffffffffa03d41b1>] btrfs_update_delayed_inode+0x72/0x137 [btrfs]
      [ 9792.655051]  [<ffffffffa03d4ea2>] btrfs_run_delayed_items+0x90/0xdb [btrfs]
      [ 9792.655062]  [<ffffffffa039a69b>] btrfs_commit_transaction+0x228/0x654 [btrfs]
      [ 9792.655064]  [<ffffffff8106e8da>] ? remove_wait_queue+0x3a/0x3a
      [ 9792.655075]  [<ffffffffa03a2fa5>] btrfs_evict_inode+0x14d/0x202 [btrfs]
      [ 9792.655077]  [<ffffffff81132bd6>] evict+0x71/0x111
      [ 9792.655079]  [<ffffffff81132de0>] iput+0x12a/0x132
      [ 9792.655081]  [<ffffffff8112aa3a>] do_unlinkat+0x106/0x155
      [ 9792.655083]  [<ffffffff81127b83>] ? path_put+0x1f/0x23
      [ 9792.655085]  [<ffffffff8109c53c>] ? audit_syscall_entry+0x145/0x171
      [ 9792.655087]  [<ffffffff81128410>] ? putname+0x34/0x36
      [ 9792.655090]  [<ffffffff8112b441>] sys_unlinkat+0x29/0x2b
      [ 9792.655092]  [<ffffffff81482c42>] system_call_fastpath+0x16/0x1b
      [ 9792.655093] ---[ end trace 02b696eb02b3f768 ]---
      
      This patch fix it by setting the reservation of the transaction handle to the
      correct one.
      Reported-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      19fd2949
  22. 04 6月, 2011 2 次提交
  23. 22 5月, 2011 1 次提交
  24. 21 5月, 2011 1 次提交
    • M
      btrfs: implement delayed inode items operation · 16cdcec7
      Miao Xie 提交于
      Changelog V5 -> V6:
      - Fix oom when the memory load is high, by storing the delayed nodes into the
        root's radix tree, and letting btrfs inodes go.
      
      Changelog V4 -> V5:
      - Fix the race on adding the delayed node to the inode, which is spotted by
        Chris Mason.
      - Merge Chris Mason's incremental patch into this patch.
      - Fix deadlock between readdir() and memory fault, which is reported by
        Itaru Kitayama.
      
      Changelog V3 -> V4:
      - Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
        inode in time.
      
      Changelog V2 -> V3:
      - Fix the race between the delayed worker and the task which does delayed items
        balance, which is reported by Tsutomu Itoh.
      - Modify the patch address David Sterba's comment.
      - Fix the bug of the cpu recursion spinlock, reported by Chris Mason
      
      Changelog V1 -> V2:
      - break up the global rb-tree, use a list to manage the delayed nodes,
        which is created for every directory and file, and used to manage the
        delayed directory name index items and the delayed inode item.
      - introduce a worker to deal with the delayed nodes.
      
      Compare with Ext3/4, the performance of file creation and deletion on btrfs
      is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
      such as inode item, directory name item, directory name index and so on.
      
      If we can do some delayed b+ tree insertion or deletion, we can improve the
      performance, so we made this patch which implemented delayed directory name
      index insertion/deletion and delayed inode update.
      
      Implementation:
      - introduce a delayed root object into the filesystem, that use two lists to
        manage the delayed nodes which are created for every file/directory.
        One is used to manage all the delayed nodes that have delayed items. And the
        other is used to manage the delayed nodes which is waiting to be dealt with
        by the work thread.
      - Every delayed node has two rb-tree, one is used to manage the directory name
        index which is going to be inserted into b+ tree, and the other is used to
        manage the directory name index which is going to be deleted from b+ tree.
      - introduce a worker to deal with the delayed operation. This worker is used
        to deal with the works of the delayed directory name index items insertion
        and deletion and the delayed inode update.
        When the delayed items is beyond the lower limit, we create works for some
        delayed nodes and insert them into the work queue of the worker, and then
        go back.
        When the delayed items is beyond the upper bound, we create works for all
        the delayed nodes that haven't been dealt with, and insert them into the work
        queue of the worker, and then wait for that the untreated items is below some
        threshold value.
      - When we want to insert a directory name index into b+ tree, we just add the
        information into the delayed inserting rb-tree.
        And then we check the number of the delayed items and do delayed items
        balance. (The balance policy is above.)
      - When we want to delete a directory name index from the b+ tree, we search it
        in the inserting rb-tree at first. If we look it up, just drop it. If not,
        add the key of it into the delayed deleting rb-tree.
        Similar to the delayed inserting rb-tree, we also check the number of the
        delayed items and do delayed items balance.
        (The same to inserting manipulation)
      - When we want to update the metadata of some inode, we cached the data of the
        inode into the delayed node. the worker will flush it into the b+ tree after
        dealing with the delayed insertion and deletion.
      - We will move the delayed node to the tail of the list after we access the
        delayed node, By this way, we can cache more delayed items and merge more
        inode updates.
      - If we want to commit transaction, we will deal with all the delayed node.
      - the delayed node will be freed when we free the btrfs inode.
      - Before we log the inode items, we commit all the directory name index items
        and the delayed inode update.
      
      I did a quick test by the benchmark tool[1] and found we can improve the
      performance of file creation by ~15%, and file deletion by ~20%.
      
      Before applying this patch:
      Create files:
              Total files: 50000
              Total time: 1.096108
              Average time: 0.000022
      Delete files:
              Total files: 50000
              Total time: 1.510403
              Average time: 0.000030
      
      After applying this patch:
      Create files:
              Total files: 50000
              Total time: 0.932899
              Average time: 0.000019
      Delete files:
              Total files: 50000
              Total time: 1.215732
              Average time: 0.000024
      
      [1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
      
      Many thanks for Kitayama-san's help!
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dave@jikos.cz>
      Tested-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Tested-by: NItaru Kitayama <kitayama@cl.bb4u.ne.jp>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      16cdcec7