1. 28 7月, 2011 1 次提交
    • J
      Btrfs: use a worker thread to do caching · bab39bf9
      Josef Bacik 提交于
      A user reported a deadlock when copying a bunch of files.  This is because they
      were low on memory and kthreadd got hung up trying to migrate pages for an
      allocation when starting the caching kthread.  The page was locked by the person
      starting the caching kthread.  To fix this we just need to use the async thread
      stuff so that the threads are already created and we don't have to worry about
      deadlocks.  Thanks,
      Reported-by: NRoman Mamedov <rm@romanrm.ru>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      bab39bf9
  2. 11 7月, 2011 5 次提交
    • J
      Btrfs: fix how we merge extent states and deal with cached states · df98b6e2
      Josef Bacik 提交于
      First, we can sometimes free the state we're merging, which means anybody who
      calls merge_state() may have the state it passed in free'ed.  This is
      problematic because we could end up caching the state, which makes caching
      useless as the state will no longer be part of the tree.  So instead of free'ing
      the state we passed into merge_state(), set it's end to the other->end and free
      the other state.  This way we are sure to cache the correct state.  Also because
      we can merge states together, instead of only using the cache'd state if it's
      start == the start we are looking for, go ahead and use it if the start we are
      looking for is within the range of the cached state.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      df98b6e2
    • J
      Btrfs: use the normal checksumming infrastructure for free space cache · 2f356126
      Josef Bacik 提交于
      We used to store the checksums of the space cache directly in the space cache,
      however that doesn't work out too well if we have more space than we can fit the
      checksums into the first page.  So instead use the normal checksumming
      infrastructure.  There were problems with doing this originally but those
      problems don't exist now so this works out fine.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      2f356126
    • J
      Btrfs: serialize flushers in reserve_metadata_bytes · fdb5effd
      Josef Bacik 提交于
      We keep having problems with early enospc, and that's because our method of
      making space is inherently racy.  The problem is we can have one guy trying to
      make space for himself, and in the meantime people come in and steal his
      reservation.  In order to stop this we make a waitqueue and put anybody who
      comes into reserve_metadata_bytes on that waitqueue if somebody is trying to
      make more space.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      fdb5effd
    • J
      Btrfs: do transaction space reservation before joining the transaction · b5009945
      Josef Bacik 提交于
      We have to do weird things when handling enospc in the transaction joining code.
      Because we've already joined the transaction we cannot commit the transaction
      within the reservation code since it will deadlock, so we have to return EAGAIN
      and then make sure we don't retry too many times.  Instead of doing this, just
      do the reservation the normal way before we join the transaction, that way we
      can do whatever we want to try and reclaim space, and then if it fails we know
      for sure we are out of space and we can return ENOSPC.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      b5009945
    • J
      Btrfs: try to only do one btrfs_search_slot in do_setxattr · fa09200b
      Josef Bacik 提交于
      I've been watching how many btrfs_search_slot()'s we do and I noticed that when
      we create a file with selinux enabled we were doing 2 each time we initialize
      the security context.  That's because we lookup the xattr first so we can delete
      it if we're setting a new value to an existing xattr.  But in the create case we
      don't have any xattrs, so it is completely useless to have the extra lookup.  So
      re-arrange things so that we only lookup first if we specifically have
      XATTR_REPLACE.  That way in the basic case we only do 1 search, and in the more
      complicated case we do the normal 2 lookups.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      fa09200b
  3. 07 7月, 2011 3 次提交
    • M
      btrfs: fix oops when doing space balance · 149e2d76
      Miao Xie 提交于
      We need to make sure the data relocation inode doesn't go through
      the delayed metadata updates, otherwise we get an oops during balance:
      
      kernel BUG at fs/btrfs/relocation.c:4303!
      [SNIP]
      Call Trace:
       [<ffffffffa03143fd>] ? update_ref_for_cow+0x22d/0x330 [btrfs]
       [<ffffffffa0314951>] __btrfs_cow_block+0x451/0x5e0 [btrfs]
       [<ffffffffa031355d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
       [<ffffffffa0314beb>] btrfs_cow_block+0x10b/0x240 [btrfs]
       [<ffffffffa031acae>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
       [<ffffffffa032d8af>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
       [<ffffffff8147bf0e>] ? mutex_lock+0x1e/0x50
       [<ffffffffa0380cf1>] btrfs_update_delayed_inode+0x71/0x160 [btrfs]
       [<ffffffffa037ff27>] ? __btrfs_release_delayed_node+0x67/0x190 [btrfs]
       [<ffffffffa0381cf8>] btrfs_run_delayed_items+0xe8/0x120 [btrfs]
       [<ffffffffa03365e0>] btrfs_commit_transaction+0x250/0x850 [btrfs]
       [<ffffffff810f91d9>] ? find_get_pages+0x39/0x130
       [<ffffffffa0336cd5>] ? join_transaction+0x25/0x250 [btrfs]
       [<ffffffff81081de0>] ? wake_up_bit+0x40/0x40
       [<ffffffffa03785fa>] prepare_to_relocate+0xda/0xf0 [btrfs]
       [<ffffffffa037f2bb>] relocate_block_group+0x4b/0x620 [btrfs]
       [<ffffffffa0334cf5>] ? btrfs_clean_old_snapshots+0x35/0x150 [btrfs]
       [<ffffffffa037fa43>] btrfs_relocate_block_group+0x1b3/0x2e0 [btrfs]
       [<ffffffffa0368ec0>] ? btrfs_tree_unlock+0x50/0x50 [btrfs]
       [<ffffffffa035e39b>] btrfs_relocate_chunk+0x8b/0x670 [btrfs]
       [<ffffffffa031303d>] ? btrfs_set_path_blocking+0x3d/0x50 [btrfs]
       [<ffffffffa03577d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
       [<ffffffffa031bea1>] ? btrfs_previous_item+0xb1/0x150 [btrfs]
       [<ffffffffa03577d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
       [<ffffffffa035f5aa>] btrfs_balance+0x21a/0x2b0 [btrfs]
       [<ffffffffa0368898>] btrfs_ioctl+0x798/0xd20 [btrfs]
       [<ffffffff8111e358>] ? handle_mm_fault+0x148/0x270
       [<ffffffff814809e8>] ? do_page_fault+0x1d8/0x4b0
       [<ffffffff81160d6a>] do_vfs_ioctl+0x9a/0x540
       [<ffffffff811612b1>] sys_ioctl+0xa1/0xb0
       [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
      [SNIP]
      RIP  [<ffffffffa037c1cc>] btrfs_reloc_cow_block+0x22c/0x270 [btrfs]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      149e2d76
    • J
      Btrfs: don't panic if we get an error while balancing V2 · 508794eb
      Josef Bacik 提交于
      A user reported an error where if we try to balance an fs after a device has
      been removed it will blow up.  This is because we get an EIO back and this is
      where BUG_ON(ret) bites us in the ass.  To fix we just exit.  Thanks,
      Reported-by: NAnand Jain <Anand.Jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      508794eb
    • D
      btrfs: add missing options displayed in mount output · 0942caa3
      David Sterba 提交于
      There are three missed mount options settable by user which are not
      currently displayed in mount output.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      0942caa3
  4. 27 6月, 2011 1 次提交
    • M
      btrfs: fix inconsonant inode information · 2f7e33d4
      Miao Xie 提交于
      When iputting the inode, We may leave the delayed nodes if they have some
      delayed items that have not been dealt with. So when the inode is read again,
      we must look up the relative delayed node, and use the information in it to
      initialize the inode. Or we will get inconsonant inode information, it may
      cause that the same directory index number is allocated again, and hit the
      following oops:
      
      [ 5447.554187] err add delayed dir index item(name: pglog_0.965_0) into the
      insertion tree of the delayed node(root id: 262, inode id: 258, errno: -17)
      [ 5447.569766] ------------[ cut here ]------------
      [ 5447.575361] kernel BUG at fs/btrfs/delayed-inode.c:1301!
      [SNIP]
      [ 5447.790721] Call Trace:
      [ 5447.793191]  [<ffffffffa0641c4e>] btrfs_insert_dir_item+0x189/0x1bb [btrfs]
      [ 5447.800156]  [<ffffffffa0651a45>] btrfs_add_link+0x12b/0x191 [btrfs]
      [ 5447.806517]  [<ffffffffa0651adc>] btrfs_add_nondir+0x31/0x58 [btrfs]
      [ 5447.812876]  [<ffffffffa0651d6a>] btrfs_create+0xf9/0x197 [btrfs]
      [ 5447.818961]  [<ffffffff8111f840>] vfs_create+0x72/0x92
      [ 5447.824090]  [<ffffffff8111fa8c>] do_last+0x22c/0x40b
      [ 5447.829133]  [<ffffffff8112076a>] path_openat+0xc0/0x2ef
      [ 5447.834438]  [<ffffffff810c58e2>] ? __perf_event_task_sched_out+0x24/0x44
      [ 5447.841216]  [<ffffffff8103ecdd>] ? perf_event_task_sched_out+0x59/0x67
      [ 5447.847846]  [<ffffffff81121a79>] do_filp_open+0x3d/0x87
      [ 5447.853156]  [<ffffffff811e126c>] ? strncpy_from_user+0x43/0x4d
      [ 5447.859072]  [<ffffffff8111f1f5>] ? getname_flags+0x2e/0x80
      [ 5447.864636]  [<ffffffff8111f179>] ? do_getname+0x14b/0x173
      [ 5447.870112]  [<ffffffff8111f1b7>] ? audit_getname+0x16/0x26
      [ 5447.875682]  [<ffffffff8112b1ab>] ? spin_lock+0xe/0x10
      [ 5447.880882]  [<ffffffff81112d39>] do_sys_open+0x69/0xae
      [ 5447.886153]  [<ffffffff81112db1>] sys_open+0x20/0x22
      [ 5447.891114]  [<ffffffff813b9aab>] system_call_fastpath+0x16/0x1b
      
      Fix it by reusing the old delayed node.
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2f7e33d4
  5. 25 6月, 2011 3 次提交
  6. 18 6月, 2011 6 次提交
    • C
      Btrfs: avoid delayed metadata items during commits · e999376f
      Chris Mason 提交于
      Snapshot creation has two phases.  One is the initial snapshot setup,
      and the second is done during commit, while nobody is allowed to modify
      the root we are snapshotting.
      
      The delayed metadata insertion code can break that rule, it does a
      delayed inode update on the inode of the parent of the snapshot,
      and delayed directory item insertion.
      
      This makes sure to run the pending delayed operations before we
      record the snapshot root, which avoids corruptions.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e999376f
    • D
      btrfs: fix uninitialized return value · 35a30d7c
      David Sterba 提交于
      When allocation fails in btrfs_read_fs_root_no_name, ret is not set
      although it is returned, holding a garbage value.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      35a30d7c
    • M
      btrfs: fix wrong reservation when doing delayed inode operations · 19fd2949
      Miao Xie 提交于
      We have migrated the space for the delayed inode items from
      trans_block_rsv to global_block_rsv, but we forgot to set trans->block_rsv to
      global_block_rsv when we doing delayed inode operations, and the following Oops
      happened:
      
      [ 9792.654889] ------------[ cut here ]------------
      [ 9792.654898] WARNING: at fs/btrfs/extent-tree.c:5681
      btrfs_alloc_free_block+0xca/0x27c [btrfs]()
      [ 9792.654899] Hardware name: To Be Filled By O.E.M.
      [ 9792.654900] Modules linked in: btrfs zlib_deflate libcrc32c
      ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables
      arc4 rt61pci rt2x00pci rt2x00lib snd_hda_codec_hdmi mac80211
      snd_hda_codec_realtek cfg80211 snd_hda_intel edac_core snd_seq rfkill
      pcspkr serio_raw snd_hda_codec eeprom_93cx6 edac_mce_amd sp5100_tco
      i2c_piix4 k10temp snd_hwdep snd_seq_device snd_pcm floppy r8169 xhci_hcd
      mii snd_timer snd soundcore snd_page_alloc ipv6 firewire_ohci pata_acpi
      ata_generic firewire_core pata_via crc_itu_t radeon ttm drm_kms_helper
      drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
      [ 9792.654919] Pid: 2762, comm: rm Tainted: G        W   2.6.39+ #1
      [ 9792.654920] Call Trace:
      [ 9792.654922]  [<ffffffff81053c4a>] warn_slowpath_common+0x83/0x9b
      [ 9792.654925]  [<ffffffff81053c7c>] warn_slowpath_null+0x1a/0x1c
      [ 9792.654933]  [<ffffffffa038e747>] btrfs_alloc_free_block+0xca/0x27c [btrfs]
      [ 9792.654945]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
      [ 9792.654953]  [<ffffffffa038189b>] __btrfs_cow_block+0xfc/0x30c [btrfs]
      [ 9792.654963]  [<ffffffffa0396aa6>] ? btrfs_buffer_uptodate+0x47/0x58 [btrfs]
      [ 9792.654970]  [<ffffffffa0382e48>] ? read_block_for_search+0x94/0x368 [btrfs]
      [ 9792.654978]  [<ffffffffa0381ba9>] btrfs_cow_block+0xfe/0x146 [btrfs]
      [ 9792.654986]  [<ffffffffa03848b0>] btrfs_search_slot+0x14d/0x4b6 [btrfs]
      [ 9792.654997]  [<ffffffffa03b8562>] ? map_extent_buffer+0x6e/0xa8 [btrfs]
      [ 9792.655022]  [<ffffffffa03938e8>] btrfs_lookup_inode+0x2f/0x8f [btrfs]
      [ 9792.655025]  [<ffffffff8147afac>] ? _cond_resched+0xe/0x22
      [ 9792.655027]  [<ffffffff8147b892>] ? mutex_lock+0x29/0x50
      [ 9792.655039]  [<ffffffffa03d41b1>] btrfs_update_delayed_inode+0x72/0x137 [btrfs]
      [ 9792.655051]  [<ffffffffa03d4ea2>] btrfs_run_delayed_items+0x90/0xdb [btrfs]
      [ 9792.655062]  [<ffffffffa039a69b>] btrfs_commit_transaction+0x228/0x654 [btrfs]
      [ 9792.655064]  [<ffffffff8106e8da>] ? remove_wait_queue+0x3a/0x3a
      [ 9792.655075]  [<ffffffffa03a2fa5>] btrfs_evict_inode+0x14d/0x202 [btrfs]
      [ 9792.655077]  [<ffffffff81132bd6>] evict+0x71/0x111
      [ 9792.655079]  [<ffffffff81132de0>] iput+0x12a/0x132
      [ 9792.655081]  [<ffffffff8112aa3a>] do_unlinkat+0x106/0x155
      [ 9792.655083]  [<ffffffff81127b83>] ? path_put+0x1f/0x23
      [ 9792.655085]  [<ffffffff8109c53c>] ? audit_syscall_entry+0x145/0x171
      [ 9792.655087]  [<ffffffff81128410>] ? putname+0x34/0x36
      [ 9792.655090]  [<ffffffff8112b441>] sys_unlinkat+0x29/0x2b
      [ 9792.655092]  [<ffffffff81482c42>] system_call_fastpath+0x16/0x1b
      [ 9792.655093] ---[ end trace 02b696eb02b3f768 ]---
      
      This patch fix it by setting the reservation of the transaction handle to the
      correct one.
      Reported-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      19fd2949
    • M
      btrfs: Remove unused sysfs code · 9fe6a50f
      Maarten Lankhorst 提交于
      Removes code no longer used. The sysfs file itself is kept, because the
      btrfs developers expressed interest in putting new entries to sysfs.
      Signed-off-by: NMaarten Lankhorst <m.b.lankhorst@gmail.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9fe6a50f
    • D
      btrfs: fix dereference of ERR_PTR value · 3ed4498c
      David Sterba 提交于
      smatch reports:
      
      btrfs_recover_log_trees error: 'wc.replay_dest' dereferencing
      possible ERR_PTR()
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3ed4498c
    • C
      Btrfs: fix relocation races · 7585717f
      Chris Mason 提交于
      The recent commit to get rid of our trans_mutex introduced
      some races with block group relocation.  The problem is that relocation
      needs to do some record keeping about each root, and it was relying
      on the transaction mutex to coordinate things in subtle ways.
      
      This fix adds a mutex just for the relocation code and makes sure
      it doesn't have a big impact on normal operations.  The race is
      really fixed in btrfs_record_root_in_trans, which is where we
      step back and wait for the relocation code to finish accounting
      setup.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7585717f
  7. 16 6月, 2011 3 次提交
    • J
      Btrfs: set no_trans_join after trying to expand the transaction · ed0ca140
      Josef Bacik 提交于
      We can lockup if we try to allow new writers join the transaction and we have
      flushoncommit set or have a pending snapshot.  This is because we set
      no_trans_join and then loop around and try to wait for ordered extents again.
      The problem is the ordered endio stuff needs to join the transaction, which it
      can't do because no_trans_join is set.  So instead wait until after this loop to
      set no_trans_join and then make sure to wait for num_writers == 1 in case
      anybody got started in between us exiting the loop and setting no_trans_join.
      This could easily be reproduced by mounting -o flushoncommit and running xfstest
      13.  It cannot be reproduced with this patch.  Thanks,
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      ed0ca140
    • J
      Btrfs: protect the pending_snapshots list with trans_lock · 8351583e
      Josef Bacik 提交于
      Currently there is nothing protecting the pending_snapshots list on the
      transaction.  We only hold the directory mutex that we are snapshotting and a
      read lock on the subvol_sem, so we could race with somebody else creating a
      snapshot in a different directory and end up with list corruption.  So protect
      this list with the trans_lock.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      8351583e
    • J
      Btrfs: fix path leakage on subvol deletion · 71d7aed0
      Josef Bacik 提交于
      The delayed ref patch accidently removed the btrfs_free_path in
      btrfs_unlink_subvol, this puts it back and means we don't leak a path.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      71d7aed0
  8. 13 6月, 2011 2 次提交
  9. 11 6月, 2011 8 次提交
  10. 10 6月, 2011 4 次提交
  11. 09 6月, 2011 4 次提交
    • J
      Btrfs: unlock the trans lock properly · 3473f3c0
      Josef Bacik 提交于
      In btrfs_wait_for_commit if we came upon a transaction that had committed we
      just exited, but that's bad since we are holding the trans_lock.  So break
      instead so that the lock is dropped.  Thanks,
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      3473f3c0
    • J
      Btrfs: don't map extent buffer if path->skip_locking is set · 25b8b936
      Josef Bacik 提交于
      Arne's scrub stuff exposed a problem with mapping the extent buffer in
      reada_for_search.  He searches the commit root with multiple threads and with
      skip_locking set, so we can race and overwrite node->map_token since node isn't
      locked.  So fix this so that we only map the extent buffer if we don't already
      have a map_token and skip_locking isn't set.  Without this patch scrub would
      panic almost immediately, with the patch it doesn't panic anymore.  Thanks,
      Reported-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      25b8b936
    • J
      Btrfs: fix duplicate checking logic · f6a39829
      Josef Bacik 提交于
      When merging my code into the integration test the second check for duplicate
      entries got screwed up.  This patch fixes it by dropping ret2 and just using ret
      for the return value, and checking if we got an error before adding the bitmap
      to the local list.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      f6a39829
    • J
      Btrfs: fix the allocator loop logic · 723bda20
      Josef Bacik 提交于
      I was testing with empty_cluster = 0 to try and reproduce a problem and kept
      hitting early enospc panics.  This was because our loop logic was a little
      confused.  So this is what I did
      
      1) Make the loop variable the ultimate decider on wether we should loop again
      isntead of checking to see if we had an uncached bg, empty size or empty
      cluster.
      
      2) Increment loop before checking to see what we are on to make the loop
      definitions make more sense.
      
      3) If we are on the chunk alloc loop don't set empty_size/empty_cluster to 0
      unless we didn't actually allocate a chunk.  If we did allocate a chunk we
      should be able to easily setup a new cluster so clearing
      empty_size/empty_cluster makes us less efficient.
      
      This kept me from hitting panics while trying to reproduce the other problem.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      723bda20