1. 18 9月, 2011 2 次提交
    • J
      Btrfs: only clear the need lookup flag after the dentry is setup · a66e7cc6
      Josef Bacik 提交于
      We can race with readdir and the RCU path walking stuff.  This is because we
      clear the need lookup flag before actually instantiating the inode.  This will
      lead the RCU path walk stuff to find a dentry it thinks is valid without a
      d_inode attached.  So instead unhash the dentry when we first start the lookup,
      and then clear the flag after we've instantiated the dentry so we're garunteed
      to either try the slow lookup, or have the d_inode set properly.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a66e7cc6
    • H
      btrfs: fix d_off in the first dirent · 3765fefa
      Hidetoshi Seto 提交于
      Since the d_off in the first dirent for "." (that originates from
      the 4th argument "offset" of filldir() for the 2nd dirent for "..")
      is wrongly assigned in btrfs_real_readdir(), telldir returns same
      offset for different locations.
      
       | # mkfs.btrfs /dev/sdb1
       | # mount /dev/sdb1 fs0
       | # cd fs0
       | # touch file0 file1
       | # ../test
       | telldir: 0
       | readdir: d_off = 2, d_name = "."
       | telldir: 2
       | readdir: d_off = 2, d_name = ".."
       | telldir: 2
       | readdir: d_off = 3, d_name = "file0"
       | telldir: 3
       | readdir: d_off = 2147483647, d_name = "file1"
       | telldir: 2147483647
      
      To fix this problem, pass filp->f_pos (which is loff_t) instead.
      
       | # ../test
       | telldir: 0
       | readdir: d_off = 1, d_name = "."
       | telldir: 1
       | readdir: d_off = 2, d_name = ".."
       | telldir: 2
       | readdir: d_off = 3, d_name = "file0"
       :
      
      At the moment the "offset" for "." is unused because there is no
      preceding dirent, however it is better to pass filp->f_pos to follow
      grammatical usage.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3765fefa
  2. 11 9月, 2011 3 次提交
    • M
      Btrfs: fix wrong nbytes information of the inode · a39f7521
      Miao Xie 提交于
      If we write some data into the data hole of the file(no preallocation for this
      hole), Btrfs will allocate some disk space, and update nbytes of the inode, but
      the other element--disk_i_size needn't be updated. At this condition, we must
      update inode metadata though disk_i_size is not changed(btrfs_ordered_update_i_size()
      return 1).
      
       # mkfs.btrfs /dev/sdb1
       # mount /dev/sdb1 /mnt
       # touch /mnt/a
       # truncate -s 856002 /mnt/a
       # dd if=/dev/zero of=/mnt/a bs=4K count=1 conv=nocreat,notrunc
       # umount /mnt
       # btrfsck /dev/sdb1
       root 5 inode 257 errors 400
       found 32768 bytes used err is 1
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a39f7521
    • M
      Btrfs: fix unclosed transaction handle in btrfs_cont_expand · 5b397377
      Miao Xie 提交于
      The function - btrfs_cont_expand() forgot to close the transaction handle before
      it jump out the while loop. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5b397377
    • S
      btrfs: fix warning in iput for bad-inode · e0b6d65b
      Sergei Trofimovich 提交于
      iput() shouldn't be called for inodes in I_NEW state.
      We need to mark inode as constructed first.
      
      WARNING: at fs/inode.c:1309 iput+0x20b/0x210()
      Call Trace:
       [<ffffffff8103e7ba>] warn_slowpath_common+0x7a/0xb0
       [<ffffffff8103e805>] warn_slowpath_null+0x15/0x20
       [<ffffffff810eaf0b>] iput+0x20b/0x210
       [<ffffffff811b96fb>] btrfs_iget+0x1eb/0x4a0
       [<ffffffff811c3ad6>] btrfs_run_defrag_inodes+0x136/0x210
       [<ffffffff811ad55f>] cleaner_kthread+0x17f/0x1a0
       [<ffffffff81035b7d>] ? sub_preempt_count+0x9d/0xd0
       [<ffffffff811ad3e0>] ? transaction_kthread+0x280/0x280
       [<ffffffff8105af86>] kthread+0x96/0xa0
       [<ffffffff814336d4>] kernel_thread_helper+0x4/0x10
       [<ffffffff8105aef0>] ? kthread_worker_fn+0x190/0x190
       [<ffffffff814336d0>] ? gs_change+0xb/0xb
      Signed-off-by: NSergei Trofimovich <slyfox@gentoo.org>
      CC: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      CC: Josef Bacik <josef@redhat.com>
      CC: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e0b6d65b
  3. 18 8月, 2011 1 次提交
    • J
      btrfs: btrfs_permission's RO check shouldn't apply to device nodes · cb6db4e5
      Jeff Mahoney 提交于
      This patch tightens the read-only access checks in btrfs_permission to
       match the constraints in inode_permission. Currently, even though the
       device node itself will be unmodified, read-write access to device nodes
       is denied to when the device node resides on a read-only subvolume or a
       is a file that has been marked read-only by the btrfs conversion utility.
      
       With this patch applied, the check only affects regular files,
       directories, and symlinks. It also restructures the code a bit so that
       we don't duplicate the MAY_WRITE check for both tests.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cb6db4e5
  4. 02 8月, 2011 3 次提交
  5. 01 8月, 2011 1 次提交
    • J
      Btrfs: load the key from the dir item in readdir into a fake dentry · b4aff1f8
      Josef Bacik 提交于
      In btrfs we have 2 indexes for inodes.  One is for readdir, it's in this nice
      sequential order and works out brilliantly for readdir.  However if you use ls,
      it usually stat's each file it gets from readdir.  This is where the second
      index comes in, which is based on a hash of the name of the file.  So then the
      lookup has to lookup this index, and then lookup the inode.  The index lookup is
      going to be in random order (since its based on the name hash), which gives us
      less than stellar performance.  Since we know the inode location from the
      readdir index, I create a dummy dentry and copy the location key into
      dentry->d_fsdata.  Then on lookup if we have d_fsdata we use that location to
      lookup the inode, avoiding looking up the other directory index.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b4aff1f8
  6. 28 7月, 2011 4 次提交
    • C
      Btrfs: use the commit_root for reading free_space_inode crcs · 2cf8572d
      Chris Mason 提交于
      Now that we are using regular file crcs for the free space cache,
      we can deadlock if we try to read the free_space_inode while we are
      updating the crc tree.
      
      This commit fixes things by using the commit_root to read the crcs.  This is
      safe because we the free space cache file would already be loaded if
      that block group had been changed in the current transaction.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2cf8572d
    • C
      Btrfs: stop using highmem for extent_buffers · a6591715
      Chris Mason 提交于
      The extent_buffers have a very complex interface where
      we use HIGHMEM for metadata and try to cache a kmap mapping
      to access the memory.
      
      The next commit adds reader/writer locks, and concurrent use
      of this kmap cache would make it even more complex.
      
      This commit drops the ability to use HIGHMEM with extent buffers,
      and rips out all of the related code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a6591715
    • J
      Btrfs: fix enospc problems with delalloc · 9e0baf60
      Josef Bacik 提交于
      So I had this brilliant idea to use atomic counters for outstanding and reserved
      extents, but this turned out to be a bad idea.  Consider this where we have 1
      outstanding extent and 1 reserved extent
      
      Reserver				Releaser
      					atomic_dec(outstanding) now 0
      atomic_read(outstanding)+1 get 1
      atomic_read(reserved) get 1
      don't actually reserve anything because
      they are the same
      					atomic_cmpxchg(reserved, 1, 0)
      atomic_inc(outstanding)
      atomic_add(0, reserved)
      					free reserved space for 1 extent
      
      Then the reserver now has no actual space reserved for it, and when it goes to
      finish the ordered IO it won't have enough space to do it's allocation and you
      get those lovely warnings.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9e0baf60
    • J
      Btrfs: use find_or_create_page instead of grab_cache_page · a94733d0
      Josef Bacik 提交于
      grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
      GFP_HIGHUSER_MOVABLE.  So instead use find_or_create_page in all cases where we
      need GFP_NOFS so we don't deadlock.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      a94733d0
  7. 27 7月, 2011 1 次提交
  8. 26 7月, 2011 1 次提交
  9. 21 7月, 2011 1 次提交
  10. 20 7月, 2011 5 次提交
  11. 15 7月, 2011 3 次提交
    • M
      btrfs: Don't BUG_ON alloc_path errors in btrfs_read_locked_inode · 1748f843
      Mark Fasheh 提交于
      btrfs_iget() also needed an update so that errors from btrfs_locked_inode()
      are caught and bubbled back up.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      1748f843
    • M
      btrfs: Don't BUG_ON alloc_path errors in btrfs_truncate_inode_items · 0eb0e19c
      Mark Fasheh 提交于
      I moved the path allocation up a few lines to the top of the function so
      that we couldn't get into the state where we've dropped delayed items and
      the extent cache but fail due to -ENOMEM.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      0eb0e19c
    • M
      btrfs: don't BUG_ON btrfs_alloc_path() errors · d8926bb3
      Mark Fasheh 提交于
      This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
      failure. All the sites that are fixed in this patch were checked by me to
      be fairly trivial to fix because of at least one of two criteria:
      
       - Callers of the function catch errors from it already so bubbling the
         error up will be handled.
       - Callers of the function might BUG_ON any nonzero return code in which
         case there is no behavior changed (but we still got to remove a BUG_ON)
      
      The following functions were updated:
      
      btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
      btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
      btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
      insert_reserved_file_extent, and run_delalloc_nocow
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      d8926bb3
  12. 07 7月, 2011 1 次提交
    • M
      btrfs: fix oops when doing space balance · 149e2d76
      Miao Xie 提交于
      We need to make sure the data relocation inode doesn't go through
      the delayed metadata updates, otherwise we get an oops during balance:
      
      kernel BUG at fs/btrfs/relocation.c:4303!
      [SNIP]
      Call Trace:
       [<ffffffffa03143fd>] ? update_ref_for_cow+0x22d/0x330 [btrfs]
       [<ffffffffa0314951>] __btrfs_cow_block+0x451/0x5e0 [btrfs]
       [<ffffffffa031355d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
       [<ffffffffa0314beb>] btrfs_cow_block+0x10b/0x240 [btrfs]
       [<ffffffffa031acae>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
       [<ffffffffa032d8af>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
       [<ffffffff8147bf0e>] ? mutex_lock+0x1e/0x50
       [<ffffffffa0380cf1>] btrfs_update_delayed_inode+0x71/0x160 [btrfs]
       [<ffffffffa037ff27>] ? __btrfs_release_delayed_node+0x67/0x190 [btrfs]
       [<ffffffffa0381cf8>] btrfs_run_delayed_items+0xe8/0x120 [btrfs]
       [<ffffffffa03365e0>] btrfs_commit_transaction+0x250/0x850 [btrfs]
       [<ffffffff810f91d9>] ? find_get_pages+0x39/0x130
       [<ffffffffa0336cd5>] ? join_transaction+0x25/0x250 [btrfs]
       [<ffffffff81081de0>] ? wake_up_bit+0x40/0x40
       [<ffffffffa03785fa>] prepare_to_relocate+0xda/0xf0 [btrfs]
       [<ffffffffa037f2bb>] relocate_block_group+0x4b/0x620 [btrfs]
       [<ffffffffa0334cf5>] ? btrfs_clean_old_snapshots+0x35/0x150 [btrfs]
       [<ffffffffa037fa43>] btrfs_relocate_block_group+0x1b3/0x2e0 [btrfs]
       [<ffffffffa0368ec0>] ? btrfs_tree_unlock+0x50/0x50 [btrfs]
       [<ffffffffa035e39b>] btrfs_relocate_chunk+0x8b/0x670 [btrfs]
       [<ffffffffa031303d>] ? btrfs_set_path_blocking+0x3d/0x50 [btrfs]
       [<ffffffffa03577d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
       [<ffffffffa031bea1>] ? btrfs_previous_item+0xb1/0x150 [btrfs]
       [<ffffffffa03577d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
       [<ffffffffa035f5aa>] btrfs_balance+0x21a/0x2b0 [btrfs]
       [<ffffffffa0368898>] btrfs_ioctl+0x798/0xd20 [btrfs]
       [<ffffffff8111e358>] ? handle_mm_fault+0x148/0x270
       [<ffffffff814809e8>] ? do_page_fault+0x1d8/0x4b0
       [<ffffffff81160d6a>] do_vfs_ioctl+0x9a/0x540
       [<ffffffff811612b1>] sys_ioctl+0xa1/0xb0
       [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
      [SNIP]
      RIP  [<ffffffffa037c1cc>] btrfs_reloc_cow_block+0x22c/0x270 [btrfs]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      149e2d76
  13. 27 6月, 2011 1 次提交
    • M
      btrfs: fix inconsonant inode information · 2f7e33d4
      Miao Xie 提交于
      When iputting the inode, We may leave the delayed nodes if they have some
      delayed items that have not been dealt with. So when the inode is read again,
      we must look up the relative delayed node, and use the information in it to
      initialize the inode. Or we will get inconsonant inode information, it may
      cause that the same directory index number is allocated again, and hit the
      following oops:
      
      [ 5447.554187] err add delayed dir index item(name: pglog_0.965_0) into the
      insertion tree of the delayed node(root id: 262, inode id: 258, errno: -17)
      [ 5447.569766] ------------[ cut here ]------------
      [ 5447.575361] kernel BUG at fs/btrfs/delayed-inode.c:1301!
      [SNIP]
      [ 5447.790721] Call Trace:
      [ 5447.793191]  [<ffffffffa0641c4e>] btrfs_insert_dir_item+0x189/0x1bb [btrfs]
      [ 5447.800156]  [<ffffffffa0651a45>] btrfs_add_link+0x12b/0x191 [btrfs]
      [ 5447.806517]  [<ffffffffa0651adc>] btrfs_add_nondir+0x31/0x58 [btrfs]
      [ 5447.812876]  [<ffffffffa0651d6a>] btrfs_create+0xf9/0x197 [btrfs]
      [ 5447.818961]  [<ffffffff8111f840>] vfs_create+0x72/0x92
      [ 5447.824090]  [<ffffffff8111fa8c>] do_last+0x22c/0x40b
      [ 5447.829133]  [<ffffffff8112076a>] path_openat+0xc0/0x2ef
      [ 5447.834438]  [<ffffffff810c58e2>] ? __perf_event_task_sched_out+0x24/0x44
      [ 5447.841216]  [<ffffffff8103ecdd>] ? perf_event_task_sched_out+0x59/0x67
      [ 5447.847846]  [<ffffffff81121a79>] do_filp_open+0x3d/0x87
      [ 5447.853156]  [<ffffffff811e126c>] ? strncpy_from_user+0x43/0x4d
      [ 5447.859072]  [<ffffffff8111f1f5>] ? getname_flags+0x2e/0x80
      [ 5447.864636]  [<ffffffff8111f179>] ? do_getname+0x14b/0x173
      [ 5447.870112]  [<ffffffff8111f1b7>] ? audit_getname+0x16/0x26
      [ 5447.875682]  [<ffffffff8112b1ab>] ? spin_lock+0xe/0x10
      [ 5447.880882]  [<ffffffff81112d39>] do_sys_open+0x69/0xae
      [ 5447.886153]  [<ffffffff81112db1>] sys_open+0x20/0x22
      [ 5447.891114]  [<ffffffff813b9aab>] system_call_fastpath+0x16/0x1b
      
      Fix it by reusing the old delayed node.
      Reported-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Tested-by: NJim Schutt <jaschut@sandia.gov>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2f7e33d4
  14. 25 6月, 2011 1 次提交
  15. 16 6月, 2011 1 次提交
  16. 11 6月, 2011 2 次提交
  17. 04 6月, 2011 1 次提交
  18. 27 5月, 2011 2 次提交
    • C
      fs: pass exact type of data dirties to ->dirty_inode · aa385729
      Christoph Hellwig 提交于
      Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
      anything else, so that the filesystem can track internally if it
      needs to push out a transaction for fdatasync or not.
      
      This is just the prototype change with no user for it yet.  I plan
      to push large XFS changes for the next merge window, and getting
      this trivial infrastructure in this window would help a lot to avoid
      tree interdependencies.
      
      Also remove incorrect comments that ->dirty_inode can't block.  That
      has been changed a long time ago, and many implementations rely on it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aa385729
    • C
      Btrfs: add mount -o auto_defrag · 4cb5300b
      Chris Mason 提交于
      This will detect small random writes into files and
      queue the up for an auto defrag process.  It isn't well suited to
      database workloads yet, but works for smaller files such as rpm, sqlite
      or bdb databases.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4cb5300b
  19. 26 5月, 2011 3 次提交
  20. 24 5月, 2011 3 次提交