1. 06 11月, 2011 1 次提交
    • J
      Btrfs: release metadata from global reserve if we have to fallback for unlink · 5a77d76c
      Josef Bacik 提交于
      I fixed a problem where we weren't reserving space for an orphan item when we
      had to fallback to using the global reserve for an unlink, but I introduced
      another problem.  I was migrating the bytes from the transaction reserve to the
      global reserve and then releasing from the global reserve in
      btrfs_end_transaction().  The problem with this is that a migrate will jack up
      the size for the destination, but leave the size alone for the source, with the
      idea that you can do a release normally on the source and it all washes out, and
      then you can do a release again on the destination and it works out right.  My
      way was skipping the release on the trans_block_rsv which still had the jacked
      up size from our original reservation.  So instead release manually from the
      global reserve if this transaction was using it, and then set the
      trans->block_rsv back to the trans_block_rsv so that btrfs_end_transaction
      cleans everything up properly.  With this patch xfstest 83 doesn't emit warnings
      about leaking space.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5a77d76c
  2. 21 10月, 2011 2 次提交
    • L
      Btrfs: fix direct-io vs nodatacow · f0dd9592
      Li Zefan 提交于
      To reproduce the bug:
      
        # mount -o nodatacow /dev/sda7 /mnt/
        # dd if=/dev/zero of=/mnt/tmp bs=4K count=1
        1+0 records in
        1+0 records out
        4096 bytes (4.1 kB) copied, 0.000136115 s, 30.1 MB/s
        # dd if=/dev/zero of=/mnt/tmp bs=4K count=1 conv=notrunc oflag=direct
        dd: writing `/mnt/tmp': Input/output error
        1+0 records in
        0+0 records out
      
      btrfs_ordered_update_i_size() may return 1, but btrfs_endio_direct_write()
      mistakenly takes it as an error.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      f0dd9592
    • L
      Btrfs: remove BUG_ON() in compress_file_range() · 560f7d75
      Li Zefan 提交于
      It's not a big deal if we fail to allocate the array, and instead of
      panic we can just give up compressing.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      560f7d75
  3. 20 10月, 2011 19 次提交
    • J
      Btrfs: seperate out btrfs_block_rsv_check out into 2 different functions · 36ba022a
      Josef Bacik 提交于
      Currently btrfs_block_rsv_check does 2 things, it will either refill a block
      reserve like in the truncate or refill case, or it will check to see if there is
      enough space in the global reserve and possibly refill it.  However because of
      overcommit we could be well overcommitting ourselves just to try and refill the
      global reserve, when really we should just be committing the transaction.  So
      breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check.  Refill
      will try to reserve more metadata if it can and btrfs_block_rsv_check will not,
      it will only tell you if the factor of the total space is still reserved.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      36ba022a
    • J
      Btrfs: reserve some space for an orphan item when unlinking · 3880a1b4
      Josef Bacik 提交于
      In __unlink_start_trans() if we don't have enough room for a reservation we will
      check to see if the unlink will free up space.  If it does that's great, but we
      will still could add an orphan item, so we need to reserve enough space to add
      the orphan item.  Do this and migrate the space the global reserve so it all
      works out right.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      3880a1b4
    • J
      Btrfs: fix the amount of space reserved for unlink · e70bea5f
      Josef Bacik 提交于
      Our unlink reservations were a bit much, we were reserving 10 and I only count 8
      possible items we're touching, so comment what we're reserving for and fix the
      count value.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      e70bea5f
    • J
      Btrfs: inline checksums into the disk free space cache · 5b0e95bf
      Josef Bacik 提交于
      Yeah yeah I know this is how we used to do it and then I changed it, but damnit
      I'm changing it back.  The fact is that writing out checksums will modify
      metadata, which could cause us to dirty a block group we've already written out,
      so we have to truncate it and all of it's checksums and re-write it which will
      write new checksums which could dirty a blockg roup that has already been
      written and you see where I'm going with this?  This can cause unmount or really
      anything that depends on a transaction to commit to take it's sweet damned time
      to happen.  So go back to the way it was, only this time we're specifically
      setting NODATACOW because we can't go through the COW pathway anyway and we're
      doing our own built-in cow'ing by truncating the free space cache.  The other
      new thing is once we truncate the old cache and preallocate the new space, we
      don't need to do that song and dance at all for the rest of the transaction, we
      can just overwrite the existing space with the new cache if the block group
      changes for whatever reason, and the NODATACOW will let us do this fine.  So
      keep track of which transaction we last cleared our cache in and if we cleared
      it in this transaction just say we're all setup and carry on.  This survives
      xfstests and stress.sh.
      
      The inode cache will continue to use the normal csum infrastructure since it
      only gets written once and there will be no more modifications to the fs tree in
      a transaction commit.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5b0e95bf
    • J
      Btrfs: break out of orphan cleanup if we can't make progress · 8f6d7f4f
      Josef Bacik 提交于
      I noticed while running xfstests 83 that if we didn't have enough space to
      delete our inode the orphan cleanup would just loop.  This is because it keeps
      finding the same orphan item and keeps trying to kill it but can't because we
      don't get an error back from iput for deleting the inode.  So keep track of the
      last guy we tried to kill, if it's the same as the one we're trying to kill
      currently we know we are having problems and can just error out.  I don't have a
      way to test this so look hard and make sure it's right.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      8f6d7f4f
    • J
      Btrfs: use the global reserve as a backup for deleting inodes · 726c35fa
      Josef Bacik 提交于
      Xfstests 83 really stresses our ENOSPC since it uses a 100mb fs which ends up
      with the mixed block group stuff.  Because of this we can run into a situation
      where we don't have enough space to delete inodes, or even worse we can't free
      the inodes when we next mount the fs which causes the orphan code to lose its
      mind.  So if we fail to make our reservation, steal from the global reserve.
      The global reserve will end up taking up the entire rest of the free space on
      the fs in this worst case so there really is no other option.  With this patch
      test 83 doesn't freak out.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      726c35fa
    • J
      Btrfs: fix orphan cleanup regression · a8c9e576
      Josef Bacik 提交于
      In fixing how we deal with bad inodes, we had a regression in the orphan cleanup
      code, since it expects to get a bad inode back.  So fix it to deal with getting
      -ESTALE back by deleting the orphan item manually and moving on.  Thanks,
      Reported-by: NSimon Kirby <sim@hostway.ca>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      a8c9e576
    • J
      Btrfs: use the inode's mapping mask for allocating pages · 3b16a4e3
      Josef Bacik 提交于
      Johannes pointed out we were allocating only kernel pages for doing writes,
      which is kind of a big deal if you are on 32bit and have more than a gig of ram.
      So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
      don't re-enter.  Thanks,
      Reported-by: NJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      3b16a4e3
    • J
      Btrfs: stop passing a trans handle all around the reservation code · 4a92b1b8
      Josef Bacik 提交于
      The only thing that we need to have a trans handle for is in
      reserve_metadata_bytes and thats to know how much flushing we can do.  So
      instead of passing it around, just check current->journal_info for a
      trans_handle so we know if we can commit a transaction to try and free up space
      or not.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4a92b1b8
    • J
      Btrfs: handle enospc accounting for free space inodes · c09544e0
      Josef Bacik 提交于
      Since free space inodes now use normal checksumming we need to make sure to
      account for their metadata use.  So reserve metadata space, and then if we fail
      to write out the metadata we can just release it, otherwise it will be freed up
      when the io completes.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      c09544e0
    • J
      Btrfs: set truncate block rsv's size · 4a338542
      Josef Bacik 提交于
      While debugging a different issue I noticed that we were always reserving space
      when we tried to use our truncate block rsv's.  This is because they didn't have
      a ->size value, so use_block_rsv just assumes there is nothing reserved and it
      does a reserve_metadata_bytes.  This is because btrfs_check_block_rsv() doesn't
      actually add to the size of the block rsv.  That seems to be the right thing to
      do so set ->size to the minimum truncate size we need, since we will always only
      refill to that size anyway, and this way everything works out correctly.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4a338542
    • J
      Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_check · 482e6dc5
      Josef Bacik 提交于
      If you run xfstest 224 it you will get lots of messages about not being able to
      delete inodes and that they will be cleaned up next mount.  This is because
      btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
      flush, so if there was not enough space, it simply failed.  But in truncate and
      evict case we could easily flush space to try and get enough space to do our
      work, so make btrfs_block_rsv_check take a flush argument to pass down to
      reserve_metadata_bytes.  Now xfstests 224 runs fine without all those
      complaints.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      482e6dc5
    • J
      Btrfs: reduce the amount of space needed for truncates · 07127184
      Josef Bacik 提交于
      With btrfs_truncate_inode_items we always return if we have to go to another
      leaf, which makes us do our reservation again.  This means we will only ever
      modify one leaf at a time, so we only need 1 items worth of slack space.  Also,
      since we are deleting we will not be creating nodes as we go down, if anything
      we'll be free'ing them as we merge them together, so make a different
      calculation for truncate which will only have the worst case useage of COW'ing
      the entire path down to the leaf.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      07127184
    • J
      Btrfs: optimize how we account for space in truncate · 907cbceb
      Josef Bacik 提交于
      Currently we're starting and stopping a transaction for no real reason, so kill
      that and just reserve enough space as if we can truncate all in one transaction.
      Also use btrfs_block_rsv_check() for our reserve to minimize the amount of space
      we may have to allocate for our slack space.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      907cbceb
    • J
      Btrfs: fix how we reserve space for deleting inodes · 4289a667
      Josef Bacik 提交于
      I converted btrfs_truncate to do sane reservations for truncate, but didn't
      convert btrfs_evict_inode.  Basically we need to save the orphan_rsv for
      deleting the orphan item, and do normal reservations for our truncate.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4289a667
    • J
      Btrfs: kill the durable block rsv stuff · 37be25bc
      Josef Bacik 提交于
      This is confusing code and isn't used by anything anymore, so delete it.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      37be25bc
    • J
      Btrfs: kill the orphan space calculation for snapshots · dba68306
      Josef Bacik 提交于
      This patch kills off the calculation for the amount of space needed for the
      orphan operations during a snapshot.  The thing is we only do snapshots on
      commit, so any space that is in the block_rsv->freed[] isn't going to be in the
      new snapshot anyway, so there isn't any reason to require that space to be
      reserved for the snapshot to occur.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      dba68306
    • J
      Btrfs: calculate checksum space correctly · 7709cde3
      Josef Bacik 提交于
      We have not been reserving enough space for checksums.  We were just reserving
      bytes for the checksum items themselves, we were not taking into account having
      to cow the tree and such.  This patch adds a csum_bytes counter to the inode for
      keeping track of the number of bytes outstanding we have for checksums.  Then we
      calculate how many leaves would be required for the checksums we are given and
      use that to reserve space.  This adds a significant amount of bytes to our
      reservations, but we will handle this later.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7709cde3
    • J
      Btrfs: kill reserved_bytes in inode · 0cbbdf7c
      Josef Bacik 提交于
      reserved_bytes is not used for anything in the inode, remove it.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0cbbdf7c
  4. 18 9月, 2011 2 次提交
    • J
      Btrfs: only clear the need lookup flag after the dentry is setup · a66e7cc6
      Josef Bacik 提交于
      We can race with readdir and the RCU path walking stuff.  This is because we
      clear the need lookup flag before actually instantiating the inode.  This will
      lead the RCU path walk stuff to find a dentry it thinks is valid without a
      d_inode attached.  So instead unhash the dentry when we first start the lookup,
      and then clear the flag after we've instantiated the dentry so we're garunteed
      to either try the slow lookup, or have the d_inode set properly.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a66e7cc6
    • H
      btrfs: fix d_off in the first dirent · 3765fefa
      Hidetoshi Seto 提交于
      Since the d_off in the first dirent for "." (that originates from
      the 4th argument "offset" of filldir() for the 2nd dirent for "..")
      is wrongly assigned in btrfs_real_readdir(), telldir returns same
      offset for different locations.
      
       | # mkfs.btrfs /dev/sdb1
       | # mount /dev/sdb1 fs0
       | # cd fs0
       | # touch file0 file1
       | # ../test
       | telldir: 0
       | readdir: d_off = 2, d_name = "."
       | telldir: 2
       | readdir: d_off = 2, d_name = ".."
       | telldir: 2
       | readdir: d_off = 3, d_name = "file0"
       | telldir: 3
       | readdir: d_off = 2147483647, d_name = "file1"
       | telldir: 2147483647
      
      To fix this problem, pass filp->f_pos (which is loff_t) instead.
      
       | # ../test
       | telldir: 0
       | readdir: d_off = 1, d_name = "."
       | telldir: 1
       | readdir: d_off = 2, d_name = ".."
       | telldir: 2
       | readdir: d_off = 3, d_name = "file0"
       :
      
      At the moment the "offset" for "." is unused because there is no
      preceding dirent, however it is better to pass filp->f_pos to follow
      grammatical usage.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3765fefa
  5. 11 9月, 2011 3 次提交
    • M
      Btrfs: fix wrong nbytes information of the inode · a39f7521
      Miao Xie 提交于
      If we write some data into the data hole of the file(no preallocation for this
      hole), Btrfs will allocate some disk space, and update nbytes of the inode, but
      the other element--disk_i_size needn't be updated. At this condition, we must
      update inode metadata though disk_i_size is not changed(btrfs_ordered_update_i_size()
      return 1).
      
       # mkfs.btrfs /dev/sdb1
       # mount /dev/sdb1 /mnt
       # touch /mnt/a
       # truncate -s 856002 /mnt/a
       # dd if=/dev/zero of=/mnt/a bs=4K count=1 conv=nocreat,notrunc
       # umount /mnt
       # btrfsck /dev/sdb1
       root 5 inode 257 errors 400
       found 32768 bytes used err is 1
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a39f7521
    • M
      Btrfs: fix unclosed transaction handle in btrfs_cont_expand · 5b397377
      Miao Xie 提交于
      The function - btrfs_cont_expand() forgot to close the transaction handle before
      it jump out the while loop. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5b397377
    • S
      btrfs: fix warning in iput for bad-inode · e0b6d65b
      Sergei Trofimovich 提交于
      iput() shouldn't be called for inodes in I_NEW state.
      We need to mark inode as constructed first.
      
      WARNING: at fs/inode.c:1309 iput+0x20b/0x210()
      Call Trace:
       [<ffffffff8103e7ba>] warn_slowpath_common+0x7a/0xb0
       [<ffffffff8103e805>] warn_slowpath_null+0x15/0x20
       [<ffffffff810eaf0b>] iput+0x20b/0x210
       [<ffffffff811b96fb>] btrfs_iget+0x1eb/0x4a0
       [<ffffffff811c3ad6>] btrfs_run_defrag_inodes+0x136/0x210
       [<ffffffff811ad55f>] cleaner_kthread+0x17f/0x1a0
       [<ffffffff81035b7d>] ? sub_preempt_count+0x9d/0xd0
       [<ffffffff811ad3e0>] ? transaction_kthread+0x280/0x280
       [<ffffffff8105af86>] kthread+0x96/0xa0
       [<ffffffff814336d4>] kernel_thread_helper+0x4/0x10
       [<ffffffff8105aef0>] ? kthread_worker_fn+0x190/0x190
       [<ffffffff814336d0>] ? gs_change+0xb/0xb
      Signed-off-by: NSergei Trofimovich <slyfox@gentoo.org>
      CC: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      CC: Josef Bacik <josef@redhat.com>
      CC: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e0b6d65b
  6. 18 8月, 2011 1 次提交
    • J
      btrfs: btrfs_permission's RO check shouldn't apply to device nodes · cb6db4e5
      Jeff Mahoney 提交于
      This patch tightens the read-only access checks in btrfs_permission to
       match the constraints in inode_permission. Currently, even though the
       device node itself will be unmodified, read-write access to device nodes
       is denied to when the device node resides on a read-only subvolume or a
       is a file that has been marked read-only by the btrfs conversion utility.
      
       With this patch applied, the check only affects regular files,
       directories, and symlinks. It also restructures the code a bit so that
       we don't duplicate the MAY_WRITE check for both tests.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cb6db4e5
  7. 02 8月, 2011 3 次提交
  8. 01 8月, 2011 1 次提交
    • J
      Btrfs: load the key from the dir item in readdir into a fake dentry · b4aff1f8
      Josef Bacik 提交于
      In btrfs we have 2 indexes for inodes.  One is for readdir, it's in this nice
      sequential order and works out brilliantly for readdir.  However if you use ls,
      it usually stat's each file it gets from readdir.  This is where the second
      index comes in, which is based on a hash of the name of the file.  So then the
      lookup has to lookup this index, and then lookup the inode.  The index lookup is
      going to be in random order (since its based on the name hash), which gives us
      less than stellar performance.  Since we know the inode location from the
      readdir index, I create a dummy dentry and copy the location key into
      dentry->d_fsdata.  Then on lookup if we have d_fsdata we use that location to
      lookup the inode, avoiding looking up the other directory index.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b4aff1f8
  9. 28 7月, 2011 4 次提交
    • C
      Btrfs: use the commit_root for reading free_space_inode crcs · 2cf8572d
      Chris Mason 提交于
      Now that we are using regular file crcs for the free space cache,
      we can deadlock if we try to read the free_space_inode while we are
      updating the crc tree.
      
      This commit fixes things by using the commit_root to read the crcs.  This is
      safe because we the free space cache file would already be loaded if
      that block group had been changed in the current transaction.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2cf8572d
    • C
      Btrfs: stop using highmem for extent_buffers · a6591715
      Chris Mason 提交于
      The extent_buffers have a very complex interface where
      we use HIGHMEM for metadata and try to cache a kmap mapping
      to access the memory.
      
      The next commit adds reader/writer locks, and concurrent use
      of this kmap cache would make it even more complex.
      
      This commit drops the ability to use HIGHMEM with extent buffers,
      and rips out all of the related code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a6591715
    • J
      Btrfs: fix enospc problems with delalloc · 9e0baf60
      Josef Bacik 提交于
      So I had this brilliant idea to use atomic counters for outstanding and reserved
      extents, but this turned out to be a bad idea.  Consider this where we have 1
      outstanding extent and 1 reserved extent
      
      Reserver				Releaser
      					atomic_dec(outstanding) now 0
      atomic_read(outstanding)+1 get 1
      atomic_read(reserved) get 1
      don't actually reserve anything because
      they are the same
      					atomic_cmpxchg(reserved, 1, 0)
      atomic_inc(outstanding)
      atomic_add(0, reserved)
      					free reserved space for 1 extent
      
      Then the reserver now has no actual space reserved for it, and when it goes to
      finish the ordered IO it won't have enough space to do it's allocation and you
      get those lovely warnings.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9e0baf60
    • J
      Btrfs: use find_or_create_page instead of grab_cache_page · a94733d0
      Josef Bacik 提交于
      grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
      GFP_HIGHUSER_MOVABLE.  So instead use find_or_create_page in all cases where we
      need GFP_NOFS so we don't deadlock.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      a94733d0
  10. 27 7月, 2011 1 次提交
  11. 26 7月, 2011 1 次提交
  12. 21 7月, 2011 1 次提交
  13. 20 7月, 2011 1 次提交