1. 11 11月, 2011 2 次提交
    • M
      Btrfs: fix unreleased path in btrfs_orphan_cleanup() · 3254c876
      Miao Xie 提交于
      When we did stress test for the space relocation, the deadlock happened.
      By debugging, We found it was caused by the carelessness that we forgot
      to unlock the read lock of the extent buffers in btrfs_orphan_cleanup()
      before we end the transaction handle, so the transaction commit task waited
      the task, which called btrfs_orphan_cleanup(), to unlock the extent buffer,
      but that task waited the commit task to end the transaction commit, and
      the deadlock happened. Fix it.
      Signed-ff-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3254c876
    • C
      Btrfs: tweak the delayed inode reservations again · 2115133f
      Chris Mason 提交于
      Josef sent along an incremental to the inode reservation
      code to make sure we try and fall back to directly updating
      the inode item if things go horribly wrong.
      
      This reworks that patch slightly, adding a fallback function
      that will always try to update the inode item directly without
      going through the delayed_inode code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      2115133f
  2. 09 11月, 2011 2 次提交
    • J
      Btrfs: fix our reservations for updating an inode when completing io · 7fd2ae21
      Josef Bacik 提交于
      People have been reporting ENOSPC crashes in finish_ordered_io.  This is because
      we try to steal from the delalloc block rsv to satisfy a reservation to update
      the inode.  The problem with this is we don't explicitly save space for updating
      the inode when doing delalloc.  This is kind of a problem and we've gotten away
      with this because way back when we just stole from the delalloc reserve without
      any questions, and this worked out fine because generally speaking the leaf had
      been modified either by the mtime update when we did the original write or
      because we just updated the leaf when we inserted the file extent item, only on
      rare occasions had the leaf not actually been modified, and that was still ok
      because we'd just use a block or two out of the over-reservation that is
      delalloc.
      
      Then came the delayed inode stuff.  This is amazing, except it wants a full
      reservation for updating the inode since it may do it at some point down the
      road after we've written the blocks and we have to recow everything again.  This
      worked out because the delayed inode stuff just stole from the global reserve,
      that is until recently when I changed that because it caused other problems.
      
      So here we are, we're doing everything right and being screwed for it.  So take
      an extra reservation for the inode at delalloc reservation time and carry it
      through the life of the delalloc reservation.  If we need it we can steal it in
      the delayed inode stuff.  If we have already stolen it try and do a normal
      metadata reservation.  If that fails try to steal from the delalloc reservation.
      If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
      solve this and in the meantime we'll steal from the global reserve.
      
      With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
      any problems.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7fd2ae21
    • C
      Btrfs: fix oops on NULL trans handle in btrfs_truncate · 917c16b2
      Chris Mason 提交于
      If we fail to reserve space in the transaction during truncate, we can
      error out with a NULL trans handle.  The cleanup code needs an extra
      check to make sure we aren't trying to use the bad handle.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      917c16b2
  3. 06 11月, 2011 2 次提交
    • D
      btrfs: separate superblock items out of fs_info · 6c41761f
      David Sterba 提交于
      fs_info has now ~9kb, more than fits into one page. This will cause
      mount failure when memory is too fragmented. Top space consumers are
      super block structures super_copy and super_for_commit, ~2.8kb each.
      Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)
      
      Add a wrapper for freeing fs_info and all of it's dynamically allocated
      members.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      6c41761f
    • J
      Btrfs: release metadata from global reserve if we have to fallback for unlink · 5a77d76c
      Josef Bacik 提交于
      I fixed a problem where we weren't reserving space for an orphan item when we
      had to fallback to using the global reserve for an unlink, but I introduced
      another problem.  I was migrating the bytes from the transaction reserve to the
      global reserve and then releasing from the global reserve in
      btrfs_end_transaction().  The problem with this is that a migrate will jack up
      the size for the destination, but leave the size alone for the source, with the
      idea that you can do a release normally on the source and it all washes out, and
      then you can do a release again on the destination and it works out right.  My
      way was skipping the release on the trans_block_rsv which still had the jacked
      up size from our original reservation.  So instead release manually from the
      global reserve if this transaction was using it, and then set the
      trans->block_rsv back to the trans_block_rsv so that btrfs_end_transaction
      cleans everything up properly.  With this patch xfstest 83 doesn't emit warnings
      about leaking space.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5a77d76c
  4. 21 10月, 2011 2 次提交
    • L
      Btrfs: fix direct-io vs nodatacow · f0dd9592
      Li Zefan 提交于
      To reproduce the bug:
      
        # mount -o nodatacow /dev/sda7 /mnt/
        # dd if=/dev/zero of=/mnt/tmp bs=4K count=1
        1+0 records in
        1+0 records out
        4096 bytes (4.1 kB) copied, 0.000136115 s, 30.1 MB/s
        # dd if=/dev/zero of=/mnt/tmp bs=4K count=1 conv=notrunc oflag=direct
        dd: writing `/mnt/tmp': Input/output error
        1+0 records in
        0+0 records out
      
      btrfs_ordered_update_i_size() may return 1, but btrfs_endio_direct_write()
      mistakenly takes it as an error.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      f0dd9592
    • L
      Btrfs: remove BUG_ON() in compress_file_range() · 560f7d75
      Li Zefan 提交于
      It's not a big deal if we fail to allocate the array, and instead of
      panic we can just give up compressing.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      560f7d75
  5. 20 10月, 2011 19 次提交
    • J
      Btrfs: seperate out btrfs_block_rsv_check out into 2 different functions · 36ba022a
      Josef Bacik 提交于
      Currently btrfs_block_rsv_check does 2 things, it will either refill a block
      reserve like in the truncate or refill case, or it will check to see if there is
      enough space in the global reserve and possibly refill it.  However because of
      overcommit we could be well overcommitting ourselves just to try and refill the
      global reserve, when really we should just be committing the transaction.  So
      breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check.  Refill
      will try to reserve more metadata if it can and btrfs_block_rsv_check will not,
      it will only tell you if the factor of the total space is still reserved.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      36ba022a
    • J
      Btrfs: reserve some space for an orphan item when unlinking · 3880a1b4
      Josef Bacik 提交于
      In __unlink_start_trans() if we don't have enough room for a reservation we will
      check to see if the unlink will free up space.  If it does that's great, but we
      will still could add an orphan item, so we need to reserve enough space to add
      the orphan item.  Do this and migrate the space the global reserve so it all
      works out right.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      3880a1b4
    • J
      Btrfs: fix the amount of space reserved for unlink · e70bea5f
      Josef Bacik 提交于
      Our unlink reservations were a bit much, we were reserving 10 and I only count 8
      possible items we're touching, so comment what we're reserving for and fix the
      count value.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      e70bea5f
    • J
      Btrfs: inline checksums into the disk free space cache · 5b0e95bf
      Josef Bacik 提交于
      Yeah yeah I know this is how we used to do it and then I changed it, but damnit
      I'm changing it back.  The fact is that writing out checksums will modify
      metadata, which could cause us to dirty a block group we've already written out,
      so we have to truncate it and all of it's checksums and re-write it which will
      write new checksums which could dirty a blockg roup that has already been
      written and you see where I'm going with this?  This can cause unmount or really
      anything that depends on a transaction to commit to take it's sweet damned time
      to happen.  So go back to the way it was, only this time we're specifically
      setting NODATACOW because we can't go through the COW pathway anyway and we're
      doing our own built-in cow'ing by truncating the free space cache.  The other
      new thing is once we truncate the old cache and preallocate the new space, we
      don't need to do that song and dance at all for the rest of the transaction, we
      can just overwrite the existing space with the new cache if the block group
      changes for whatever reason, and the NODATACOW will let us do this fine.  So
      keep track of which transaction we last cleared our cache in and if we cleared
      it in this transaction just say we're all setup and carry on.  This survives
      xfstests and stress.sh.
      
      The inode cache will continue to use the normal csum infrastructure since it
      only gets written once and there will be no more modifications to the fs tree in
      a transaction commit.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5b0e95bf
    • J
      Btrfs: break out of orphan cleanup if we can't make progress · 8f6d7f4f
      Josef Bacik 提交于
      I noticed while running xfstests 83 that if we didn't have enough space to
      delete our inode the orphan cleanup would just loop.  This is because it keeps
      finding the same orphan item and keeps trying to kill it but can't because we
      don't get an error back from iput for deleting the inode.  So keep track of the
      last guy we tried to kill, if it's the same as the one we're trying to kill
      currently we know we are having problems and can just error out.  I don't have a
      way to test this so look hard and make sure it's right.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      8f6d7f4f
    • J
      Btrfs: use the global reserve as a backup for deleting inodes · 726c35fa
      Josef Bacik 提交于
      Xfstests 83 really stresses our ENOSPC since it uses a 100mb fs which ends up
      with the mixed block group stuff.  Because of this we can run into a situation
      where we don't have enough space to delete inodes, or even worse we can't free
      the inodes when we next mount the fs which causes the orphan code to lose its
      mind.  So if we fail to make our reservation, steal from the global reserve.
      The global reserve will end up taking up the entire rest of the free space on
      the fs in this worst case so there really is no other option.  With this patch
      test 83 doesn't freak out.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      726c35fa
    • J
      Btrfs: fix orphan cleanup regression · a8c9e576
      Josef Bacik 提交于
      In fixing how we deal with bad inodes, we had a regression in the orphan cleanup
      code, since it expects to get a bad inode back.  So fix it to deal with getting
      -ESTALE back by deleting the orphan item manually and moving on.  Thanks,
      Reported-by: NSimon Kirby <sim@hostway.ca>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      a8c9e576
    • J
      Btrfs: use the inode's mapping mask for allocating pages · 3b16a4e3
      Josef Bacik 提交于
      Johannes pointed out we were allocating only kernel pages for doing writes,
      which is kind of a big deal if you are on 32bit and have more than a gig of ram.
      So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
      don't re-enter.  Thanks,
      Reported-by: NJohannes Weiner <jweiner@redhat.com>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      3b16a4e3
    • J
      Btrfs: stop passing a trans handle all around the reservation code · 4a92b1b8
      Josef Bacik 提交于
      The only thing that we need to have a trans handle for is in
      reserve_metadata_bytes and thats to know how much flushing we can do.  So
      instead of passing it around, just check current->journal_info for a
      trans_handle so we know if we can commit a transaction to try and free up space
      or not.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4a92b1b8
    • J
      Btrfs: handle enospc accounting for free space inodes · c09544e0
      Josef Bacik 提交于
      Since free space inodes now use normal checksumming we need to make sure to
      account for their metadata use.  So reserve metadata space, and then if we fail
      to write out the metadata we can just release it, otherwise it will be freed up
      when the io completes.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      c09544e0
    • J
      Btrfs: set truncate block rsv's size · 4a338542
      Josef Bacik 提交于
      While debugging a different issue I noticed that we were always reserving space
      when we tried to use our truncate block rsv's.  This is because they didn't have
      a ->size value, so use_block_rsv just assumes there is nothing reserved and it
      does a reserve_metadata_bytes.  This is because btrfs_check_block_rsv() doesn't
      actually add to the size of the block rsv.  That seems to be the right thing to
      do so set ->size to the minimum truncate size we need, since we will always only
      refill to that size anyway, and this way everything works out correctly.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4a338542
    • J
      Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_check · 482e6dc5
      Josef Bacik 提交于
      If you run xfstest 224 it you will get lots of messages about not being able to
      delete inodes and that they will be cleaned up next mount.  This is because
      btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
      flush, so if there was not enough space, it simply failed.  But in truncate and
      evict case we could easily flush space to try and get enough space to do our
      work, so make btrfs_block_rsv_check take a flush argument to pass down to
      reserve_metadata_bytes.  Now xfstests 224 runs fine without all those
      complaints.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      482e6dc5
    • J
      Btrfs: reduce the amount of space needed for truncates · 07127184
      Josef Bacik 提交于
      With btrfs_truncate_inode_items we always return if we have to go to another
      leaf, which makes us do our reservation again.  This means we will only ever
      modify one leaf at a time, so we only need 1 items worth of slack space.  Also,
      since we are deleting we will not be creating nodes as we go down, if anything
      we'll be free'ing them as we merge them together, so make a different
      calculation for truncate which will only have the worst case useage of COW'ing
      the entire path down to the leaf.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      07127184
    • J
      Btrfs: optimize how we account for space in truncate · 907cbceb
      Josef Bacik 提交于
      Currently we're starting and stopping a transaction for no real reason, so kill
      that and just reserve enough space as if we can truncate all in one transaction.
      Also use btrfs_block_rsv_check() for our reserve to minimize the amount of space
      we may have to allocate for our slack space.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      907cbceb
    • J
      Btrfs: fix how we reserve space for deleting inodes · 4289a667
      Josef Bacik 提交于
      I converted btrfs_truncate to do sane reservations for truncate, but didn't
      convert btrfs_evict_inode.  Basically we need to save the orphan_rsv for
      deleting the orphan item, and do normal reservations for our truncate.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4289a667
    • J
      Btrfs: kill the durable block rsv stuff · 37be25bc
      Josef Bacik 提交于
      This is confusing code and isn't used by anything anymore, so delete it.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      37be25bc
    • J
      Btrfs: kill the orphan space calculation for snapshots · dba68306
      Josef Bacik 提交于
      This patch kills off the calculation for the amount of space needed for the
      orphan operations during a snapshot.  The thing is we only do snapshots on
      commit, so any space that is in the block_rsv->freed[] isn't going to be in the
      new snapshot anyway, so there isn't any reason to require that space to be
      reserved for the snapshot to occur.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      dba68306
    • J
      Btrfs: calculate checksum space correctly · 7709cde3
      Josef Bacik 提交于
      We have not been reserving enough space for checksums.  We were just reserving
      bytes for the checksum items themselves, we were not taking into account having
      to cow the tree and such.  This patch adds a csum_bytes counter to the inode for
      keeping track of the number of bytes outstanding we have for checksums.  Then we
      calculate how many leaves would be required for the checksums we are given and
      use that to reserve space.  This adds a significant amount of bytes to our
      reservations, but we will handle this later.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7709cde3
    • J
      Btrfs: kill reserved_bytes in inode · 0cbbdf7c
      Josef Bacik 提交于
      reserved_bytes is not used for anything in the inode, remove it.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0cbbdf7c
  6. 29 9月, 2011 3 次提交
    • J
      btrfs: Moved repair code from inode.c to extent_io.c · 4a54c8c1
      Jan Schmidt 提交于
      The raid-retry code in inode.c can be generalized so that it works for
      metadata as well. Thus, this patch moves it to extent_io.c and makes the
      raid-retry code a raid-repair code.
      
      Repair works that way: Whenever a read error occurs and we have more
      mirrors to try, note the failed mirror, and retry another. If we find a
      good one, check if we did note a failure earlier and if so, do not allow
      the read to complete until after the bad sector was written with the good
      data we just fetched. As we have the extent locked while reading, no one
      can change the data in between.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      4a54c8c1
    • J
      btrfs: Do not use bio->bi_bdev after submission · 1503140d
      Jan Schmidt 提交于
      The block layer modifies bio->bi_bdev and bio->bi_sector while working on
      the bio, they do _not_ come back unmodified in the completion callback.
      
      To call add_page, we need at least some bi_bdev set, which is why the code
      was working, previously. With this patch, we use the latest_bdev from
      fsinfo instead of the leftover in the bio. This gives us the possibility to
      use the bi_bdev field for another purpose.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      1503140d
    • J
      btrfs: add mirror_num to extent_read_full_page · 8ddc7d9c
      Jan Schmidt 提交于
      Currently, extent_read_full_page always assumes we are trying to read mirror
      0, which generally is the best we can do. To add flexibility, pass it as a
      parameter. This will be needed by scrub fixup code.
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      8ddc7d9c
  7. 18 9月, 2011 2 次提交
    • J
      Btrfs: only clear the need lookup flag after the dentry is setup · a66e7cc6
      Josef Bacik 提交于
      We can race with readdir and the RCU path walking stuff.  This is because we
      clear the need lookup flag before actually instantiating the inode.  This will
      lead the RCU path walk stuff to find a dentry it thinks is valid without a
      d_inode attached.  So instead unhash the dentry when we first start the lookup,
      and then clear the flag after we've instantiated the dentry so we're garunteed
      to either try the slow lookup, or have the d_inode set properly.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a66e7cc6
    • H
      btrfs: fix d_off in the first dirent · 3765fefa
      Hidetoshi Seto 提交于
      Since the d_off in the first dirent for "." (that originates from
      the 4th argument "offset" of filldir() for the 2nd dirent for "..")
      is wrongly assigned in btrfs_real_readdir(), telldir returns same
      offset for different locations.
      
       | # mkfs.btrfs /dev/sdb1
       | # mount /dev/sdb1 fs0
       | # cd fs0
       | # touch file0 file1
       | # ../test
       | telldir: 0
       | readdir: d_off = 2, d_name = "."
       | telldir: 2
       | readdir: d_off = 2, d_name = ".."
       | telldir: 2
       | readdir: d_off = 3, d_name = "file0"
       | telldir: 3
       | readdir: d_off = 2147483647, d_name = "file1"
       | telldir: 2147483647
      
      To fix this problem, pass filp->f_pos (which is loff_t) instead.
      
       | # ../test
       | telldir: 0
       | readdir: d_off = 1, d_name = "."
       | telldir: 1
       | readdir: d_off = 2, d_name = ".."
       | telldir: 2
       | readdir: d_off = 3, d_name = "file0"
       :
      
      At the moment the "offset" for "." is unused because there is no
      preceding dirent, however it is better to pass filp->f_pos to follow
      grammatical usage.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3765fefa
  8. 11 9月, 2011 3 次提交
    • M
      Btrfs: fix wrong nbytes information of the inode · a39f7521
      Miao Xie 提交于
      If we write some data into the data hole of the file(no preallocation for this
      hole), Btrfs will allocate some disk space, and update nbytes of the inode, but
      the other element--disk_i_size needn't be updated. At this condition, we must
      update inode metadata though disk_i_size is not changed(btrfs_ordered_update_i_size()
      return 1).
      
       # mkfs.btrfs /dev/sdb1
       # mount /dev/sdb1 /mnt
       # touch /mnt/a
       # truncate -s 856002 /mnt/a
       # dd if=/dev/zero of=/mnt/a bs=4K count=1 conv=nocreat,notrunc
       # umount /mnt
       # btrfsck /dev/sdb1
       root 5 inode 257 errors 400
       found 32768 bytes used err is 1
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a39f7521
    • M
      Btrfs: fix unclosed transaction handle in btrfs_cont_expand · 5b397377
      Miao Xie 提交于
      The function - btrfs_cont_expand() forgot to close the transaction handle before
      it jump out the while loop. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5b397377
    • S
      btrfs: fix warning in iput for bad-inode · e0b6d65b
      Sergei Trofimovich 提交于
      iput() shouldn't be called for inodes in I_NEW state.
      We need to mark inode as constructed first.
      
      WARNING: at fs/inode.c:1309 iput+0x20b/0x210()
      Call Trace:
       [<ffffffff8103e7ba>] warn_slowpath_common+0x7a/0xb0
       [<ffffffff8103e805>] warn_slowpath_null+0x15/0x20
       [<ffffffff810eaf0b>] iput+0x20b/0x210
       [<ffffffff811b96fb>] btrfs_iget+0x1eb/0x4a0
       [<ffffffff811c3ad6>] btrfs_run_defrag_inodes+0x136/0x210
       [<ffffffff811ad55f>] cleaner_kthread+0x17f/0x1a0
       [<ffffffff81035b7d>] ? sub_preempt_count+0x9d/0xd0
       [<ffffffff811ad3e0>] ? transaction_kthread+0x280/0x280
       [<ffffffff8105af86>] kthread+0x96/0xa0
       [<ffffffff814336d4>] kernel_thread_helper+0x4/0x10
       [<ffffffff8105aef0>] ? kthread_worker_fn+0x190/0x190
       [<ffffffff814336d0>] ? gs_change+0xb/0xb
      Signed-off-by: NSergei Trofimovich <slyfox@gentoo.org>
      CC: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Tested-by: NDavid Sterba <dsterba@suse.cz>
      CC: Josef Bacik <josef@redhat.com>
      CC: Chris Mason <chris.mason@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      e0b6d65b
  9. 18 8月, 2011 1 次提交
    • J
      btrfs: btrfs_permission's RO check shouldn't apply to device nodes · cb6db4e5
      Jeff Mahoney 提交于
      This patch tightens the read-only access checks in btrfs_permission to
       match the constraints in inode_permission. Currently, even though the
       device node itself will be unmodified, read-write access to device nodes
       is denied to when the device node resides on a read-only subvolume or a
       is a file that has been marked read-only by the btrfs conversion utility.
      
       With this patch applied, the check only affects regular files,
       directories, and symlinks. It also restructures the code a bit so that
       we don't duplicate the MAY_WRITE check for both tests.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cb6db4e5
  10. 02 8月, 2011 3 次提交
  11. 01 8月, 2011 1 次提交
    • J
      Btrfs: load the key from the dir item in readdir into a fake dentry · b4aff1f8
      Josef Bacik 提交于
      In btrfs we have 2 indexes for inodes.  One is for readdir, it's in this nice
      sequential order and works out brilliantly for readdir.  However if you use ls,
      it usually stat's each file it gets from readdir.  This is where the second
      index comes in, which is based on a hash of the name of the file.  So then the
      lookup has to lookup this index, and then lookup the inode.  The index lookup is
      going to be in random order (since its based on the name hash), which gives us
      less than stellar performance.  Since we know the inode location from the
      readdir index, I create a dummy dentry and copy the location key into
      dentry->d_fsdata.  Then on lookup if we have d_fsdata we use that location to
      lookup the inode, avoiding looking up the other directory index.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b4aff1f8