1. 21 10月, 2011 1 次提交
    • B
      GFS2: Use rbtree for resource groups and clean up bitmap buffer ref count scheme · 7c9ca621
      Bob Peterson 提交于
      Here is an update of Bob's original rbtree patch which, in addition, also
      resolves the rather strange ref counting that was being done relating to
      the bitmap blocks.
      
      Originally we had a dual system for journaling resource groups. The metadata
      blocks were journaled and also the rgrp itself was added to a list. The reason
      for adding the rgrp to the list in the journal was so that the "repolish
      clones" code could be run to update the free space, and potentially send any
      discard requests when the log was flushed. This was done by comparing the
      "cloned" bitmap with what had been written back on disk during the transaction
      commit.
      
      Due to this, there was a requirement to hang on to the rgrps' bitmap buffers
      until the journal had been flushed. For that reason, there was a rather
      complicated set up in the ->go_lock ->go_unlock functions for rgrps involving
      both a mutex and a spinlock (the ->sd_rindex_spin) to maintain a reference
      count on the buffers.
      
      However, the journal maintains a reference count on the buffers anyway, since
      they are being journaled as metadata buffers. So by moving the code which deals
      with the post-journal accounting for bitmap blocks to the metadata journaling
      code, we can entirely dispense with the rather strange buffer ref counting
      scheme and also the requirement to journal the rgrps.
      
      The net result of all this is that the ->sd_rindex_spin is left to do exactly
      one job, and that is to look after the rbtree or rgrps.
      
      This patch is designed to be a stepping stone towards using RCU for the rbtree
      of resource groups, however the reduction in the number of uses of the
      ->sd_rindex_spin is likely to have benefits for multi-threaded workloads,
      anyway.
      
      The patch retains ->go_lock and ->go_unlock for rgrps, however these maybe also
      be removed in future in favour of calling the functions directly where required
      in the code. That will allow locking of resource groups without needing to
      actually read them in - something that could be useful in speeding up statfs.
      
      In the mean time though it is valid to dereference ->bi_bh only when the rgrp
      is locked. This is basically the same rule as before, modulo the references not
      being valid until the following journal flush.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Cc: Benjamin Marzinski <bmarzins@redhat.com>
      7c9ca621
  2. 15 7月, 2011 1 次提交
  3. 21 5月, 2011 1 次提交
    • S
      GFS2: Wipe directory hash table metadata when deallocating a directory · 6d3117b4
      Steven Whitehouse 提交于
      The deallocation code for directories in GFS2 is largely divided into
      two parts. The first part deallocates any directory leaf blocks and
      marks the directory as being a regular file when that is complete. The
      second stage was identical to deallocating regular files.
      
      Regular files have their data blocks in a different
      address space to directories, and thus what would have been normal data
      blocks in a regular file (the hash table in a GFS2 directory) were
      deallocated correctly. However, a reference to these blocks was left in the
      journal (assuming of course that some previous activity had resulted in
      those blocks being in the journal or ail list).
      
      This patch uses the i_depth as a test of whether the inode is an
      exhash directory (we cannot test the inode type as that has already
      been changed to a regular file at this stage in deallocation)
      
      The original issue was reported by Chris Hertel as an issue he encountered
      running bonnie++
      Reported-by: NChristopher R. Hertel <crh@samba.org>
      Cc: Abhijith Das <adas@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      6d3117b4
  4. 20 4月, 2011 2 次提交
  5. 18 4月, 2011 1 次提交
    • B
      GFS2: filesystem hang caused by incorrect lock order · 44ad37d6
      Bob Peterson 提交于
      This patch fixes a deadlock in GFS2 where two processes are trying
      to reclaim an unlinked dinode:
      One holds the inode glock and calls gfs2_lookup_by_inum trying to look
      up the inode, which it can't, due to I_FREEING.  The other has set
      I_FREEING from vfs and is at the beginning of gfs2_delete_inode
      waiting for the glock, which is held by the first.  The solution is to
      add a new non_block parameter to the gfs2_iget function that causes it
      to return -ENOENT if the inode is being freed.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      44ad37d6
  6. 24 2月, 2011 1 次提交
    • B
      GFS2: deallocation performance patch · 4c16c36a
      Bob Peterson 提交于
      This patch is a performance improvement to GFS2's dealloc code.
      Rather than update the quota file and statfs file for every
      single block that's stripped off in unlink function do_strip,
      this patch keeps track and updates them once for every layer
      that's stripped.  This is done entirely inside the existing
      transaction, so there should be no risk of corruption.
      The other functions that deallocate blocks will be unaffected
      because they are using wrapper functions that do the same
      thing that they do today.
      
      I tested this code on my roth cluster by creating 200
      files in a directory, each of which is 100MB, then on
      four nodes, I simultaneously deleted the files, thus competing
      for GFS2 resources (but different files).  The commands
      I used were:
      
      [root@roth-01]# time for i in `seq 1 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
      [root@roth-02]# time for i in `seq 2 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
      [root@roth-03]# time for i in `seq 3 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
      [root@roth-05]# time for i in `seq 4 4 200` ; do rm /mnt/gfs2/bigdir/gfs2.$i; done
      
      The performance increase was significant:
      
                   roth-01     roth-02     roth-03     roth-05
                   ---------   ---------   ---------   ---------
      old: real    0m34.027    0m25.021s   0m23.906s   0m35.646s
      new: real    0m22.379s   0m24.362s   0m24.133s   0m18.562s
      
      Total time spent deleting:
      old: 118.6s
      new:  89.4
      
      For this particular case, this showed a 25% performance increase for
      GFS2 unlinks.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      4c16c36a
  7. 08 12月, 2010 1 次提交
    • B
      GFS2: fsck.gfs2 reported statfs error after gfs2_grow · bcd7278d
      Bob Peterson 提交于
      When you do gfs2_grow it failed to take the very last
      rgrp into account when adding up the new free space due
      to an off-by-one error.  It was not reading the last
      rgrp from the rindex because of a check for "<=" that
      should have been "<".  Therefore, fsck.gfs2 was finding
      (and fixing) an error with the system statfs file.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      bcd7278d
  8. 30 11月, 2010 2 次提交
  9. 15 11月, 2010 1 次提交
    • S
      GFS2: Fix inode deallocation race · 044b9414
      Steven Whitehouse 提交于
      This area of the code has always been a bit delicate due to the
      subtleties of lock ordering. The problem is that for "normal"
      alloc/dealloc, we always grab the inode locks first and the rgrp lock
      later.
      
      In order to ensure no races in looking up the unlinked, but still
      allocated inodes, we need to hold the rgrp lock when we do the lookup,
      which means that we can't take the inode glock.
      
      The solution is to borrow the technique already used by NFS to solve
      what is essentially the same problem (given an inode number, look up
      the inode carefully, checking that it really is in the expected
      state).
      
      We cannot do that directly from the allocation code (lock ordering
      again) so we give the job to the pre-existing delete workqueue and
      carry on with the allocation as normal.
      
      If we find there is no space, we do a journal flush (required anyway
      if space from a deallocation is to be released) which should block
      against the pending deallocations, so we should always get the space
      back.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      044b9414
  10. 01 10月, 2010 1 次提交
    • B
      GFS2 fatal: filesystem consistency error on rename · 46290341
      Bob Peterson 提交于
      This patch fixes a GFS2 problem whereby the first rename after a
      mount can result in a file system consistency error being flagged
      improperly and cause the file system to withdraw.  The problem is
      that the rename code tries to run the rgrp list with function
      gfs2_blk2rgrpd before the rgrp list is guaranteed to be read in
      from disk.  The patch makes the rename function hold the rindex
      glock (as the gfs2_unlink code does today) which reads in the rgrp
      list if need be.  There were a total of three places in the rename
      code that improperly referenced the rgrp list without the rindex
      glock and this patch fixes all three.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      46290341
  11. 20 9月, 2010 3 次提交
    • B
      GFS2: fallocate support · 3921120e
      Benjamin Marzinski 提交于
      This patch adds support for fallocate to gfs2.  Since the gfs2 does not support
      uninitialized data blocks, it must write out zeros to all the blocks.  However,
      since it does not need to lock any pages to read from, gfs2 can write out the
      zero blocks much more efficiently.  On a moderately full filesystem, fallocate
      works around 5 times faster on average.  The fallocate call also allows gfs2 to
      add blocks to the file without changing the filesize, which will make it
      possible for gfs2 to preallocate space for the rindex file, so that gfs2 can
      grow a completely full filesystem.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      3921120e
    • S
      GFS2: Add a bug trap in allocation code · 9a3f236d
      Steven Whitehouse 提交于
      This adds a check to ensure that if we reach the block allocator
      that we don't try and proceed if there is no alloc structure
      hanging off the inode. This should only happen if there is a bug
      in GFS2. The error return code is distinctive in order that it
      will be easily spotted.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      9a3f236d
    • S
      GFS2: Remove i_disksize · a2e0f799
      Steven Whitehouse 提交于
      With the update of the truncate code, ip->i_disksize and
      inode->i_size are merely copies of each other. This means
      we can remove ip->i_disksize and use inode->i_size exclusively
      reducing the size of a GFS2 inode by 8 bytes.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      a2e0f799
  12. 17 9月, 2010 1 次提交
    • C
      block: remove BLKDEV_IFL_WAIT · dd3932ed
      Christoph Hellwig 提交于
      All the blkdev_issue_* helpers can only sanely be used for synchronous
      caller.  To issue cache flushes or barriers asynchronously the caller needs
      to set up a bio by itself with a completion callback to move the asynchronous
      state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
      specified when calling blkdev_issue_* and also remove the now unused flags
      argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
      blkdev_issue_discard we need to keep it for the secure discard flag, which
      gains a more descriptive name and loses the bitops vs flag confusion.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      dd3932ed
  13. 10 9月, 2010 1 次提交
  14. 21 5月, 2010 1 次提交
  15. 12 5月, 2010 1 次提交
  16. 29 4月, 2010 1 次提交
  17. 14 4月, 2010 1 次提交
    • B
      GFS2: glock livelock · 1a0eae88
      Bob Peterson 提交于
      This patch fixes a couple gfs2 problems with the reclaiming of
      unlinked dinodes.  First, there were a couple of livelocks where
      everything would come to a halt waiting for a glock that was
      seemingly held by a process that no longer existed.  In fact, the
      process did exist, it just had the wrong pid number in the holder
      information.  Second, there was a lock ordering problem between
      inode locking and glock locking.  Third, glock/inode contention
      could sometimes cause inodes to be improperly marked invalid by
      iget_failed.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      1a0eae88
  18. 01 2月, 2010 3 次提交
  19. 03 12月, 2009 1 次提交
  20. 21 9月, 2009 1 次提交
  21. 14 9月, 2009 2 次提交
  22. 09 9月, 2009 1 次提交
    • S
      GFS2: Be extra careful about deallocating inodes · acf7e244
      Steven Whitehouse 提交于
      There is a potential race in the inode deallocation code if two
      nodes try to deallocate the same inode at the same time. Most of
      the issue is solved by the iopen locking. There is still a small
      window which is not covered by the iopen lock. This patches fixes
      that and also makes the deallocation code more robust in the face of
      any errors in the rgrp bitmaps, or erroneous iopen callbacks from
      other nodes.
      
      This does introduce one extra disk read, but that is generally not
      an issue since its the same block that must be written to later
      in the deallocation process. The total disk accesses therefore stay
      the same,
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      acf7e244
  23. 27 8月, 2009 1 次提交
    • S
      GFS2: Remove no_formal_ino generating code · 8d8291ae
      Steven Whitehouse 提交于
      The inum structure used throughout GFS2 has two fields. One
      no_addr is the disk block number of the inode in question and
      is used everywhere as the inode number. The other, no_formal_ino,
      is used only as the generation number for NFS.
      
      Historically the no_formal_ino field was set using a complicated
      system of one global and one per-node file containing inode numbers
      in order to ensure that each no_formal_ino was unique. Also this
      code made no provision for what would happen when eventually the
      (64 bit) numbers ran out. Now I know that is pretty unlikely to
      happen given the large space of numbers, but it is possible
      nevertheless.
      
      The only guarantee required for no_formal_ino is that, for any
      single inode, the same number doesn't get reused too quickly.
      
      We already have a generation number which is kept in the inode
      and initialised from a counter in the resource group (almost
      no overhead, since we have to touch the resource group anyway
      in order to allocate an inode in the first place). Aside from
      ensuring that we never use the value 0 in the no_formal_ino
      field, we can use that counter directly.
      
      As a result of that change, we lose about 200 lines of code and
      also gain about 10 creates/sec on the postmark benchmark (on
      my test machine).
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      8d8291ae
  24. 17 8月, 2009 2 次提交
  25. 30 7月, 2009 2 次提交
  26. 12 6月, 2009 1 次提交
    • S
      GFS2: Add tracepoints · 63997775
      Steven Whitehouse 提交于
      This patch adds the ability to trace various aspects of the GFS2
      filesystem. The trace points are divided into three groups,
      glocks, logging and bmap. These points have been chosen because
      they allow inspection of the major internal functions of GFS2
      and they are also generic enough that they are unlikely to need
      any major changes as the filesystem evolves.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      63997775
  27. 23 5月, 2009 1 次提交
  28. 22 5月, 2009 1 次提交
    • S
      GFS2: Clean up some file names · b1e71b06
      Steven Whitehouse 提交于
      This patch renames the ops_*.c files which have no counterpart
      without the ops_ prefix in order to shorten the name and make
      it more readable. In addition, ops_address.h (which was very
      small) is moved into inode.h and inode.h is cleaned up by
      adding extern where required.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      b1e71b06
  29. 21 5月, 2009 2 次提交
    • S
      GFS2: Be more aggressive in reclaiming unlinked inodes · 1ce97e56
      Steven Whitehouse 提交于
      This patch increases the frequency with which gfs2 looks
      for unlinked, but still allocated inodes. Its the equivalent
      operation to ext3's orphan list, but done with bitmaps in
      the resource groups.
      
      This also fixes a bug where a field in the rgrp was too small.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      1ce97e56
    • S
      GFS2: Add a rgrp bitmap full flag · 60a0b8f9
      Steven Whitehouse 提交于
      During block allocation, it is useful to know if sections of disk
      are full on a finer grained basis than a single resource group.
      This can make a performance difference when resource groups have
      larger numbers of bitmap blocks, since we no longer have to search
      them all block by block in each individual bitmap.
      
      The full flag is set on a per-bitmap basis when it has been
      searched and found to have no free space. It is then skipped in
      subsequent searches until the flag is reset. The resetting
      occurs if we have to drop the glock on the resource group for any
      reason, or if we deallocate some blocks within that resource
      group and thus free up some space.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      60a0b8f9
  30. 20 5月, 2009 1 次提交
    • S
      GFS2: Improve resource group error handling · 09010978
      Steven Whitehouse 提交于
      This patch improves the error handling in the case where we
      discover that the summary information in the resource group
      doesn't match the bitmap information while in the process of
      allocating blocks. Originally this resulted in a kernel bug,
      but this patch changes that so that we return -EIO and print
      some messages explaining what went wrong, and how to fix it.
      
      We also remember locally not to try and allocate from the
      same rgrp again, so that a subsequent allocation in a
      different rgrp should succeed.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      09010978