1. 03 1月, 2014 7 次提交
    • S
      GFS2: Use range based functions for rgrp sync/invalidation · 7005c3e4
      Steven Whitehouse 提交于
      Each rgrp header is represented as a single extent on disk, so we
      can calculate the position within the address space, since we are
      using address spaces mapped 1:1 to the disk. This means that it
      is possible to use the range based versions of filemap_fdatawrite/wait
      and for invalidating the page cache.
      
      Our eventual intent is to then be able to merge the address spaces
      used for rgrps into a single address space, rather than to have
      one for each glock, saving memory and reducing complexity.
      
      Since during umount, the rgrp structures are disposed of before
      the glocks, we need to store the extent information in the glock
      so that is is available for a final invalidation. This patch uses
      a field which is otherwise unused in rgrp glocks to do that, so
      that we do not have to expand the size of a glock.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      7005c3e4
    • S
      GFS2: Remove test which is always true · 7de41d36
      Steven Whitehouse 提交于
      Since gfs2_inplace_reserve() is always called with a valid
      alloc parms structure, there is no need to test for this
      within the function itself - and in any case, after we've
      all ready dereferenced it anyway.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      7de41d36
    • S
      GFS2: Remove gfs2_quota_change_host structure · 7aed98fb
      Steven Whitehouse 提交于
      There is only one place this is used, when reading in the quota
      changes at mount time. It is not really required and much
      simpler to just convert the fields from the on-disk structure
      as required.
      
      There should be no functional change as a result of this patch.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      7aed98fb
    • S
      GFS2: Clean up releasepage · e4f29206
      Steven Whitehouse 提交于
      For historical reasons, we drop and retake the log lock in ->releasepage()
      however, since there is no reason why we cannot hold the log lock over
      the whole function, this allows some simplification. In particular,
      pinning a buffer is only ever done under the log lock, so it is possible
      here to remove the test for pinned buffers in the second loop, since it
      is impossible for that to happen (it is also tested in the first loop).
      
      As a result, two tests made later in the second loop become constants
      and can also be reduced to the only possible branch. So the net result
      is to remove various bits of unreachable code and make this more
      readable.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      e4f29206
    • B
      GFS2: Implement a "rgrp has no extents longer than X" scheme · 5ea5050c
      Bob Peterson 提交于
      With the preceding patch, we started accepting block reservations
      smaller than the ideal size, which requires a lot more parsing of the
      bitmaps. To reduce the amount of bitmap searching, this patch
      implements a scheme whereby each rgrp keeps track of the point
      at this multi-block reservations will fail.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      5ea5050c
    • B
      GFS2: Drop inadequate rgrps from the reservation tree · 1330edbe
      Bob Peterson 提交于
      This is just basically a resend of a patch I posted earlier.
      It didn't change from its original, except in diff offsets, etc:
      
      This patch fixes a bug in the GFS2 block allocation code. The problem
      starts if a process already has a multi-block reservation, but for
      some reason, another process disqualifies it from further allocations.
      For example, the other process might set on the GFS2_RDF_ERROR bit.
      The process holding the reservation jumps to label skip_rgrp, but
      that label comes after the code that removes the reservation from the
      tree. Therefore, the no longer usable reservation is not removed from
      the rgrp's reservations tree; it's lost. Eventually, the lost reservation
      causes the count of reserved blocks to get off, and eventually that
      causes a BUG_ON(rs->rs_rbm.rgd->rd_reserved < rs->rs_free) to trigger.
      This patch moves the call to after label skip_rgrp so that the
      disqualified reservation is properly removed from the tree, thus keeping
      the rgrp rd_reserved count sane.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      1330edbe
    • B
      GFS2: If requested is too large, use the largest extent in the rgrp · 5ce13431
      Bob Peterson 提交于
      Here is a second try at a patch I posted earlier, which also implements
      suggestions Steve made:
      
      Before this patch, GFS2 would keep searching through all the rgrps
      until it found one that had a chunk of free blocks big enough to
      satisfy the size hint, which is based on the file write size,
      regardless of whether the chunk was big enough to perform the write.
      However, when doing big writes there may not be a large enough
      chunk of free blocks in any rgrp, due to file system fragmentation.
      The largest chunk may be big enough to satisfy the write request,
      but it may not meet the ideal reservation size from the "size hint".
      The writes would slow to a crawl because every write would search
      every rgrp, then finally give up and default to a single-block write.
      In my case, performance would drop from 425MB/s to 18KB/s, or 24000
      times slower.
      
      This patch basically makes it so that if we can't find a contiguous
      chunk of blocks big enough to satisfy the sizehint, we'll use the
      largest chunk of blocks we found that will still contain the write.
      It does so by keeping track of the largest run of blocks within the
      rgrp.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      5ce13431
  2. 02 1月, 2014 1 次提交
  3. 20 12月, 2013 2 次提交
    • S
      GFS2: Wait for async DIO in glock state changes · 582d2f7a
      Steven Whitehouse 提交于
      We need to wait for any outstanding DIO to complete in a couple
      of situations. Firstly, in case we are changing out of deferred
      mode (in inode_go_sync) where GLF_DIRTY will not be set. That
      call could be prefixed with a test for gl_state == LM_ST_DEFERRED
      but it doesn't seem worth it bearing in mind that the test for
      outstanding DIO is very quick anyway, in the usual case that there
      is none.
      
      The second case is in inode_go_lock which will catch the cases
      where we have a cached EX lock, but where we grant deferred locks
      against it so that there is no glock state transistion. We only
      need to wait if the state is not deferred, since DIO is valid
      anyway in that state.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      582d2f7a
    • S
      GFS2: Fix incorrect invalidation for DIO/buffered I/O · dfd11184
      Steven Whitehouse 提交于
      In patch 209806ab we allowed
      local deferred locks to be granted against a cached exclusive
      lock. That opened up a corner case which this patch now
      fixes.
      
      The solution to the problem is to check whether we have cached
      pages each time we do direct I/O and if so to unmap, flush
      and invalidate those pages. Since the glock state machine
      normally does that for us, mostly the code will be a no-op.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      dfd11184
  4. 14 12月, 2013 3 次提交
    • B
      GFS2: Fix slab memory leak in gfs2_bufdata · 502be2a3
      Bob Peterson 提交于
      This patch fixes a slab memory leak that sometimes can occur
      for files with a very short lifespan. The problem occurs when
      a dinode is deleted before it has gotten to the journal properly.
      In the leak scenario, the bd object is pinned for journal
      committment (queued to the metadata buffers queue: sd_log_le_buf)
      but is subsequently unpinned and dequeued before it finds its way
      to the ail or the revoke queue. In this rare circumstance, the bd
      object needs to be freed from slab memory, or it is forgotten.
      We have to be very careful how we do it, though, because
      multiple processes can call gfs2_remove_from_journal. In order to
      avoid double-frees, only the process that does the unpinning is
      allowed to free the bd.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      502be2a3
    • B
      GFS2: Fix use-after-free race when calling gfs2_remove_from_ail · 9290a9a7
      Bob Peterson 提交于
      Function gfs2_remove_from_ail drops the reference on the bh via
      brelse. This patch fixes a race condition whereby bh is deferenced
      after the brelse when setting bd->bd_blkno = bh->b_blocknr;
      Under certain rare circumstances, bh might be gone or reused,
      and bd->bd_blkno is set to whatever that memory happens to be,
      which is often 0. Later, in gfs2_trans_add_unrevoke, that bd fails
      the test "bd->bd_blkno >= blkno" which causes it to never be freed.
      The end result is that the bd is never freed from the bufdata cache,
      which results in this error:
      slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      9290a9a7
    • S
      GFS2: don't hold s_umount over blkdev_put · dfe5b9ad
      Steven Whitehouse 提交于
      This is a GFS2 version of Tejun's patch:
      4f331f01
      vfs: don't hold s_umount over close_bdev_exclusive() call
      
      In this case its blkdev_put itself that is the issue and this
      patch uses the same solution of dropping and retaking s_umount.
      Reported-by: NTejun Heo <tj@kernel.org>
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      dfe5b9ad
  5. 22 11月, 2013 1 次提交
  6. 21 11月, 2013 1 次提交
  7. 16 11月, 2013 1 次提交
  8. 04 11月, 2013 3 次提交
  9. 25 10月, 2013 1 次提交
  10. 15 10月, 2013 1 次提交
    • S
      GFS2: Use lockref for glocks · e66cf161
      Steven Whitehouse 提交于
      Currently glocks have an atomic reference count and also a spinlock
      which covers various internal fields, such as the state. This intent of
      this patch is to replace the spinlock and the atomic reference count
      with a lockref structure. This contains a spinlock which we can continue
      to use as before, and a reference counter which is used in conjuction
      with the spinlock to replace the previous atomic counter.
      
      As a result of this there are some new rules for reference counting on
      glocks. We need to distinguish between reference count changes under
      gl_spin (which are now just increment or decrement of the new counter,
      provided the count cannot hit zero) and those which are outside of
      gl_spin, but which now take gl_spin internally.
      
      The conversion is relatively straight forward. There is probably some
      further clean up which can be done, but the priority at this stage is to
      make the change in as simple a manner as possible.
      
      A consequence of this change is that the reference count is being
      decoupled from the lru list processing. This should allow future
      adoption of the lru_list code with glocks in due course.
      
      The reason for using the "dead" state and not just relying on 0 being
      the "invalid state" is so that in due course 0 ref counts can be
      allowable. The intent is to eventually be able to remove the ref count
      changes which are currently hidden away in state_change().
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      e66cf161
  11. 04 10月, 2013 4 次提交
  12. 02 10月, 2013 3 次提交
  13. 27 9月, 2013 1 次提交
    • S
      GFS2: Clean up reservation removal · af5c2697
      Steven Whitehouse 提交于
      The reservation for an inode should be cleared when it is truncated so
      that we can start again at a different offset for future allocations.
      We could try and do better than that, by resetting the search based on
      where the truncation started from, but this is only a first step.
      
      In addition, there are three callers of gfs2_rs_delete() but only one
      of those should really be testing the value of i_writecount. While
      we get away with that in the other cases currently, I think it would
      be better if we made that test specific to the one case which
      requires it.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      af5c2697
  14. 23 9月, 2013 1 次提交
  15. 18 9月, 2013 2 次提交
    • B
      GFS2: new function gfs2_rbm_incr · 149ed7f5
      Bob Peterson 提交于
      Since the previous patch eliminated bi in favor of bii, this follow-on
      patch needed to be adjusted accordingly. Here is the revised version.
      
      This patch adds a new function, gfs2_rbm_incr, which increments
      an rbm structure. This is more efficient than calling gfs2_rbm_to_block,
      incrementing, then calling gfs2_rbm_from_block.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      149ed7f5
    • B
      GFS2: Introduce rbm field bii · e579ed4f
      Bob Peterson 提交于
      This is a respin of the original patch. As Steve pointed out, the
      introduction of field bii makes it easy to eliminate bi itself.
      This revised patch does just that, replacing bi with bii.
      
      This patch adds a new field to the rbm structure, called bii,
      which is an index into the array of bitmaps for an rgrp.
      This replaces *bi which was a pointer to the bitmap.
      This is being done for further optimizations.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      e579ed4f
  16. 17 9月, 2013 5 次提交
  17. 13 9月, 2013 1 次提交
  18. 11 9月, 2013 2 次提交
    • D
      fs: convert fs shrinkers to new scan/count API · 1ab6c499
      Dave Chinner 提交于
      Convert the filesystem shrinkers to use the new API, and standardise some
      of the behaviours of the shrinkers at the same time.  For example,
      nr_to_scan means the number of objects to scan, not the number of objects
      to free.
      
      I refactored the CIFS idmap shrinker a little - it really needs to be
      broken up into a shrinker per tree and keep an item count with the tree
      root so that we don't need to walk the tree every time the shrinker needs
      to count the number of objects in the tree (i.e.  all the time under
      memory pressure).
      
      [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree]
      [assorted fixes folded in]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1ab6c499
    • G
      super: fix calculation of shrinkable objects for small numbers · 55f841ce
      Glauber Costa 提交于
      The sysctl knob sysctl_vfs_cache_pressure is used to determine which
      percentage of the shrinkable objects in our cache we should actively try
      to shrink.
      
      It works great in situations in which we have many objects (at least more
      than 100), because the aproximation errors will be negligible.  But if
      this is not the case, specially when total_objects < 100, we may end up
      concluding that we have no objects at all (total / 100 = 0, if total <
      100).
      
      This is certainly not the biggest killer in the world, but may matter in
      very low kernel memory situations.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      55f841ce