1. 14 6月, 2013 1 次提交
    • S
      GFS2: Add atomic_open support · 6d4ade98
      Steven Whitehouse 提交于
      I've restricted atomic_open to only operate on regular files, although
      I still don't understand why atomic_open should not be possible also for
      directories on GFS2. That can always be added in later though, if it
      makes sense.
      
      The ->atomic_open function can be passed negative dentries, which
      in most cases means either ENOENT (->lookup) or a call to d_instantiate
      (->create). In the GFS2 case though, we need to actually perform the
      look up, since we do not know whether there has been a new inode created
      on another node. The look up calls d_splice_alias which then tries to
      rehash the dentry - so the solution here is to simply check for that
      in d_splice_alias. The same issue is likely to affect any other cluster
      filesystem implementing ->atomic_open
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "J. Bruce Fields" <bfields fieldses org>
      Cc: Jeff Layton <jlayton@redhat.com>
      6d4ade98
  2. 03 6月, 2013 1 次提交
  3. 08 5月, 2013 1 次提交
  4. 04 4月, 2013 1 次提交
  5. 23 2月, 2013 1 次提交
  6. 22 2月, 2013 1 次提交
    • D
      mm: only enforce stable page writes if the backing device requires it · 1d1d1a76
      Darrick J. Wong 提交于
      Create a helper function to check if a backing device requires stable
      page writes and, if so, performs the necessary wait.  Then, make it so
      that all points in the memory manager that handle making pages writable
      use the helper function.  This should provide stable page write support
      to most filesystems, while eliminating unnecessary waiting for devices
      that don't require the feature.
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own stable page guarantees or they don't block at all.
      The blocking behavior is back to what it was before 3.0 if you don't
      have a disk requiring stable page writes.
      
      Here's the result of using dbench to test latency on ext2:
      
      3.8.0-rc3:
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       WriteX        109347     0.028    59.817
       ReadX         347180     0.004     3.391
       Flush          15514    29.828   287.283
      
      Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
       WriteX        105556     0.029     4.273
       ReadX         335004     0.005     4.112
       Flush          14982    30.540   298.634
      
      Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, the maximum write latency drops considerably with this
      patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
      similarly, but see the cover letter for those results.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d1d1a76
  7. 29 1月, 2013 1 次提交
    • S
      GFS2: Split gfs2_trans_add_bh() into two · 350a9b0a
      Steven Whitehouse 提交于
      There is little common content in gfs2_trans_add_bh() between the data
      and meta classes by the time that the functions which it calls are
      taken into account. The intent here is to split this into two
      separate functions. Stage one is to introduce gfs2_trans_add_data()
      and gfs2_trans_add_meta() and update the callers accordingly.
      
      Later patches will then pull in the content of gfs2_trans_add_bh()
      and its dependent functions in order to clean up the code in this
      area.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      350a9b0a
  8. 18 12月, 2012 1 次提交
  9. 07 11月, 2012 3 次提交
    • S
      GFS2: Add Orlov allocator · 9dbe9610
      Steven Whitehouse 提交于
      Just like ext3, this works on the root directory and any directory
      with the +T flag set. Also, just like ext3, any subdirectory created
      in one of the just mentioned cases will be allocated to a random
      resource group (GFS2 equivalent of a block group).
      
      If you are creating a set of directories, each of which will contain a
      job running on a different node, then by setting +T on the parent
      directory before creating the subdirectories, each will land up in a
      different resource group, and thus resource group contention between
      nodes will be kept to a minimum.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      9dbe9610
    • B
      GFS2: Don't call file_accessed() with a shared glock · 3d162688
      Benjamin Marzinski 提交于
      file_accessed() was being called by gfs2_mmap() with a shared glock. If it
      needed to update the atime, it was crashing because it dirtied the inode in
      gfs2_dirty_inode() without holding an exclusive lock. gfs2_dirty_inode()
      checked if the caller was already holding a glock, but it didn't make sure that
      the glock was in the exclusive state. Now, instead of calling file_accessed()
      while holding the shared lock in gfs2_mmap(), file_accessed() is called after
      grabbing and releasing the glock to update the inode.  If file_accessed() needs
      to update the atime, it will grab an exclusive lock in gfs2_dirty_inode().
      
      gfs2_dirty_inode() now also checks to make sure that if the calling process has
      already locked the glock, it has an exclusive lock.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      3d162688
    • A
      GFS2: Clean up some unused assignments · 73738a77
      Andrew Price 提交于
      Cleans up two cases where variables were assigned values but then never
      used again.
      Signed-off-by: NAndrew Price <anprice@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      73738a77
  10. 09 10月, 2012 1 次提交
    • K
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov 提交于
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      
      Now device drivers can implement this method and obtain nonlinear vma support.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b173bc4
  11. 24 9月, 2012 1 次提交
    • S
      GFS2: Remove rs_requested field from reservations · 71f890f7
      Steven Whitehouse 提交于
      The rs_requested field is left over from the original allocation
      code, however this should have been a parameter passed to the
      various functions from gfs2_inplace_reserve() and not a member of the
      reservation structure as the value is not required after the
      initial allocation.
      
      This also helps simplify the code since we no longer need to set
      the rs_requested to zero. Also the gfs2_inplace_release()
      function can also be simplified since the reservation structure
      will always be defined when it is called, and the only remaining
      task is to unlock the rgrp if required. It can also now be
      called unconditionally too, resulting in a further simplification.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      71f890f7
  12. 13 9月, 2012 1 次提交
  13. 31 7月, 2012 2 次提交
  14. 21 7月, 2012 1 次提交
  15. 19 7月, 2012 1 次提交
    • B
      GFS2: Reduce file fragmentation · 8e2e0047
      Bob Peterson 提交于
      This patch reduces GFS2 file fragmentation by pre-reserving blocks. The
      resulting improved on disk layout greatly speeds up operations in cases
      which would have resulted in interlaced allocation of blocks previously.
      A typical example of this is 10 parallel dd processes, each writing to a
      file in a common dirctory.
      
      The implementation uses an rbtree of reservations attached to each
      resource group (and each inode).
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      8e2e0047
  16. 06 6月, 2012 3 次提交
    • S
      GFS2: Add "top dir" flag support · 23d0bb83
      Steven Whitehouse 提交于
      This patch adds support for the "top dir" flag. Currently this is unused
      but a subsequent patch is planned which will add support for the
      Orlov allocation policy when allocating subdirectories in a parent
      with this flag set.
      
      In order to ensure backward compatible behaviour, mkfs.gfs2 does
      not currently tag the root directory with this flag, it must always be
      set manually.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      23d0bb83
    • B
      GFS2: Fold quota data into the reservations struct · 5407e242
      Bob Peterson 提交于
      This patch moves the ancillary quota data structures into the
      block reservations structure. This saves GFS2 some time and
      effort in allocating and deallocating the qadata structure.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      5407e242
    • B
      GFS2: Extend the life of the reservations · 0a305e49
      Bob Peterson 提交于
      This patch lengthens the lifespan of the reservations structure for
      inodes. Before, they were allocated and deallocated for every write
      operation. With this patch, they are allocated when the first write
      occurs, and deallocated when the last process closes the file.
      It's more efficient to do it this way because it saves GFS2 a lot of
      unnecessary allocates and frees. It also gives us more flexibility
      for the future: (1) we can now fold the qadata structure back into
      the structure and save those alloc/frees, (2) we can use this for
      multi-block reservations.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      0a305e49
  17. 24 4月, 2012 1 次提交
  18. 01 4月, 2012 1 次提交
  19. 09 3月, 2012 1 次提交
    • B
      GFS2: call gfs2_write_alloc_required for each chunk · 58a7d5fb
      Benjamin Marzinski 提交于
      gfs2_fallocate was calling gfs2_write_alloc_required() once at the start of
      the function. This caused problems since gfs2_write_alloc_required used a
      long unsigned int for the len, but gfs2_fallocate could allocate a much
      larger amount.  This patch will move the call into the loop where the
      chunks are actually allocated and zeroed out. This will keep the allocation
      size under the limit, and also allow gfs2_fallocate to quickly skip over
      sections of the file that are already completely allocated.
      
      fallcate_chunk was also not correctly setting the file size.  It was using the
      len veriable to find the last block written to, but by the time it was setting
      the size, the len variable had already been decremented to 0.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      58a7d5fb
  20. 29 2月, 2012 1 次提交
    • S
      GFS2: FITRIM ioctl support · 66fc061b
      Steven Whitehouse 提交于
      The FITRIM ioctl provides an alternative way to send discard requests to
      the underlying device. Using the discard mount option results in every
      freed block generating a discard request to the block device. This can
      be slow, since many block devices can only process discard requests of
      larger sizes, and also such operations can be time consuming.
      
      Rather than using the discard mount option, FITRIM allows a sweep of the
      filesystem on an occasional basis, and also to optionally avoid sending
      down discard requests for smaller regions.
      
      In GFS2 FITRIM will work at resource group granularity. There is a flag
      for each resource group which keeps track of which resource groups have
      been trimmed. This flag is reset whenever a deallocation occurs in the
      resource group, and set whenever a successful FITRIM of that resource
      group has taken place. This helps to reduce repeated discard requests
      for the same block ranges, again improving performance.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      66fc061b
  21. 28 2月, 2012 2 次提交
    • S
      GFS2: Read resource groups on mount · a365fbf3
      Steven Whitehouse 提交于
      This makes mount take slightly longer, but at the same time, the first
      write to the filesystem will be faster too. It also means that if there
      is a problem in the resource index, then we can refuse to mount rather
      than having to try and report that when the first write occurs.
      
      In addition, to avoid recursive locking, we hvae to take account of
      instances when the rindex glock may already be held when we are
      trying to update the rbtree of resource groups.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      a365fbf3
    • B
      GFS2: Ensure rindex is uptodate for fallocate · 9e73f571
      Bob Peterson 提交于
      This patch fixes a problem whereby gfs2_grow was failing and causing GFS2
      to assert. The problem was that when GFS2's fallocate operation tried to
      acquire an "allocation" it made sure the rindex was up to date, and if not,
      it called gfs2_rindex_update. However, if the file being fallocated was
      the rindex itself, it was already locked at that point. By calling
      gfs2_rindex_update at an earlier point in time, we bring rindex up to date
      and thereby avoid trying to lock it when the "allocation" is acquired.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      9e73f571
  22. 04 1月, 2012 2 次提交
  23. 22 11月, 2011 1 次提交
  24. 21 11月, 2011 1 次提交
    • S
      GFS2: O_(D)SYNC support for fallocate · 4442f2e0
      Steven Whitehouse 提交于
      Add sync of metadata after fallocate for O_SYNC files to ensure that we
      meet expectations for everything being on disk in this case.
      Unfortunately, the offset and len parameters are modified during the
      course of the fallocate function, so I've had to add a couple of new
      variables to call generic_write_sync() at the end.
      
      I know that potentially this will sync data as well within the range,
      but I think that is a fairly harmless side-effect overall, since we
      would not normally expect there to be any dirty data within the range in
      question.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Benjamin Marzinski <bmarzins@redhat.com>
      4442f2e0
  25. 08 11月, 2011 2 次提交
  26. 28 10月, 2011 1 次提交
    • A
      vfs: do (nearly) lockless generic_file_llseek · ef3d0fd2
      Andi Kleen 提交于
      The i_mutex lock use of generic _file_llseek hurts.  Independent processes
      accessing the same file synchronize over a single lock, even though
      they have no need for synchronization at all.
      
      Under high utilization this can cause llseek to scale very poorly on larger
      systems.
      
      This patch does some rethinking of the llseek locking model:
      
      First the 64bit f_pos is not necessarily atomic without locks
      on 32bit systems. This can already cause races with read() today.
      This was discussed on linux-kernel in the past and deemed acceptable.
      The patch does not change that.
      
      Let's look at the different seek variants:
      
      SEEK_SET: Doesn't really need any locking.
      If there's a race one writer wins, the other loses.
      
      For 32bit the non atomic update races against read()
      stay the same. Without a lock they can also happen
      against write() now.  The read() race was deemed
      acceptable in past discussions, and I think if it's
      ok for read it's ok for write too.
      
      => Don't need a lock.
      
      SEEK_END: This behaves like SEEK_SET plus it reads
      the maximum size too. Reading the maximum size would have the
      32bit atomic problem. But luckily we already have a way to read
      the maximum size without locking (i_size_read), so we
      can just use that instead.
      
      Without i_mutex there is no synchronization with write() anymore,
      however since the write() update is atomic on 64bit it just behaves
      like another racy SEEK_SET.  On non atomic 32bit it's the same
      as SEEK_SET.
      
      => Don't need a lock, but need to use i_size_read()
      
      SEEK_CUR: This has a read-modify-write race window
      on the same file. One could argue that any application
      doing unsynchronized seeks on the same file is already broken.
      But for the sake of not adding a regression here I'm
      using the file->f_lock to synchronize this. Using this
      lock is much better than the inode mutex because it doesn't
      synchronize between processes.
      
      => So still need a lock, but can use a f_lock.
      
      This patch implements this new scheme in generic_file_llseek.
      I dropped generic_file_llseek_unlocked and changed all callers.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      ef3d0fd2
  27. 21 10月, 2011 6 次提交
    • B
      GFS2: rewrite fallocate code to write blocks directly · 64dd153c
      Benjamin Marzinski 提交于
      GFS2's fallocate code currently goes through the page cache. Since it's only
      writing to the end of the file or to holes in it, it doesn't need to, and it
      was causing issues on low memory environments. This patch pulls in some of
      Steve's block allocation work, and uses it to simply allocate the blocks for
      the file, and zero them out at allocation time.  It provides a slight
      performance increase, and it dramatically simplifies the code.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      64dd153c
    • S
      GFS2: Clean up ->page_mkwrite · 13d921e3
      Steven Whitehouse 提交于
      This patch brings gfs2's ->page_mkwrite uptodate with respect to the
      expectations set by the VM. Also added is a check to wait if the fs
      is frozen, before we attempt to get a glock. This will only work on
      the node which initiates the freeze, but thats ok since the transaction
      lock will still provide the expected barrier on other nodes.
      
      The major change here is that we return a locked page now, except when
      we don't return a page at all (error cases). This removes the race
      which required rechecking the page after it was returned.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      13d921e3
    • S
      GFS2: Fix AIL flush issue during fsync · b5b24d7a
      Steven Whitehouse 提交于
      Unfortunately, it is not enough to just ignore locked buffers during
      the AIL flush from fsync. We need to be able to ignore all buffers
      which are locked, dirty or pinned at this stage as they might have
      been added subsequent to the log flush earlier in the fsync function.
      
      In addition, this means that we no longer need to rely on i_mutex to
      keep out writes during fsync, so we can, as a side-effect, remove
      that protection too.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Tested-By: NAbhijith Das <adas@redhat.com>
      b5b24d7a
    • S
      GFS2: Cache the most recently used resource group in the inode · 54335b1f
      Steven Whitehouse 提交于
      This means that after the initial allocation for any inode, the
      last used resource group is cached in the inode for future use.
      This drastically reduces the number of lookups of resource
      groups in the common case, and this the contention on that
      data structure.
      
      The allocation algorithm is the same as previously, except that we
      always check to see if the goal block is within the cached rgrp
      first before going to the rbtree to look one up.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      54335b1f
    • S
      GFS2: Fix lseek after SEEK_DATA, SEEK_HOLE have been added · 9453615a
      Steven Whitehouse 提交于
      We need to take the inode's glock whenever the inode's size
      is referenced, otherwise it might not be uptodate. Even
      though generic_file_llseek_unlocked() doesn't implement
      SEEK_DATA, SEEK_HOLE directly, it does reference the inode's
      size in those cases, so we need to add them to the list
      of origins which need the glock.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      9453615a
    • S
      GFS2: Use ->dirty_inode() · ab9bbda0
      Steven Whitehouse 提交于
      The aim of this patch is to use the newly enhanced ->dirty_inode()
      super block operation to deal with atime updates, rather than
      piggy backing that code into ->write_inode() as is currently
      done.
      
      The net result is a simplification of the code in various places
      and a reduction of the number of gfs2_dinode_out() calls since
      this is now implied by ->dirty_inode().
      
      Some of the mark_inode_dirty() calls have been moved under glocks
      in order to take advantage of then being able to avoid locking in
      ->dirty_inode() when we already have suitable locks.
      
      One consequence is that generic_write_end() now correctly deals
      with file size updates, so that we do not need a separate check
      for that afterwards. This also, indirectly, means that fdatasync
      should work correctly on GFS2 - the current code always syncs the
      metadata whether it needs to or not.
      
      Has survived testing with postmark (with and without atime) and
      also fsx.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      ab9bbda0