1. 09 8月, 2017 6 次提交
  2. 21 7月, 2017 4 次提交
    • B
      GFS2: Set gl_object in inode lookup only after block type check · 4d7c18c7
      Bob Peterson 提交于
      Before this patch, the inode glock's gl_object was set after a
      reference was acquired, but before the block type was verified.
      In cases where the block was unlinked, then freed and reused on
      another node, a residule delete callback (delete_work) would try
      to look up the inode, eventually failing the block check, but
      only after it overwrites gl_object with a pointer to the wrong
      inode. This patch moves the assignment of gl_object after the
      block check so it won't be improperly overwritten.
      
      Likewise, at the end of the function, gfs2_inode_lookup was
      clearing gl_object after it unlocked the glock, which meant
      another process might free the glock in the meantime. This
      patch guards against that case.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      4d7c18c7
    • B
      GFS2: Introduce helper for clearing gl_object · df3d87bd
      Bob Peterson 提交于
      This patch introduces a new helper function in glock.h that
      clears gl_object, with an added integrity check. An additional
      integrity check has been added to glock_set_object, plus comments.
      This is step 1 in a series to ensure gl_object integrity.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      df3d87bd
    • C
      gfs2: add flag REQ_PRIO for metadata I/O · e477b24b
      Coly Li 提交于
      When gfs2 does metadata I/O, only REQ_META is used as a metadata hint of
      the bio. But flag REQ_META is just a hint for block trace, not for block
      layer code to handle a bio as metadata request.
      
      For some of metadata I/Os of gfs2, A REQ_PRIO flag on the metadata bio
      would be very informative to block layer code. For example, if bcache is
      used as a I/O cache for gfs2, it will be possible for bcache code to get
      the hint and cache the pre-fetched metadata blocks on cache device. This
      behavior may be helpful to improve metadata I/O performance if the
      following requests hit the cache.
      
      Here are the locations in gfs2 code where a REQ_PRIO flag should be added,
      - All places where REQ_READAHEAD is used, gfs2 code uses this flag for
        metadata read ahead.
      - In gfs2_meta_rq() where the first metadata block is read in.
      - In gfs2_write_buf_to_page(), read in quota metadata blocks to have them
        up to date.
      These metadata blocks are probably to be accessed again in future, adding
      a REQ_PRIO flag may have bcache to keep such metadata in fast cache
      device. For system without a cache layer, REQ_PRIO can still provide hint
      to block layer to handle metadata requests more properly.
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      e477b24b
    • W
      GFS2: fix code parameter error in inode_go_lock · e7cb550d
      Wang Xibo 提交于
      In inode_go_lock() function, the parameter order of list_add() is error.
      According to the define of list_add(), the first parameter is new entry
      and the second is the list head, so ip->i_trunc_list should be the
      first parameter and the sdp->sd_trunc_list should be second.
      
      Signed-off-by: Wang Xibo<wang.xibo@zte.com.cn>
      Signed-off-by: Xiao Likun<xiao.likun@zte.com.cn>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      e7cb550d
  3. 20 7月, 2017 1 次提交
  4. 19 7月, 2017 1 次提交
    • J
      gfs2: Don't clear SGID when inheriting ACLs · 914cea93
      Jan Kara 提交于
      When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
      set, DIR1 is expected to have SGID bit set (and owning group equal to
      the owning group of 'DIR0'). However when 'DIR0' also has some default
      ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
      'DIR1' to get cleared if user is not member of the owning group.
      
      Fix the problem by moving posix_acl_update_mode() out of
      __gfs2_set_acl() into gfs2_set_acl(). That way the function will not be
      called when inheriting ACLs which is what we want as it prevents SGID
      bit clearing and the mode has been properly set by posix_acl_create()
      anyway.
      
      Fixes: 07393101Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      914cea93
  5. 18 7月, 2017 1 次提交
  6. 17 7月, 2017 1 次提交
    • B
      GFS2: Prevent double brelse in gfs2_meta_indirect_buffer · 61eaadcd
      Bob Peterson 提交于
      Before this patch, problems reading in indirect buffers would send
      an IO error back to the caller, and release the buffer_head with
      brelse() in function gfs2_meta_indirect_buffer, however, it would
      still return the address of the buffer_head it released. After the
      error was discovered, function gfs2_block_map would call function
      release_metapath to free all buffers. That checked:
      if (mp->mp_bh[i] == NULL) but since the value was set after the
      error, it was non-zero, so brelse was called a second time. This
      resulted in the following error:
      
      kernel: WARNING: at fs/buffer.c:1224 __brelse+0x3a/0x40() (Tainted: G        W  -- ------------   )
      kernel: Hardware name: RHEV Hypervisor
      kernel: VFS: brelse: Trying to free free buffer
      
      This patch changes gfs2_meta_indirect_buffer so it only sets
      the buffer_head pointer in cases where it isn't released.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      61eaadcd
  7. 08 7月, 2017 1 次提交
  8. 06 7月, 2017 1 次提交
    • J
      buffer: set errors in mapping at the time that the error occurs · 87354e5d
      Jeff Layton 提交于
      I noticed on xfs that I could still sometimes get back an error on fsync
      on a fd that was opened after the error condition had been cleared.
      
      The problem is that the buffer code sets the write_io_error flag and
      then later checks that flag to set the error in the mapping. That flag
      perisists for quite a while however. If the file is later opened with
      O_TRUNC, the buffers will then be invalidated and the mapping's error
      set such that a subsequent fsync will return error. I think this is
      incorrect, as there was no writeback between the open and fsync.
      
      Add a new mark_buffer_write_io_error operation that sets the flag and
      the error in the mapping at the same time. Replace all calls to
      set_buffer_write_io_error with mark_buffer_write_io_error, and remove
      the places that check this flag in order to set the error in the
      mapping.
      
      This sets the error in the mapping earlier, at the time that it's first
      detected.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      87354e5d
  9. 05 7月, 2017 5 次提交
  10. 20 6月, 2017 1 次提交
  11. 13 6月, 2017 2 次提交
  12. 09 6月, 2017 2 次提交
  13. 05 6月, 2017 1 次提交
  14. 24 5月, 2017 1 次提交
    • J
      gfs2: Make flush bios explicitely sync · 0f0b9b63
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
      definitions.  generic_make_request_checks() however strips REQ_FUA and
      REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
      write cache and thus write effectively becomes asynchronous which can
      lead to performance regressions
      
      Fix the problem by making sure all bios which are synchronous are
      properly marked with REQ_SYNC.
      
      Fixes: b685d3d6
      CC: Steven Whitehouse <swhiteho@redhat.com>
      CC: cluster-devel@redhat.com
      CC: stable@vger.kernel.org
      Acked-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      0f0b9b63
  15. 09 5月, 2017 1 次提交
  16. 06 5月, 2017 1 次提交
    • B
      GFS2: Allow glocks to be unlocked after withdraw · ed17545d
      Bob Peterson 提交于
      This bug fixes a regression introduced by patch 0d1c7ae9.
      
      The intent of the patch was to stop promoting glocks after a
      file system is withdrawn due to a variety of errors, because doing
      so results in a BUG(). (You should be able to unmount after a
      withdraw rather than having the kernel panic.)
      
      Unfortunately, it also stopped demotions, so glocks could not be
      unlocked after withdraw, which means the unmount would hang.
      
      This patch allows function do_xmote to demote locks to an
      unlocked state after a withdraw, but not promote them.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      ed17545d
  17. 21 4月, 2017 2 次提交
  18. 19 4月, 2017 1 次提交
    • B
      GFS2: Non-recursive delete · d552a2b9
      Bob Peterson 提交于
      Implement truncate/delete as a non-recursive algorithm. The older
      algorithm was implemented with recursion to strip off each layer
      at a time (going by height, starting with the maximum height.
      This version tries to do the same thing but without recursion,
      and without needing to allocate new structures or lists in memory.
      
      For example, say you want to truncate a very large file to 1 byte,
      and its end-of-file metapath is: 0.505.463.428. The starting
      metapath would be 0.0.0.0. Since it's a truncate to non-zero, it
      needs to preserve that byte, and all metadata pointing to it.
      So it would start at 0.0.0.0, look up all its metadata buffers,
      then free all data blocks pointed to at the highest level.
      After that buffer is "swept", it moves on to 0.0.0.1, then
      0.0.0.2, etc., reading in buffers and sweeping them clean.
      When it gets to the end of the 0.0.0 metadata buffer (for 4K
      blocks the last valid one is 0.0.0.508), it backs up to the
      previous height and starts working on 0.0.1.0, then 0.0.1.1,
      and so forth. After it reaches the end and sweeps 0.0.1.508,
      it continues with 0.0.2.0, and so on. When that height is
      exhausted, and it reaches 0.0.508.508 it backs up another level,
      to 0.1.0.0, then 0.1.0.1, through 0.1.0.508. So it has to keep
      marching backwards and forwards through the metadata until it's
      all swept clean. Once it has all the data blocks freed, it
      lowers the strip height, and begins the process all over again,
      but with one less height. This time it sweeps 0.0.0 through
      0.505.463. When that's clean, it lowers the strip height again
      and works to free 0.505. Eventually it strips the lowest height, 0.
      For a delete or truncate to 0, all metadata for all heights of
      0.0.0.0 would be freed. For a truncate to 1 byte, 0.0.0.0 would
      be preserved.
      
      This isn't much different from normal integer incrementing,
      where an integer gets incremented from 0000 (0.0.0.0) to 3021
      (3.0.2.1). So 0000 gets increments to 0001, 0002, up to 0009,
      then on to 0010, 0011 up to 0099, then 0100 and so forth. It's
      just that each "digit" goes from 0 to 508 (for a total of 509
      pointers) rather than from 0 to 9.
      
      Note that the dinode will only have 483 pointers due to the
      dinode structure itself.
      
      Also note: this is just an example. These numbers (509 and 483)
      are based on a standard 4K block size. Smaller block sizes will
      yield smaller numbers of indirect pointers accordingly.
      
      The truncation process is accomplished with the help of two
      major functions and a few helper functions.
      
      Functions do_strip and recursive_scan are obsolete, so removed.
      
      New function sweep_bh_for_rgrps cleans a buffer_head pointed to
      by the given metapath and height. By cleaning, I mean it frees
      all blocks starting at the offset passed in metapath. It starts
      at the first block in the buffer pointed to by the metapath and
      identifies its resource group (rgrp). From there it frees all
      subsequent block pointers that lie within that rgrp. If it's
      already inside a transaction, it stays within it as long as it
      can. In other words, it doesn't close a transaction until it knows
      it's freed what it can from the resource group. In this way,
      multiple buffers may be cleaned in a single transaction, as long
      as those blocks in the buffer all lie within the same rgrp.
      
      If it's not in a transaction, it starts one. If the buffer_head
      has references to blocks within multiple rgrps, it frees all the
      blocks inside the first rgrp it finds, then closes the
      transaction. Then it repeats the cycle: identifies the next
      unfreed block, uses it to find its rgrp, then starts a new
      transaction for that set. It repeats this process repeatedly
      until the buffer_head contains no more references to any blocks
      past the given metapath.
      
      Function trunc_dealloc has been reworked into a finite state
      automaton. It has basically 3 active states:
      DEALLOC_MP_FULL, DEALLOC_MP_LOWER, and DEALLOC_FILL_MP:
      
      The DEALLOC_MP_FULL state implies the metapath has a full set
      of buffers out to the "shrink height", and therefore, it can
      call function sweep_bh_for_rgrps to free the blocks within the
      highest height of the metapath. If it's just swept the lowest
      level (or an error has occurred) the state machine is ended.
      Otherwise it proceeds to the DEALLOC_MP_LOWER state.
      
      The DEALLOC_MP_LOWER state implies we are finished with a given
      buffer_head, which may now be released, and therefore we are
      then missing some buffer information from the metapath. So we
      need to find more buffers to read in. In most cases, this is
      just a matter of releasing the buffer_head and moving to the
      next pointer from the previous height, so it may be read in and
      swept as well. If it can't find another non-null pointer to
      process, it checks whether it's reached the end of a height
      and needs to lower the strip height, or whether it still needs
      move forward through the previous height's metadata. In this
      state, all zero-pointers are skipped. From this state, it can
      only loop around (once more backing up another height) or,
      once a valid metapath is found (one that has non-zero
      pointers), proceed to state DEALLOC_FILL_MP.
      
      The DEALLOC_FILL_MP state implies that we have a metapath
      but not all its buffers are read in. So we must proceed to read
      in buffer_heads until the metapath has a valid buffer for every
      height. If the previous state backed us up 3 heights, we may
      need to read in a buffer, increment the height, then repeat the
      process until buffers have been read in for all required heights.
      If it's successful reading a buffer, and it's at the highest
      height we need, it proceeds back to the DEALLOC_MP_FULL state.
      If it's unable to fill in a buffer, (encounters a hole, etc.)
      it tries to find another non-zero block pointer. If they're all
      zero, it lowers the height and returns to the DEALLOC_MP_LOWER
      state. If it finds a good non-null pointer, it loops around and
      reads it in, while keeping the metapath in lock-step with the
      pointers it examines.
      
      The state machine runs until the truncation request is
      satisfied. Then any transactions are ended, the quota and
      statfs data are updated, and the function is complete.
      
      Helper function metaptr1 was introduced to be an easy way to
      determine the start of a buffer_head's indirect pointers.
      
      Helper function lookup_mp_height was introduced to find a
      metapath index and read in the buffer that corresponds to it.
      In this way, function lookup_metapath becomes a simple loop to
      call it for every height.
      
      Helper function fillup_metapath is similar to lookup_metapath
      except it can do partial lookups. If the state machine
      backed up multiple levels (like 2999 wrapping to 3000) it
      needs to find out the next starting point and start issuing
      metadata reads at that point.
      
      Helper function hptrs is a shortcut to determine how many
      pointers should be expected in a buffer. Height 0 is the dinode
      which has fewer pointers than the others.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      d552a2b9
  19. 05 4月, 2017 1 次提交
  20. 03 4月, 2017 2 次提交
  21. 17 3月, 2017 1 次提交
    • B
      GFS2: Temporarily zero i_no_addr when creating a dinode · cc963a11
      Bob Peterson 提交于
      Before this patch i_no_addr was not initialized until after the
      return from allocating its block. That meant the i_no_addr was
      temporarily uninitialized storage. Ordinarily that's not a concern,
      but if inplace_reserve can't find space, it can call try_rgrp_unlink
      which references i_no_addr as a block to avoid. That can result in
      unpredictable behavior. More importantly, the trace point in
      gfs2_alloc_blocks references ip->i_no_addr before it is set, which
      is misleading when reading the kernel traces. This patch makes it
      look like the new dinode block was assigned in the name of inode 0
      rather than a random inode that's completely unrelated.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      cc963a11
  22. 16 3月, 2017 3 次提交