1. 05 9月, 2019 1 次提交
    • B
      gfs2: Use async glocks for rename · ad26967b
      Bob Peterson 提交于
      Because s_vfs_rename_mutex is not cluster-wide, multiple nodes can
      reverse the roles of which directories are "old" and which are "new" for
      the purposes of rename. This can cause deadlocks where two nodes end up
      waiting for each other.
      
      There can be several layers of directory dependencies across many nodes.
      
      This patch fixes the problem by acquiring all gfs2_rename's inode glocks
      asychronously and waiting for all glocks to be acquired.  That way all
      inodes are locked regardless of the order.
      
      The timeout value for multiple asynchronous glocks is calculated to be
      the total of the individual wait times for each glock times two.
      
      Since gfs2_exchange is very similar to gfs2_rename, both functions are
      patched in the same way.
      
      A new async glock wait queue, sd_async_glock_wait, keeps a list of
      waiters for these events. If gfs2's holder_wake function detects an
      async holder, it wakes up any waiters for the event. The waiter only
      tests whether any of its requests are still pending.
      
      Since the glocks are sent to dlm asychronously, the wait function needs
      to check to see which glocks, if any, were granted.
      
      If a glock is granted by dlm (and therefore held), its minimum hold time
      is checked and adjusted as necessary, as other glock grants do.
      
      If the event times out, all glocks held thus far must be dequeued to
      resolve any existing deadlocks.  Then, if there are any outstanding
      locking requests, we need to loop around and wait for dlm to respond to
      those requests too.  After we release all requests, we return -ESTALE to
      the caller (vfs rename) which loops around and retries the request.
      
          Node1           Node2
          ---------       ---------
      1.  Enqueue A       Enqueue B
      2.  Enqueue B       Enqueue A
      3.  A granted
      6.                  B granted
      7.  Wait for B
      8.                  Wait for A
      9.                  A times out (since Node 1 holds A)
      10.                 Dequeue B (since it was granted)
      11.                 Wait for all requests from DLM
      12. B Granted (since Node2 released it in step 10)
      13. Rename
      14. Dequeue A
      15.                 DLM Grants A
      16.                 Dequeue A (due to the timeout and since we
                          no longer have B held for our task).
      17. Dequeue B
      18.                 Return -ESTALE to vfs
      19.                 VFS retries the operation, goto step 1.
      
      This release-all-locks / acquire-all-locks may slow rename / exchange
      down as both nodes struggle in the same way and do the same thing.
      However, this will only happen when there is contention for the same
      inodes, which ought to be rare.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      ad26967b
  2. 28 6月, 2019 3 次提交
    • B
      gfs2: dump fsid when dumping glock problems · 3792ce97
      Bob Peterson 提交于
      Before this patch, if a glock error was encountered, the glock with
      the problem was dumped. But sometimes you may have lots of file systems
      mounted, and that doesn't tell you which file system it was for.
      
      This patch adds a new boolean parameter fsid to the dump_glock family
      of functions. For non-error cases, such as dumping the glocks debugfs
      file, the fsid is not dumped in order to keep lock dumps and glocktop
      as clean as possible. For all error cases, such as GLOCK_BUG_ON, the
      file system id is now printed. This will make it easier to debug.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      3792ce97
    • B
      gfs2: Rename SDF_SHUTDOWN to SDF_WITHDRAWN · 04aea0ca
      Bob Peterson 提交于
      Before this patch, the superblock flag indicating when a file system
      is withdrawn was called SDF_SHUTDOWN. This patch simply renames it to
      the more obvious SDF_WITHDRAWN.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      04aea0ca
    • B
      gfs2: eliminate tr_num_revoke_rm · e955537e
      Bob Peterson 提交于
      For its journal processing, gfs2 kept track of the number of buffers
      added and removed on a per-transaction basis. These values are used
      to calculate space needed in the journal. But while these calculations
      make sense for the number of buffers, they make no sense for revokes.
      Revokes are managed in their own list, linked from the superblock.
      So it's entirely unnecessary to keep separate per-transaction counts
      for revokes added and removed. A single count will do the same job.
      Therefore, this patch combines the transaction revokes into a single
      count.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      e955537e
  3. 06 6月, 2019 1 次提交
  4. 05 6月, 2019 1 次提交
  5. 08 5月, 2019 4 次提交
    • A
      gfs2: fix race between gfs2_freeze_func and unmount · 8f918219
      Abhi Das 提交于
      As part of the freeze operation, gfs2_freeze_func() is left blocking
      on a request to hold the sd_freeze_gl in SH. This glock is held in EX
      by the gfs2_freeze() code.
      
      A subsequent call to gfs2_unfreeze() releases the EXclusively held
      sd_freeze_gl, which allows gfs2_freeze_func() to acquire it in SH and
      resume its operation.
      
      gfs2_unfreeze(), however, doesn't wait for gfs2_freeze_func() to complete.
      If a umount is issued right after unfreeze, it could result in an
      inconsistent filesystem because some journal data (statfs update) isn't
      written out.
      
      Refer to commit 24972557 for a more detailed explanation of how
      freeze/unfreeze work.
      
      This patch causes gfs2_unfreeze() to wait for gfs2_freeze_func() to
      complete before returning to the user.
      Signed-off-by: NAbhi Das <adas@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      8f918219
    • A
      gfs2: Rename sd_log_le_{revoke,ordered} · a5b1d3fc
      Andreas Gruenbacher 提交于
      Rename sd_log_le_revoke to sd_log_revokes and sd_log_le_ordered to
      sd_log_ordered: not sure what le stands for here, but it doesn't add
      clarity, and if it stands for list entry, it's actually confusing as
      those are both list heads but not list entries.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      a5b1d3fc
    • B
      gfs2: Replace gl_revokes with a GLF flag · 73118ca8
      Bob Peterson 提交于
      The gl_revokes value determines how many outstanding revokes a glock has
      on the superblock revokes list; this is used to avoid unnecessary log
      flushes.  However, gl_revokes is only ever tested for being zero, and it's
      only decremented in revoke_lo_after_commit, which removes all revokes
      from the list, so we know that the gl_revoke values of all the glocks on
      the list will reach zero.  Therefore, we can replace gl_revokes with a
      bit flag. This saves an atomic counter in struct gfs2_glock.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      73118ca8
    • B
      gfs2: clean_journal improperly set sd_log_flush_head · 7c70b896
      Bob Peterson 提交于
      This patch fixes regressions in 588bff95.
      Due to that patch, function clean_journal was setting the value of
      sd_log_flush_head, but that's only valid if it is replaying the node's
      own journal. If it's replaying another node's journal, that's completely
      wrong and will lead to multiple problems. This patch tries to clean up
      the mess by passing the value of the logical journal block number into
      gfs2_write_log_header so the function can treat non-owned journals
      generically. For the local journal, the journal extent map is used for
      best performance. For other nodes from other journals, new function
      gfs2_lblk_to_dblk is called to figure it out using gfs2_iomap_get.
      
      This patch also tries to establish more consistency when passing journal
      block parameters by changing several unsigned int types to a consistent
      u32.
      
      Fixes: 588bff95 ("GFS2: Reduce code redundancy writing log headers")
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      7c70b896
  6. 23 1月, 2019 1 次提交
  7. 12 12月, 2018 2 次提交
  8. 12 10月, 2018 2 次提交
  9. 05 10月, 2018 1 次提交
    • B
      gfs2: slow the deluge of io error messages · b524abcc
      Bob Peterson 提交于
      When an io error is hit, it calls gfs2_io_error_bh_i for every
      journal buffer it can't write. Since we changed gfs2_io_error_bh_i
      recently to withdraw later in the cycle, it sends a flood of
      errors to the console. This patch checks for the file system already
      being withdrawn, and if so, doesn't send more messages. It doesn't
      stop the flood of messages, but it slows it down and keeps it more
      reasonable.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      b524abcc
  10. 07 8月, 2018 1 次提交
    • B
      gfs2: Fix gfs2_testbit to use clone bitmaps · dffe12a8
      Bob Peterson 提交于
      Function gfs2_testbit is called in three places. Two of those places,
      gfs2_alloc_extent and gfs2_unaligned_extlen, should be using the clone
      bitmaps, not the "real" bitmaps. Function gfs2_unaligned_extlen is used
      by the block reservations scheme to determine the length of an extent of
      free blocks. Before this patch, it wasn't using the clone bitmap, which
      means recently-freed blocks were treated as free blocks for the purposes
      of an allocation.
      
      This patch adds a new parameter to gfs2_testbit to indicate whether or
      not the clone bitmaps should be used (if available).
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      dffe12a8
  11. 05 7月, 2018 1 次提交
  12. 21 6月, 2018 1 次提交
  13. 04 6月, 2018 1 次提交
    • B
      GFS2: gfs2_free_extlen can return an extent that is too long · dc8fbb03
      Bob Peterson 提交于
      Function gfs2_free_extlen calculates the length of an extent of
      free blocks that may be reserved. The end pointer was calculated as
      end = start + bh->b_size but b_size is incorrect because the
      bitmap usually stops prior to the end of the buffer data on
      the last bitmap.
      
      What this means is that when you do a write, you can reserve a
      chunk of blocks that runs off the end of the last bitmap. For
      example, I've got a file system where there is only one bitmap
      for each rgrp, so ri_length==1. I saw cases in which iozone
      tried to do a big write, grabbed a large block reservation,
      chose rgrp 5464152, which has ri_data0 5464153 and ri_data 8188.
      So 5464153 + 8188 = 5472341 which is the end of the rgrp.
      
      When it grabbed a reservation it got back: 5470936, length 7229.
      But 5470936 + 7229 = 5478165. So the reservation starts inside
      the rgrp but runs 5824 blocks past the end of the bitmap.
      
      This patch fixes the calculation so it won't exceed the last
      bitmap. It also adds a BUG_ON to guard against overflows in the
      future.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      dc8fbb03
  14. 17 4月, 2018 1 次提交
    • A
      gfs2: Remove sdp->sd_jheightsize · 9a38662b
      Andreas Gruenbacher 提交于
      GFS2 keeps two arrarys in the superblock that define the maximum size of
      an inode depending on the inode's height: sdp->sd_heightsize defines the
      heights in units of sb->s_blocksize; sdp->sd_jheightsize defines them in
      units of sb->s_blocksize - sizeof(struct gfs2_meta_header).  These
      arrays are used to determine when additional layers of indirect blocks
      are needed.  The second array is used for directories which have an
      additional gfs2_meta_header at the beginning of each block.
      
      Distinguishing between these two cases makes no sense: the height
      required for representing N blocks will come out the same no matter if
      the calculation is done in gross (sb->s_blocksize) or net
      (sb->s_blocksize - sizeof(struct gfs2_meta_header)) units.
      
      Stuffed directories don't have an additional gfs2_meta_header, but the
      stuffed case is handled separately for both files and directories,
      anyway.
      
      Remove the unncessary sdp->sd_jheightsize array.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      9a38662b
  15. 29 3月, 2018 1 次提交
  16. 22 1月, 2018 1 次提交
  17. 19 1月, 2018 1 次提交
  18. 25 8月, 2017 2 次提交
    • A
      gfs2: Silence gcc format-truncation warning · 561b7969
      Andreas Gruenbacher 提交于
      Enlarge sd_fsname to be big enough for the longest long lock table name
      and an arbitrary journal number.  This silences two -Wformat-truncation
      warnings with gcc 7.1.1.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      561b7969
    • B
      GFS2: Withdraw for IO errors writing to the journal or statfs · 942b0cdd
      Bob Peterson 提交于
      Before this patch, if GFS2 encountered IO errors while writing to
      the journal, it would not report the problem, so they would go
      unnoticed, sometimes for many hours. Sometimes this would only be
      noticed later, when recovery tried to do journal replay and failed
      due to invalid metadata at the blocks that resulted in IO errors.
      
      This patch makes GFS2's log daemon check for IO errors. If it
      encounters one, it withdraws from the file system and reports
      why in dmesg. A similar action is taken when IO errors occur when
      writing to the system statfs file.
      
      These errors are also reported back to any callers of fsync, since
      that requires the journal to be flushed. Therefore, any IO errors
      that would previously go unnoticed are now noticed and the file
      system is withdrawn as early as possible, thus preventing further
      file system damage.
      
      Also note that this reintroduces superblock variable sd_log_error,
      which Christoph removed with commit f729b66f.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      942b0cdd
  19. 10 8月, 2017 1 次提交
    • A
      gfs2: forcibly flush ail to relieve memory pressure · b066a4ee
      Abhi Das 提交于
      On systems with low memory, it is possible for gfs2 to infinitely
      loop in balance_dirty_pages() under heavy IO (creating sparse files).
      
      balance_dirty_pages() attempts to write out the dirty pages via
      gfs2_writepages() but none are found because these dirty pages are
      being used by the journaling code in the ail. Normally, the journal
      has an upper threshold which when hit triggers an automatic flush
      of the ail. But this threshold can be higher than the number of
      allowable dirty pages and result in the ail never being flushed.
      
      This patch forces an ail flush when gfs2_writepages() fails to write
      anything. This is a good indication that the ail might be holding
      some dirty pages.
      Signed-off-by: NAbhi Das <adas@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      b066a4ee
  20. 08 7月, 2017 1 次提交
  21. 05 7月, 2017 2 次提交
  22. 20 6月, 2017 1 次提交
  23. 13 6月, 2017 1 次提交
  24. 09 6月, 2017 1 次提交
  25. 16 3月, 2017 1 次提交
  26. 15 3月, 2017 1 次提交
  27. 27 1月, 2017 1 次提交
  28. 06 1月, 2017 1 次提交
  29. 15 3月, 2016 1 次提交
  30. 15 12月, 2015 2 次提交
    • B
      gfs2: change gfs2 readdir cookie · 471f3db2
      Benjamin Marzinski 提交于
      gfs2 currently returns 31 bits of filename hash as a cookie that readdir
      uses for an offset into the directory.  When there are a large number of
      directory entries, the likelihood of a collision goes up way too
      quickly.  GFS2 will now return cookies that are guaranteed unique for a
      while, and then fail back to using 30 bits of filename hash.
      Specifically, the directory leaf blocks are divided up into chunks based
      on the minimum size of a gfs2 directory entry (48 bytes). Each entry's
      cookie is based off the chunk where it starts, in the linked list of
      leaf blocks that it hashes to (there are 131072 hash buckets). Directory
      entries will have unique names until they take reach chunk 8192.
      Assuming the largest filenames possible, and the least efficient spacing
      possible, this new method will still be able to return unique names when
      the previous method has statistically more than a 99% chance of a
      collision.  The non-unique names it fails back to are guaranteed to not
      collide with the unique names.
      
      unique cookies will be in this format:
      - 1 bit "0" to make sure the the returned cookie is positive
      - 17 bits for the hash table index
      - 1 bit for the mode "0"
      - 13 bits for the offset
      
      non-unique cookies will be in this format:
      - 1 bit "0" to make sure the the returned cookie is positive
      - 17 bits for the hash table index
      - 1 bit for the mode "1"
      - 13 more bits of the name hash
      
      Another benefit of location based cookies, is that once a directory's
      exhash table is fully extended (so that multiple hash table indexs do
      not use the same leaf blocks), gfs2 can skip sorting the directory
      entries until it reaches the non-unique ones, and then it only needs to
      sort these. This provides a significant speed up for directory reads of
      very large directories.
      
      The only issue is that for these cookies to continue to point to the
      correct entry as files are added and removed from the directory, gfs2
      must keep the entries at the same offset in the leaf block when they are
      split (see my previous patch). This means that until all the nodes in a
      cluster are running with code that will split the directory leaf blocks
      this way, none of the nodes can use the new cookie code. To deal with
      this, gfs2 now has the mount option loccookie, which, if set, will make
      it return these new location based cookies.  This option must not be set
      until all nodes in the cluster are at least running this version of the
      kernel code, and you have guaranteed that there are no outstanding
      cookies required by other software, such as NFS.
      
      gfs2 uses some of the extra space at the end of the gfs2_dirent
      structure to store the calculated readdir cookies. This keeps us from
      needing to allocate a seperate array to hold these values.  gfs2
      recomputes the cookie stored in de_cookie for every readdir call.  The
      time it takes to do so is small, and if gfs2 expected this value to be
      saved on disk, the new code wouldn't work correctly on filesystems
      created with an earlier version of gfs2.
      
      One issue with adding de_cookie to the union in the gfs2_dirent
      structure is that it caused the union to align itself to a 4 byte
      boundary, instead of its previous 2 byte boundary. This changed the
      offset of de_rahead. To solve that, I pulled de_rahead out of the union,
      since it does not need to be there.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      471f3db2
    • B
      GFS2: Reduce size of incore inode · b58bf407
      Bob Peterson 提交于
      This patch makes no functional changes. Its goal is to reduce the
      size of the gfs2 inode in memory by rearranging structures and
      changing the size of some variables within the structure.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      b58bf407