1. 25 10月, 2021 3 次提交
    • B
      gfs2: Eliminate GIF_INVALID flag · ec1d398d
      Bob Peterson 提交于
      With the addition of the new GLF_INSTANTIATE_NEEDED flag, the
      GIF_INVALID flag is now redundant. This patch removes it.
      Since inode_instantiate is only called when instantiation is needed,
      the check in inode_instantiate is removed too.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      ec1d398d
    • B
      gfs2: fix GL_SKIP node_scope problems · f2e70d8f
      Bob Peterson 提交于
      Before this patch, when a glock was locked, the very first holder on the
      queue would unlock the lockref and call the go_instantiate glops function
      (if one existed), unless GL_SKIP was specified. When we introduced the new
      node-scope concept, we allowed multiple holders to lock glocks in EX mode
      and share the lock.
      
      But node-scope introduced a new problem: if the first holder has GL_SKIP
      and the next one does NOT, since it is not the first holder on the queue,
      the go_instantiate op was not called. Eventually the GL_SKIP holder may
      call the instantiate sub-function (e.g. gfs2_rgrp_bh_get) but there was
      still a window of time in which another non-GL_SKIP holder assumes the
      instantiate function had been called by the first holder. In the case of
      rgrp glocks, this led to a NULL pointer dereference on the buffer_heads.
      
      This patch tries to fix the problem by introducing two new glock flags:
      
      GLF_INSTANTIATE_NEEDED, which keeps track of when the instantiate function
      needs to be called to "fill in" or "read in" the object before it is
      referenced.
      
      GLF_INSTANTIATE_IN_PROG which is used to determine when a process is
      in the process of reading in the object. Whenever a function needs to
      reference the object, it checks the GLF_INSTANTIATE_NEEDED flag, and if
      set, it sets GLF_INSTANTIATE_IN_PROG and calls the glops "go_instantiate"
      function.
      
      As before, the gl_lockref spin_lock is unlocked during the IO operation,
      which may take a relatively long amount of time to complete. While
      unlocked, if another process determines go_instantiate is still needed,
      it sees GLF_INSTANTIATE_IN_PROG is set, and waits for the go_instantiate
      glop operation to be completed. Once GLF_INSTANTIATE_IN_PROG is cleared,
      it needs to check GLF_INSTANTIATE_NEEDED again because the other process's
      go_instantiate operation may not have been successful.
      
      Functions that previously called the instantiate sub-functions now call
      directly into gfs2_instantiate so the new bits are managed properly.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      f2e70d8f
    • B
      gfs2: change go_lock to go_instantiate · 3278b977
      Bob Peterson 提交于
      Before this patch, the go_lock glock operations (glops) did not do
      any actual locking. They were used to instantiate objects, like reading
      in dinodes and rgrps from the media.
      
      This patch renames the functions to go_instantiate for clarity.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      3278b977
  2. 21 10月, 2021 2 次提交
    • A
      gfs2: Eliminate ip->i_gh · 1b223f70
      Andreas Gruenbacher 提交于
      Now that gfs2_file_buffered_write is the only remaining user of
      ip->i_gh, we can move the glock holder to the stack (or rather, use the
      one we already have on the stack); there is no need for keeping the
      holder in the inode anymore.
      
      This is slightly complicated by the fact that we're using ip->i_gh for
      the statfs inode in gfs2_file_buffered_write as well.  Writing to the
      statfs inode isn't very common, so allocate the statfs holder
      dynamically when needed.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      1b223f70
    • B
      gfs2: Introduce flag for glock holder auto-demotion · dc732906
      Bob Peterson 提交于
      This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure that
      will allow glocks to be demoted automatically on locking conflicts.
      When a locking request comes in that isn't compatible with the locking
      state of an active holder and that holder has the HIF_MAY_DEMOTE flag
      set, the holder will be demoted before the incoming locking request is
      granted.
      
      Note that this mechanism demotes active holders (with the HIF_HOLDER
      flag set), while before we were only demoting glocks without any active
      holders.  This allows processes to keep hold of locks that may form a
      cyclic locking dependency; the core glock logic will then break those
      dependencies in case a conflicting locking request occurs.  We'll use
      this to avoid giving up the inode glock proactively before faulting in
      pages.
      
      Processes that allow a glock holder to be taken away indicate this by
      calling gfs2_holder_allow_demote(), which sets the HIF_MAY_DEMOTE flag.
      Later, they call gfs2_holder_disallow_demote() to clear the flag again,
      and then they check if their holder is still queued: if it is, they are
      still holding the glock; if it isn't, they can re-acquire the glock (or
      abort).
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      dc732906
  3. 20 8月, 2021 2 次提交
  4. 04 4月, 2021 2 次提交
  5. 23 2月, 2021 1 次提交
    • A
      gfs2: Per-revoke accounting in transactions · 2129b428
      Andreas Gruenbacher 提交于
      In the log, revokes are stored as a revoke descriptor (struct
      gfs2_log_descriptor), followed by zero or more additional revoke blocks
      (struct gfs2_meta_header).  On filesystems with a blocksize of 4k, the
      revoke descriptor contains up to 503 revokes, and the metadata blocks
      contain up to 509 revokes each.  We've so far been reserving space for
      revokes in transactions in block granularity, so a lot more space than
      necessary was being allocated and then released again.
      
      This patch switches to assigning revokes to transactions individually
      instead.  Initially, space for the revoke descriptor is reserved and
      handed out to transactions.  When more revokes than that are reserved,
      additional revoke blocks are added.  When the log is flushed, the space
      for the additional revoke blocks is released, but we keep the space for
      the revoke descriptor block allocated.
      
      Transactions may still reserve more revokes than they will actually need
      in the end, but now we won't overshoot the target as much, and by only
      returning the space for excess revokes at log flush time, we further
      reduce the amount of contention between processes.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      2129b428
  6. 18 2月, 2021 3 次提交
    • A
      gfs2: Add local resource group locking · 9e514605
      Andreas Gruenbacher 提交于
      Prepare for treating resource group glocks as exclusive among nodes but
      shared among all tasks running on a node: introduce another layer of
      node-specific locking that the local tasks can use to coordinate their
      accesses.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      9e514605
    • A
      gfs2: Add per-reservation reserved block accounting · 725d0e9d
      Andreas Gruenbacher 提交于
      Add a rs_reserved field to struct gfs2_blkreserv to keep track of the number of
      blocks reserved by this particular reservation, and a rd_reserved field to
      struct gfs2_rgrpd to keep track of the total number of reserved blocks in the
      resource group.  Those blocks are exclusively reserved, as opposed to the
      rs_requested / rd_requested blocks which are tracked in the reservation tree
      (rd_rstree) and which can be stolen if necessary.
      
      When making a reservation with gfs2_inplace_reserve, rs_reserved is set to
      somewhere between ap->min_target and ap->target depending on the number of free
      blocks in the resource group.  When allocating blocks with gfs2_alloc_blocks,
      rs_reserved is decremented accordingly.  Eventually, any reserved but not
      consumed blocks are returned to the resource group by gfs2_inplace_release.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      725d0e9d
    • A
      gfs2: Rename rs_{free -> requested} and rd_{reserved -> requested} · 07974d2a
      Andreas Gruenbacher 提交于
      We keep track of what we've so far been referring to as reservations in
      rd_rstree: the nodes in that tree indicate where in a resource group we'd
      like to allocate the next couple of blocks for a particular inode.  Local
      processes take those as hints, but they may still "steal" blocks from those
      extents, so when actually allocating a block, we must double check in the
      bitmap whether that block is actually still free.  Likewise, other cluster
      nodes may "steal" such blocks as well.
      
      One of the following patches introduces resource group glock sharing, i.e.,
      sharing of an exclusively locked resource group glock among local processes to
      speed up allocations.  To make that work, we'll need to keep track of how many
      blocks we've actually reserved for each inode, so we end up with two different
      kinds of reservations.
      
      Distinguish these two kinds by referring to blocks which are reserved but may
      still be "stolen" as "requested".  This rename also makes it more obvious that
      rs_requested and rd_requested are strongly related.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      07974d2a
  7. 08 2月, 2021 1 次提交
  8. 04 2月, 2021 3 次提交
    • A
      gfs2: Get rid of current_tail() · 5cb738b5
      Andreas Gruenbacher 提交于
      Keep the current value of the updated log tail in the super block as
      sb_log_flush_tail instead of computing it on the fly.  This avoids
      unnecessary sd_ail_lock taking and cleans up the code.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      5cb738b5
    • A
      gfs2: Get rid of sd_reserving_log · f3708fb5
      Andreas Gruenbacher 提交于
      This counter and the associated wait queue are only used so that
      gfs2_make_fs_ro can efficiently wait for all pending log space
      allocations to fail after setting the filesystem to read-only.  This
      comes at the cost of waking up that wait queue very frequently.
      
      Instead, when gfs2_log_reserve fails because the filesystem has become
      read-only, Wake up sd_log_waitq.  In gfs2_make_fs_ro, set the file
      system read-only and then wait until all the log space has been
      released.  Give up and report the problem after a while.  With that,
      sd_reserving_log and sd_reserving_log_wait can be removed.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      f3708fb5
    • A
      gfs2: Clean up on-stack transactions · c968f578
      Andreas Gruenbacher 提交于
      Replace the TR_ALLOCED flag by its inverse, TR_ONSTACK: that way, the flag only
      needs to be set in the exceptional case of on-stack transactions.  Split off
      __gfs2_trans_begin from gfs2_trans_begin and use it to replace the open-coded
      version in gfs2_ail_empty_gl.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      c968f578
  9. 25 1月, 2021 1 次提交
    • B
      gfs2: keep bios separate for each journal · 82218943
      Bob Peterson 提交于
      The recovery func can recover multiple journals, but they were all using
      the same bio. This resulted in use-after-free related to sdp->sd_log_bio.
      This patch moves the variable to the journal descriptor, jd, so that
      every recovery can operate on its own bio. And hopefully we never run out.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      82218943
  10. 18 1月, 2021 1 次提交
    • A
      gfs2: Only use struct gfs2_rbm for bitmap manipulations · c65b76b8
      Andreas Gruenbacher 提交于
      GFS2 uses struct gfs2_rbm to represent a filesystem block number as a
      bit position within a resource group.  This representation is used in
      the bitmap manipulation code to prevent excessive conversions between
      block numbers and bit positions, but also in struct gfs2_blkreserv which
      is part of struct gfs2_inode, to mark the start of a reservation.  In
      the inode, the bit position representation makes less sense: first, the
      start position is used as a block number about as often as a bit
      position; second, the bit position representation makes the code
      unnecessarily complicated and difficult to read.
      
      Therefore, change struct gfs2_blkreserv to represent the start of a
      reservation as a block number instead of a bit position.  (This requires
      keeping track of the resource group in gfs2_blkreserv separately.) With
      that change, various things can be slightly simplified, and struct
      gfs2_rbm can be moved to rgrp.c.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      c65b76b8
  11. 01 12月, 2020 1 次提交
  12. 25 11月, 2020 1 次提交
    • A
      gfs2: set lockdep subclass for iopen glocks · 515b269d
      Alexander Aring 提交于
      This patch introduce a new globs attribute to define the subclass of the
      glock lockref spinlock. This avoid the following lockdep warning, which
      occurs when we lock an inode lock while an iopen lock is held:
      
      ============================================
      WARNING: possible recursive locking detected
      5.10.0-rc3+ #4990 Not tainted
      --------------------------------------------
      kworker/0:1/12 is trying to acquire lock:
      ffff9067d45672d8 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: lockref_get+0x9/0x20
      
      but task is already holding lock:
      ffff9067da308588 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: delete_work_func+0x164/0x260
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&gl->gl_lockref.lock);
        lock(&gl->gl_lockref.lock);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by kworker/0:1/12:
       #0: ffff9067c1bfdd38 ((wq_completion)delete_workqueue){+.+.}-{0:0}, at: process_one_work+0x1b7/0x540
       #1: ffffac594006be70 ((work_completion)(&(&gl->gl_delete)->work)){+.+.}-{0:0}, at: process_one_work+0x1b7/0x540
       #2: ffff9067da308588 (&gl->gl_lockref.lock){+.+.}-{3:3}, at: delete_work_func+0x164/0x260
      
      stack backtrace:
      CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.10.0-rc3+ #4990
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Workqueue: delete_workqueue delete_work_func
      Call Trace:
       dump_stack+0x8b/0xb0
       __lock_acquire.cold+0x19e/0x2e3
       lock_acquire+0x150/0x410
       ? lockref_get+0x9/0x20
       _raw_spin_lock+0x27/0x40
       ? lockref_get+0x9/0x20
       lockref_get+0x9/0x20
       delete_work_func+0x188/0x260
       process_one_work+0x237/0x540
       worker_thread+0x4d/0x3b0
       ? process_one_work+0x540/0x540
       kthread+0x127/0x140
       ? __kthread_bind_mask+0x60/0x60
       ret_from_fork+0x22/0x30
      Suggested-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NAlexander Aring <aahringo@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      515b269d
  13. 23 10月, 2020 1 次提交
  14. 21 10月, 2020 2 次提交
    • A
      gfs2: Add fields for statfs info in struct gfs2_log_header_host · 73092698
      Abhi Das 提交于
      And read these in __get_log_header() from the log header.
      Also make gfs2_statfs_change_out() non-static so it can be used
      outside of super.c
      Signed-off-by: NAbhi Das <adas@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      73092698
    • B
      gfs2: Eliminate gl_vm · 23cfb0c3
      Bob Peterson 提交于
      The gfs2_glock structure has a gl_vm member, introduced in commit 7005c3e4
      ("GFS2: Use range based functions for rgrp sync/invalidation"), which stores
      the location of resource groups within their address space.  This structure is
      in a union with iopen glock specific fields.  It was introduced because at
      unmount time, the resource group objects were destroyed before flushing out any
      pending resource group glock work, and flushing out such work could require
      flushing / truncating the address space.
      
      Since commit b3422cac ("gfs2: Rework how rgrp buffer_heads are managed"),
      any pending resource group glock work is flushed out before destroying the
      resource group objects.  So the resource group objects will now always exist in
      rgrp_go_sync and rgrp_go_inval, and we now simply compute the gl_vm values
      where needed instead of caching them.  This also eliminates the union.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      23cfb0c3
  15. 15 10月, 2020 2 次提交
    • B
      gfs2: eliminate GLF_QUEUED flag in favor of list_empty(gl_holders) · e2c6c8a7
      Bob Peterson 提交于
      Before this patch, glock.c maintained a flag, GLF_QUEUED, which indicated
      when a glock had a holder queued. It was only checked for inode glocks,
      although set and cleared by all glocks, and it was only used to determine
      whether the glock should be held for the minimum hold time before releasing.
      
      The problem is that the flag is not accurate at all. If a process holds
      the glock, the flag is set. When they dequeue the glock, it only cleared
      the flag in cases when the state actually changed. So if the state doesn't
      change, the flag may still be set, even when nothing is queued.
      
      This happens to iopen glocks often: the get held in SH, then the file is
      closed, but the glock remains in SH mode.
      
      We don't need a special flag to indicate this: we can simply tell whether
      the glock has any items queued to the holders queue. It's a waste of cpu
      time to maintain it.
      
      This patch eliminates the flag in favor of simply checking list_empty
      on the glock holders.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      e2c6c8a7
    • J
      gfs2: use-after-free in sysfs deregistration · c2a04b02
      Jamie Iles 提交于
      syzkaller found the following splat with CONFIG_DEBUG_KOBJECT_RELEASE=y:
      
        Read of size 1 at addr ffff000028e896b8 by task kworker/1:2/228
      
        CPU: 1 PID: 228 Comm: kworker/1:2 Tainted: G S                5.9.0-rc8+ #101
        Hardware name: linux,dummy-virt (DT)
        Workqueue: events kobject_delayed_cleanup
        Call trace:
         dump_backtrace+0x0/0x4d8
         show_stack+0x34/0x48
         dump_stack+0x174/0x1f8
         print_address_description.constprop.0+0x5c/0x550
         kasan_report+0x13c/0x1c0
         __asan_report_load1_noabort+0x34/0x60
         memcmp+0xd0/0xd8
         gfs2_uevent+0xc4/0x188
         kobject_uevent_env+0x54c/0x1240
         kobject_uevent+0x2c/0x40
         __kobject_del+0x190/0x1d8
         kobject_delayed_cleanup+0x2bc/0x3b8
         process_one_work+0x96c/0x18c0
         worker_thread+0x3f0/0xc30
         kthread+0x390/0x498
         ret_from_fork+0x10/0x18
      
        Allocated by task 1110:
         kasan_save_stack+0x28/0x58
         __kasan_kmalloc.isra.0+0xc8/0xe8
         kasan_kmalloc+0x10/0x20
         kmem_cache_alloc_trace+0x1d8/0x2f0
         alloc_super+0x64/0x8c0
         sget_fc+0x110/0x620
         get_tree_bdev+0x190/0x648
         gfs2_get_tree+0x50/0x228
         vfs_get_tree+0x84/0x2e8
         path_mount+0x1134/0x1da8
         do_mount+0x124/0x138
         __arm64_sys_mount+0x164/0x238
         el0_svc_common.constprop.0+0x15c/0x598
         do_el0_svc+0x60/0x150
         el0_svc+0x34/0xb0
         el0_sync_handler+0xc8/0x5b4
         el0_sync+0x15c/0x180
      
        Freed by task 228:
         kasan_save_stack+0x28/0x58
         kasan_set_track+0x28/0x40
         kasan_set_free_info+0x24/0x48
         __kasan_slab_free+0x118/0x190
         kasan_slab_free+0x14/0x20
         slab_free_freelist_hook+0x6c/0x210
         kfree+0x13c/0x460
      
      Use the same pattern as f2fs + ext4 where the kobject destruction must
      complete before allowing the FS itself to be freed.  This means that we
      need an explicit free_sbd in the callers.
      
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NJamie Iles <jamie@nuviainc.com>
      [Also go to fail_free when init_names fails.]
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      c2a04b02
  16. 03 7月, 2020 1 次提交
    • B
      gfs2: eliminate GIF_ORDERED in favor of list_empty · 7542486b
      Bob Peterson 提交于
      In several places, we used the GIF_ORDERED inode flag to determine
      if an inode was on the ordered writes list. However, since we always
      held the sd_ordered_lock spin_lock during the manipulation, we can
      just as easily check list_empty(&ip->i_ordered) instead.
      This allows us to keep more than one ordered writes list to make
      journal writing improvements.
      
      This patch eliminates GIF_ORDERED in favor of checking list_empty.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      7542486b
  17. 06 6月, 2020 3 次提交
    • A
      gfs2: Check inode generation number in delete_work_func · b0dcffd8
      Andreas Gruenbacher 提交于
      In delete_work_func, if the iopen glock still has an inode attached,
      limit the inode lookup to that specific generation number: in the likely
      case that the inode was deleted on the node on which the inode's link
      count dropped to zero, we can skip verifying the on-disk block type and
      reading in the inode.  The same applies if another node that had the
      inode open managed to delete the inode before us.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      b0dcffd8
    • A
      gfs2: Give up the iopen glock on contention · 8c7b9262
      Andreas Gruenbacher 提交于
      When there's contention on the iopen glock, it means that the link count
      of the corresponding inode has dropped to zero on a remote node which is
      now trying to delete the inode.  In that case, try to evict the inode so
      that the iopen glock will be released, which will allow the remote node
      to do its job.
      
      When the inode is still open locally, the inode's reference count won't
      drop to zero and so we'll keep holding the inode and its iopen glock.
      The remote node will time out its request to grab the iopen glock, and
      when the inode is finally closed locally, we'll try to delete it
      ourself.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      8c7b9262
    • A
      gfs2: Turn gl_delete into a delayed work · a0e3cc65
      Andreas Gruenbacher 提交于
      This requires flushing delayed work items in gfs2_make_fs_ro (which is called
      before unmounting a filesystem).
      
      When inodes are deleted and then recreated, pending gl_delete work items would
      have no effect because the inode generations will have changed, so we can
      cancel any pending gl_delete works before reusing iopen glocks.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      a0e3cc65
  18. 28 3月, 2020 1 次提交
    • B
      gfs2: Change inode qa_data to allow multiple users · 2fba46a0
      Bob Peterson 提交于
      Before this patch, multiple users called gfs2_qa_alloc which allocated
      a qadata structure to the inode, if quotas are turned on. Later, in
      file close or evict, the structure was deleted with gfs2_qa_delete.
      But there can be several competing processes who need access to the
      structure. There were races between file close (release) and the others.
      Thus, a release could delete the structure out from under a process
      that relied upon its existence. For example, chown.
      
      This patch changes the management of the qadata structures to be
      a get/put scheme. Function gfs2_qa_alloc has been changed to gfs2_qa_get
      and if the structure is allocated, the count essentially starts out at
      1. Function gfs2_qa_delete has been renamed to gfs2_qa_put, and the
      last guy to decrement the count to 0 frees the memory.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      2fba46a0
  19. 27 2月, 2020 2 次提交
    • B
      gfs2: Do proper error checking for go_sync family of glops functions · 1c634f94
      Bob Peterson 提交于
      Before this patch, function do_xmote would try to sync out the glock
      dirty data by calling the appropriate glops function XXX_go_sync()
      but it did not check for a good return code. If the sync was not
      possible due to an io error or whatever, do_xmote would continue on
      and call go_inval and release the glock to other cluster nodes.
      When those nodes go to replay the journal, they may already be holding
      glocks for the journal records that should have been synced, but were
      not due to the ignored error.
      
      This patch introduces proper error code checking to the go_sync
      family of glops functions.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      1c634f94
    • B
      gfs2: Force withdraw to replay journals and wait for it to finish · 601ef0d5
      Bob Peterson 提交于
      When a node withdraws from a file system, it often leaves its journal
      in an incomplete state. This is especially true when the withdraw is
      caused by io errors writing to the journal. Before this patch, a
      withdraw would try to write a "shutdown" record to the journal, tell
      dlm it's done with the file system, and none of the other nodes
      know about the problem. Later, when the problem is fixed and the
      withdrawn node is rebooted, it would then discover that its own
      journal was incomplete, and replay it. However, replaying it at this
      point is almost guaranteed to introduce corruption because the other
      nodes are likely to have used affected resource groups that appeared
      in the journal since the time of the withdraw. Replaying the journal
      later will overwrite any changes made, and not through any fault of
      dlm, which was instructed during the withdraw to release those
      resources.
      
      This patch makes file system withdraws seen by the entire cluster.
      Withdrawing nodes dequeue their journal glock to allow recovery.
      
      The remaining nodes check all the journals to see if they are
      clean or in need of replay. They try to replay dirty journals, but
      only the journals of withdrawn nodes will be "not busy" and
      therefore available for replay.
      
      Until the journal replay is complete, no i/o related glocks may be
      given out, to ensure that the replay does not cause the
      aforementioned corruption: We cannot allow any journal replay to
      overwrite blocks associated with a glock once it is held.
      
      The "live" glock which is now used to signal when a withdraw
      occurs. When a withdraw occurs, the node signals its withdraw by
      dequeueing the "live" glock and trying to enqueue it in EX mode,
      thus forcing the other nodes to all see a demote request, by way
      of a "1CB" (one callback) try lock. The "live" glock is not
      granted in EX; the callback is only just used to indicate a
      withdraw has occurred.
      
      Note that all nodes in the cluster must wait for the recovering
      node to finish replaying the withdrawing node's journal before
      continuing. To this end, it checks that the journals are clean
      multiple times in a retry loop.
      
      Also note that the withdraw function may be called from a wide
      variety of situations, and therefore, we need to take extra
      precautions to make sure pointers are valid before using them in
      many circumstances.
      
      We also need to take care when glocks decide to withdraw, since
      the withdraw code now uses glocks.
      
      Also, before this patch, if a process encountered an error and
      decided to withdraw, if another process was already withdrawing,
      the second withdraw would be silently ignored, which set it free
      to unlock its glocks. That's correct behavior if the original
      withdrawer encounters further errors down the road. But if
      secondary waiters don't wait for the journal replay, unlocking
      glocks will allow other nodes to use them, despite the fact that
      the journal containing those blocks is being replayed. The
      replay needs to finish before our glocks are released to other
      nodes. IOW, secondary withdraws need to wait for the first
      withdraw to finish.
      
      For example, if an rgrp glock is unlocked by a process that didn't
      wait for the first withdraw, a journal replay could introduce file
      system corruption by replaying a rgrp block that has already been
      granted to a different cluster node.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      601ef0d5
  20. 21 2月, 2020 1 次提交
    • B
      gfs2: Allow some glocks to be used during withdraw · a72d2401
      Bob Peterson 提交于
      We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
      when we're withdrawn. For example, to maintain metadata integrity, we should
      disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
      iopen or the transaction glocks may be safely used because none of their
      metadata goes through the journal. So in general, we should disallow all
      glocks with an address space, and allow all the others. One exception is:
      we need to allow our active journal to be demoted so others may recover it.
      
      Allowing glocks after withdraw gives us the ability to take appropriate
      action (in a following patch) to have our journal properly replayed by
      another node rather than just abandoning the current transactions and
      pretending nothing bad happened, leaving the other nodes free to modify
      the blocks we had in our journal, which may result in file system
      corruption.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      a72d2401
  21. 10 2月, 2020 3 次提交
    • B
      gfs2: log error reform · 036330c9
      Bob Peterson 提交于
      Before this patch, gfs2 kept track of journal io errors in two
      places sd_log_error and the SDF_AIL1_IO_ERROR flag in sd_flags.
      This patch consolidates the two into sd_log_error so that it
      reflects the first error encountered writing to the journal.
      In future patches, we will take advantage of this by checking
      this value rather than having to check both when reacting to
      io errors.
      
      In addition, this fixes a tight loop in unmount: If buffers
      get on the ail1 list and an io error occurs elsewhere, the
      ail1 list would never be cleared because they were always busy.
      So unmount would hang, waiting for the ail1 list to empty.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      036330c9
    • B
      gfs2: Rework how rgrp buffer_heads are managed · b3422cac
      Bob Peterson 提交于
      Before this patch, the rgrp code had a serious problem related to
      how it managed buffer_heads for resource groups. The problem caused
      file system corruption, especially in cases of journal replay.
      
      When an rgrp glock was demoted to transfer ownership to a
      different cluster node, do_xmote() first calls rgrp_go_sync and then
      rgrp_go_inval, as expected. When it calls rgrp_go_sync, that called
      gfs2_rgrp_brelse() that dropped the buffer_head reference count.
      In most cases, the reference count went to zero, which is right.
      However, there were other places where the buffers are handled
      differently.
      
      After rgrp_go_sync, do_xmote called rgrp_go_inval which called
      gfs2_rgrp_brelse a second time, then rgrp_go_inval's call to
      truncate_inode_pages_range would get rid of the pages in memory,
      but only if the reference count drops to 0.
      
      Unfortunately, gfs2_rgrp_brelse was setting bi->bi_bh = NULL.
      So when rgrp_go_sync called gfs2_rgrp_brelse, it lost the pointer
      to the buffer_heads in cases where the reference count was still 1.
      Therefore, when rgrp_go_inval called gfs2_rgrp_brelse a second time,
      it failed the check for "if (bi->bi_bh)" and thus failed to call
      brelse a second time. Because of that, the reference count on those
      buffers sometimes failed to drop from 1 to 0. And that caused
      function truncate_inode_pages_range to keep the pages in page cache
      rather than freeing them.
      
      The next time the rgrp glock was acquired, the metadata read of
      the rgrp buffers re-used the pages in memory, which were now
      wrong because they were likely modified by the other node who
      acquired the glock in EX (which is why we demoted the glock).
      This re-use of the page cache caused corruption because changes
      made by the other nodes were never seen, so the bitmaps were
      inaccurate.
      
      For some reason, the problem became most apparent when journal
      replay forced the replay of rgrps in memory, which caused newer
      rgrp data to be overwritten by the older in-core pages.
      
      A big part of the problem was that the rgrp buffer were released
      in multiple places: The go_unlock function would release them when
      the glock was released rather than when the glock is demoted,
      which is clearly wrong because our intent was to cache them until
      the glock is demoted from SH or EX.
      
      This patch attempts to clean up the mess and make one consistent
      and centralized mechanism for managing the rgrp buffer_heads by
      implementing several changes:
      
      1. It eliminates the call to gfs2_rgrp_brelse() from rgrp_go_sync.
         We don't want to release the buffers or zero the pointers when
         syncing for the reasons stated above. It only makes sense to
         release them when the glock is actually invalidated (go_inval).
         And when we do, then we set the bh pointers to NULL.
      2. The go_unlock function (which was only used for rgrps) is
         eliminated, as we've talked about doing many times before.
         The go_unlock function was called too early in the glock dq
         process, and should not happen until the glock is invalidated.
      3. It also eliminates the call to rgrp_brelse in gfs2_clear_rgrpd.
         That will now happen automatically when the rgrp glocks are
         demoted, and shouldn't happen any sooner or later than that.
         Instead, function gfs2_clear_rgrpd has been modified to demote
         the rgrp glocks, and therefore, free those pages, before the
         remaining glocks are culled by gfs2_gl_hash_clear. This
         prevents the gl_object from hanging around when the glocks are
         culled.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      b3422cac
    • B
      gfs2: Introduce concept of a pending withdraw · 69511080
      Bob Peterson 提交于
      File system withdraws can be delayed when inconsistencies are
      discovered when we cannot withdraw immediately, for example, when
      critical spin_locks are held. But delaying the withdraw can cause
      gfs2 to ignore the error and keep running for a short period of time.
      For example, an rgrp glock may be dequeued and demoted while there
      are still buffers that haven't been properly revoked, due to io
      errors writing to the journal.
      
      This patch introduces a new concept of a pending withdraw, which
      means an inconsistency has been discovered and we need to withdraw
      at the earliest possible opportunity. In these cases, we aren't
      quite withdrawn yet, but we still need to not dequeue glocks and
      other critical things. If we dequeue the glocks and the withdraw
      results in our journal being replayed, the replay could overwrite
      data that's been modified by a different node that acquired the
      glock in the meantime.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Reviewed-by: NAndreas Gruenbacher <agruenba@redhat.com>
      69511080
  22. 28 1月, 2020 1 次提交
    • B
      Revert "gfs2: eliminate tr_num_revoke_rm" · a31b4ec5
      Bob Peterson 提交于
      This reverts commit e955537e.
      
      Before patch e955537e, tr_num_revoke tracked the number of revokes
      added to the transaction, and tr_num_revoke_rm tracked how many
      revokes were removed. But since revokes are queued off the sdp
      (superblock) pointer, some transactions could remove more revokes
      than they added. (e.g. revokes added by a different process).
      Commit e955537e eliminated transaction variable tr_num_revoke_rm,
      but in order to do so, it changed the accounting to always use
      tr_num_revoke for its math. Since you can remove more revokes than
      you add, tr_num_revoke could now become a negative value.
      This negative value broke the assert in function gfs2_trans_end:
      
      	if (gfs2_assert_withdraw(sdp, (nbuf <=3D tr->tr_blocks) &&
      			       (tr->tr_num_revoke <=3D tr->tr_revokes)))
      
      One way to fix this is to simply remove the tr_num_revoke clause
      from the assert and allow the value to become negative. Andreas
      didn't like that idea, so instead, we decided to revert e955537e.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      a31b4ec5
  23. 20 1月, 2020 2 次提交