Avoid I/U failure due to out-of-sync AO segfile state between QD & QE
Issue:
The QE, when not able to acquire exclusive lock on AO/AOCO relation
during drop phase of vacuum, skips dropping the file and the update of
its state from AOSEG_STATE_AWAITING_DROP to AVAILABLE is not performed.
In spite of that, the QD moves forward and transitions the segfile state
to AVAILABLE. This causes master and segment to have disconnected states
for the file and hence the master might erroneously schedule that
segfile for I/U causing ERRORs.
Highlevel vacuum flow for AO table:
Prepare phase
while(num_of_segiles_for_table)
{
Compaction phase
Drop phase
}
Cleanup phase
Our fix:
Wait for acquiring lock on QE instead of skipping the vacuum drop phase, if a
concurrent read query is running. QD still skips if read query is
running and leaves the segfile in AOSEG_STATE_AWAITING_DROP. This way
the code is aligned with current gpdb master and 6X_STABLE code. Since
there should be none or rare operations that acquire a lock on QE
without acquiring the same lock on QD first, it is okay to introduce the
wait on the QE.
The downside of our approach is that if there are read queries running
on QE without a lock on QD, vacuum will keep waiting to acquire the lock
for every segment file. This could lead to very long vacuum runtime. An
example of such a workload is a concurrent COPY for partitioned tables.
COPY acquires the lock for both root and child partitions on QE but only
acquires the lock for the root on the QD.
Also, for `AOCSDrop()`, we wait for acquiring the lock similarly. This
practically doesn't cause any issue as table level AccessExclusive lock is
always acquired before calling `AOCSDrop()`. But better to not have the
skipping code and also aligns this with `AppendOnlyDrop()`.
Alternative fixes:
1. elog(ERROR) on QE when not able to acquire the lock. This correctly
aborts the drop phase transaction on QE and QD. Then, both the states on
the QD and the QE have the segfile state = AOSEG_STATE_AWAITING_DROP.
We don't like this solution because a) it unnecessarily emits error to user
for vacuum b) if it errors out for one segfile, the whole command terminates.
This means that vacuum will not proceed to compact other segfiles for
the table.
2. If we can't acquire the lock on the QE, report it back to QD to
correctly update the state of the segfile. This would ensure that the QD
doesn't mark the segfile as AVAILABLE (The state would be
AOSEG_STATE_AWAITING_DROP for both QD and QE).
The first downside is that this would introduce considerable complexity
to already complex legacy vacuum code for a rare scenario. Also it may
leave many segfiles in the AOSEG_STATE_AWAITING_DROP state for a
prolonged period. This may lead to a state where we run out of segfiles
to I/U into.
Conclusion:
Our chosen fix involves the least complexity and is aligned with
behavior on 6X+. Based on feedback, we may incorporate one of the
alternative fixes in the future.
Co-authored-by: NAshwin Agrawal <aagrawal@pivotal.io>
Showing
想要评论请 注册 或 登录