提交 aee8cac8 编写于 作者: S Soumyadeep Chakraborty

Avoid I/U failure due to out-of-sync AO segfile state between QD & QE

Issue:
The QE, when not able to acquire exclusive lock on AO/AOCO relation
during drop phase of vacuum, skips dropping the file and the update of
its state from AOSEG_STATE_AWAITING_DROP to AVAILABLE is not performed.
In spite of that, the QD moves forward and transitions the segfile state
to AVAILABLE. This causes master and segment to have disconnected states
for the file and hence the master might erroneously schedule that
segfile for I/U causing ERRORs.

Highlevel vacuum flow for AO table:
Prepare phase
while(num_of_segiles_for_table)
{
Compaction phase
Drop phase
}
Cleanup phase

Our fix:
Wait for acquiring lock on QE instead of skipping the vacuum drop phase, if a
concurrent read query is running. QD still skips if read query is
running and leaves the segfile in AOSEG_STATE_AWAITING_DROP. This way
the code is aligned with current gpdb master and 6X_STABLE code. Since
there should be none or rare operations that acquire a lock on QE
without acquiring the same lock on QD first, it is okay to introduce the
wait on the QE.
The downside of our approach is that if there are read queries running
on QE without a lock on QD, vacuum will keep waiting to acquire the lock
for every segment file. This could lead to very long vacuum runtime. An
example of such a workload is a concurrent COPY for partitioned tables.
COPY acquires the lock for both root and child partitions on QE but only
acquires the lock for the root on the QD.

Also, for `AOCSDrop()`, we wait for acquiring the lock similarly. This
practically doesn't cause any issue as table level AccessExclusive lock is
always acquired before calling `AOCSDrop()`. But better to not have the
skipping code and also aligns this with `AppendOnlyDrop()`.

Alternative fixes:
1. elog(ERROR) on QE when not able to acquire the lock. This correctly
aborts the drop phase transaction on QE and QD. Then, both the states on
the QD and the QE have the segfile state = AOSEG_STATE_AWAITING_DROP.
We don't like this solution because a) it unnecessarily emits error to user
for vacuum b) if it errors out for one segfile, the whole command terminates.
This means that vacuum will not proceed to compact other segfiles for
the table.

2. If we can't acquire the lock on the QE, report it back to QD to
correctly update the state of the segfile. This would ensure that the QD
doesn't mark the segfile as AVAILABLE (The state would be
AOSEG_STATE_AWAITING_DROP for both QD and QE).
The first downside is that this would introduce considerable complexity
to already complex legacy vacuum code for a rare scenario. Also it may
leave many segfiles in the AOSEG_STATE_AWAITING_DROP state for a
prolonged period. This may lead to a state where we run out of segfiles
to I/U into.

Conclusion:
Our chosen fix involves the least complexity and is aligned with
behavior on 6X+. Based on feedback, we may incorporate one of the
alternative fixes in the future.
Co-authored-by: NAshwin Agrawal <aagrawal@pivotal.io>
上级 934a67b2
......@@ -431,7 +431,6 @@ AOCSDrop(Relation aorel,
int total_segfiles;
AOCSFileSegInfo** segfile_array;
int i, segno;
LockAcquireResult acquireResult;
AOCSFileSegInfo* fsinfo;
Assert (Gp_role == GP_ROLE_EXECUTE || Gp_role == GP_ROLE_UTILITY);
......@@ -455,21 +454,14 @@ AOCSDrop(Relation aorel,
}
/*
* Try to get the transaction write-lock for the Append-Only segment file.
* Get the transaction write-lock for the Append-Only segment file.
*
* NOTE: This is a transaction scope lock that must be held until commit / abort.
*/
acquireResult = LockRelationAppendOnlySegmentFile(
&aorel->rd_node,
segfile_array[i]->segno,
AccessExclusiveLock,
/* dontWait */ true);
if (acquireResult == LOCKACQUIRE_NOT_AVAIL)
{
elog(DEBUG5, "drop skips AOCS segfile %d, "
"relation %s", segfile_array[i]->segno, relname);
continue;
}
LockRelationAppendOnlySegmentFile(&aorel->rd_node,
segfile_array[i]->segno,
AccessExclusiveLock,
/* dontWait */ false);
/* Re-fetch under the write lock to get latest committed eof. */
fsinfo = GetAOCSFileSegInfo(aorel, SnapshotNow, segno);
......
......@@ -5490,7 +5490,19 @@ open_relation_and_check_permission(VacuumStmt *vacstmt,
* marked.
*/
lmode = AccessExclusiveLock;
dontWait = true;
/*
* We don't block trying to acquire the Access Exclusive lock on QD.
* Since lazy vacuum is a best-effort to reclaim space, it is okay to
* defer a drop of the segfile and leave the segment file in the
* AOSEG_STATE_AWAITING_DROP state. If this is the QE, we should not
* skip the drop phase as the current QD code (post drop phase update of
* AppendOnlyHash) assumes that if a dispatch is successful, the segfile
* always is dropped successfully and put into the AVAILABLE state on
* the QE (unless the drop phase transaction aborts on the QE). Hence,
* we block on QE but not on QD.
*/
if (Gp_role == GP_ROLE_DISPATCH)
dontWait = true;
SIMPLE_FAULT_INJECTOR(VacuumRelationOpenRelationDuringDropPhase);
}
else if (!vacstmt->vacuum)
......
-- @Description Assert that QEs don't skip a vacuum drop phase (unless we have
-- an abort) and thus guarantees that seg file states are consistent across QD/QE.
-- Given we have an AO table
1: CREATE TABLE ao_test_drop_phase (a INT, b INT) WITH (appendonly=true);
CREATE
-- And the AO table has all tuples on primary with dbid = 2
1: INSERT INTO ao_test_drop_phase SELECT 1,i from generate_series(1, 5)i;
INSERT 5
-- We should see 1 pg_aoseg catalog table tuple in state 1 (AVAILABLE) for
-- segno = 1
2U: SELECT * FROM gp_toolkit.__gp_aoseg_name('ao_test_drop_phase');
segno|eof|tupcount|varblockcount|eof_uncompressed|modcount|formatversion|state
-----+---+--------+-------------+----------------+--------+-------------+-----
1 |128|5 |1 |128 |1 |3 |1
(1 row)
-- And we create a utility mode session on the primary with dbid = 2 in order
-- to take an access shared lock.
2U: BEGIN;
BEGIN
2U: SELECT COUNT(*) FROM ao_test_drop_phase;
count
-----
5
(1 row)
-- And we delete 4/5 rows to trigger vacuum's compaction phase.
1: DELETE FROM ao_test_drop_phase where b != 5;
DELETE 4
-- We should see that VACUUM blocks while the utility mode session holds the
-- access shared lock
1&: VACUUM ao_test_drop_phase; <waiting ...>
2U: END;
END
1<: <... completed>
VACUUM
-- We should see that the one visible tuple left after the DELETE gets compacted
-- from segno = 1 to segno = 2.
-- Also, segno = 1 should be empty and in state 1 (AVAILABLE)
2U: SELECT * FROM gp_toolkit.__gp_aoseg_name('ao_test_drop_phase');
segno|eof|tupcount|varblockcount|eof_uncompressed|modcount|formatversion|state
-----+---+--------+-------------+----------------+--------+-------------+-----
1 |0 |0 |0 |0 |1 |3 |1
2 |40 |1 |1 |40 |1 |3 |1
(2 rows)
-- We should see that the master's hash table matches dbid = 2's pg_aoseg catalog
1: SELECT segno, total_tupcount, state FROM gp_toolkit.__gp_get_ao_entry_from_cache('ao_test_drop_phase'::regclass::oid) WHERE segno IN (1, 2);
segno|total_tupcount|state
-----+--------------+-----
1 |0 |1
2 |1 |1
(2 rows)
-- We should see that a subsequent insert succeeds and lands on segno = 1
1: INSERT INTO ao_test_drop_phase SELECT 1,i from generate_series(11, 15)i;
INSERT 5
2U: SELECT * FROM gp_toolkit.__gp_aoseg_name('ao_test_drop_phase');
segno|eof|tupcount|varblockcount|eof_uncompressed|modcount|formatversion|state
-----+---+--------+-------------+----------------+--------+-------------+-----
1 |128|5 |1 |128 |2 |3 |1
2 |40 |1 |1 |40 |1 |3 |1
(2 rows)
1: SELECT * FROM ao_test_drop_phase;
a|b
-+--
1|11
1|12
1|13
1|14
1|15
1|5
(6 rows)
......@@ -10,7 +10,7 @@ test: pg_views_concurrent_drop
test: resource_queue
test: alter_blocks_for_update_and_viceversa
test: reader_waits_for_lock
test: drop_rename concurrent_schema_drop
test: drop_rename concurrent_schema_drop vacuum_drop_phase_ao
test: instr_in_shmem_setup
test: instr_in_shmem_terminate
test: instr_in_shmem_cleanup
......
-- @Description Assert that QEs don't skip a vacuum drop phase (unless we have
-- an abort) and thus guarantees that seg file states are consistent across QD/QE.
-- Given we have an AO table
1: CREATE TABLE ao_test_drop_phase (a INT, b INT) WITH (appendonly=true);
-- And the AO table has all tuples on primary with dbid = 2
1: INSERT INTO ao_test_drop_phase SELECT 1,i from generate_series(1, 5)i;
-- We should see 1 pg_aoseg catalog table tuple in state 1 (AVAILABLE) for
-- segno = 1
2U: SELECT * FROM gp_toolkit.__gp_aoseg_name('ao_test_drop_phase');
-- And we create a utility mode session on the primary with dbid = 2 in order
-- to take an access shared lock.
2U: BEGIN;
2U: SELECT COUNT(*) FROM ao_test_drop_phase;
-- And we delete 4/5 rows to trigger vacuum's compaction phase.
1: DELETE FROM ao_test_drop_phase where b != 5;
-- We should see that VACUUM blocks while the utility mode session holds the
-- access shared lock
1&: VACUUM ao_test_drop_phase;
2U: END;
1<:
-- We should see that the one visible tuple left after the DELETE gets compacted
-- from segno = 1 to segno = 2.
-- Also, segno = 1 should be empty and in state 1 (AVAILABLE)
2U: SELECT * FROM gp_toolkit.__gp_aoseg_name('ao_test_drop_phase');
-- We should see that the master's hash table matches dbid = 2's pg_aoseg catalog
1: SELECT segno, total_tupcount, state
FROM gp_toolkit.__gp_get_ao_entry_from_cache('ao_test_drop_phase'::regclass::oid)
WHERE segno IN (1, 2);
-- We should see that a subsequent insert succeeds and lands on segno = 1
1: INSERT INTO ao_test_drop_phase SELECT 1,i from generate_series(11, 15)i;
2U: SELECT * FROM gp_toolkit.__gp_aoseg_name('ao_test_drop_phase');
1: SELECT * FROM ao_test_drop_phase;
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册