Avoid I/U failure due to out-of-sync AO segfile state between QD & QE

Issue: The QE, when not able to acquire exclusive lock on AO/AOCO relation during drop phase of vacuum, skips dropping the file and the update of its state from AOSEG_STATE_AWAITING_DROP to AVAILABLE is not performed. In spite of that, the QD moves forward and transitions the segfile state to AVAILABLE. This causes master and segment to have disconnected states for the file and hence the master might erroneously schedule that segfile for I/U causing ERRORs. Highlevel vacuum flow for AO table: Prepare phase while(num_of_segiles_for_table) { Compaction phase Drop phase } Cleanup phase Our fix: Wait for acquiring lock on QE instead of skipping the vacuum drop phase, if a concurrent read query is running. QD still skips if read query is running and leaves the segfile in AOSEG_STATE_AWAITING_DROP. This way the code is aligned with current gpdb master and 6X_STABLE code. Since there should be none or rare operations that acquire a lock on QE without acquiring the same lock on QD first, it is okay to introduce the wait on the QE. The downside of our approach is that if there are read queries running on QE without a lock on QD, vacuum will keep waiting to acquire the lock for every segment file. This could lead to very long vacuum runtime. An example of such a workload is a concurrent COPY for partitioned tables. COPY acquires the lock for both root and child partitions on QE but only acquires the lock for the root on the QD. Also, for `AOCSDrop()`, we wait for acquiring the lock similarly. This practically doesn't cause any issue as table level AccessExclusive lock is always acquired before calling `AOCSDrop()`. But better to not have the skipping code and also aligns this with `AppendOnlyDrop()`. Alternative fixes: 1. elog(ERROR) on QE when not able to acquire the lock. This correctly aborts the drop phase transaction on QE and QD. Then, both the states on the QD and the QE have the segfile state = AOSEG_STATE_AWAITING_DROP. We don't like this solution because a) it unnecessarily emits error to user for vacuum b) if it errors out for one segfile, the whole command terminates. This means that vacuum will not proceed to compact other segfiles for the table. 2. If we can't acquire the lock on the QE, report it back to QD to correctly update the state of the segfile. This would ensure that the QD doesn't mark the segfile as AVAILABLE (The state would be AOSEG_STATE_AWAITING_DROP for both QD and QE). The first downside is that this would introduce considerable complexity to already complex legacy vacuum code for a rare scenario. Also it may leave many segfiles in the AOSEG_STATE_AWAITING_DROP state for a prolonged period. This may lead to a state where we run out of segfiles to I/U into. Conclusion: Our chosen fix involves the least complexity and is aligned with behavior on 6X+. Based on feedback, we may incorporate one of the alternative fixes in the future. Co-authored-by: N Ashwin Agrawal <aagrawal@pivotal.io>

Avoid I/U failure due to out-of-sync AO segfile state between QD & QE
Issue: The QE, when not able to acquire exclusive lock on AO/AOCO relation during drop phase of vacuum, skips dropping the file and the update of its state from AOSEG_STATE_AWAITING_DROP to AVAILABLE is not performed. In spite of that, the QD moves forward and transitions the segfile state to AVAILABLE. This causes master and segment to have disconnected states for the file and hence the master might erroneously schedule that segfile for I/U causing ERRORs. Highlevel vacuum flow for AO table: Prepare phase while(num_of_segiles_for_table) { Compaction phase Drop phase } Cleanup phase Our fix: Wait for acquiring lock on QE instead of skipping the vacuum drop phase, if a concurrent read query is running. QD still skips if read query is running and leaves the segfile in AOSEG_STATE_AWAITING_DROP. This way the code is aligned with current gpdb master and 6X_STABLE code. Since there should be none or rare operations that acquire a lock on QE without acquiring the same lock on QD first, it is okay to introduce the wait on the QE. The downside of our approach is that if there are read queries running on QE without a lock on QD, vacuum will keep waiting to acquire the lock for every segment file. This could lead to very long vacuum runtime. An example of such a workload is a concurrent COPY for partitioned tables. COPY acquires the lock for both root and child partitions on QE but only acquires the lock for the root on the QD. Also, for `AOCSDrop()`, we wait for acquiring the lock similarly. This practically doesn't cause any issue as table level AccessExclusive lock is always acquired before calling `AOCSDrop()`. But better to not have the skipping code and also aligns this with `AppendOnlyDrop()`. Alternative fixes: 1. elog(ERROR) on QE when not able to acquire the lock. This correctly aborts the drop phase transaction on QE and QD. Then, both the states on the QD and the QE have the segfile state = AOSEG_STATE_AWAITING_DROP. We don't like this solution because a) it unnecessarily emits error to user for vacuum b) if it errors out for one segfile, the whole command terminates. This means that vacuum will not proceed to compact other segfiles for the table. 2. If we can't acquire the lock on the QE, report it back to QD to correctly update the state of the segfile. This would ensure that the QD doesn't mark the segfile as AVAILABLE (The state would be AOSEG_STATE_AWAITING_DROP for both QD and QE). The first downside is that this would introduce considerable complexity to already complex legacy vacuum code for a rare scenario. Also it may leave many segfiles in the AOSEG_STATE_AWAITING_DROP state for a prolonged period. This may lead to a state where we run out of segfiles to I/U into. Conclusion: Our chosen fix involves the least complexity and is aligned with behavior on 6X+. Based on feedback, we may incorporate one of the alternative fixes in the future. Co-authored-by: N Ashwin Agrawal <aagrawal@pivotal.io>
aee8cac8 · Soumyadeep Chakraborty · 934a67b2 · aee8cac8 · aee8cac8 · aee8cac8
5 changed file
--- a/src/backend/access/aocs/aocs_compaction.c
+++ b/src/backend/access/aocs/aocs_compaction.c
@@ -431,7 +431,6 @@ AOCSDrop(Relation aorel,
 	int total_segfiles;
 	AOCSFileSegInfo** segfile_array;
 	int i, segno;
-	LockAcquireResult acquireResult;
 	AOCSFileSegInfo* fsinfo;

 	Assert (Gp_role == GP_ROLE_EXECUTE || Gp_role == GP_ROLE_UTILITY);
@@ -455,21 +454,14 @@ AOCSDrop(Relation aorel,
 		}

 		/*
-		 * Try to get the transaction write-lock for the Append-Only segment file.
+		 * Get the transaction write-lock for the Append-Only segment file.
 		 *
 		 * NOTE: This is a transaction scope lock that must be held until commit / abort.
 		 */
-		acquireResult = LockRelationAppendOnlySegmentFile(
-												&aorel->rd_node,
-												segfile_array[i]->segno,
-												AccessExclusiveLock,
-												/* dontWait */ true);
-		if (acquireResult == LOCKACQUIRE_NOT_AVAIL)
-		{
-			elog(DEBUG5, "drop skips AOCS segfile %d, "
-					 "relation %s", segfile_array[i]->segno, relname);
-			continue;
-		}
+		LockRelationAppendOnlySegmentFile(&aorel->rd_node,
+										  segfile_array[i]->segno,
+										  AccessExclusiveLock,
+										  /* dontWait */ false);

 		/* Re-fetch under the write lock to get latest committed eof. */
 		fsinfo = GetAOCSFileSegInfo(aorel, SnapshotNow, segno);

--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -5490,7 +5490,19 @@ open_relation_and_check_permission(VacuumStmt *vacstmt,
 		 * marked.
 		 */
 		lmode = AccessExclusiveLock;
-		dontWait = true;
+		/*
+		 * We don't block trying to acquire the Access Exclusive lock on QD.
+		 * Since lazy vacuum is a best-effort to reclaim space, it is okay to
+		 * defer a drop of the segfile and leave the segment file in the
+		 * AOSEG_STATE_AWAITING_DROP state. If this is the QE, we should not
+		 * skip the drop phase as the current QD code (post drop phase update of
+		 * AppendOnlyHash) assumes that if a dispatch is successful, the segfile
+		 * always is dropped successfully and put into the AVAILABLE state on
+		 * the QE (unless the drop phase transaction aborts on the QE). Hence,
+		 * we block on QE but not on QD.
+		 */
+		if (Gp_role == GP_ROLE_DISPATCH)
+			dontWait = true;
 		SIMPLE_FAULT_INJECTOR(VacuumRelationOpenRelationDuringDropPhase);
 	}
 	else if (!vacstmt->vacuum)

--- a/src/test/isolation2/expected/vacuum_drop_phase_ao.out
+++ b/src/test/isolation2/expected/vacuum_drop_phase_ao.out
+-- @Description Assert that QEs don't skip a vacuum drop phase (unless we have
+-- an abort) and thus guarantees that seg file states are consistent across QD/QE.
+
+-- Given we have an AO table
+1: CREATE TABLE ao_test_drop_phase (a INT, b INT) WITH (appendonly=true);
+CREATE
+-- And the AO table has all tuples on primary with dbid = 2
+1: INSERT INTO ao_test_drop_phase SELECT 1,i from generate_series(1, 5)i;
+INSERT 5
+
+-- We should see 1 pg_aoseg catalog table tuple in state 1 (AVAILABLE) for
+-- segno = 1
+2U: SELECT * FROM gp_toolkit.__gp_aoseg_name('ao_test_drop_phase');
+segno|eof|tupcount|varblockcount|eof_uncompressed|modcount|formatversion|state
+-----+---+--------+-------------+----------------+--------+-------------+-----
+1    |128|5       |1            |128             |1       |3            |1    
+(1 row)
+
+-- And we create a utility mode session on the primary with dbid = 2 in order
+-- to take an access shared lock.
+2U: BEGIN;
+BEGIN
+2U: SELECT COUNT(*) FROM ao_test_drop_phase;
+count
+-----
+5    
+(1 row)
+
+-- And we delete 4/5 rows to trigger vacuum's compaction phase.
+1: DELETE FROM ao_test_drop_phase where b != 5;
+DELETE 4
+-- We should see that VACUUM blocks while the utility mode session holds the
+-- access shared lock
+1&: VACUUM ao_test_drop_phase;  <waiting ...>
+
+2U: END;
+END
+1<:  <... completed>
+VACUUM
+
+-- We should see that the one visible tuple left after the DELETE gets compacted
+-- from segno = 1 to segno = 2.
+-- Also, segno = 1 should be empty and in state 1 (AVAILABLE)
+2U: SELECT * FROM gp_toolkit.__gp_aoseg_name('ao_test_drop_phase');
+segno|eof|tupcount|varblockcount|eof_uncompressed|modcount|formatversion|state
+-----+---+--------+-------------+----------------+--------+-------------+-----
+1    |0  |0       |0            |0               |1       |3            |1    
+2    |40 |1       |1            |40              |1       |3            |1    
+(2 rows)
+
+-- We should see that the master's hash table matches dbid = 2's pg_aoseg catalog
+1: SELECT segno, total_tupcount, state FROM gp_toolkit.__gp_get_ao_entry_from_cache('ao_test_drop_phase'::regclass::oid) WHERE segno IN (1, 2);
+segno|total_tupcount|state
+-----+--------------+-----
+1    |0             |1    
+2    |1             |1    
+(2 rows)
+
+-- We should see that a subsequent insert succeeds and lands on segno = 1
+1: INSERT INTO ao_test_drop_phase SELECT 1,i from generate_series(11, 15)i;
+INSERT 5
+2U: SELECT * FROM gp_toolkit.__gp_aoseg_name('ao_test_drop_phase');
+segno|eof|tupcount|varblockcount|eof_uncompressed|modcount|formatversion|state
+-----+---+--------+-------------+----------------+--------+-------------+-----
+1    |128|5       |1            |128             |2       |3            |1    
+2    |40 |1       |1            |40              |1       |3            |1    
+(2 rows)
+
+1: SELECT * FROM ao_test_drop_phase;
+a|b 
+-+--
+1|11
+1|12
+1|13
+1|14
+1|15
+1|5 
+(6 rows)
--- a/src/test/isolation2/isolation2_schedule
+++ b/src/test/isolation2/isolation2_schedule
@@ -10,7 +10,7 @@ test: pg_views_concurrent_drop
 test: resource_queue
 test: alter_blocks_for_update_and_viceversa
 test: reader_waits_for_lock
-test: drop_rename concurrent_schema_drop
+test: drop_rename concurrent_schema_drop vacuum_drop_phase_ao
 test: instr_in_shmem_setup
 test: instr_in_shmem_terminate
 test: instr_in_shmem_cleanup

--- a/src/test/isolation2/sql/vacuum_drop_phase_ao.sql
+++ b/src/test/isolation2/sql/vacuum_drop_phase_ao.sql
+-- @Description Assert that QEs don't skip a vacuum drop phase (unless we have
+-- an abort) and thus guarantees that seg file states are consistent across QD/QE.
+
+-- Given we have an AO table
+1: CREATE TABLE ao_test_drop_phase (a INT, b INT) WITH (appendonly=true);
+-- And the AO table has all tuples on primary with dbid = 2
+1: INSERT INTO ao_test_drop_phase SELECT 1,i from generate_series(1, 5)i;
+
+-- We should see 1 pg_aoseg catalog table tuple in state 1 (AVAILABLE) for
+-- segno = 1
+2U: SELECT * FROM gp_toolkit.__gp_aoseg_name('ao_test_drop_phase');
+
+-- And we create a utility mode session on the primary with dbid = 2 in order
+-- to take an access shared lock.
+2U: BEGIN;
+2U: SELECT COUNT(*) FROM ao_test_drop_phase;
+
+-- And we delete 4/5 rows to trigger vacuum's compaction phase.
+1: DELETE FROM ao_test_drop_phase where b != 5;
+-- We should see that VACUUM blocks while the utility mode session holds the
+-- access shared lock
+1&: VACUUM ao_test_drop_phase;
+
+2U: END;
+1<:
+
+-- We should see that the one visible tuple left after the DELETE gets compacted
+-- from segno = 1 to segno = 2.
+-- Also, segno = 1 should be empty and in state 1 (AVAILABLE)
+2U: SELECT * FROM gp_toolkit.__gp_aoseg_name('ao_test_drop_phase');
+
+-- We should see that the master's hash table matches dbid = 2's pg_aoseg catalog
+1: SELECT segno, total_tupcount, state
+FROM gp_toolkit.__gp_get_ao_entry_from_cache('ao_test_drop_phase'::regclass::oid)
+WHERE segno IN (1, 2);
+
+-- We should see that a subsequent insert succeeds and lands on segno = 1
+1: INSERT INTO ao_test_drop_phase SELECT 1,i from generate_series(11, 15)i;
+2U: SELECT * FROM gp_toolkit.__gp_aoseg_name('ao_test_drop_phase');
+
+1: SELECT * FROM ao_test_drop_phase;