Do not terminate the connection and quit in SyncRepWaitForLSN().
We previously had the code to terminate the connection if needed on QE to avoid potential data inconsistency. This is gpdb specific since upstream code there seems to be not friendly for failover + data consistency. However that introduces various abort or assert failures since apparently some shm exit callback functions are not friendly to the current transaction state. Below are some stack examples. Originally I fixed them in those callback functions but I found on gpdb6 after I fixed one, another one (in another callback function) come out. That's why I could collect so many gpdb 6 stacks below. I just collect one gpdb master stack, but I it should have more stacks also if we fixing in those callbacks one by one. Anyway finally I decide to fix by delaying the ereport(FATAL) exec_mpp_dtx_protocol_command() instead, and let QD retry 2PC to ensure the data consistency. Note 1PC retry is currently not implemented but this should be in another PR. gpdb master (7) stack: 2 0x0000000000b48ddc in ExceptionalCondition (conditionName=0xe527e8 "!(ShmemAddrIsValid(nextElem))", errorType=0xe527bd "FailedAssertion", fileName=0xe527b2 "shmqueue.c", lineNumber=74) at assert.c:66 3 0x0000000000996311 in SHMQueueDelete (queue=0x7ff5e6676da8) at shmqueue.c:74 4 0x00000000009689de in SyncRepCleanupAtProcExit () at syncrep.c:436 5 0x00000000009a7b49 in ProcKill (code=1, arg=0) at proc.c:949 6 0x000000000098c001 in shmem_exit (code=1) at ipc.c:288 7 0x000000000098be5f in proc_exit_prepare (code=1) at ipc.c:212 8 0x000000000098bd64 in proc_exit (code=1) at ipc.c:104 9 0x0000000000b4a7d4 in errfinish (dummy=0) at elog.c:738 10 0x000000000096860e in SyncRepWaitForLSN (lsn=210148624, commit=1 '\001') at syncrep.c:303 11 0x000000000055c082 in RecordTransactionCommitPrepared (xid=638, gid=0x2c603ad "1575462785-0000000012", nchildren=0, children=0x2c6d2d0, nrels=0, rels=0x2c6d2d0, ndeldbs=0, deldbs=0x2c6d2d0, ninvalmsgs=0, invalmsgs=0x2c6d2d0, initfileinval=0 '\000') at twophase.c:2283 12 0x000000000055aae3 in FinishPreparedTransaction (gid=0x2c603ad "1575462785-0000000012", isCommit=1 '\001', raiseErrorIfNotFound=0 '\000') at twophase.c:1493 13 0x0000000000c4e4fe in performDtxProtocolCommitPrepared (gid=0x2c603ad "1575462785-0000000012", raiseErrorIfNotFound=0 '\000') at cdbtm.c:2037 14 0x0000000000c4e9d5 in performDtxProtocolCommand (dtxProtocolCommand=DTX_PROTOCOL_COMMAND_RECOVERY_COMMIT_PREPARED, gid=0x2c603ad "1575462785-0000000012", contextInfo=0x1220f20) at cdbtm.c:2215 gpdb 6 stacks: 2 0x0000000000ad9ea5 in ExceptionalCondition (conditionName=0xdbfddb "!(MyProc->syncRepState == 0)", errorType=0xdbfd28 "FailedAssertion", fileName=0xdbfcd0 "syncrep.c", lineNumber=130) at assert.c:66 3 0x000000000091ce81 in SyncRepWaitForLSN (XactCommitLSN=3400317528) at syncrep.c:130 4 0x000000000053991a in RecordTransactionCommit () at xact.c:1663 5 0x000000000053b0b2 in CommitTransaction () at xact.c:2756 6 0x000000000053c024 in CommitTransactionCommand () at xact.c:3646 7 0x00000000005c6c25 in RemoveTempRelationsCallback (code=1, arg=0) at namespace.c:4107 8 0x000000000093c353 in shmem_exit (code=1) at ipc.c:257 9 0x000000000093c248 in proc_exit_prepare (code=1) at ipc.c:214 10 0x000000000093c146 in proc_exit (code=1) at ipc.c:104 11 0x0000000000adb93d in errfinish (dummy=0) at elog.c:754 12 0x000000000091d2ef in SyncRepWaitForLSN (XactCommitLSN=3400294096) at syncrep.c:284 13 0x0000000000549d8e in EndPrepare (gxact=0x7f8a7d5fa0e0) at twophase.c:1241 3 0x0000000000ade6d1 in elog_finish (elevel=22, fmt=0xc3a898 "cannot abort transaction %u, it was already committed") at elog.c:1735 4 0x0000000000539d22 in RecordTransactionAbort (isSubXact=0 '\000') at xact.c:1923 5 0x000000000053b95c in AbortTransaction () at xact.c:3340 6 0x000000000053e0a7 in AbortOutOfAnyTransaction () at xact.c:5248 7 0x00000000005c68b9 in RemoveTempRelationsCallback (code=1, arg=0) at namespace.c:4088 8 0x000000000093c371 in shmem_exit (code=1) at ipc.c:257 9 0x000000000093c266 in proc_exit_prepare (code=1) at ipc.c:214 10 0x000000000093c164 in proc_exit (code=1) at ipc.c:104 11 0x0000000000adb94e in errfinish (dummy=0) at elog.c:754 12 0x000000000091d30d in SyncRepWaitForLSN (XactCommitLSN=19529538376) at syncrep.c:284 13 0x000000000053985a in RecordTransactionCommit () at xact.c:1663 2 0x0000000000adb9a9 in ExceptionalCondition (conditionName=0xdb2560 "!(entry->trans == ((void *)0))", errorType=0xdb2550 "FailedAssertion", fileName=0xdb216a "pgstat.c", lineNumber=842) at assert.c:66 3 0x00000000008d3391 in pgstat_report_stat (force=1 '\001') at pgstat.c:842 4 0x00000000008d65e8 in pgstat_beshutdown_hook (code=1, arg=0) at pgstat.c:2685 5 0x000000000093deba in shmem_exit (code=1) at ipc.c:290 6 0x000000000093dd1a in proc_exit_prepare (code=1) at ipc.c:214 7 0x000000000093dc18 in proc_exit (code=1) at ipc.c:104 8 0x0000000000add441 in errfinish (dummy=0) at elog.c:750 9 0x000000000091ee5c in SyncRepWaitForLSN (XactCommitLSN=225227432) at syncrep.c:333 10 0x0000000000549dd8 in EndPrepare (gxact=0x7f02508680e0) at twophase.c:1241 2 0x0000000000adb9c2 in ExceptionalCondition (conditionName=0xdcb458 "!(!((allPgXact[proc->pgprocno].xid) != ((TransactionId) 0)))", errorType=0xdcb408 "FailedAssertion", fileName=0xdcb3d9 "procarray.c", lineNumber=369) at assert.c:66 3 0x000000000093f614 in ProcArrayRemove (proc=0x7f4f1f5a05d0, latestXid=0) at procarray.c:369 4 0x00000000009586ec in RemoveProcFromArray (code=1, arg=0) at proc.c:904 5 0x000000000093ded3 in shmem_exit (code=1) at ipc.c:290 6 0x000000000093dd33 in proc_exit_prepare (code=1) at ipc.c:214 7 0x000000000093dc31 in proc_exit (code=1) at ipc.c:104 8 0x0000000000add45a in errfinish (dummy=0) at elog.c:750 9 0x000000000091ee75 in SyncRepWaitForLSN (XactCommitLSN=348629504) at syncrep.c:333 10 0x0000000000549dd8 in EndPrepare (gxact=0x7f4f1fa8cce0) at twophase.c:1241 11 0x000000000053b621 in PrepareTransaction () at xact.c:3115 Reviewed-by: NAshwin Agrawal <aagrawal@pivotal.io> Reviewed-by: NAsim R P <apraveen@pivotal.io> Cherry-picked from 7b761730
Showing
想要评论请 注册 或 登录