Fix the postmaster reset failure on master node
If a QD crashes for reasons such as SIGSEGV, SIGKILL or PANIC, postmaster reset fails sometimes. The root cause is: postmaster would first tell child processes to exit, and then wait for the termination of important processes such as AutoVacuum, BgWriter, CheckPoint etc, before it resets share memory and restarts auxiliary processes; however, WAL writer process is missed in the waiting list, so it can happen that postmaster spawns StartupProcess and then notices the exit of WAL writer, so it tells StartupProcess to exit; then postmaster would notice the abnormal exit of StartupProcess in turn, and treats it as recovery failure, then call exit() itself. Thus, we end up with no postmaster process on master node at all. This happens almost everytime when master host machine has poor performance.
Showing
想要评论请 注册 或 登录