Fix postmaster reset failure on segment nodes with mirror configured
If a QE crashes for reasons such as SIGSEGV, SIGKILL or PANIC, segment postmaster reset fails sometimes. The root cause is: primary segment postmaster would first tell child processes to exit, then start a filerep peer reset process to instruct mirror postmaster do reset; the filerep peer reset process would only exit when mirror postmaster finishes or fails the reset procedure; primary postmaster would wait for the termination of important processes such as AutoVacuum, BgWriter, CheckPoint, filerep peer reset process etc, before it resets share memory and restarts auxiliary processes; however, in some cases, primary postmaster would be stuck in filerep peer reset step, if mirror postmaster is hanging/waiting for some events; if this happens, filerep peer reset process would wait there until timeout(1 hour), and retry 10 times before reports failure to primary postmaster (so 10 hours in total); so the final result is primary postmaster takes 10 hours to report reset failure. This happens almost every time on mirror segment host machine with poor performance for reasons that: mirror postmaster would do similar reset procedure with primary postmaster, i.e, notify child processes to exit and wait their terminations and then restart auxiliary processes; filerep peer reset process would first connect to mirror postmaster to request a postmaster reset, then it would check the reset status of mirror every 10ms by connecting to mirror postmaster; so it can happen that filerep peer reset process keeps connecting mirror postmaster, which would lead to continuous dead_end backend processes forked, while at the same time mirror postmaster waits for the exit of all dead_end backend processes, so it is possible that the speed of generating new dead_end processes exceeds the exit speed, and hence mirror postmaster can never see the clearance of child processes. All in all, this can lead to hang issue and failure of postmaster reset. This issue exists for master postmaster reset as well on heavy workload circumstances.
Showing
想要评论请 注册 或 登录