未验证 提交 934d87c6 编写于 作者: P Paul Guo 提交者: GitHub

Enlarge timeout in isolation2:pg_ctl UDF (#9991)

Currently this UDF might report a false positive if the node is still starting
up after timeout since currently pg_ctl returns 0 for this case. This behavior
is changed in upstream with the below patch:

commit f13ea95f
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Wed Jun 28 17:31:24 2017 -0400

    Change pg_ctl to detect server-ready by watching status in postmaster.pid.

We've seen some test flakiness due to this issue since pg_ctl restart needs
more time sometimes on pipeline (by default pg_ctl timeout is 60 seconds).
Yesterday I found on a hang job that a primary needs ~ 4 minutes to get the
recovery finished during 'pg_ctl restart' (It's test
ao_same_trans_truncate_crash which enables fsync. Even it launches a checkpoint
before pg_ctl restart, pg_ctl restarts still needs a lot of time).

Enlarge the timeout of pg_ctl to 600 seconds now and add a pg_ctl stdout
checking before returning OK in the UDF (this check could be removed after PG
12 merge finishes so I added a FIXME there).

Here is the output of the pg_ctl experiment:

$ pg_ctl -l postmaster.log -D /data/gpdb7/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0 -w -m immediate restart -t 1
waiting for server to shut down.... done
server stopped
waiting for server to start.... stopped waiting
server is still starting up
$ echo $?
0
Reviewed-by: NAsim R P <apraveen@pivotal.io>
上级 5a855614
......@@ -20,14 +20,22 @@ returns text as $$
cmd = 'pg_ctl promote -D %s' % datadir
elif command in ('stop', 'restart'):
cmd = 'pg_ctl -l postmaster.log -D %s ' % datadir
cmd = cmd + '-w -m %s %s' % (command_mode, command)
cmd = cmd + '-w -t 600 -m %s %s' % (command_mode, command)
else:
return 'Invalid command input'
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
shell=True)
stdout, stderr = proc.communicate()
if proc.returncode == 0:
# GPDB_12_MERGE_FIXME: upstream patch f13ea95f9e473a43ee4e1baeb94daaf83535d37c
# (Change pg_ctl to detect server-ready by watching status in postmaster.pid.)
# makes pg_ctl return 1 when the postgres is still starting up after timeout
# so there is only need of checking of returncode then. For now we still
# need to check stdout additionally since if the postgres is starting up
# pg_ctl still returns 0 after timeout.
if proc.returncode == 0 and stdout.find("server is still starting up") == -1:
return 'OK'
else:
raise PgCtlError(stdout+'|'+stderr)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册