Enlarge timeout in isolation2:pg_ctl UDF (#9991)

Currently this UDF might report a false positive if the node is still starting up after timeout since currently pg_ctl returns 0 for this case. This behavior is changed in upstream with the below patch: commit f13ea95f Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed Jun 28 17:31:24 2017 -0400 Change pg_ctl to detect server-ready by watching status in postmaster.pid. We've seen some test flakiness due to this issue since pg_ctl restart needs more time sometimes on pipeline (by default pg_ctl timeout is 60 seconds). Yesterday I found on a hang job that a primary needs ~ 4 minutes to get the recovery finished during 'pg_ctl restart' (It's test ao_same_trans_truncate_crash which enables fsync. Even it launches a checkpoint before pg_ctl restart, pg_ctl restarts still needs a lot of time). Enlarge the timeout of pg_ctl to 600 seconds now and add a pg_ctl stdout checking before returning OK in the UDF (this check could be removed after PG 12 merge finishes so I added a FIXME there). Here is the output of the pg_ctl experiment: $ pg_ctl -l postmaster.log -D /data/gpdb7/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0 -w -m immediate restart -t 1 waiting for server to shut down.... done server stopped waiting for server to start.... stopped waiting server is still starting up $ echo $? 0 Reviewed-by: N Asim R P <apraveen@pivotal.io>

Enlarge timeout in isolation2:pg_ctl UDF (#9991)
Currently this UDF might report a false positive if the node is still starting up after timeout since currently pg_ctl returns 0 for this case. This behavior is changed in upstream with the below patch: commit f13ea95f Author: Tom Lane <tgl@sss.pgh.pa.us> Date: Wed Jun 28 17:31:24 2017 -0400 Change pg_ctl to detect server-ready by watching status in postmaster.pid. We've seen some test flakiness due to this issue since pg_ctl restart needs more time sometimes on pipeline (by default pg_ctl timeout is 60 seconds). Yesterday I found on a hang job that a primary needs ~ 4 minutes to get the recovery finished during 'pg_ctl restart' (It's test ao_same_trans_truncate_crash which enables fsync. Even it launches a checkpoint before pg_ctl restart, pg_ctl restarts still needs a lot of time). Enlarge the timeout of pg_ctl to 600 seconds now and add a pg_ctl stdout checking before returning OK in the UDF (this check could be removed after PG 12 merge finishes so I added a FIXME there). Here is the output of the pg_ctl experiment: $ pg_ctl -l postmaster.log -D /data/gpdb7/gpAux/gpdemo/datadirs/dbfast1/demoDataDir0 -w -m immediate restart -t 1 waiting for server to shut down.... done server stopped waiting for server to start.... stopped waiting server is still starting up $ echo $? 0 Reviewed-by: N Asim R P <apraveen@pivotal.io>
934d87c6 · Paul Guo · GitHub · 5a855614 · 934d87c6
隐藏空白更改
内联并排

Showing with 10 addition and 2 deletion

src/test/isolation2/helpers/server_helpers.sql src/test/isolation2/helpers/server_helpers.sql +10 -2

未找到文件。
--- a/src/test/isolation2/helpers/server_helpers.sql
+++ b/src/test/isolation2/helpers/server_helpers.sql
@@ -20,14 +20,22 @@ returns text as $$
        cmd = 'pg_ctl promote -D %s' % datadir
    elif command in ('stop', 'restart'):
        cmd = 'pg_ctl -l postmaster.log -D %s ' % datadir
-        cmd = cmd + '-w -m %s %s' % (command_mode, command)
+        cmd = cmd + '-w -t 600 -m %s %s' % (command_mode, command)
    else:
        return 'Invalid command input'

    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
                            shell=True)
    stdout, stderr = proc.communicate()
-    if proc.returncode == 0:
+
+    # GPDB_12_MERGE_FIXME: upstream patch f13ea95f9e473a43ee4e1baeb94daaf83535d37c
+    # (Change pg_ctl to detect server-ready by watching status in postmaster.pid.)
+    # makes pg_ctl return 1 when the postgres is still starting up after timeout
+    # so there is only need of checking of returncode then. For now we still
+    # need to check stdout additionally since if the postgres is starting up
+    # pg_ctl still returns 0 after timeout.
+
+    if proc.returncode == 0 and stdout.find("server is still starting up") == -1:
        return 'OK'
    else:
        raise PgCtlError(stdout+'|'+stderr)