提交 · c91e320c3469fac656ee35a65ffc20102a6d7c6a · Greenplum / Gpdb

13 1月, 2018 40 次提交

Make newly-added gp_bloat_diag test less sensitive. · c91e320c

由 Heikki Linnakangas 提交于 1月 02, 2018

The test was sensitive to the number of pages in the pg_rewrite system
table's index, for no good reason. Also, don't create a new database for
it, to speed it up.

c91e320c

H
Remove LWLockWaitCancel(). · 67795819
由 Heikki Linnakangas 提交于 1月 02, 2018
```
It was only used by the filerep code. Now that that's gone, this was just
dead code.
```
67795819

Remove dead/unreferenced db state code (#4225) · 145eb5d6

由 Jacob Champion 提交于 1月 01, 2018

The DB_IN_STANDBY_NEW_TLI_SET state doesn't really seem to do anything
anymore, as of commit 813b817cc. Remove it entirely to get rid of an
assertion during standby tests. Also remove multipass function
declarations; they're gone.

145eb5d6

Disable new added 2PC test as failing in CI. · 6494b02c

由 Ashwin Agrawal 提交于 1月 01, 2018

This test passed locally, also in PR pipeline and forked pipeline multiple
times, but intermittently faling in main CI pipeline hence disabling the
same. The failure is happening when master is connecting to segments first time
after PANIC and its failing with

```
+LOG:  could not connect to segment: initialization of segworker group failed (cdbgang.c:235)
+LOG:  could not connect to segment: initialization of segworker group failed (cdbgang.c:235)

2018-01-02 02:44:16.225927 UTC,"gpadmin","isolation2test",p33565,th-1615808736,"[local]",,2018-01-02 02:44:16 UTC,0,con640,,seg-1,,,,sx1,"FATAL","XX000","DTM initialization: failure during startup recovery, retry failed, check segment status (cdbtm.c:1537)",,"Process 33565 will wait for gp_debug_linger=120 seconds before termination.
Note that its locks and other resources will not be released until then.",,,,,0,,"cdbtm.c",1537,"Stack trace:
1    0x9c1afb postgres errstart + 0x1db
2    0x9c3ca9 postgres elog_finish + 0xb9
3    0xadedea postgres initTM (cdbtm.c:1536)
4    0x9dac77 postgres InitPostgres + 0x857
5    0x8867a7 postgres PostgresMain + 0x207
6    0x81c97d postgres <symbol not found> (postmaster.c:0)
7    0x81eea2 postgres PostmasterMain + 0xc42
8    0x73f0a1 postgres main (main.c:206)
9    0x7f909b4fbd1d libc.so.6 __libc_start_main + 0xfd
10   0x4bf2b5 postgres <symbol not found> + 0x4bf2b5

```

Investigating and will renable this newly added test once able to find why first
connection is failing from master.

6494b02c

Revert "cs_walrep_1: disable gpactivatestandby tests for now" · c28fd064

由 Jacob Champion 提交于 1月 01, 2018

Unfortunately the cluster crashes anyway two tests later. Rather than
comment out half the tests to get a fake green, put this set of tests
back. We'll just have to solve this one problem at a time.

This reverts commit 5982a72614492916187ca27fc660d7cc7e3b69e1.

c28fd064

cs_walrep_1: disable gpactivatestandby tests for now · 1384a094

由 Jacob Champion 提交于 1月 01, 2018

The promotion logic that gpactivatestandby relies on doesn't work yet,
and when these tests fail, they leave the cluster completely unusable.

1384a094

Rewrite a 2PC test in isolation2. · c20ac186

由 Ashwin Agrawal 提交于 12月 29, 2017

This test in TINC is very shaky, as brings down primary and mirror and hence
affects gp_segment_configuration.

Test intends to fail broadcasting of COMMIT PREPARED to one segment and hence
trigger PANIC in master while after completing phase 2 of 2PC. Master's recovery
cycle should correctly broadcast COMMIT PREPARED again because master should
find distributed commit record in its xlog during recovery. Verify that the
transaction is committed after recovery. This scenario used to create cluster
inconsistency due to bug fixed now, as transaction used to get committed on all
segments except one where COMMIT PREPARED broadcast failed before
recovery. Master used to miss sending the COMMIT PREPARED across restart and
instead abort the transaction after querying in-doubt prepared transactions from
segments.

c20ac186

Add retry in isolation2 test framework for database restart. · b26fe4eb

由 Ashwin Agrawal 提交于 12月 29, 2017

To support writing tests where session can cause PANIC of master, add retry
logic while establishing connection in isolation2. This helps to keep the tests
simple.

b26fe4eb

A

Adding new faultinjector at star of FinishPreparedTransaction. · 9586ea99
由 Ashwin Agrawal 提交于 12月 29, 2017

9586ea99

Add retries using grace period for declaring mirror down. · cd647b1f

由 Ashwin Agrawal 提交于 12月 28, 2017

If fts detects primary as down, it retries n times before marking it down. But
mirror gets marked as down if connection to primary has not been made or
lost. This surfaced as problem mostly during cluster start (gpstart), where
sequence is to start primary and mirror followed by master. In many instances
when master probed primary, mirror connection was yet to be made and hence up
mirror in configuration unnecessarily got marked down, if if just few secs latr
mirror established connection to primary.

So, to avoid such sitations plus make it little resilient against minor network
glitches, adding variable to record when initialization or disconnection
happened. Using the same on fts probe find now can find how long mirror didn't
showed-up. Only if mirror didn't show-up for allowed period (30 secs) for now
report it was down, else request fts to retry the probe. This logic doesn't
affect regular flow also avoids any waiting in utilties for specific states
after cluster restart.

cd647b1f

gpstop -u should not specifically check for "No such process" · bb0b9cb5

由 Ashwin Agrawal 提交于 12月 27, 2017

If postmaster.pid file is present, reload will get error as "No such
process". But if postmaster.pid is not present then error returned back is
"pg_ctl: PID file "......../postmaster.pid" does not exist". So, its better not
to check for any particular error message but report segmnes failed to be
reloaded.

bb0b9cb5

Restore logic to skip databases cannot connect for oldest database. · 52884f32

由 Ashwin Agrawal 提交于 12月 27, 2017

GPDB skips databases that cannot be connected to in computing the oldest
database in vac_truncate_clog(). Make write_database_file() same, which was
reverted to upstream version. This helps to get the storage tests green for now.

Later can figure out and uniformly remove this code from vac_truncate_clog() and
write_database_file() if better solution is found to original issue for which
this check was added.

52884f32

A

gparray only add standbyMaster's host if not same as master. · 6554324b
由 Ashwin Agrawal 提交于 12月 27, 2017

6554324b

Add walrep specific states to gparray.py. · f812f6d6

由 Ashwin Agrawal 提交于 12月 27, 2017

With walrep we have new states 'n' not in sync. So, adding valid states
corresponding to it to let some tests pass. Lot more cleanup needs to happen of
this area to remove filerep specific states but that's work for different
commit.

f812f6d6

Remove tests for cross check between gp_relation_node and pg_aoseg · d97dfc1e

由 Ashwin Agrawal 提交于 12月 27, 2017

Since now gp_relation_node table no more exists, no point testing if ERROR is
reported if pg_aoseg and gp_relation_node are not in sync.

d97dfc1e

A
Delete duplicate_entries tests as its specific to filerep. · 629b7b3f
由 Ashwin Agrawal 提交于 12月 27, 2017
```
The test is specific to filerep behavior whene truncate was not properly
resynced, causing the problem.
```
629b7b3f
A

Fix expected return code for mm_gpcheckcat. · 40ba2864
由 Ashwin Agrawal 提交于 12月 27, 2017

40ba2864
A
Force FTS scan after stopping the mirror that failed to recover incrementally. · cc8240a9
由 Asim R P 提交于 12月 27, 2017
```
FTS scan marks the stopped mirror as down so that subsequent recoverfull works.
```
cc8240a9
A
Remove filerep specific tests from Storage suite. · 2aa2e656
由 Ashwin Agrawal 提交于 12月 26, 2017
```
This removes the make target storage_filerep.
```
2aa2e656

Delete test_AOCOAlterColumnChangeTracking. · c56c93a7

由 Ashwin Agrawal 提交于 12月 26, 2017

This test is not more relevant with wal replication. This should get
`aocoalter_catalog_loaders` task in Storage schedule green.

c56c93a7

Start running segspace test for wal replication. · acec2c16

由 Ashwin Agrawal 提交于 12月 26, 2017

Now that gpstop/gpstart works for wal replication, remove segspace from
--exclude-tests. filespace is only one remains in --exclude-tests list which
would go away soon as well.

acec2c16

Make gpstart work for walrep mirrors · 2b5720bf

由 Jimmy Yih 提交于 12月 21, 2017

All that was needed was to make sure mirrors are not started with
pg_ctl -w flag since the mirror is in recovery mode and will not
respond to PQPing messages.

Author: Jimmy Yih <jyih@pivotal.io>
Author: Marbin Tan <mtan@pivotal.io>

2b5720bf

gpinitsystem with walrep mirrors instead of filerep mirrors · ce4d96b6

由 Jimmy Yih 提交于 12月 19, 2017

With file replication gone, gpinitsystem should no longer try to
initialize the cluster through filerep sequence.

The sequence now goes as follows:
1. Create and start master in master-only mode
2. Create primaries and register to master
3. Stop master.
4. Run gpstart to start master and primaries.
5. Create mirrors w/ pg_basebackup and register to master.
6. Start the mirrors and wait until primaries and mirrors sync.

Author: Jimmy Yih <jyih@pivotal.io>
Author: Marbin Tan <mtan@pivotal.io>

ce4d96b6

Try to avoid race condition in test, when querying pg_partitions. · 3b74382b

由 Heikki Linnakangas 提交于 12月 22, 2017

pg_partitions contains calls to pg_get_expr() function. That function
suffers from a race condition: If the relation is dropped between the
get_rel_name() call, and another syscache lookup in pg_get_expr_worker(),
you get a "relation not found" error. The error message is reasonable,
and I don't see any easy fix for the pg_partitions view itself, so just
try to avoid hitting that in the tests.

For some reason we are hitting that frequently in this particular query.
Change it to query pg_class instead, it doesn't use any of the more
complicated fields from pg_partitions, anyway.

I'm pushing this to the 'walreplication' branch first, because for some
reason, we're seeing the failure there more often than on 'master'. If
this fixes the problem, I'll push this to 'master', too.

3b74382b

H

Remove remnants of multi-pass startup. · 69578765
由 Heikki Linnakangas 提交于 12月 21, 2017

69578765
M
Shutdown standby master to make walrep related test pass. · 40bc027a
由 Max Yang 提交于 12月 21, 2017
```
Author: Max Yang <myang@pivotal.io>
Author: Xiaoran Wang <xiwang@pivotal.io>
```
40bc027a
M
Add answer file for restart_standup case · 2943e824
由 Max Yang 提交于 12月 21, 2017
```
Author: Max Yang <myang@pivotal.io>
Author: Xiaoran Wang <xiwang@pivotal.io>
```
2943e824

Fix walrep test case failure. · a1990a80

由 Max Yang 提交于 12月 21, 2017

Currently we start standby master when WITH_MIRROS=true. Which
will make fake wal receiver error out:
number of requested standby connections exceeds max_wal_senders (currently 1)
Because standby master already use one wal_sender.
To make test pass, we remove standby master at the beginning of this test
and recover it at the end of test.
A better solution maybe change this value to be configurable at startup time.
But this is just a simple fix for passing.

Author: Max Yang <myang@pivotal.io>
Author: Xiaoran Wang <xiwang@pivotal.io>

a1990a80

Fix bgwriter_checkpoint test case · 5aa2f649

由 Max Yang 提交于 12月 21, 2017

Since we start standby master if WITH_MIRRORS=true. The element number
in gp_segment_configuration changes, and result in change of answer file

Author: Max Yang <myang@pivotal.io>
Author: Xiaoran Wang <xiwang@pivotal.io>

5aa2f649

M
Start standby master in create-demo-cluster when WITH_MIRRORS = true. · 0c7f1281
由 Max Yang 提交于 12月 20, 2017
```
Author: Max Yang <myang@pivotal.io>
Author: Xiaoran Wang <xiwang@pivotal.io>
```
0c7f1281

gpaddmirrors: fix unit tests · 5cc2ddd0

由 Asim R P 提交于 12月 20, 2017

The last commit removed the replication ports (replacing them with -1 in
the Python utilities), and those numbers were being checked as part of
this test. Comment the checks out and tag with FIXMEs.

Author: Asim R P <apraveen@pivotal.io>
Author: Jacob Champion <pchampion@pivotal.io>

5cc2ddd0

Quick fix to make gpstart work. · cadf63a8

由 Heikki Linnakangas 提交于 12月 20, 2017

At least on with gpdemo, on my laptop.

We really shouldn't need these filerep port numbers anymore, right?

cadf63a8

H
Remove unused gp_initdb_mirrored variable. · c50fa05d
由 Heikki Linnakangas 提交于 12月 20, 2017
```
And the mechanism in initdb and gpinitsystem to set it. It's no longer
used for anything.
```
c50fa05d
H

Remove leftover LWLocks that are now unused. · 8a46029f
由 Heikki Linnakangas 提交于 12月 20, 2017

8a46029f
H
Remove GUCs and fault injection points related to PT and filerep. · e8dc97d4
由 Heikki Linnakangas 提交于 12月 20, 2017
```
These were left over when Persistent Tables and Filerep were removed.
```
e8dc97d4

Remove cdbmirroredappendonly.[ch]. · 334d41a8

由 Heikki Linnakangas 提交于 12月 19, 2017

What was left of it, was a very thin and leaky abstraction, plus WAL-logging
functions. Move the WAL-logging functions to a new file called
cdbappendonlyxlog.c, and dismantle the MirroredAppendOnlyOpen abstraction.

334d41a8

Add more robust retry logic to gp_replica_check, so that it can be run online. · f6d42b45

由 Heikki Linnakangas 提交于 12月 19, 2017

Instead of waiting for the primary and mirror to have the exact same LSN,
add logic to retry the file comparisons a few times if there are any
differences. This is a natural continuation of the earlier retry-loops I
added there, but now the LSN checks are made so that we don't even expect
the primary and mirror to sync on a particular value, and we retry not
while trying to sync the LSNs, but during the comparison itself.

This makes it possible to run gp_replica_check on a running cluster, while
modifying tables. (The extra checkpoints it emits will have a performance
impact on the other queries, though)I tested this by running pgbench at the
same time. You'll get a few NOTICEs about mismatches, but those are
harmless. After a few automatic retries, it eventually passes.

f6d42b45

H
Remove MirroredAppendOnly_Truncate() function. · f882ee40
由 Heikki Linnakangas 提交于 12月 19, 2017
```
Might as well call FileTruncate directly.
```
f882ee40
H

Remove unused fields, and README. · e95a8ada
由 Heikki Linnakangas 提交于 12月 19, 2017

e95a8ada
H

Remove some remnants of multi-pass recovery from postmaster.c. · 2aa8a67e
由 Heikki Linnakangas 提交于 12月 18, 2017

2aa8a67e