提交 · e7a531a0a3a488f9290cf399997b676828fbb0db · Greenplum / Gpdb

01 8月, 2017 3 次提交

STDIN and STDOUT should be rejected by 'COPY ON SEGMENT' · 2d59ed76

由 Xiaoran Wang 提交于 7月 28, 2017

If 'COPY ON SEGMENT', STDIN / STDOUT refers to segments' own STDIN /
STDOUT, which are not available for data stream.
Signed-off-by: NXiaoran Wang <xiwang@pivotal.io>

2d59ed76

Choose segment randomly to server as singleton reader gang · 06f56fe8

由 Pengzhou Tang 提交于 7月 19, 2017

This is a typo issue that cause segment 0 was always assigned
as singleton reader. It existed for a long time with no
functional issue, but may result in performance issue somehow.

Beside, root->config->cdbpath_segments is tuneable by GUC
gp_segments_for_planner, so gp_singleton_segindex may point to
an invalid segment, we use real segment count instead to avoid
mismatch.

06f56fe8

Log interconnect ports info on DEBUG1 level · 5c642b07

由 Pengzhou Tang 提交于 7月 27, 2017

This is a mistake introduced by 353a937d and it
flood the pg_log at a noticeable rate, change it back
to DEBUG1.

5c642b07

31 7月, 2017 3 次提交

Detect cgroup mount point in test code and fix some bugs. · 2f6e842d

由 Zhenghua Lyu 提交于 7月 31, 2017

1. Detect cgroup mount point in test code.
2. Fix bug when buflen is 0.
3. Check cgroup status on master in gpconfig.
4. Fix coverity warnings.

2f6e842d

Find out correct system memory when run in a container. · adaca0c6

由 Zhenghua Lyu 提交于 7月 30, 2017

When GPDB is running in a container, the swap and ram read via sysinfo is the value of host machine. To find the correct value of swap and ram in the container context, we take both the value from sysinfo and the value from cgroup into account.

adaca0c6

Implement "COPY ... FROM ... ON SEGMENT' · e254287e

由 Ming LI 提交于 7月 31, 2017

Support COPY statement that imports the data file on segments directly
parallel. It could be used to import data files generated by "COPY ...
to ... ON SEGMENT'.

This commit also supports all kinds of data file formats which "COPY ...
TO" supports, processes reject limit numbers and logs errors accordingly.

Key workflow:
   a) For COPY FROM, nothing changed by this commit, dispatch modified
   COPY command to segments at first, then read data file on master, and
   dispatch the data to relevant segment to process.

   b) For COPY FROM ON SEGMENT, on QD, read dummy data file, other parts
   keep unchanged, on QE, process the data stream (empty) dispatched
   from QD at first, then re-do the same workflow to read and process
   the local segment data file.
Signed-off-by: NMing LI <mli@pivotal.io>
Signed-off-by: NAdam Lee <ali@pivotal.io>
Signed-off-by: NHaozhou Wang <hawang@pivotal.io>
Signed-off-by: NXiaoran Wang <xiwang@pivotal.io>

e254287e

27 7月, 2017 7 次提交

Use xl_heaptid_set() in heap_update_internal. · f1d1d55b

由 Ashwin Agrawal 提交于 7月 26, 2017

Commit d50f429c added xlog lock record, but
missed to tune into for Greenplum which is to add persistent table
information. Hence caused failure during recovery with FATAL message "xlog
record with zero persistenTID". Using xl_heaptid_set() which calls
`RelationGetPTInfo()` making sure PT info is populated for xlog record.

f1d1d55b

A

Fix error in schedule file · 138141f8
由 Asim R P 提交于 7月 26, 2017

138141f8

Move dtm test to pg_regress from its own contrib module · c10e75fd

由 Asim R P 提交于 7月 25, 2017

The gp_inject_fault() function is now available in pg_regress so a contrib
module is not required. The test was not being run, it trips an assertion. So
it is not added to greenplum_schedule.

c10e75fd

A

Update fsync test to use SQL UDF to inject faults · 9bd14bd3
由 Asim R P 提交于 7月 24, 2017

9bd14bd3

Make SQL based fault injection function available to all tests. · b23680d6

由 Asim R P 提交于 7月 24, 2017

The function gp_inject_fault() was defined in a test specific contrib module
(src/test/dtm).  It is moved to a dedicated contrib module gp_inject_fault.
All tests can now make use of it.  Two pg_regress tests (dispatch and cursor)
are modified to demonstrate the usage.  The function is modified so that it can
inject fault in any segment, specified by dbid.  No more invoking
gpfaultinjector python script from SQL files.

The new module is integrated into top level build so that it is included in
make and make install.

b23680d6

Ensure Execution of Shared Scan Writer On Squelch [#149182449] · 9fbd2da5

由 Jesse Zhang 提交于 7月 25, 2017

SharedInputScan (a.k.a. "Shared Scan" in EXPLAIN) is the operator
through which Greenplum implements Common Table Expression execution. It
executes in two modes: writer (a.k.a. producer) and reader (a.k.a.
consumer). Writers will execute the common table expression definition
and materialize the output, and readers can read the materialized output
(potentially in parallel).

Because of the parallel nature of Greenplum execution, slices containing
Shared Scans need to synchronize among themselves to ensure that readers
don't start until writers are finished writing. Specifically, a slice
with readers depending on writers on a different slice will block during
`ExecutorRun`, before even pulling the first tuple from the executor
tree.

Greenplum's Hash Join implementation will skip executing its outer
("probe side") subtree if it detects an empty inner ("hash side"), and
declare all motions in the skipped subtree as "stopped" (we call this
"squelching"). That means we can potentially squelch a subtree that
contains a shared scan writer, leaving cross-slice readers waiting
forever.

For example, with ORCA enabled, the following query:

```SQL
CREATE TABLE foo (a int, b int);
CREATE TABLE bar (c int, d int);
CREATE TABLE jazz(e int, f int);

INSERT INTO bar  VALUES (1, 1), (2, 2), (3, 3);
INSERT INTO jazz VALUES (2, 2), (3, 3);

ANALYZE foo;
ANALYZE bar;
ANALYZE jazz;

SET statement_timeout = '15s';

SELECT * FROM
        (
        WITH cte AS (SELECT * FROM foo)
        SELECT * FROM (SELECT * FROM cte UNION ALL SELECT * FROM cte)
        AS X
        JOIN bar ON b = c
        ) AS XY
        JOIN jazz on c = e AND b = f;
```
leads to a plan that will expose this problem:

```
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice2; segments: 3)  (cost=0.00..2155.00 rows=1 width=24)
   ->  Hash Join  (cost=0.00..2155.00 rows=1 width=24)
         Hash Cond: bar.c = jazz.e AND share0_ref2.b = jazz.f AND share0_ref2.b = jazz.e AND bar.c = jazz.f
         ->  Sequence  (cost=0.00..1724.00 rows=1 width=16)
               ->  Shared Scan (share slice:id 2:0)  (cost=0.00..431.00 rows=1 width=1)
                     ->  Materialize  (cost=0.00..431.00 rows=1 width=1)
                           ->  Table Scan on foo  (cost=0.00..431.00 rows=1 width=8)
               ->  Hash Join  (cost=0.00..1293.00 rows=1 width=16)
                     Hash Cond: share0_ref2.b = bar.c
                     ->  Redistribute Motion 3:3  (slice1; segments: 3)  (cost=0.00..862.00 rows=1 width=8)
                           Hash Key: share0_ref2.b
                           ->  Append  (cost=0.00..862.00 rows=1 width=8)
                                 ->  Shared Scan (share slice:id 1:0)  (cost=0.00..431.00 rows=1 width=8)
                                 ->  Shared Scan (share slice:id 1:0)  (cost=0.00..431.00 rows=1 width=8)
                     ->  Hash  (cost=431.00..431.00 rows=1 width=8)
                           ->  Table Scan on bar  (cost=0.00..431.00 rows=1 width=8)
         ->  Hash  (cost=431.00..431.00 rows=1 width=8)
               ->  Table Scan on jazz  (cost=0.00..431.00 rows=1 width=8)
                     Filter: e = f
 Optimizer status: PQO version 2.39.1
(20 rows)
```
where processes executing slice1 on the segments that have an empty
`jazz` will hang.

We fix this by ensuring we execute the Shared Scan writer even if it's
in the sub tree that we're squelching.
Signed-off-by: NMelanie Plageman <mplageman@pivotal.io>
Signed-off-by: NSambitesh Dash <sdash@pivotal.io>

9fbd2da5

Fix torn-page, unlogged xid and further risks from heap_update(). · d50f429c

由 Andres Freund 提交于 7月 15, 2016

When heap_update needs to look for a page for the new tuple version,
because the current one doesn't have sufficient free space, or when
columns have to be processed by the tuple toaster, it has to release the
lock on the old page during that. Otherwise there'd be lock ordering and
lock nesting issues.

To avoid concurrent sessions from trying to update / delete / lock the
tuple while the page's content lock is released, the tuple's xmax is set
to the current session's xid.

That unfortunately was done without any WAL logging, thereby violating
the rule that no XIDs may appear on disk, without an according WAL
record.  If the database were to crash / fail over when the page level
lock is released, and some activity lead to the page being written out
to disk, the xid could end up being reused; potentially leading to the
row becoming invisible.

There might be additional risks by not having t_ctid point at the tuple
itself, without having set the appropriate lock infomask fields.

To fix, compute the appropriate xmax/infomask combination for locking
the tuple, and perform WAL logging using the existing XLOG_HEAP_LOCK
record. That allows the fix to be backpatched.

This issue has existed for a long time. There appears to have been
partial attempts at preventing dangers, but these never have fully been
implemented, and were removed a long time ago, in
11919160 (cf. HEAP_XMAX_UNLOGGED).

In master / 9.6, there's an additional issue, namely that the
visibilitymap's freeze bit isn't reset at that point yet. Since that's a
new issue, introduced only in a892234f, that'll be fixed in a
separate commit.

Author: Masahiko Sawada and Andres Freund
Reported-By: Different aspects by Thomas Munro, Noah Misch, and others
Discussion: CAEepm=3fWAbWryVW9swHyLTY4sXVf0xbLvXqOwUoDiNCx9mBjQ@mail.gmail.com
Backpatch: 9.1/all supported versions

d50f429c

25 7月, 2017 2 次提交

I

Update branding on sample config · aa4c8075
由 Ivan Novick 提交于 7月 24, 2017

aa4c8075

Fix resgroup ICW failures · 4165a543

由 Ning Yu 提交于 7月 25, 2017

* Fix the resgroup assert failure on CREATE INDEX CONCURRENTLY syntax.

When resgroup is enabled an assertion failure will be encountered with
below case:

    SET gp_create_index_concurrently TO true;
    DROP TABLE IF EXISTS concur_heap;
    CREATE TABLE concur_heap (f1 text, f2 text, dk text) distributed by (dk);
    CREATE INDEX CONCURRENTLY concur_index1 ON concur_heap(f2,f1);

The root cause is that we had the assumption on QD that a command is
dispatched to QEs when assigned to a resgroup, but this is false with
CREATE INDEX CONCURRENTLY syntax.

To fix it we have to make necessary check and cleanup on QEs.

* Do not assign a resource group in SIGUSR1 handler.

When assigning a resource group on master it might call WaitLatch() to
wait for a free slot. However as WaitLatch() expects to be waken by the
SIGUSR1 signal, it will run into endless waiting when SIGUSR1 is
blocked.

One scenario is the catch up handler. Catch up handler is triggered and
executed directly in the SIGUSR1 handler, so during its execution
SIGUSR1 is blocked. And as catch up handler will begin a transaction so
it will try to assign a resource group and trigger the endless waiting.

To fix this we add the check to not assign a resource group when running
inside the SIGUSR1 handler. As signal handlers are supposed to be light
and short and safe, so skip resource group in such a case shall be
reasonable.

4165a543

24 7月, 2017 2 次提交

Use non-blocking recv() in internal_cancel() · 23e5a5ee

由 xiong-gang 提交于 7月 24, 2017

The issue of hanging on recv() in internal_cancel() are reported
serveral times, the socket status is shown 'ESTABLISHED' on master,
while the peer process on the segment has already exit. We are not
sure how exactly dose this happen, but we are able to simulate this
hang issue by dropping packet or reboot the system on the segment.

This patch use poll() to do non-blocking recv() in internal_cancel();
the timeout of poll() is set to the max value of authentication_timeout
to make sure the process on segment has already exit before attempting
another retry; and we expect retry on connect() can detect network issue.
Signed-off-by: NNing Yu <nyu@pivotal.io>

23e5a5ee

Detect cgroup mount point at runtime. (#2790) · 1b1b3a11

由 Zhenghua Lyu 提交于 7月 24, 2017

In the past, we use hard coded path "/sys/fs/cgroup" as cgroup mount
point, this can be wrong when 1) running on old kernels or 2) the
customer has special cgroup mount points.

Now we detect the mount point at runtime by checking /proc/self/mounts.
Signed-off-by: NNing Yu <nyu@pivotal.io>

1b1b3a11

22 7月, 2017 6 次提交

Revert "Make SQL based fault injection function available to all tests." · 582d0fd4

由 Asim R P 提交于 7月 21, 2017

Loading a C UDF within postgres binary is not a good idea.  The binary cannot
be loaded as a shared object on Linux (it works on OSX).

This reverts commit 9361a6dd.

582d0fd4

A
Revert "Move dtm test to pg_regress from its own contrib module" · 96c63912
由 Asim R P 提交于 7月 21, 2017
```
This reverts commit a021d1b5.
```
96c63912

Revert "Move gp_inject_fault function to pg_regress/regress.c" · d3c78a1a

由 Asim R P 提交于 7月 21, 2017

regress.c cannot include fmgroids.h because the header file is generated during
build process. The ICW jobs in CI checkout gpdb source code and run make from
within src/test/regress. That fails to find fmgroids.h. It seems we need a
dedicated contrib module for gp_inject_fault.

This reverts commit bd26a268.

d3c78a1a

Move gp_inject_fault function to pg_regress/regress.c · bd26a268

由 Asim R P 提交于 7月 21, 2017

This fixes ICW breakage due to "postgres" binary cannot be loaded as a shared
library. To run gp_fault_inject() function manually, generate regress.so by
running make in src/test/regress. Thereafter, create function command can be
used to create the function, as in create_fault_function.source.

bd26a268

Move dtm test to pg_regress from its own contrib module · a021d1b5

由 Asim R P 提交于 7月 20, 2017

The gp_inject_fault() function is now available in pg_regress so a contrib
module is not required. The test was not being run, it trips an assertion. So
it is not added to greenplum_schedule.

a021d1b5

Make SQL based fault injection function available to all tests. · 9361a6dd

由 Asim R P 提交于 7月 20, 2017

The function gp_inject_fault() was defined in a test specific contrib module
(src/test/dtm).  All tests can now make use of it.  Two pg_regress tests
(dispatch and cursor) are modified to demonstrate the usage.  The function is
also made capable to inject fault in any segment, specified by dbid.  No more
invoking gpfaultinjector python script from SQL files.

9361a6dd

21 7月, 2017 2 次提交

Improve partition selection logging (#2796) · 038aa959

由 Jesse Zhang 提交于 7月 20, 2017

Partition Selection is the process of determining at runtime ("execution
time") which leaf partitions we can skip scanning. Three types of Scan
operators benefit from partition selection: DynamicTableScan,
DynamicIndexScan, and BitmapTableScan.

Currently, there is a minimal amount of logging about what partitions
are selected, but they are scattered between DynamicIndexScan and
DynamicTableScan (and so we missed BitmapTableScan).

This commit moves the logging into the PartitionSelector operator
itself, when it exhausts its inputs. This also brings the nice side
effect of more granular information: the log now attributes the
partition selection to individual partition selectors.

038aa959

Fix rtable index of FunctionScan when translating GPORCA plan. · 3b24a561

由 Venkatesh Raghavan 提交于 7月 19, 2017

Arguments to the function scan can themselve have a subquery
that can create new rtable entries. Therefore, first translate all
arguments of the FunctionScan before setting scanrelid of the
FunctionScan.

3b24a561

20 7月, 2017 1 次提交
- T
  Dump database grants with CREATE DATABASE statement · bdafd0ce
  由 Tom Meyer 提交于 7月 17, 2017
```
Signed-off-by: NChris Hajas <chajas@pivotal.io>
```
  bdafd0ce
19 7月, 2017 3 次提交

Backport Postgres TAP SSL tests (#2765) · 57a11a94

由 Peifeng Qiu 提交于 7月 19, 2017

* Port Postgres TAP SSL tests
Signed-off-by: NYuan Zhao <yuzhao@pivotal.io>

* Add config to enable tap tests
Signed-off-by: NYuan Zhao <yuzhao@pivotal.io>

1. Add enable-tap-tests flag to control tests.
2. Add Perl checking module.
3. Enable tap tests for enterprise build by default.

* Adapt postgres tap tests to gpdb

1. Assume a running GPDB cluster instance(gpdemo), instead of
using temp installation. Remove most node init operation.
Disable environment variable override during test init.

2. Replace node control operation with GPDB counterpart:
start   -> gpstart -a
stop    -> gpstop -a
restart -> gpstop -arf
reload  -> gpstop -u

disable promote, add restart_qd.
restart_qd -> pg_ctl -w -t 3 -D $MASTER_DATA_DIRECTORY

3. Add default server key and certificate for GPDB.

4. Update server setup to work with running gpdemo.

5. Disable SSL alternative names cases.
Signed-off-by: NYuan Zhao <yuzhao@pivotal.io>

57a11a94

[#147774653] Implemented ValuesScan Operator in ORCA · 819107b7

由 Bhuvnesh Chaudhary 提交于 7月 18, 2017

This commit introduces a new operator for ValuesScan, earlier we
generated `UNION ALL` for cases where VALUES lists passed are all
constants, but now a new Operator CLogicalConstTable with an array of
const tuples will be generated

Once the plan is generated by ORCA, it will be translated to valuesscan
node in GPDB.

This enhancement helps significantly in improving the total run time for the queries
involving values scan in ORCA with const values.
Signed-off-by: NEkta Khanna <ekhanna@pivotal.io>

819107b7

Add support to enable singleton bucket in histogram · b31dcf31

由 Omer Arap 提交于 7月 17, 2017

If gpdb and orca is build with debugging enabled, there is an assert in
orca to check if the upper and lower bound are both closed.
`GPOS_ASSERT_IMP(FSingleton(), fLowerClosed && fUpperClosed);`

The histogram that is stored in pg_statistics might lead to have a
singleton buckets as follows: `10, 20, 20, 30, 40` which will lead to have
buckets in this format: `[0,10), [10, 20), [20,20), [20,30), [30,40]`

This will cause assert to fail since [20,20) is a singleton bucket but
its upper bound is open.

With this fix, the generated buckets will look like below:
`[0,10], [10,20), [20,20], (20,30), [30,40]`
Signed-off-by: NShreedhar Hardikar <shardikar@pivotal.io>

b31dcf31

18 7月, 2017 2 次提交

Restrict only one reader of pipe on Linux. (#2771) · 82329ca1

由 Ming LI 提交于 7月 18, 2017

If there are two external tables refer to the same PIPE file using gpfdist
or file protocol directly, concurrent read will result in wrong data format
or hang for gpfdist. Now before read the pipe, it will firstly flock the
pipe file (Windows not supported yet), other requests from gpdb will report
error.
Signed-off-by: NMing LI <mli@apache.org>
Signed-off-by: NXiaoran Wang <xiwang@pivotal.io>

82329ca1

Fix ICW dispatch failure · 41b5d563

由 Abhijit Subramanya 提交于 7月 13, 2017

Exclude wal sender process backends (which application is `walreceiver`) from
accounted as leftover backends.
Signed-off-by: NXin Zhang <xzhang@pivotal.io>

41b5d563

15 7月, 2017 1 次提交

Remove PartOidExpr, it's not used in GPDB. (#2481) · 941327cd

由 Heikki Linnakangas 提交于 7月 14, 2017

* Remove PartOidExpr, it's not used in GPDB.

The target lists of DML nodes that ORCA generates includes a column for the
target partition OID. It can then be referenced by PartOidExprs. ORCA uses
these to allow sorting the tuples by partition, before inserting them to the
underlying table. That feature is used by HAWQ, where grouping tuples that
go to the same output partition is cheaper.

Since commit adfad608, which removed the gp_parquet_insert_sort GUC, we
don't do that in GPDB, however. GPDB can hold multiple result relations open
at the same time, so there is no performance benefit to grouping the tuples
first (or at least not enough benefit to counterbalance the cost of a sort).

So remove the now unused support for PartOidExpr in the executor.

* Bump ORCA version to 2.37
Signed-off-by: NEkta Khanna <ekhanna@pivotal.io>

* Removed acceptedLeaf
Signed-off-by: NEkta Khanna <ekhanna@pivotal.io>

941327cd

14 7月, 2017 2 次提交

During replay of AO XLOG records, keep track of missing AO/AOCO segment files · b659d047

由 Jimmy Yih 提交于 7月 11, 2017

When a standby is shut down and restarted, WAL recovery starts from
the last restartpoint. If we replay an AO write record which has a
following drop record, the WAL replay of the AO write record will find
that the segment file does not exist. To fix this, we piggyback on top
of the heap solution of tracking invalid pages in the invalid_page_tab
hash table. The hash table key struct uses a block number which, for
AO's sake, we pretend is the segment file number for AO/AOCO
tables. This solution will be revisited to possibly create a separate
hash table for AO/AOCO tables with a proper key struct.

Big thanks to Heikki for pointing out the issue.

b659d047

Replay of AO XLOG records · cc9131ba

由 Ashwin Agrawal 提交于 6月 23, 2017

We generate AO XLOG records when --enable-segwalrep is configured. We
should now replay those records on the mirror or during recovery. The
replay is only performed for standby mode since promotion will not
execute until after there are no more XLOG records to read from the
WAL stream.

cc9131ba

13 7月, 2017 6 次提交

Harmonize error message, add test for external tables with too many URIs. · 4055ab3b

由 Heikki Linnakangas 提交于 7月 13, 2017

Seems like a good thing to test. To avoid having to have separate ORCA
and non-ORCA expected outputs, change the ORCA error message to match
that you get without ORCA.

4055ab3b

Remove unreachable and unused code (#2611) · f4e50a64

由 Daniel Gustafsson 提交于 7月 13, 2017

This removes code which is either unreachable due to prior identical
tests which break the codepath, or which is dead due to always being
true. Asserting that an unsigned integer is >= 0 will always be true,
so it's pointless.

Per "logically dead code" gripes by Coverity

f4e50a64

gpfaultinjector should work with filerep disabled · 41ba1012

由 Abhijit Subramanya 提交于 7月 10, 2017

If we try to inject certain faults when the system is initialized with filerep
disabled, we get the following error:

```
gpfaultinjector error: Injection Failed: Failure: could not insert fault
injection, segment not in primary or mirror role
Failure: could not insert fault injection, segment not in primary or mirror
role
```

This patch removes the check for the role for non-filerep faults so that they
don't fail on a cluster initialized without filerep.

41ba1012

Use block number instead of LSN to batch changed blocks in filerep · abe13c79

由 Asim R P 提交于 6月 30, 2017

Filerep resync logic to fetch changed blocks from changetracking (CT)
log is changed. LSN is no longer used to filter out blocks from CT
log. If a relation's changed blocks falls above the threshold number
of blocks that can be fetched at a time, the last fetched block number
is remembered and used to form subsequent batch.

abe13c79

TINC test to detect a bug in filerep resync logic. · 8e59eea3

由 Asim R P 提交于 6月 29, 2017

Filerep resync works by obtaining blocks changed since a mirror went down from
changetracking (CT) log. The changed blocks are obtained in fixed sized
batches. Blocks of the same relation are ordered by block number. The bug
occurs when a higher numbered block of a relation is changed such that it has
lower LSN as compared to lower numbered blocks. And the higher numbered blocks
is not included in the first batch of changed blocks for this relation. Such
blocks miss being resynchronized to mirror due to incorret filter based on
previously obtained changed blocks' LSN. That means the mirror is eventually
declared in-sync with primary but some changed blocks remain only on the
primary. This loss in data manifests only when the mirror takes over as
primary, upon rebalance or the primary going down.

8e59eea3

Add GUC to control number of blocks that a resync worker operates on · 2960bd7c

由 Asim R P 提交于 6月 27, 2017

The GUC gp_changetracking_max_rows replaces a compile time constant. Resync
worker obtains at the most gp_changetracking_max_rows number of changed blocks
from changetracking log at one time. Controling this with a GUC allows
exploiting bugs in resync logic around this area.

2960bd7c