提交 · 10052f48d9bdf3122d3cedc901ddcd51fafcbcbf · Greenplum / Gpdb

30 6月, 2019 1 次提交

Improve select statement with locking clause · 10052f48

由 Zhenghua Lyu 提交于 6月 30, 2019

With GDD enabled, and under some simple cases (refer
the commit 6ebce733 and the function checkCanOptSelectLockingClause
for details), we might also do some optimizations for the
select statement with locking clause and limit clause.

Greenplum generates two-stage-merge sort or limit plan to implement
sort or limit and we can only lock tuples on segments. We prefer
locking more tuples on segments rather than locking the whole table.
Without the whole table lock, performance for OLTP should be improved.

Also, after lockrows data cannot be assumed in order, but we do merge
gather after lockrows. It is reasonable because even for postgres:
`select * from t order by c for update` cannot guarantee result's order.

10052f48

05 6月, 2019 1 次提交

Fix plan for select-statement with locking clause · 4183297f

由 Zhenghua Lyu 提交于 6月 05, 2019

Commit 6ebce733 do some optimization for select statement
with locking clause for some simple cases. It also applies
such optimization to select statement with limit.

In Greenplum, the results of limit can only be known on QD,
but we can only lock tuples on QEs. So this commit fixes this.

For replicated table, select statement may only execute
on one segment, but update statement has to happen on all segments.
We should also turn off such optimization on locking clause for
replicated table.

Also, Currently, Greenplum uses two-stage sort to implement
order by, and might generate a gather motion to QD.

If the processing order is: first order-by then rowmarks,
we might put a lockrows plannode above a gather motion.
However, we cannot lock tuples on QD.

This commit changes the order. The plan for the query
`select * from t order by c for update` on a hash-distributed
table or randomly-distributed table t is:

  previous:
               QUERY PLAN
    ------------------------------------------
    LockRows
        ->Gather Motion 3:1  (slice1; segments: 3)
          Merge Key: c
          ->  Sort
              Sort Key: c
               ->  Seq Scan on t

  after this commit:
             QUERY PLAN
    ------------------------------------------
    Gather Motion 3:1  (slice1; segments: 3)
        Merge Key: c
        ->  Sort
            Sort Key: c
             ->  LockRows
                 ->  Seq Scan on t

4183297f

04 6月, 2019 1 次提交

Fix crash in transformSetOperationTree_internal() · cb46ac0e

由 Adam Lee 提交于 5月 28, 2019

In operation tree it's possible the leaf query has a different number of
columns than the previous ones. Greenplum used to fill the smaller number
of leaf infos before realizing it's different and breaking.

It works well before until Greenplum supports zero target queries like
"SELECT FROM t1", in this case the smaller number is zero, the list of
leaf infos refer to 0x7f7f7f7f7f7f7f7e, coming from palloc0(0).

Reproducer from Asim https://github.com/greenplum-db/gpdb/issues/7613

```
create table foo (a int, b int);
insert into foo select i,i from generate_series(1,21)i;

select from foo union select * from foo;
```

This commit fixes it by bypassing filling the leaf infos if they have
different numbers of columns.

cb46ac0e

24 5月, 2019 1 次提交

Fix locking clause issue · 6ebce733

由 Zhenghua Lyu 提交于 5月 24, 2019

This commit corrects the behaviours of select-statement with locking clause and
do some optimization for the very simple case.

There are four kinds of locking clause:
1. for update
2. for no key update
3. for share
4. for key share

The key steps to implement the locking clause semantics in Postgres are:
1. lock the table in RowShareLock mode during the parsing stage (this is the same
for each type of locking clause)
2. Generate a LockRows node in the plan
3. During executing the node LockRows, it locks each tuple from below plan nodes

In Greenplum, things get more complicated.

If Global Deadlock Detector is disabled, we cannot simply lock tuples on segments without
holding a high-level lock on QD, because this may lead to Global DeadLocks. Even if we enable
Global Deadlock Detector, in the MPP environment, tuples may be motioned here, it is not possible
to lock remote tuples now.

But for the very simple case that only involves one table, we can just behave as upstream when
global deadlock detector is enabled. And almost each select-for-xxx query in OLTP scenario
is a simple case. For such cases, we could behave just like Postgres: holds RowShareLock on the rangetable, and locks the tuple when executing. This will improve OLTP concurrence performance
by not locking the table in Exclusive mode.

In Summary:
* With GDD disabled, Greenplum locks the entire tables in ExclusiveLock mode for select statement
with locking clause (for update|for no key update|for share|for key share) to conflict with DMLs
* With GDD enabled, it keeps the same behaviour as above, except for some very simple cases
shown below (these simple case, hold RowShareLock on the table, then generate lockrows plan):

Simple cases have to ensure all the following condition holds:
1. GDD is enabled
2. Top level select statement
3. no set operations (union, intersect, ...)
4. From list contains and only contains one rangeVar (and is not a view)
5. no sublink, subquery
6. has locking clause
7. the table in locking clause is a heap table (AO table cannot lock tuples now)

6ebce733

08 5月, 2019 1 次提交
- D
  
  Fix typos in code comments · bbe1ddc3
  由 Daniel Gustafsson 提交于 5月 08, 2019
  
  bbe1ddc3
11 3月, 2019 1 次提交

Retire the reshuffle method for table data expansion (#7091) · 1c262c6e

由 Ning Yu 提交于 3月 11, 2019

This method was introduced to improve the data redistribution
performance during gpexpand phase2, however per benchmark results the
effect does not reach our expectation. For example when expanding a
table from 7 segments to 8 segments the reshuffle method is only 30%
faster than the traditional CTAS method, when expanding from 4 to 8
segments reshuffle is even 10% slower than CTAS. When there are indexes
on the table the reshuffle performance can be worse, and extra VACUUM is
needed to actually free the disk space. According to our experiments
the bottleneck of reshuffle method is on the tuple deletion operation,
it is much slower than the insertion operation used by CTAS.

The reshuffle method does have some benefits, it requires less extra
disk space, it also requires less network bandwidth (similar to CTAS
method with the new JCH reduce method, but less than CTAS + MOD). And
it can be faster in some cases, however as we can not automatically
determine when it is faster it is not easy to get benefit from it in
practice.

On the other side the reshuffle method is less tested, it is possible to
have bugs in corner cases, so it is not production ready yet.

In such a case we decided to retire it entirely for now, we might add it
back in the future if we can get rid of the slow deletion or find out
reliable ways to automatically choose between reshuffle and ctas
methods.

Discussion: https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/8xknWag-SkI/5OsIhZWdDgAJReviewed-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
Reviewed-by: NAshwin Agrawal <aagrawal@pivotal.io>

1c262c6e

27 2月, 2019 1 次提交

refactor NUMSEGMENTS related macro (#7028) · d28b7057

由 Jialun 提交于 2月 27, 2019

- Retire GP_POLICY_ALL_NUMSEGMENTS and GP_POLICY_ENTRY_NUMSEGMENTS,
  unify to getgpsegmentCount
- retire GP_POLICY_MINIMAL_NUMSEGMENTS & GP_POLICY_RANDOM_NUMSEGMENTS
- Change NUMSEGMENTS related macro from variable macro to function
  macro
- Change default return value of getgpsegmentCount to 1, which
  represents a singleton postgresql in utility mode
- change __GP_POLICY_INVALID_NUMSEGMENTS to GP_POLICY_INVALID_NUMSEGMENTS

d28b7057

01 2月, 2019 2 次提交

Error out on multiple writers in CTE · bfcb7882

由 Daniel Gustafsson 提交于 2月 01, 2019

While Greenplum can plan a CTE query with multiple writable expressions,
it cannot execute it as there is a limitation on using a single writer
gang. Until we can support multiple writer gangs, let's error out with
a graceful error message rather than failing during exeucution with a
more cryptic internal error.

Ideally this will be reverted in GPDB 7.X but right now it's much too
close to release for attacking this.
Reviewed-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>

bfcb7882

Use normal hash operator classes for data distribution. · 242783ae

由 Heikki Linnakangas 提交于 2月 01, 2019

Replace the use of the built-in hashing support for built-in datatypes, in
cdbhash.c, with the normal PostgreSQL hash functions. Now is a good time
to do this, since we've already made the change to use jump consistent
hashing in GPDB 6, so we'll need to deal with the upgrade problems
associated with changing the hash functions, anyway.

It is no longer enough to track which columns/expressions are used to
distribute data. You also need to know the hash function used. For that,
a new field is added to gp_distribution_policy, to record the hash
operator class used for each distribution key column. In the planner,
a new opfamily field is added to DistributionKey, to track that throughout
the planning.

Normally, if you do "CREATE TABLE ... DISTRIBUTED BY (column)", the
default hash operator class for the datatype is used. But this patch
extends the syntax so that you can specify the operator class explicitly,
like "... DISTRIBUTED BY (column opclass)". This is similar to how an
operator class can be specified for each column in CREATE INDEX.

To support upgrade, the old hash functions have been converted to special
(non-default) operator classes, named cdbhash_*_ops. For example, if you
want to use the old hash function for an integer column, you could do
"DISTRIBUTED BY (intcol cdbhash_int4_ops)". The old hard-coded whitelist
of operators that have "compatible" cdbhash functions has been replaced
by putting the compatible hash opclasses in the same operator family. For
example, all legacy integer operator classes, cdbhash_int2_ops,
cdbhash_int4_ops and cdbhash_int8_ops, are all part of the
cdbhash_integer_ops operator family).

This removes the pg_database.hashmethod field. The hash method is now
tracked on a per-table and per-column basis, using the opclasses, so it's
not needed anymore.

To help with upgrade from GPDB 5, this introduces a new GUC called
'gp_use_legacy_hashops'. If it's set, CREATE TABLE uses the legacy hash
opclasses, instead of the default hash opclasses, if the opclass is not
specified explicitly. pg_upgrade will set the new GUC, to force the use of
legacy hashops, when restoring the schema dump. It will also set the GUC
on all upgraded databases, as a per-database option, so any new tables
created after upgrade will also use the legacy opclasses. It seems better
to be consistent after upgrade, so that collocation between old and new
tables work for example. The idea is that some time after the upgrade, the
admin can reorganize all tables to use the default opclasses instead. At
that point, he should also clear the GUC on the converted databases. (Or
rather, the automated tool that hasn't been written yet, should do that.)

ORCA doesn't know about hash operator classes, or the possibility that we
might need to use a different hash function for two columns with the same
datatype. Therefore, it cannot produce correct plans for queries that mix
different distribution hash opclasses for the same datatype, in the same
query. There are checks in the Query->DXL translation, to detect that
case, and fall back to planner. As long as you stick to the default
opclasses in all tables, we let ORCA to create the plan without any regard
to them, and use the default opclasses when translating the DXL plan to a
Plan tree. We also allow the case that all tables in the query use the
"legacy" opclasses, so that ORCA works after pg_upgrade. But a mix of the
two, or using any non-default opclasses, forces ORCA to fall back.

One curiosity with this is the "int2vector" and "aclitem" datatypes. They
have a hash opclass, but no b-tree operators. GPDB 4 used to allow them
as DISTRIBUTED BY columns, but we forbid that in GPDB 5, in commit
56e7c16b. Now they are allowed again, so you can specify an int2vector
or aclitem column in DISTRIBUTED BY, but it's still pretty useless,
because the planner still can't form EquivalenceClasses on it, and will
treat it as "strewn" distribution, and won't co-locate joins.

Abstime, reltime, tinterval datatypes don't have default hash opclasses.
They are being removed completely on PostgreSQL v12, and users shouldn't
be using them in the first place, so instead of adding hash opclasses for
them now, we accept that they can't be used as distribution key columns
anymore. Add a check to pg_upgrade, to refuse upgrade if they are used
as distribution keys in the old cluster. Do the same for 'money' datatype
as well, although that's not being removed in upstream.

The legacy hashing code for anyarray in GPDB 5 was actually broken. It
could produce a different hash value for two arrays that are considered
equal, according to the = operator, if there were differences in e.g.
whether the null bitmap was stored or not. Add a check to pg_upgrade, to
reject the upgrade if array types were used as distribution keys. The
upstream hash opclass for anyarray works, though, so it is OK to use
arrays as distribution keys in new tables. We just don't support binary
upgrading them from GPDB 5. (See github issue
https://github.com/greenplum-db/gpdb/issues/5467). The legacy hashing of
'anyrange' had the same problem, but that was new in GPDB 6, so we don't
need a pg_upgrade check for that.

This also tightens the checks ALTER TABLE ALTER COLUMN and CREATE UNIQUE
INDEX, so that you can no longer create a situation where a non-hashable
column becomes the distribution key. (Fixes github issue
https://github.com/greenplum-db/gpdb/issues/6317)

Discussion: https://groups.google.com/a/greenplum.org/forum/#!topic/gpdb-dev/4fZVeOpXllQCo-authored-by: NMel Kiyama <mkiyama@pivotal.io>
Co-authored-by: NAbhijit Subramanya <asubramanya@pivotal.io>
Co-authored-by: NPengzhou Tang <ptang@pivotal.io>
Co-authored-by: NChris Hajas <chajas@pivotal.io>
Reviewed-by: NBhuvnesh Chaudhary <bchaudhary@pivotal.io>
Reviewed-by: NNing Yu <nyu@pivotal.io>
Reviewed-by: NSimon Gao <sgao@pivotal.io>
Reviewed-by: NJesse Zhang <jzhang@pivotal.io>
Reviewed-by: NZhenghua Lyu <zlv@pivotal.io>
Reviewed-by: NMelanie Plageman <mplageman@pivotal.io>
Reviewed-by: NYandong Yao <yyao@pivotal.io>

242783ae

11 1月, 2019 1 次提交

Disable materialized view on greenplum. · 22727acc

由 Paul Guo 提交于 1月 08, 2019

This feature will not be supported in the oncoming gpdb major release.
Explicitly disable it for now.

22727acc

03 12月, 2018 1 次提交
- H
  
  Misc whitespace cleanup to reduce diff vs. upstream. · 167de83e
  由 Heikki Linnakangas 提交于 12月 02, 2018
  
  167de83e
22 11月, 2018 1 次提交

New extension to debug partially distributed tables · 3119009a

由 Ning Yu 提交于 11月 22, 2018

Introduced a new debugging extension gp_debug_numsegments to get / set
the default numsegments when creating tables.

gp_debug_get_create_table_default_numsegments() gets the default
numsegments.

gp_debug_set_create_table_default_numsegments(text) sets the default
numsegments in text format, valid values are:
- 'FULL': all the segments;
- 'RANDOM': pick a random set of segments each time;
- 'MINIMAL': the minimal set of segments;

gp_debug_set_create_table_default_numsegments(integer) sets the default
numsegments directly, valid range is [1, gp_num_contents_in_cluster].

gp_debug_reset_create_table_default_numsegments(text) or
gp_debug_reset_create_table_default_numsegments(integer) reset the
default numsegments to the specified value, and the value can be reused
later.

gp_debug_reset_create_table_default_numsegments() resets the default
numsegments to the value passed last time, if there is no previous call
to it the value is 'FULL'.

Refactored ICG test partial_table.sql to create partial tables with this
extension.

3119009a

20 11月, 2018 1 次提交

Fix data lost during data reorganization · 1c0707b9

由 Ning Yu 提交于 11月 20, 2018

A temp table is created during data reorganization, its numsegments must
be the same with the original table, otherwise some data will be lost
after the reorganization.
Reviewed-by: NDaniel Gustafsson <dgustafsson@pivotal.io>
Reviewed-by: NZhenghua Lyu <zlv@pivotal.io>
Reviewed-by: NShujie Zhang <shzhang@pivotal.io>

1c0707b9

13 11月, 2018 1 次提交

Support 'copy (select statement) to file on segment' (#6077) · bad6cebc

由 Jinbao Chen 提交于 11月 13, 2018

In ‘copy (select statement) to file’, we generate a query plan and set
its dest receivor to copy_dest_receive. And run the dest receivor on QD.
In 'copy (select statement) to file on segment', we modify the query plan,
delete gather mothon, and let dest receivor run on QE.

Change 'isCtas' in Query to 'parentStmtType' to be able to mark the upper
utility statement type. Add a CopyIntoClause node to store copy
informations. Add copyIntoClause to PlannedStmt.

In postgres, we don't need to make a different query plan for the
query in the utility stament. But in greenplum, we need to.
So we use a field to indicate whether the query is contained in utitily
statemnt, and the type of utitily statemnt.

Actually the behavior of 'copy (select statement) to file on segment'
is very similar to 'SELECT ... INTO ...' and 'CREATE TABLE ... AS SELECT ...'.
We use distribution policy inherent in the query result as the final data
distribution policy. If not, we use the first clomn in target list as the key,
and redistribute. The only difference is that we used 'copy_dest_receiver'
instead of 'intorel_dest_receiver'

bad6cebc

31 10月, 2018 1 次提交
- D
  Fix various typos in comments · b48441ab
  由 Daniel Gustafsson 提交于 10月 31, 2018
```
A collection of typo fixes in comments that I found sitting around in
a local patch.
```
  b48441ab
29 10月, 2018 1 次提交

Remove memory context argument from GpPolicyFetch and friends. · 6d17d31f

由 Heikki Linnakangas 提交于 10月 29, 2018

Most callers were passing CurrentMemoryContext, so this makes most callers
slightly simpler. The few places that needed to pass a different context
now switch to the correct one before calling the GpPolicy*() function.
Reviewed-by: NDaniel Gustafsson <dgustafsson@pivotal.io>

6d17d31f

25 10月, 2018 1 次提交

Unify the way to fetch/manage the number of segments (#6034) · 8eed4217

由 Tang Pengzhou 提交于 10月 25, 2018

* Don't use GpIdentity.numsegments directly for the number of segments

Use getgpsegmentCount() instead.

* Unify the way to fetch/manage the number of segments

Commit e0b06678 lets us expanding a GPDB cluster without a restart,
the number of segments may changes during a transaction now, so we
need to take care of the numsegments.

We now have two way to get segments number, 1) from GpIdentity.numsegments
2) from gp_segment_configuration (cdb_component_dbs) which dispatcher used
to decide the segments range of dispatching. We did some hard work to
update GpIdentity.numsegments correctly within e0b06678 which made the
management of segments more complicated, now we want to use an easier way
to do it:

1. We only allow getting segments info (include number of segments) through
gp_segment_configuration, gp_segment_configuration has newest segments info,
there is no need to update GpIdentity.numsegments, GpIdentity.numsegments is
left only for debugging and can be removed totally in the future.

2. Each global transaction fetches/updates the newest snapshot of
gp_segment_configuration and never change it until the end of transaction
unless a writer gang is lost, so a global transaction can see consistent
state of segments. We used to use gxidDispatched to do the same thing, now
it can be removed.

* Remove GpIdentity.numsegments

GpIdentity.numsegments take no effect now, remove it. This commit
does not remove gp_num_contents_in_cluster because it needs to
modify utilities like gpstart, gpstop, gprecoverseg etc, let's
do such cleanup work in another PR.

* Exchange the default UP/DOWN value in fts cache

Previously, Fts prober read gp_segment_configuration, checked the
status of segments and then set the status of segments in the shared
memory struct named ftsProbeInfo->fts_status[], so other components
(mainly used by dispatcher) can detect a segment was down.

All segments were initialized as down and then be updated to up in
most common cases, this brings two problems:

1. The fts_status is invalid until FTS does the first loop, so QD
need to check ftsProbeInfo->fts_statusVersion > 0
2. gpexpand add a new segment in gp_segment_configuration, the
new added segment may be marked as DOWN if FTS doesn't scan it
yet.

This commit changes the default value from DOWN to UP which can
resolve problems mentioned above.

* Fts should not be used to notify backends that a gpexpand occurs

As Ashwin mentioned in PR#5679, "I don't think giving FTS responsibility to
provide new segment count is right. FTS should only be responsible for HA
of the segments. The dispatcher should independently figure out the count
based on catalog.gp_segment_configuration should be the only way to get
the segment count", FTS should decouple from gpexpand.

* Access gp_segment_configuration inside a transaction

* upgrade log level from ERROR to FATAL if expand version changed

* Modify gpexpand test cases according to new design

8eed4217

23 10月, 2018 1 次提交

Table data should be reshuffled to new segments · f4f4bdcc

由 ZhangJackey 提交于 10月 23, 2018

Each table has a `numsegments` attribute in the
GP_DISTRIBUTION_POLICY table,  it indicates that the table's
data is distributed on the first N segments, In the common case,
the `numsegments` equal the total segment count of this
cluster.

When we add new segments into the cluster, `numsegments` no
longer equal the actual segment count in the cluster, we
need to reshuffle the table data to all segments in 2 steps:

	* Reshuffle the table data to all segments
	* Update `numsegments`

It is easy to update `numsegments`, so we focus on how to
reshuffle the table data, There are 3 type tables in the
Greenplum database, they are reshuffled in different ways.
For the hash distributed table, we reshuffle data based on
Update statement. Updating the hash keys of the	table
ill generate a Plan like:

	Update
		->Redistributed Motion
			->SplitUpdate
				->SeqScan

We can not use this Plan to reshuffle table data directly.
The problem is that we need to know the segment count
when Motion node computes the destination segment. When
we compute the destination segment of deleting tuple, it
need the old segment count which is equal `numsegments`;
n the other hand, we need to use the new segment count to
compute the destination segment for	inserting tuple.
So we have to add a new operator Reshuffle to compute the
destination segment, it records the O and N (O is the count
of old segments and N is the count of new segments), then
the Plan would be adjusted like:

	Update
		->Explicit Motion
			->Reshuffle
				->SplitUpdate
					->SeqScan

It can compute the destination segments directly with O and
N, at the same time we change the Motion type to Explicit,
it can send a tuple to the destination segment which we
computed in the Reshuffle node.

With changing the hash method to the `jump hash`, not all
the table data need to reshuffle, so we add an new
ReshuffleExpr to filter the tuples which are need to
reshuffle, this expression will compute the destination
segment ahead of schedule, if the destination segment is
current segment, the tuple do not need to reshuffle, with
the ReshuffleExpr the plan would adjust like that:

	Update
		->Explicit Motion
			->Reshuffle
				->SplitUpdate
					->SeqScan
						|-ReshuffleExpr

When we want to reshuffle one table, we use the SQL `ALTER
TABLE xxx SET WITH (RESHUFFLE)`, Actually it will generate
an new UpdateStmt parse tree, the parse tree is similar to
the parse tree which is generated by SQL `UPDATE xxx SET
xxx.aaa = COALESCE(xxx.aaa...) WHERE ReshuffleExpr`. We set
an reshuffle flag in the UpdateStmt, so it can distinguish
the common update and the reshuffling.

In conclusion, we reshuffle hash distributed table by
Reshuffle node and ReshuffleExpr, the ReshuffleExpr filter
the tuple need to reshuffle and the Reshuffle node do the
real reshuffling work, we can use that framework to
implement reshuffle random distributed table and replicated
table.

For random distributed table, it has no hash keys,  each
old segment need reshuffle (O - N) / N data to the new
segments, In the ReshuffleExpr, we can generate a random
value between [0, N), if the random values is greater than
O, it means that the tuple need to reshuffle, so SeqScan
node can return this tuple to ReshuffleNode.  Reshuffle node
will generate a random value between [O, N), it means which
new segment the tuple need to insert.

For replicated table, the table data is same in the all old
segments, so there do not need to delete any tuples, it only
need copy the tuple which is in the old segments to the new
segments, so the ReshuffleExpr do not filte any tuples, In
the Reshuffle node, we neglect the tuple which is generated
for deleting, only return the inserting tuple to motion. Let
me illustrate this with an example:

If there are 3 old segments in the cluster and we add 4 new
segments, the segment ID of old segments is (0,1,2) and the
segment ID of new segments is (3,4,5,6), when reshuffle the
replicated table, the seg#0 is responsible to copy data to
seg#3 and seg#6, the seg#1 is responsible to copy data to
seg#4, the seg#2 is responsible to copy data to seg#5.


Co-authored-by: Ning Yu nyu@pivotal.io
Co-authored-by: Zhenghua Lyu zlv@pivotal.io
Co-authored-by: Shujie Zhang shzhang@pivotal.io

f4f4bdcc

28 9月, 2018 1 次提交

Allow tables to be distributed on a subset of segments · 4eb65a53

由 ZhangJackey 提交于 9月 28, 2018

There was an assumption in gpdb that a table's data is always
distributed on all segments, however this is not always true for example
when a cluster is expanded from M segments to N (N > M) all the tables
are still on M segments, to workaround the problem we used to have to
alter all the hash distributed tables to randomly distributed to get
correct query results, at the cost of bad performance.

Now we support table data to be distributed on a subset of segments.

A new columne `numsegments` is added to catalog table
`gp_distribution_policy` to record how many segments a table's data is
distributed on.  By doing so we could allow DMLs on M tables, joins
between M and N tables are also supported.

```sql
-- t1 and t2 are both distributed on (c1, c2),
-- one on 1 segments, the other on 2 segments
select localoid::regclass, attrnums, policytype, numsegments
    from gp_distribution_policy;
 localoid | attrnums | policytype | numsegments
----------+----------+------------+-------------
 t1       | {1,2}    | p          |           1
 t2       | {1,2}    | p          |           2
(2 rows)

-- t1 and t1 have exactly the same distribution policy,
-- join locally
explain select * from t1 a join t1 b using (c1, c2);
                   QUERY PLAN
------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)
   ->  Hash Join
         Hash Cond: a.c1 = b.c1 AND a.c2 = b.c2
         ->  Seq Scan on t1 a
         ->  Hash
               ->  Seq Scan on t1 b
 Optimizer: legacy query optimizer

-- t1 and t2 are both distributed on (c1, c2),
-- but as they have different numsegments,
-- one has to be redistributed
explain select * from t1 a join t2 b using (c1, c2);
                          QUERY PLAN
------------------------------------------------------------------
 Gather Motion 1:1  (slice2; segments: 1)
   ->  Hash Join
         Hash Cond: a.c1 = b.c1 AND a.c2 = b.c2
         ->  Seq Scan on t1 a
         ->  Hash
               ->  Redistribute Motion 2:1  (slice1; segments: 2)
                     Hash Key: b.c1, b.c2
                     ->  Seq Scan on t2 b
 Optimizer: legacy query optimizer
```

4eb65a53

31 8月, 2018 1 次提交
- H
  
  Fix spelling of "GPDB" in comments. · a04f2bfb
  由 Heikki Linnakangas 提交于 8月 31, 2018
  
  a04f2bfb
25 8月, 2018 1 次提交

Refactor parse analysis of DISTRIBUTED BY, to fix bug with inheritance. · c892d95f

由 Heikki Linnakangas 提交于 8月 24, 2018

Previously, we would create a GpPolicy object to represent the DISTRIBUTED
BY clause in the parse analysis stage, and store it in CreateStmt->policy.
That's problematic with table inheritance, however. A GpPolicy refers to
the columns by attribute numbers, and we don't know the final attribute
numbers of the columns at parse analysis yet, because the table might
inherit additional columns from a parent table. We tried to adjust the
attribute numbers in MergeAttributes(), but that was complicated, and
we got it wrong in the case reported in github issue #4462.

To fix, delay constructing the GpPolicy object until after we have
processed table inheritance. transformDistributedBy() now returns
a DistributedBy object, which is equivalent to the GpPolicy object we
used to create there, but it refers to the columns by name rather than
attribute number.

With CREATE TABLE AS, we still create a GpPolicy to represent the
DISTRIBUTED BY clause, if the user gave one explicitly. That works,
because you cannot create an inherited table with CREATE TABLE AS.
This is slightly inconsistent with the regular CREATE TABLE case, but
the processing is quite different anyway. In the CTAS case, we deduce
the default policy in the planner, rather than parse analysis, and
we deal with a GpPolicy from that point on.

Fixes https://github.com/greenplum-db/gpdb/issues/4462

c892d95f

20 8月, 2018 1 次提交

Backport PG patch to fix a panic in plancache code. (#5514) · 3883d52c

由 Paul Guo 提交于 8月 20, 2018

Actually we saw this panic on GPDB5. This issue was fixed on PG
in 2014 and gpdb master does not include this fix yet.

Here is the crash stack on GPDB5.
    \#2  <signal handler called>
    \#3  transformStmt (pstate=pstate@entry=0x1464200, parseTree=parseTree@entry=0x0) at analyze.c:269
    \#4  0x00000000005b5852 in parse_analyze (parseTree=parseTree@entry=0x0, sourceText=sourceText@entry=0x207c8c0 "",
        paramTypes=0x0, numParams=0) at analyze.c:166
    \#5  0x000000000082cbc9 in pg_analyze_and_rewrite (parsetree=0x0, query_string=0x207c8c0 "",
        paramTypes=<optimized out>, numParams=<optimized out>) at postgres.c:811
    \#6  0x000000000092a232 in RevalidateCachedPlanWithParams (plansource=plansource@entry=0x207c830,
        useResOwner=useResOwner@entry=0 '\000', boundParams=boundParams@entry=0x0, intoClause=intoClause@entry=0x0)
        at plancache.c:556
    \#7  0x000000000092a462 in RevalidateCachedPlan (plansource=plansource@entry=0x207c830,
        useResOwner=useResOwner@entry=0 '\000') at plancache.c:665
    \#8  0x0000000000828b1a in exec_bind_message (input_message=input_message@entry=0x7fff0c00a300) at postgres.c:2521
    \#9  0x000000000082b935 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x12e1bb0, dbname=<optimized out>,
        username=<optimized out>) at postgres.c:5309

Here is part of the PG commit message.

    commit 67770803
    Author: Tom Lane <tgl@sss.pgh.pa.us>
    Date:   Wed Nov 12 15:58:37 2014 -0500

        Explicitly support the case that a plancache's raw_parse_tree is NULL.

        This only happens if a client issues a Parse message with an empty query
        string, which is a bit odd; but since it is explicitly called out as legal
        by our FE/BE protocol spec, we'd probably better continue to allow it.

        ......

3883d52c

15 8月, 2018 1 次提交

Resolve a FIXME of copying intoPolicy to query. · f3aafa86

由 Richard Guo 提交于 8月 15, 2018

As with upstream, intoClause has been removed from Query.
We can skip the intoClause check and just let it copy NULL if
not a CTAS query.

f3aafa86

03 8月, 2018 1 次提交
- K
  Revert "Merge with PostgreSQL 9.2beta2." · e0aa3ef2
  由 Karen Huddleston 提交于 8月 02, 2018
```
This reverts commit 4750e1b6.
```
  e0aa3ef2
02 8月, 2018 1 次提交

Merge with PostgreSQL 9.2beta2. · 4750e1b6

由 Richard Guo 提交于 8月 02, 2018

This is the final batch of commits from PostgreSQL 9.2 development,
up to the point where the REL9_2_STABLE branch was created, and 9.3
development started on the PostgreSQL master branch.

Notable upstream changes:

* Index-only scan was included in the batch of upstream commits. It
  allows queries to retrieve data only from indexes, avoiding heap access.

* Group commit was added to work effectively under heavy load. Previously,
  batching of commits became ineffective as the write workload increased,
  because of internal lock contention.

* A new fast-path lock mechanism was added to reduce the overhead of
  taking and releasing certain types of locks which are taken and released
  very frequently but rarely conflict.

* The new "parameterized path" mechanism was added. It allows inner index
  scans to use values from relations that are more than one join level up
  from the scan. This can greatly improve performance in situations where
  semantic restrictions (such as outer joins) limit the allowed join orderings.

* SP-GiST (Space-Partitioned GiST) index access method was added to support
  unbalanced partitioned search structures. For suitable problems, SP-GiST can
  be faster than GiST in both index build time and search time.

* Checkpoints now are performed by a dedicated background process. Formerly
  the background writer did both dirty-page writing and checkpointing. Separating
  this into two processes allows each goal to be accomplished more predictably.

* Custom plan was supported for specific parameter values even when using
  prepared statements.

* API for FDW was improved to provide multiple access "paths" for their tables,
  allowing more flexibility in join planning.

* Security_barrier option was added for views to prevents optimizations that
  might allow view-protected data to be exposed to users.

* Range data type was added to store a lower and upper bound belonging to its
  base data type.

* CTAS (CREATE TABLE AS/SELECT INTO) is now treated as utility statement. The
  SELECT query is planned during the execution of the utility. To conform to
  this change, GPDB executes the utility statement only on QD and dispatches
  the plan of the SELECT query to QEs.
Co-authored-by: NAdam Lee <ali@pivotal.io>
Co-authored-by: NAlexandra Wang <lewang@pivotal.io>
Co-authored-by: NAshwin Agrawal <aagrawal@pivotal.io>
Co-authored-by: NAsim R P <apraveen@pivotal.io>
Co-authored-by: NDaniel Gustafsson <dgustafsson@pivotal.io>
Co-authored-by: NGang Xiong <gxiong@pivotal.io>
Co-authored-by: NHaozhou Wang <hawang@pivotal.io>
Co-authored-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
Co-authored-by: NJesse Zhang <sbjesse@gmail.com>
Co-authored-by: NJinbao Chen <jinchen@pivotal.io>
Co-authored-by: NJoao Pereira <jdealmeidapereira@pivotal.io>
Co-authored-by: NMelanie Plageman <mplageman@pivotal.io>
Co-authored-by: NPaul Guo <paulguo@gmail.com>
Co-authored-by: NRichard Guo <guofenglinux@gmail.com>
Co-authored-by: NShujie Zhang <shzhang@pivotal.io>
Co-authored-by: NTaylor Vesely <tvesely@pivotal.io>
Co-authored-by: NZhenghua Lyu <zlv@pivotal.io>

4750e1b6

29 3月, 2018 1 次提交

Support replicated table in GPDB · 7efe3204

由 Pengzhou Tang 提交于 1月 29, 2018

* Support replicated table in GPDB

Currently, tables are distributed across all segments by hash or random in GPDB. There
are requirements to introduce a new table type that all segments have the duplicate
and full table data called replicated table.

To implement it, we added a new distribution policy named POLICYTYPE_REPLICATED to mark
a replicated table and added a new locus type named CdbLocusType_SegmentGeneral to specify
the distribution of tuples of a replicated table.  CdbLocusType_SegmentGeneral implies
data is generally available on all segments but not available on qDisp, so plan node with
this locus type can be flexibly planned to execute on either single QE or all QEs. it is
similar with CdbLocusType_General, the only difference is that CdbLocusType_SegmentGeneral
node can't be executed on qDisp. To guarantee this, we try our best to add a gather motion
on the top of a CdbLocusType_SegmentGeneral node when planing motion for join even other
rel has bottleneck locus type, a problem is such motion may be redundant if the single QE
is not promoted to executed on qDisp finally, so we need to detect such case and omit the
redundant motion at the end of apply_motion(). We don't reuse CdbLocusType_Replicated since
it's always implies a broadcast motion bellow, it's not easy to plan such node as direct
dispatch to avoid getting duplicate data.

We don't support replicated table with inherit/partition by clause now, the main problem is
that update/delete on multiple result relations can't work correctly now, we can fix this
later.

* Allow spi_* to access replicated table on QE

Previously, GPDB didn't allow QE to access non-catalog table because the
data is incomplete,
we can remove this limitation now if it only accesses replicated table.

One problem is QE need to know if a table is replicated table,
previously, QE didn't maintain
the gp_distribution_policy catalog, so we need to pass policy info to QE
for replicated table.

* Change schema of gp_distribution_policy to identify replicated table

Previously, we used a magic number -128 in gp_distribution_policy table
to identify replicated table which is quite a hack, so we add a new column
in gp_distribution_policy to identify replicated table and partitioned
table.

This commit also abandon the old way that used 1-length-NULL list and
2-length-NULL list to identify DISTRIBUTED RANDOMLY and DISTRIBUTED
FULLY clause.

Beside, this commit refactor the code to make the decision-making of
distribution policy more clear.

* support COPY for replicated table

* Disable row ctid unique path for replicated table.
  Previously, GPDB use a special Unique path on rowid to address queries
  like "x IN (subquery)", For example:
  select * from t1 where t1.c2 in (select c2 from t3), the plan looks
  like:
   ->  HashAggregate
         Group By: t1.ctid, t1.gp_segment_id
            ->  Hash Join
                  Hash Cond: t2.c2 = t1.c2
                ->  Seq Scan on t2
                ->  Hash
                    ->  Seq Scan on t1

  Obviously, the plan is wrong if t1 is a replicated table because ctid
  + gp_segment_id can't identify a tuple, in replicated table, a logical
  row may have different ctid and gp_segment_id. So we disable such plan
  for replicated table temporarily, it's not the best way because rowid
  unique way maybe the cheapest plan than normal hash semi join, so
  we left a FIXME for later optimization.

* ORCA related fix
  Reported and added by Bhuvnesh Chaudhary <bchaudhary@pivotal.io>
  Fallback to legacy query optimizer for queries over replicated table

* Adapt pg_dump/gpcheckcat to replicated table
  gp_distribution_policy is no longer a master-only catalog, do
  same check as other catalogs.

* Support gpexpand on replicated table && alter the dist policy of replicated table

7efe3204

15 2月, 2018 1 次提交

Re-enable RETURNING. · 8a736a5f

由 Heikki Linnakangas 提交于 2月 15, 2018

Now that we have merged the upstream changes that introduced the
ModifyTable node, it seems like a good time to fix RETURNING. So, here
we go.

8a736a5f

24 11月, 2017 3 次提交

Centralize the logic for detecting misplaced aggregates, window funcs, etc. · fc8f849d

由 Heikki Linnakangas 提交于 9月 29, 2017

This cherry-picks the following commit. This is needed because subsequent
commits depend on this one.

I took the EXPR_KIND_PARTITION_EXPRESSION value from PostgreSQL v10, where
it's also for partition-related things. Seems like a good idea, even though
our partitioning implementation is completely different.

commit eaccfded
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri Aug 10 11:35:33 2012 -0400

Centralize the logic for detecting misplaced aggregates, window funcs, etc.

Formerly we relied on checking after-the-fact to see if an expression
contained aggregates, window functions, or sub-selects when it shouldn't.
This is grotty, easily forgotten (indeed, we had forgotten to teach
DefineIndex about rejecting window functions), and none too efficient
since it requires extra traversals of the parse tree. To improve matters,
define an enum type that classifies all SQL sub-expressions, store it in
ParseState to show what kind of expression we are currently parsing, and
make transformAggregateCall, transformWindowFuncCall, and transformSubLink
check the expression type and throw error if the type indicates the
construct is disallowed. This allows removal of a large number of ad-hoc
checks scattered around the code base. The enum type is sufficiently
fine-grained that we can still produce error messages of at least the
same specificity as before.

Bringing these error checks together revealed that we'd been none too
consistent about phrasing of the error messages, so standardize the wording
a bit.

Also, rewrite checking of aggregate arguments so that it requires only one
traversal of the arguments, rather than up to three as before.

In passing, clean up some more comments left over from add_missing_from
support, and annotate some tests that I think are dead code now that that's
gone. (I didn't risk actually removing said dead code, though.)

Author: Heikki Linnakangas <hlinnakangas@pivotal.io>
Author: Ekta Khanna <ekhanna@pivotal.io>

fc8f849d

Backport implementation of ORDER BY within aggregates, from PostgreSQL 9.0. · 4319b7bb

由 Heikki Linnakangas 提交于 9月 24, 2017

This is functionality that was lost by the ripout & replace.

commit 34d26872
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Tue Dec 15 17:57:48 2009 +0000

    Support ORDER BY within aggregate function calls, at long last providing a
    non-kluge method for controlling the order in which values are fed to an
    aggregate function.  At the same time eliminate the old implementation
    restriction that DISTINCT was only supported for single-argument aggregates.

    Possibly release-notable behavioral change: formerly, agg(DISTINCT x)
    dropped null values of x unconditionally.  Now, it does so only if the
    agg transition function is strict; otherwise nulls are treated as DISTINCT
    normally would, ie, you get one copy.

    Andrew Gierth, reviewed by Hitoshi Harada

4319b7bb

Remove PercentileExpr. · bb6a757e

由 Heikki Linnakangas 提交于 9月 22, 2017

This loses the functionality, and leaves all the regression tests that used
those functions failing.

The plan is to later backport the upstream implementation of those
functions from PostgreSQL 9.4. The feature is called "ordered set
aggregates" there.

bb6a757e

11 11月, 2017 1 次提交
- H
  Fix error message on DECLARE CURSOR ... SELECT INTO. · 98d3c182
  由 Heikki Linnakangas 提交于 11月 10, 2017
```
The error message was changed accidentally in commit 78b0a42e. Probably
a copy-paste mistake. Change it back.
```
  98d3c182
27 9月, 2017 1 次提交

Move exprType,exprTypmod,expression_tree_walker and related routines · e65f963b

由 Tom Lane 提交于 8月 25, 2008

  commit e5536e77
  Author: Tom Lane <tgl@sss.pgh.pa.us>
  Date:   Mon Aug 25 22:42:34 2008 +0000

      Move exprType(), exprTypmod(), expression_tree_walker(), and related routines
      into nodes/nodeFuncs, so as to reduce wanton cross-subsystem #includes inside
      the backend.  There's probably more that should be done along this line,
      but this is a start anyway
Signed-off-by: NShreedhar Hardikar <shardikar@pivotal.io>

e65f963b

25 9月, 2017 1 次提交
- H
  Rename WindowRef et al. to WindowFunc. · 9e82d83d
  由 Heikki Linnakangas 提交于 9月 25, 2017
```
To match upstream.
```
  9e82d83d
23 9月, 2017 1 次提交

Cherry-pick 'ae47eb1' from upstream to fix Nested CTE errors (#3360) · 009b1809

由 sambitesh 提交于 9月 22, 2017

Before this cherry-pick the below query would have errored out

WITH outermost(x) AS (
  SELECT 1
  UNION (WITH innermost as (SELECT 2)
         SELECT * FROM innermost
         UNION SELECT 3)
)
SELECT * FROM outermost;
Signed-off-by: NMelanie Plageman <mplageman@pivotal.io>

009b1809

21 9月, 2017 2 次提交

Fix CURRENT OF to work with PL/pgSQL cursors. · 91411ac4

由 Heikki Linnakangas 提交于 9月 21, 2017

It only worked for cursors declared with DECLARE CURSOR, before. You got
an "there is no parameter $0" error if you tried. This moves the decision
on whether a plan is "simply updatable", from the parser to the planner.
Doing it in the parser was awkward, because we only want to do it for
queries that are used in a cursor, and for SPI queries, we don't know it
at that time yet.

For some reason, the copy, out, read-functions of CurrentOfExpr were missing
the cursor_param field. While we're at it, reorder the code to match
upstream.

This only makes the required changes to the Postgres planner. ORCA has never
supported updatable cursors. In fact, it will fall back to the Postgres
planner on any DECLARE CURSOR command, so that's why the existing tests
have passed even with optimizer=off.

91411ac4

Add support for CREATE FUNCTION EXECUTE ON [MASTER | ALL SEGMENTS] · aa148d2a

由 Heikki Linnakangas 提交于 9月 20, 2017

We already had a hack for the EXECUTE ON ALL SEGMENTS case, by setting
prodataaccess='s'. This exposes the functionality to users via DDL, and adds
support for the EXECUTE ON MASTER case.

There was discussion on gpdb-dev about also supporting ON MASTER AND ALL
SEGMENTS, but that is not implemented yet. There is no handy "locus" in the
planner to represent that. There was also discussion about making a
gp_segment_id column implicitly available for functions, but that is also
not implemented yet.

The old behavior was that a function that if a function was marked as
IMMUTABLE, it could be executed anywhere. Otherwise it was always executed
on the master. For backwards-compatibility, this keeps that behavior for
EXECUTE ON ANY (the default), so even if a function is marked as EXECUTE ON
ANY, it will always be executed on the master unless it's IMMUTABLE.

There is no support for these new options in ORCA. Using any ON MASTER or
ON ALL SEGMENTS functions in a query cause ORCA to fall back. This is the
same as with the prodataaccess='s' hack that this replaces, but now that it
is more user-visible, it would be nice to teach ORCA about it.

The new options are only supported for set-returning functions, because for
a regular function marked as EXECUTE ON ALL SEGMENTS, it's not clear how
the results should be combined. ON MASTER would probably be doable, but
there's no need for that right now, so punt.

Another restriction is that a function with ON ALL SEGMENTS or ON MASTER can
only be used in the FROM clause, or in the target list of a simple SELECT
with no FROM clause. So "SELECT func()" is accepted, but "SELECT func() FROM
foo" is not. "SELECT * FROM func(), foo" works, however. EXECUTE ON ANY
functions, which is the default, work the same as before.

aa148d2a

16 9月, 2017 1 次提交

Fix CREATE TABLE AS VALUES ... DISTRIBUTED BY · 47936ab2

由 Heikki Linnakangas 提交于 9月 16, 2017

Should call setQryDistributionPolicy() after applyColumnNames(), otherwise
the column names specified in the CREATE TABLE cannot be used in the
DISTRIBUTED BY clause. Add test case.

Fixes github issue #3285.

47936ab2

15 9月, 2017 1 次提交
- H
  Move extended grouping processing to after transforming window functions. · 0b246cec
  由 Heikki Linnakangas 提交于 9月 15, 2017
```
This way we don't need the weird half-transformation of WindowDefs. Makes
things simpler.
```
  0b246cec
12 9月, 2017 1 次提交

Split WindowSpec into separate before and after parse-analysis structs. · 789f443d

由 Heikki Linnakangas 提交于 9月 12, 2017

In the upstream, two different structs are used to represent a window
definition. WindowDef in the grammar, which is transformed into
WindowClause during parse analysis. In GPDB, we've been using the same
struct, WindowSpec, in both stages. Split it up, to match the upstream.

The representation of the window frame, i.e. "ROWS/RANGE BETWEEN ..." was
different between the upstream implementation and the GPDB one. We now use
the upstream frameOptions+startOffset+endOffset representation in raw
WindowDef parse node, but it's still converted to the WindowFrame
representation for the later stages, so WindowClause still uses that. I
will switch over the rest of the codebase to the upstream representation as
a separate patch.

Also, refactor WINDOW clause deparsing to be closer to upstream.

One notable difference is that the old WindowSpec.winspec field corresponds
to the winref field in WindowDef andWindowClause, except that the new
'winref' is 1-based, while the old field was 0-based.

Another noteworthy thing is that this forbids specifying "OVER (w
ROWS/RANGE BETWEEN ...", if the window "w" already specified a window frame,
i.e. a different ROWS/RANGE BETWEEN. There was one such case in the
regression suite, in window_views, and this updates the expected output of
that to be an error.

789f443d

08 9月, 2017 1 次提交

Refactor window function syntax checks to match upstream. · f819890b

由 Heikki Linnakangas 提交于 9月 08, 2017

Mostly, move the responsibilities of the check_call() function to the
callers, transformAggregateCall() and transformWindowFuncCall().

This fixes one long-standing, albeit harmless, bug. Previously, you got an
"Unexpected internal error", if you tried to use a window function in the
WHERE clause of a DELETE statement, instead of a user-friendly syntax
error. Add a test case for that.

Move a few similar tests from 'olap_window_seq' to 'qp_olap_windowerr'.
Seems like a more appropriate place for them. Also, 'olap_window_seq' has
an alternative expected output file for ORCA, so it's nice to keep tests
that produce the same output with or without ORCA out of there. Also add a
test query for creating an index on an expression containing a window
function. There was a test for that already, but it was missing parens
around the expression, and therefore produced an error already in the
grammar.

f819890b