提交 · b4e2e3e2280ca81d1a3a1b20d9659e229238ed18 · Greenplum / Gpdb

14 3月, 2019 2 次提交

Rename legacy planner to Postgres planner · b4e2e3e2

由 Daniel Gustafsson 提交于 3月 14, 2019

As we merge with upstream and by that keep refining the Postgres
planner, legacy planner is no longer a suitable name. This changes
all variations of the spelling (legacy planner, legacy optimizer,
legacy query optimizer etc) to say "Postgres" rather than "legacy".
Reviewed-by: NVenkatesh Raghavan <vraghavan@pivotal.io>
Reviewed-by: NDavid Yozie <dyozie@pivotal.io>
Reviewed-by: NGeorgios Kokolatos <gkokolatos@pivotal.io>

b4e2e3e2

Clean up some regression tests. · 0d42b6ef

由 Heikki Linnakangas 提交于 3月 14, 2019

* Avoid using start/end_ignore blocks where not necessary.
* Remove unnecessary DROP commands.
* Use begin/commit when building test tables, to reduce 2PC overhead
* Reuse test tables, rather than drop and recreate them

Shaves a few seconds from the total execution time.

0d42b6ef

01 2月, 2019 1 次提交

Use normal hash operator classes for data distribution. · 242783ae

由 Heikki Linnakangas 提交于 2月 01, 2019

Replace the use of the built-in hashing support for built-in datatypes, in
cdbhash.c, with the normal PostgreSQL hash functions. Now is a good time
to do this, since we've already made the change to use jump consistent
hashing in GPDB 6, so we'll need to deal with the upgrade problems
associated with changing the hash functions, anyway.

It is no longer enough to track which columns/expressions are used to
distribute data. You also need to know the hash function used. For that,
a new field is added to gp_distribution_policy, to record the hash
operator class used for each distribution key column. In the planner,
a new opfamily field is added to DistributionKey, to track that throughout
the planning.

Normally, if you do "CREATE TABLE ... DISTRIBUTED BY (column)", the
default hash operator class for the datatype is used. But this patch
extends the syntax so that you can specify the operator class explicitly,
like "... DISTRIBUTED BY (column opclass)". This is similar to how an
operator class can be specified for each column in CREATE INDEX.

To support upgrade, the old hash functions have been converted to special
(non-default) operator classes, named cdbhash_*_ops. For example, if you
want to use the old hash function for an integer column, you could do
"DISTRIBUTED BY (intcol cdbhash_int4_ops)". The old hard-coded whitelist
of operators that have "compatible" cdbhash functions has been replaced
by putting the compatible hash opclasses in the same operator family. For
example, all legacy integer operator classes, cdbhash_int2_ops,
cdbhash_int4_ops and cdbhash_int8_ops, are all part of the
cdbhash_integer_ops operator family).

This removes the pg_database.hashmethod field. The hash method is now
tracked on a per-table and per-column basis, using the opclasses, so it's
not needed anymore.

To help with upgrade from GPDB 5, this introduces a new GUC called
'gp_use_legacy_hashops'. If it's set, CREATE TABLE uses the legacy hash
opclasses, instead of the default hash opclasses, if the opclass is not
specified explicitly. pg_upgrade will set the new GUC, to force the use of
legacy hashops, when restoring the schema dump. It will also set the GUC
on all upgraded databases, as a per-database option, so any new tables
created after upgrade will also use the legacy opclasses. It seems better
to be consistent after upgrade, so that collocation between old and new
tables work for example. The idea is that some time after the upgrade, the
admin can reorganize all tables to use the default opclasses instead. At
that point, he should also clear the GUC on the converted databases. (Or
rather, the automated tool that hasn't been written yet, should do that.)

ORCA doesn't know about hash operator classes, or the possibility that we
might need to use a different hash function for two columns with the same
datatype. Therefore, it cannot produce correct plans for queries that mix
different distribution hash opclasses for the same datatype, in the same
query. There are checks in the Query->DXL translation, to detect that
case, and fall back to planner. As long as you stick to the default
opclasses in all tables, we let ORCA to create the plan without any regard
to them, and use the default opclasses when translating the DXL plan to a
Plan tree. We also allow the case that all tables in the query use the
"legacy" opclasses, so that ORCA works after pg_upgrade. But a mix of the
two, or using any non-default opclasses, forces ORCA to fall back.

One curiosity with this is the "int2vector" and "aclitem" datatypes. They
have a hash opclass, but no b-tree operators. GPDB 4 used to allow them
as DISTRIBUTED BY columns, but we forbid that in GPDB 5, in commit
56e7c16b. Now they are allowed again, so you can specify an int2vector
or aclitem column in DISTRIBUTED BY, but it's still pretty useless,
because the planner still can't form EquivalenceClasses on it, and will
treat it as "strewn" distribution, and won't co-locate joins.

Abstime, reltime, tinterval datatypes don't have default hash opclasses.
They are being removed completely on PostgreSQL v12, and users shouldn't
be using them in the first place, so instead of adding hash opclasses for
them now, we accept that they can't be used as distribution key columns
anymore. Add a check to pg_upgrade, to refuse upgrade if they are used
as distribution keys in the old cluster. Do the same for 'money' datatype
as well, although that's not being removed in upstream.

The legacy hashing code for anyarray in GPDB 5 was actually broken. It
could produce a different hash value for two arrays that are considered
equal, according to the = operator, if there were differences in e.g.
whether the null bitmap was stored or not. Add a check to pg_upgrade, to
reject the upgrade if array types were used as distribution keys. The
upstream hash opclass for anyarray works, though, so it is OK to use
arrays as distribution keys in new tables. We just don't support binary
upgrading them from GPDB 5. (See github issue
https://github.com/greenplum-db/gpdb/issues/5467). The legacy hashing of
'anyrange' had the same problem, but that was new in GPDB 6, so we don't
need a pg_upgrade check for that.

This also tightens the checks ALTER TABLE ALTER COLUMN and CREATE UNIQUE
INDEX, so that you can no longer create a situation where a non-hashable
column becomes the distribution key. (Fixes github issue
https://github.com/greenplum-db/gpdb/issues/6317)

Discussion: https://groups.google.com/a/greenplum.org/forum/#!topic/gpdb-dev/4fZVeOpXllQCo-authored-by: NMel Kiyama <mkiyama@pivotal.io>
Co-authored-by: NAbhijit Subramanya <asubramanya@pivotal.io>
Co-authored-by: NPengzhou Tang <ptang@pivotal.io>
Co-authored-by: NChris Hajas <chajas@pivotal.io>
Reviewed-by: NBhuvnesh Chaudhary <bchaudhary@pivotal.io>
Reviewed-by: NNing Yu <nyu@pivotal.io>
Reviewed-by: NSimon Gao <sgao@pivotal.io>
Reviewed-by: NJesse Zhang <jzhang@pivotal.io>
Reviewed-by: NZhenghua Lyu <zlv@pivotal.io>
Reviewed-by: NMelanie Plageman <mplageman@pivotal.io>
Reviewed-by: NYandong Yao <yyao@pivotal.io>

242783ae

09 1月, 2019 1 次提交

Do not enforce join ordering for ANTI and LASJ. (#6625) · 29daab51

由 Richard Guo 提交于 1月 09, 2019

The following identity holds true:

	(A antijoin B on (Pab)) innerjoin C on (Pac)
    	= (A innerjoin C on (Pac)) antijoin B on (Pab)

So we should not enforce join ordering for ANTI. Instead we need to
collapse ANTI join nodes so that they participate fully in the join
order search.

For example:

	select * from a join b on a.i = b.i where
		not exists (select i from c where a.i = c.i);

For this query, the origin join order is "(a innerjoin b) antijoin c". If
we enforce ANTI join ordering, this will be the final join order. But
another join order "(a antijoin c) innerjoin b" is also legal. We should
take this order into consideration and pick a cheaper one.

For LASJ, it is the same as ANTI joins.
Reviewed-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
Reviewed-by: NMelanie Plageman <mplageman@pivotal.io>

29daab51

07 1月, 2019 1 次提交

Refactor executor Squelch and EagerFree functions. · 6195b967

由 Heikki Linnakangas 提交于 1月 07, 2019

Merge the two concepts, squelching and eager-freeing. There is now only
one function, ExecSquelchNode(), that you can call. It recurses to
children, as before, but it now also performs eager-freeing of resources.
Previously that was done as a separate call, but that was an unnecessary
distinction, because all callers of ExecSquelchNode() also called
ExecEagerFree()

The concept of eager-freeing still lives on, as ExecEagerFree*() functions
specific to many node types. But it no longer recurses! The pattern is
that ExecSquelchNode() always performs eager freeing of the node, and also
recurses. In addition to that, some node types also call the node-specific
EagerFree function of the same node, after reaching the end of tuples.

This makes it more clear which function should be called when.

ExecEagerWalker() used to have special handling for the pattern of:
  Result
  -> Material
    -> Broadcast Motion

I tried removing that, but then I started to get "illegal rescan of motion
node" errors in the regression tests, from queries with a LIMIT node in a
subquery. Upon closer look, I believe that was because the Limit node was
calling ExecSquelchNode() on the input, even though the Limit node was
marked as rescannable. To fix that, I added delayEagerFree logic to Limit
node, to not call ExecSquelchNode() when the node might get rescanned
later.

The planstate_walk_kids() code did not know how to recurse into the
children of a MergeAppend node. We missed adding that logic, when we
merged the MergeAppend node type from upstream, in 9.1. We don't use that
mechanism for recursing in ExecSquelchNode() anymore, but that probably
should be fixed, anyway, as a separate commit later.

Fixes https://github.com/greenplum-db/gpdb/issues/6602 and
https://github.com/greenplum-db/gpdb/issues/6074.
Reviewed-by: NTang Pengzhou <ptang@pivotal.io>

6195b967

29 10月, 2018 1 次提交

Simplify direct dispatch related code (#6080) · 576690f2

由 Tang Pengzhou 提交于 10月 29, 2018

* Simplify direct dispatch related code

This commit include two parts:
* simplify direct-dispatch dispatching code
* simplify direct-dispatch DTM related code

Previously, cdbdisp_dispatchToGang need a CdbDispatchDirectDesc info,
now gang only contain inuse segments, so direct-dispatch info is useless.

Another thing is, we need to decide if DTM is available for direct-dispatch
within dtmPreCommand, the logic is complex, you need to know if the main plan
is direct-dispatch and if the init plan contain direct-dispatch.

one example is:
"update foo set foo.c2 = 2
where foo.c1 = 1 and exists (select * from bar where bar.c1=4)"

main plan can be direct dispatched to segment 1, init plan can be direct
dispatched to segment 2, with the old logic, the DTM like PREPARE need to
dispatched to all segments, so dtmPreCommand need to dispatch a DTM named
'DTX_PROTOCOL_COMMAND_STAY_AT_OR_BECOME_IMPLIED' to all segment so those
segments like segment 3 who didn't receive the plan can be ready for two
phase commit.

With the new gang API, we can simplify this process, we add a list in
currentGxact to record which segments are actually get involved in a two
phase commit, then we can dispatch DTM to them directly. This is also very
usefully for queries on tables that are not fully expaned yet.

* support direct dispatch to more than one segment

576690f2

10 10月, 2018 1 次提交

Enable ALL_SUBLINK pullup based on LHS input. · 996639e0

由 Richard Guo 提交于 10月 10, 2018

This patch takes LHS input into consideration when pull up sublink of type
ALL_SUBLINK. As a result, there would be more subplans conveyed into joins,
such as the query below:

```
select * from A,B where exists (select * from C where C.j = A.j and B.i
    not in (select C.i from C where C.i != 10));
```

This is only possible after commit eb919e8f, in which a set of
cartesian-product joins would be generated as a last-ditch effort, if we failed
to find any usable joins.
Authored-by: NRichard Guo <guofenglinux@gmail.com>
Authored-by: NAlexandra Wang <leiwangcheme@gmail.com>
Reviewed by: Heikki Linnakangas <hlinnakangas@pivotal.io>

996639e0

25 9月, 2018 1 次提交

Allow to add motion to unique-ify the path in create_unique_path(). (#5589) · e9fe4224

由 Paul Guo 提交于 9月 25, 2018

create_unique_path() could be used to convert semi join to inner join.
Previously, during the Semi-join refactor in commit d4ce0921, creating unique
path was disabled for the case where duplicats might be on different QEs.

In this patch we enable adding motion to unique_ify the path, only if unique
mothod is not UNIQUE_PATH_NOOP. We don't create unique path for that case
because if later on during plan creation, it is possible to create a motion
above this unique path whose subpath is a motion. In that case, the unique path
node will be ignored and we will get a motion plan node above a motion plan
node and that is bad. We could further improve that, but not in this patch.
Co-authored-by: NAlexandra Wang <lewang@pivotal.io>
Co-authored-by: NPaul Guo <paulguo@gmail.com>

e9fe4224

22 9月, 2018 1 次提交

Change pretty-printing of expressions in EXPLAIN to match upstream. · 4c54c894

由 Heikki Linnakangas 提交于 9月 21, 2018

We had changed this in GPDB, to print less parens. That's fine and dandy,
but it hardly seems worth it to carry a diff vs upstream for this. Which
format is better, is a matter of taste. The extra parens make some
expressions more clear, but OTOH, it's unnecessarily verbose for simple
expressions. Let's follow the upstream on this.

These changes were made to GPDB back in 2006, as part of backporting
to EXPLAIN-related patches from PostgreSQL 8.2. But I didn't see any
explanation for this particular change in output in that commit message.

It's nice to match upstream, to make merging easier. However, this won't
make much difference to that: almost all EXPLAIN plans in regression
tests are different from upstream anyway, because GPDB needs Motion nodes
for most queries. But every little helps.

4c54c894

18 9月, 2018 2 次提交
- R
  
  Revise expect outs for non-equivalence clauses tests. · ec993bdb
  由 Richard Guo 提交于 9月 18, 2018
  
  ec993bdb
- R
  Add more checks when deduce quals from non-equivalence clauses. · 72a325b6
  由 Richard Guo 提交于 9月 18, 2018
```
When we deduce new quals from non-equivalence clauses, we should
only use operators from the same opfamily, so that the operators can
have consistent semantics. In addition, datatype and expression should
also be checked.
```
  72a325b6
11 9月, 2018 1 次提交

Make planner generate redistribute-motion. · a4cbf586

由 ZhangJackey 提交于 9月 11, 2018

When doing an inner join, we will test that if we can use redistribute motion
by the function cdbpath_partkeys_from_preds. But if a_partkey is NIL(it is NIL
at the beginning of the function), we append nothing into it. Thus this
function will only return false. This leads to the planner can only generate a
broadcast motion for the inner relation.

We fix this by the same logic as an outer join.

WTS node is immovable, this commit adds some code to handle it.
Co-authored-by: NShujie Zhang <shzhang@pivotal.io>
Co-authored-by: NZhenghua Lyu <zlv@pivotal.io>

a4cbf586

07 9月, 2018 1 次提交

Re-enable ANY_SUBLINK pullup based on LHS input. · 0dac173a

由 Richard Guo 提交于 9月 07, 2018

Previously, for query below, we disabled ANY_SUBLINK pullup to work
around the assertion failure that left_ec/right_ec not being set.

```
select * from A where exists
     (select * from B where A.i in
         (select C.i from C where C.i = B.i));
```

This commit sets left_ec/right_ec properly in gen_implied_qual() and
re-enables above ANY_SUBLINK pullup.
Co-authored-by: NAlexandra Wang <lewang@pivotal.io>
Co-authored-by: NRichard Guo <guofenglinux@gmail.com>

0dac173a

31 8月, 2018 1 次提交

Replace GPDB versions of some numeric aggregates with upstream's. · 325e6fcd

由 Heikki Linnakangas 提交于 8月 31, 2018

Among other things, this fixes the inaccuracy of integer avg() and sum()
functions. (i.e. fixes https://github.com/greenplum-db/gpdb/issues/5525)

The upstream versions are from PostgreSQL 9.6, using the 128-bit math
from the following commit:

commit 959277a4
Author: Andres Freund <andres@anarazel.de>
Date:   Fri Mar 20 10:26:17 2015 +0100

    Use 128-bit math to accelerate some aggregation functions.

    On platforms where we support 128bit integers, use them to implement
    faster transition functions for sum(int8), avg(int8),
    var_*(int2/int4),stdev_*(int2/int4). Where not supported continue to use
    numeric as a transition type.

    In some synthetic benchmarks this has been shown to provide significant
    speedups.

    Bumps catversion.

    Discussion: 544BB5F1.50709@proxel.se
    Author: Andreas Karlsson
    Reviewed-By: Peter Geoghegan, Petr Jelinek, Andres Freund,
        Oskari Saarenmaa, David Rowley

325e6fcd

18 8月, 2018 1 次提交

Change cdbhash to use jump consistent hash. · 4a174240

由 Zhenghua Lyu 提交于 8月 18, 2018

Previously, Greenplum uses a very simple algorithm to reduce hash
value to segment id: it just does a modulo computation. Simply
modulo is not suitable for cluster expanding.

In this commit, we change cdbhash to use jump consistent hash algorithm.
More details, please refer the proposal on gpdb-dev mail-list:
https://groups.google.com/a/greenplum.org/forum/#!topic/gpdb-dev/f5bqTzAZjAsCo-authored-by: NZhenghua Lyu <zlv@pivotal.io>
Co-authored-by: NShujie Zhang <shzhang@pivotal.io>
Co-authored-by: NJialun Du <jdu@pivotal.io>
Co-authored-by: NDavid Kimura <dkimura@pivotal.io>
Co-authored-by: NAsim R P <apraveen@pivotal.io>

4a174240

03 8月, 2018 1 次提交
- K
  Revert "Merge with PostgreSQL 9.2beta2." · e0aa3ef2
  由 Karen Huddleston 提交于 8月 02, 2018
```
This reverts commit 4750e1b6.
```
  e0aa3ef2
02 8月, 2018 1 次提交

Merge with PostgreSQL 9.2beta2. · 4750e1b6

由 Richard Guo 提交于 8月 02, 2018

This is the final batch of commits from PostgreSQL 9.2 development,
up to the point where the REL9_2_STABLE branch was created, and 9.3
development started on the PostgreSQL master branch.

Notable upstream changes:

* Index-only scan was included in the batch of upstream commits. It
  allows queries to retrieve data only from indexes, avoiding heap access.

* Group commit was added to work effectively under heavy load. Previously,
  batching of commits became ineffective as the write workload increased,
  because of internal lock contention.

* A new fast-path lock mechanism was added to reduce the overhead of
  taking and releasing certain types of locks which are taken and released
  very frequently but rarely conflict.

* The new "parameterized path" mechanism was added. It allows inner index
  scans to use values from relations that are more than one join level up
  from the scan. This can greatly improve performance in situations where
  semantic restrictions (such as outer joins) limit the allowed join orderings.

* SP-GiST (Space-Partitioned GiST) index access method was added to support
  unbalanced partitioned search structures. For suitable problems, SP-GiST can
  be faster than GiST in both index build time and search time.

* Checkpoints now are performed by a dedicated background process. Formerly
  the background writer did both dirty-page writing and checkpointing. Separating
  this into two processes allows each goal to be accomplished more predictably.

* Custom plan was supported for specific parameter values even when using
  prepared statements.

* API for FDW was improved to provide multiple access "paths" for their tables,
  allowing more flexibility in join planning.

* Security_barrier option was added for views to prevents optimizations that
  might allow view-protected data to be exposed to users.

* Range data type was added to store a lower and upper bound belonging to its
  base data type.

* CTAS (CREATE TABLE AS/SELECT INTO) is now treated as utility statement. The
  SELECT query is planned during the execution of the utility. To conform to
  this change, GPDB executes the utility statement only on QD and dispatches
  the plan of the SELECT query to QEs.
Co-authored-by: NAdam Lee <ali@pivotal.io>
Co-authored-by: NAlexandra Wang <lewang@pivotal.io>
Co-authored-by: NAshwin Agrawal <aagrawal@pivotal.io>
Co-authored-by: NAsim R P <apraveen@pivotal.io>
Co-authored-by: NDaniel Gustafsson <dgustafsson@pivotal.io>
Co-authored-by: NGang Xiong <gxiong@pivotal.io>
Co-authored-by: NHaozhou Wang <hawang@pivotal.io>
Co-authored-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
Co-authored-by: NJesse Zhang <sbjesse@gmail.com>
Co-authored-by: NJinbao Chen <jinchen@pivotal.io>
Co-authored-by: NJoao Pereira <jdealmeidapereira@pivotal.io>
Co-authored-by: NMelanie Plageman <mplageman@pivotal.io>
Co-authored-by: NPaul Guo <paulguo@gmail.com>
Co-authored-by: NRichard Guo <guofenglinux@gmail.com>
Co-authored-by: NShujie Zhang <shzhang@pivotal.io>
Co-authored-by: NTaylor Vesely <tvesely@pivotal.io>
Co-authored-by: NZhenghua Lyu <zlv@pivotal.io>

4750e1b6

09 7月, 2018 1 次提交

Use a penalty cost to implement enable_* planner GUCs, like in upstream. · 8fcd3fdd

由 Heikki Linnakangas 提交于 2月 05, 2018

Instead of completely disabling the generation of Paths with disabled
plan types, add a high penalty to their cost estimates, like in the
upstream. This reduces our diff vs. upstream, making future merges more
straightforward.

Discussion: https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/Az2cDcqf73g/_tY6Yv1kBgAJCo-authored-by: NDaniel Gustafsson <dgustafsson@pivotal.io>
Reviewed-by: NVenkatesh Raghavan <vraghavan@pivotal.io>
Reviewed-by: NRichard Guo <riguo@pivotal.io>

8fcd3fdd

11 4月, 2018 1 次提交

Address GPDB_84_MERGE_FIXME in simplify_EXISTS_query() · 99450728

由 Abhijit Subramanya 提交于 3月 30, 2018

This FIXME is two-fold:
- Handling LIMIT 0
  The LIMIT is already handled in the caller,
  convert_EXISTS_sublink_to_join(): When an existential sublink contains
  an aggregate without GROUP BY or HAVING, we can safely replace it by a
  one-time TRUE/FALSE filter based on the type of sublink since the
  result of aggregate is always going to be one row even if it's input
  rows are 0.  However this assumption is incorrect when sublink
  contains LIMIT/OFFSET, such as, if the final limit count after
  applying the offset is 0.

- Rules for demoting HAVING to WHERE
  previously, simplify_EXISTS_query() only disallowed demoting HAVING
  quals to WHERE, if it did not contain any aggregates. To determine the
  same, previously it used query->hasAggs, which is incorrect
  since hasAggs indicates that aggregate is present either in targetlist
  or HAVING.  This penalized the queries, wherein HAVING did not contain
  the agg but targetlist did (as demonstrated in the newly added test).
  This check is now replaced by contain_aggs_of_level().  Also, do not
  demote if HAVING contains volatile functions since they need to be
  evaluated once per group.
Signed-off-by: NDhanashree Kashid <dkashid@pivotal.io>

99450728

22 3月, 2018 1 次提交

Fix wrong results for queries of type NOT (select <>) · a0a38df9

由 Dhanashree Kashid 提交于 3月 15, 2018

For queries of the form "NOT (subselect)", planner lost the "NOT" operator
during the initial pull-up in pull_up_sublinks_qual_recurse() which resulted in
incorrect filter and hence wrong results. Inside
pull_up_sublinks_qual_recurse(), when a qual contains NOT, we check if sublink
type is any one of these: EXISTS, ANY or ALL, and invoke appropriate sublink
pull up routines. In case of qual of the form "NOT (SELECT <>)" the sublink
type is EXPR; hence we recurse into the argument of NOT; at which point we
lose the information about NOT operator.

This commit fixes the issue by returning the node unmodified when the argument
of NOT is an EXPR sublink. The EXPR sublink, later gets pulled up by
preprocess_qual_conditions() wherein, pull_up_sublinks() is invoked again to
handle sublinks in an expression.
Signed-off-by: NSambitesh Dash <sdash@pivotal.io>

a0a38df9

09 2月, 2018 1 次提交

Refactor the way Semi-Joins plans are constructed. · d4ce0921

由 Heikki Linnakangas 提交于 2月 09, 2018

This removes much of the GPDB machinery to handle "deduplication paths"
within the planner. We will now use the upstream code to build JOIN_SEMI
paths, as well as paths where the outer side of the join is first
deduplicated (JOIN_UNIQUE_OUTER/INNER).

The old style "join first and deduplicate later" plans can be better in
some cases, however. To still be able to generate such plan, add new
JOIN_DEDUP_SEMI join type, which is transformed into JOIN_INNER followed
by the deduplication step after the join, during planning.

This new way of constructing these plans is simpler, and allows removing
a bunch of code, and reverting some more code to the way it is in the
upstream.

I'm not sure if this can generate the same plans that the old code could,
in all cases. In particular, I think the old "late deduplication"
mechanism could delay the deduplication further, all the way to the top of
the join tree. I'm not sure when that woud be useful, though, and the
regression suite doesn't seem to contain any such cases (with EXPLAIN). Or
maybe I misunderstood the old code. In any case, I think this is good
enough.

d4ce0921

12 1月, 2018 1 次提交

Fix Filter required properties for correlated subqueries in ORCA · 59abec44

由 Shreedhar Hardikar 提交于 1月 03, 2018

This commit brings in ORCA changes that ensure that a Materialize node is not
added under a Filter when its child contains outer references.  Otherwise, the
subplan is not rescanned (because it is under a Material), producing wrong
results. A rescan is necessary for it evaluates the subplan for each of the
outer referenced values.

For example:

```
SELECT * FROM A,B WHERE EXISTS (
  SELECT * FROM E WHERE E.j = A.j and B.i NOT IN (
    SELECT E.i FROM E WHERE E.i != 10));
```

For the above query ORCA produces a plan with two nested subplans:

```
Result
  Filter: (SubPlan 2)
  ->  Gather Motion 3:1
        ->  Nested Loop
              Join Filter: true
              ->  Broadcast Motion 3:3
                    ->  Table Scan on a
              ->  Table Scan on b
  SubPlan 2
    ->  Result
          Filter: public.c.j = $0
          ->  Materialize
                ->  Result
                      Filter: (SubPlan 1)
                      ->  Materialize
                            ->  Gather Motion 3:1
                                  ->  Table Scan on c
                      SubPlan 1
                        ->  Materialize
                              ->  Gather Motion 3:1
                                    ->  Table Scan on c
                                          Filter: i <> 10
```

The Materialize node (on top of Filter with Subplan 1) has cdb_strict = true.
The cdb_strict semantics dictate that when the Materialize is rescanned,
instead of destroying its tuplestore, it resets the accessor pointer to the
beginning and the subtree is NOT rescanned.
So the entries from the first scan are returned for all future calls; i.e. the
results depend on the first row output by the cross join. This causes wrong and
non-deterministic results.

Also, this commit reinstates this test in qp_correlated_query.sql. It also
fixes another wrong result caused by the same issue. Note that the changes in
rangefuncs_optimizer.out are because ORCA now no longer falls back for those
queries. Instead it produces a plan which is executed on master (instead of the
segments as was done by planner) which changes the error messages.

Also bump ORCA version to 2.53.8.
Signed-off-by: NEkta Khanna <ekhanna@pivotal.io>

59abec44

11 11月, 2017 1 次提交

Align simplify_EXISTS_query with upstream · c823e7c6

由 Dhanashree Kashid 提交于 10月 09, 2017

This function had diverged a lot from upstream; post subselect merge.
One of the main reason is that upstream has lot of restrictive checks
which prevent pull-up of EXISTS/NOT EXISTS. GPDB handles them
differently; thus producing a join/initplan or a one-time filter.

The cases that GPDB handles and for which we have not ported the checks
from upstream are as follows:

- AGG with limit count with/without offset
- HAVING clause without AGG
- AGG without HAVING clause

For other conditions, we bail out as upstream. Hence we have added
checks differently for having and aggs inside simplify_EXISTS_query.
Rest of the code is similar to upstream.
Signed-off-by: NEkta Khanna <ekhanna@pivotal.io>

c823e7c6

27 9月, 2017 8 次提交

Disable flattening of IN/EXISTS sublinks inside outer joins · fb1448d0

由 Tom Lane 提交于 2月 27, 2009

commit 07b9936a
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Fri Feb 27 23:30:29 2009 +0000

    Temporarily (I hope) disable flattening of IN/EXISTS sublinks that are within
    the ON clause of an outer join.  Doing so is semantically correct but results
    in de-optimizing queries that were structured to take advantage of the sublink
    style of execution, as seen in recent complaint from Kevin Grittner.  Since
    the user can get the other behavior by reorganizing his query, having the
    flattening happen automatically is just a convenience, and that doesn't
    justify breaking existing applications.  Eventually it would be nice to
    re-enable this, but that seems to require a significantly different approach
    to outer joins in the executor.

Added relevant test case.
Signed-off-by: NDhanashree Kashid <dkashid@pivotal.io>

fb1448d0

E
Misc test answer file changes · 8bd49b1b
由 Ekta Khanna 提交于 9月 15, 2017
```
Signed-off-by: NJemish Patel <jpatel@pivotal.io>
```
8bd49b1b

Remove is_simple_subquery() check in simplify_EXISTS_query() · 77f804f5

由 Shreedhar Hardikar 提交于 9月 07, 2017

GPDB handles a lot of the cases that are restricted by
is_simple_subquery; and the restrictions not handled, are checked for
separately in convert_EXISTS_sublink_to_join().

Resulting from cascading ICG failures, we also fixed the following:

- initialize all the members of IncrementVarSublevelsUp_context
  properly.
- remove incorrect assertions brought in from upstream. In GPDB, these
  cases are handled.
- improve plans for NOT EXISTS sub-queries containing an aggregation
  without limits by creating a "false" one-time filter.
Signed-off-by: NDhanashree Kashid <dkashid@pivotal.io>

77f804f5

Remove dead code around JoinExpr::subqfromlist. · f16deabd

由 Shreedhar Hardikar 提交于 9月 06, 2017

This was used to keep information about the subquery join tree for
pulled-up sublinks for use later in deconstruct_recurse(). With the
upstream subselect merge, a JoinExpr constructed at the pull-up time
itself, so this is no longer needed since the subquery join tree
information is available in the constructed JoinExpr.

Also with the merge, deconstruct_recurse() handles JOIN_SEMI JoinExprs.
However, since GPDB differs from upstream by treating SEMI joins as
INNER join for internal join planning, this commit also updates
inner_join_rels correctly for SEMI joins (see regression test).

Also remove unused function declaration for not_null_inner_vars().
Signed-off-by: NDhanashree Kashid <dkashid@pivotal.io>

f16deabd

Do not commute inner/outer rels of JOIN_ANTI and JOIN_LASJ_NOTIN · 70be2285

由 Shreedhar Hardikar 提交于 9月 07, 2017

This issue was discovered during the subselect merge. wherin planner
incorrectly commutes anti joins.
`cdb_add_subquery_join_paths()` creates join paths for (rel1, rel2) and
(rel2, rel1) for all join types including JOIN_ANTI and JOIN_LASJ_NOTIN.
This produces wrong results since these joins are order-sensitive w.r.t
inner and outer relations (see new regression tests). So, do not add
(rel2, rel1) for JOIN_ANTI and JOIN_LASJ_NOTIN.

This commit also refactors cdb_add_subquery_join_paths() and
make_join_rel() to make it easier to control the commuting.
Signed-off-by: NDhanashree Kashid <dkashid@pivotal.io>

70be2285

Change EXPLAIN output so that subplans and initplans · acfd5f4d

由 Tom Lane 提交于 4月 05, 2009

commit fbcce080
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sun Apr 5 19:59:40 2009 +0000

    Change EXPLAIN output so that subplans and initplans (particularly CTEs)
    are individually labeled, rather than just grouped under an "InitPlan"
    or "SubPlan" heading.  This in turn makes it possible for decompilation of
    a subplan reference to usefully identify which subplan it's referencing.
    I also made InitPlans identify which parameter symbol(s) they compute,
    so that references to those parameters elsewhere in the plan tree can
    be connected to the initplan that will be executed.  Per a gripe from
    Robert Haas about EXPLAIN output of a WITH query being inadequate,
    plus some longstanding pet peeves of my own.

acfd5f4d

Updated test answer files after merging dc9cc88 · 745bcd2b

由 Dhanashree Kashid 提交于 7月 25, 2017

After the introduction of `placeholder` mechanism,
we see some changes in the plan.
These are all good changes in that we remove the
redundant `NOTIN_subquery` subquery scan nodes:

Before
```
...
->  Broadcast Motion 3:3  (slice1; segments: 3)
	->  Subquery Scan "NotIn_SUBQUERY"
		->  Seq Scan on subselect_tbl
			Filter: f3 IS NOT NULL
```
After
```
...
->  Broadcast Motion 3:3  (slice1; segments: 3)
	 ->  Seq Scan on subselect_tbl
	 	 Filter: f3 IS NOT NULL
```

New plans have better cost since we will not be executing
subquery scan for each tuple coming from its OUTER.

We are now consistent with plans produced for same queries
with IN instead of NOT IN.
Signed-off-by: NEkta Khanna <ekhanna@pivotal.io>

745bcd2b

Update subselect EXPLAIN tests · e0415acc

由 Shreedhar Hardikar 提交于 6月 30, 2017

This commit updates test answer files after merging e006a24a, 19e34b62 and
e549722a. As a result of these merges, we produce inferior plans for
which we have added FIX_ME's appropriately. They will be addressed in
later commits.

e0415acc

12 9月, 2017 1 次提交

Fix wrong results for NOT-EXISTS sublinks with aggs & LIMIT · 087a7175

由 Shreedhar Hardikar 提交于 9月 08, 2017

During NOT EXISTS sublink pullup, we create a one-time false filter when
the sublink contains aggregates without checking for limitcount. However
in situations where the sublink contains an aggregate with limit 0, we
should not generate such filter as it produces incorrect results.

Added regress test.

Also, initialize all the members of IncrementVarSublevelsUp_context
properly.
Signed-off-by: NDhanashree Kashid <dkashid@pivotal.io>

087a7175

30 6月, 2017 1 次提交

Add EXPLAIN tests for subselect merge. · 30348765

由 Shreedhar Hardikar 提交于 6月 28, 2017

These tests compare generated plans to catch significant plan changes
during subselect pull up for IN, EXISTS, NOT IN, NOT EXISTS related
queries.

30348765

14 2月, 2017 1 次提交

Remove redundant queries from schema_topology. · b0690a83

由 Heikki Linnakangas 提交于 2月 14, 2017

Most of these queries were already in the qp_correlated_query test. A few
were not, moved those instead of outright removing them.

This removed the last queries from schema_topology whose output was
different with ORCA than without ORCA, so remove the ORCA-specific expected
output file.

b0690a83

15 12月, 2016 3 次提交

Cleanup ICG qp_correlated_query · 8db2ce37

由 Xin Zhang 提交于 12月 14, 2016

- Remove RESET ALL so that it can actually test with optimizer=on
- Remove duplicated population of table A, B, C
- Rename tables touched by DML to avoid duplicated setup
- Fix subquery to verify the correct result other than verify error message
- Remove query with non-deterministic result

Now, with GPORCA, all SQL statements passed without ERROR. Planner still got few
failures due to multi-level correlated subqueries, which is expected.

8db2ce37

X
Revert "Cleanup ICG qp_correlated_query" · 3b4fd28c
由 Xin Zhang 提交于 12月 14, 2016
```
This reverts commit 6f2e35bc.

Pipeline detect failure when codegen is enabled with gporca.
```
3b4fd28c

Cleanup ICG qp_correlated_query · 6f2e35bc

由 Xin Zhang 提交于 12月 13, 2016

- Remove RESET ALL so that it can actually test with optimizer=on
- Remove duplicated population of table A, B, C
- Rename tables touched by DML to avoid duplicated setup
- Fix subquery to verify the correct result other than verify error message

Now, with GPORCA, all SQL statements passed without ERROR. Planner still got few
failures due to multi-level correlated subqueries, which is expected.

6f2e35bc

06 12月, 2016 1 次提交

Fix for issue with EXISTS subquery with CTE · 5838bc1b

由 Xin Zhang 提交于 11月 30, 2016

When we have a correlated subquery with EXISTS and CTE, the planner
produces wrong plan as:

```
pivotal=# explain with t as (select 1) select * from foo where exists (select * from bar where foo.a = 'a' and foo.b = 'b');
                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice3; segments: 3)  (cost=0.01..1522.03 rows=5055210000 width=16)
   ->  Nested Loop  (cost=0.01..1522.03 rows=1685070000 width=16)
         ->  Broadcast Motion 1:3  (slice2; segments: 1)  (cost=0.01..0.03 rows=1 width=0)
               ->  Limit  (cost=0.01..0.01 rows=1 width=0)
                     ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.01..0.03 rows=1 width=0)
                           ->  Limit  (cost=0.01..0.01 rows=1 width=0)
                                 ->  Result  (cost=0.01..811.00 rows=23700 width=0)
                                       One-Time Filter: $0 = 'a'::bpchar AND $1 = 'b'::bpchar
                                       ->  Seq Scan on bar  (cost=0.01..811.00 rows=23700 width=0)
         ->  Seq Scan on foo  (cost=0.00..811.00 rows=23700 width=16)
 Optimizer status: legacy query optimizer
(11 rows)
```

This failed during execution because $0 is referenced across the slices.

Root Cause:
That's because planner produce a plan with `$0` aka `param` but without
`subplan`.  The `param` is created by `replace_outer_var()`, when planner
detects a query referring to relations from its outer/parent query.  Such `var`
is created when removing `sublink` in `convert_EXISTS_to_join()` function.  In
that function, when handling the `EXISTS` query, we convert the `EXISTS_sublink`
to a `subquery RTE` (and expect it to get pulled up later by
`pull_up_subquery()`.  However the subquery cannot be pulled-up by
`pull_up_subquery()` since it is not a simple subquery (`is_simple_subquery()`
returns false because of CTE in this case).  In this case, the `sublink` got
removed, hence it cannot produce the `subplan` (which is an valid option).  And
the `var` left behind as outer-reference, and then covered to param, which is
blowed up during query execution.  There is a mismatching conditions between
`convert_EXISTS_to_join()` and `pull_up_subquery()` about whether this subquery
can be pulled-up.

The fix is to reuse `is_simple_subquery()` checking when
`convert_EXISTS_to_join()`, so that it can be consistent with
`pull_up_subquery()` on whether subquery can be pulled or not.

The correct plan after fix is:

```
pivotal=# explain with t as (select 1) select * from foo where exists (select * from bar where foo.a = 'a' and foo.b = 'b');
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1  (slice2; segments: 3)  (cost=0.00..1977.50 rows=35550 width=16)
   ->  Seq Scan on foo  (cost=0.00..1977.50 rows=11850 width=16)
         Filter: (subplan)
         SubPlan 1
           ->  Result  (cost=0.01..811.00 rows=23700 width=16)
                 One-Time Filter: $0 = 'a'::bpchar AND $1 = 'b'::bpchar
                 ->  Result  (cost=882.11..1593.11 rows=23700 width=16)
                       ->  Materialize  (cost=882.11..1593.11 rows=23700 width=16)
                             ->  Broadcast Motion 3:3  (slice1; segments: 3)  (cost=0.01..811.00 rows=23700 width=16)
                                   ->  Seq Scan on bar  (cost=0.01..811.00 rows=23700 width=16)
 Optimizer status: legacy query optimizer
(11 rows)
```
Signed-off-by: NDhanashree Kashid <dkashid@pivotal.io>

5838bc1b

18 11月, 2016 2 次提交

H
Move some correlated test queries from bugbuster to qp_correlated_query. · b1af58aa
由 Heikki Linnakangas 提交于 11月 18, 2016
```
We have similar tests there, and have the same test tables set up
already.
```
b1af58aa

Use proper error code for errors. · 0bf31cd6

由 Heikki Linnakangas 提交于 11月 17, 2016

Attach a suitable error code for many errors that were previously reported
as "internal errors". GPDB's elog.c prints a source file name and line
number for any internal errors, which is a bit ugly for errors that are
in fact unexpected internal errors, but user-facing errors that happen
as a result of e.g. an invalid query.

To make sure we don't accumulate more of these, adjust the regression tests
to not ignore the source file and line number in error messages. There are
a few exceptions, which are listed explicitly.

0bf31cd6