Relax assertions in setop planning, to accept execution on particular QE.

In setop plannning, we had assertions that checked that FLOW_SINGLETON flows had segindex=0. I'm not sure what segindex 0 means; is it "any"? In any case, it's possible to have an input that resides on a single QE, different from 0, as evidenced by the new test query. Fixes https://github.com/greenplum-db/gpdb/issues/3807

Relax assertions in setop planning, to accept execution on particular QE.
In setop plannning, we had assertions that checked that FLOW_SINGLETON flows had segindex=0. I'm not sure what segindex 0 means; is it "any"? In any case, it's possible to have an input that resides on a single QE, different from 0, as evidenced by the new test query. Fixes https://github.com/greenplum-db/gpdb/issues/3807
73fd01ca · Heikki Linnakangas · 20f2c007 · 73fd01ca · 73fd01ca · 73fd01ca
3 changed file
--- a/src/backend/cdb/cdbsetop.c
+++ b/src/backend/cdb/cdbsetop.c
@@ -169,7 +169,15 @@ adjust_setop_arguments(PlannerInfo *root, List *planlist, GpSetOpType setop_type
 						break;

 					case CdbLocusType_SingleQE:
-						Assert(subplanflow->flotype == FLOW_SINGLETON && subplanflow->segindex == 0);
+						Assert(subplanflow->flotype == FLOW_SINGLETON);
+
+						/*
+						 * The input was focused on a single QE, but we need it in the QD.
+						 * It's bit silly to add a Motion to just move the whole result from
+						 * single QE to QD, it would be better to produce the result in the
+						 * QD in the first place, and avoid the Motion. But it's too late
+						 * to modify the subplan.
+						 */
 						adjusted_plan = (Plan *) make_motion_gather_to_QD(root, subplan, NULL);
 						break;

@@ -328,7 +336,7 @@ make_motion_gather(PlannerInfo *root, Plan *subplan, int segindex, List *sortPat

 	Assert(subplan->flow != NULL);
 	Assert(subplan->flow->flotype == FLOW_PARTITIONED ||
-		   (subplan->flow->flotype == FLOW_SINGLETON && subplan->flow->segindex == 0));
+		   subplan->flow->flotype == FLOW_SINGLETON);

 	if (sortPathKeys)
 	{

--- a/src/test/regress/expected/union_gp.out
+++ b/src/test/regress/expected/union_gp.out
@@ -211,6 +211,48 @@ select distinct a from (select  distinct 'A' from (select 'C' from (select disti
 B
 (2 rows)

+-- Test case where input to one branch of UNION resides on a single segment, and another on the QE.
+-- The external table resides on QD, and the LIMIT on the test1 table forces the plan to be focused
+-- on a single QE.
+--
+CREATE TABLE test1 (id int);
+NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
+HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
+insert into test1 values (1);
+CREATE EXTERNAL WEB TABLE test2 (id int) EXECUTE 'echo 2' ON MASTER FORMAT 'csv';
+(SELECT 'test1' as branch, id FROM test1 LIMIT 1)
+union
+(SELECT 'test2' as branch, id FROM test2);
+ branch | id 
+--------+----
+ test1  |  1
+ test2  |  2
+(2 rows)
+
+-- The plan you currently get for this has a Motion to move the data from the single QE to
+-- QD. That's a bit silly, it would probably make more sense to pull all the data to the QD
+-- in the first place, and execute the Limit in the QD, to avoid the extra Motion. But this
+-- is hopefully a pretty rare case.
+explain (SELECT 'test1' as branch, id FROM test1 LIMIT 1)
+union
+(SELECT 'test2' as branch, id FROM test2);
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
+ Unique  (cost=1.06..1.07 rows=2 width=4)
+   Group Key: "outer".branch, "*SELECT* 1".id
+   ->  Sort  (cost=1.06..1.06 rows=2 width=4)
+         Sort Key (Distinct): "outer".branch, "*SELECT* 1".id
+         ->  Append  (cost=0.00..1.05 rows=2 width=4)
+               ->  Gather Motion 1:1  (slice2; segments: 1)  (cost=0.00..1.04 rows=1 width=4)
+                     ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..1.04 rows=1 width=4)
+                           ->  Limit  (cost=0.00..1.03 rows=1 width=4)
+                                 ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1.03 rows=1 width=4)
+                                       ->  Limit  (cost=0.00..1.01 rows=1 width=4)
+                                             ->  Seq Scan on test1  (cost=0.00..1.01 rows=1 width=4)
+               ->  External Scan on test2  (cost=0.00..0.00 rows=1 width=4)
+ Optimizer: legacy query optimizer
+(13 rows)
+
 --
 -- Setup
 --

--- a/src/test/regress/sql/union_gp.sql
+++ b/src/test/regress/sql/union_gp.sql
@@ -61,6 +61,26 @@ select distinct a from (select  'A' from (select distinct 'C' ) as bar union sel
 select distinct a from (select  distinct 'A' from (select distinct 'C' ) as bar union select distinct 'B') as foo(a);
 select distinct a from (select  distinct 'A' from (select 'C' from (select distinct 'D') as bar1 ) as bar union select distinct 'B') as foo(a);

+-- Test case where input to one branch of UNION resides on a single segment, and another on the QE.
+-- The external table resides on QD, and the LIMIT on the test1 table forces the plan to be focused
+-- on a single QE.
+--
+CREATE TABLE test1 (id int);
+insert into test1 values (1);
+CREATE EXTERNAL WEB TABLE test2 (id int) EXECUTE 'echo 2' ON MASTER FORMAT 'csv';
+
+(SELECT 'test1' as branch, id FROM test1 LIMIT 1)
+union
+(SELECT 'test2' as branch, id FROM test2);
+
+-- The plan you currently get for this has a Motion to move the data from the single QE to
+-- QD. That's a bit silly, it would probably make more sense to pull all the data to the QD
+-- in the first place, and execute the Limit in the QD, to avoid the extra Motion. But this
+-- is hopefully a pretty rare case.
+explain (SELECT 'test1' as branch, id FROM test1 LIMIT 1)
+union
+(SELECT 'test2' as branch, id FROM test2);
+
 --
 -- Setup
 --