• H
    Fix bug that partition selector may generate incomplete results for NLJ · befc063b
    Haisheng Yuan 提交于
    By setting material->cdb_strict to true if the outer child of material is a
    partition selector.  Before this patch, the following query return incomplete
    results quite often:
    
    create table t(id int, a int);
    create table pt(id int, b int) DISTRIBUTED BY (id)
    PARTITION BY RANGE(b) (START (0) END (5) EVERY (1));
    insert into t select i, i from generate_series(0,4) i;
    insert into pt select i, i from generate_series(0,4) i;
    analyze t;
    analyze pt;
    set enable_hashjoin=off;
    set enable_nestloop=on;
    select * from t, pt where a = b;
    
    In 3 segments cluster, it may return different result shown below:
    hyuan=# select * from t, pt where a = b;
     id | a | id | b
    ----+---+----+---
      0 | 0 |  0 | 0
      1 | 1 |  1 | 1
      2 | 2 |  2 | 2
    (3 rows)
    
    hyuan=# select * from t, pt where a = b;
     id | a | id | b
    ----+---+----+---
      3 | 3 |  3 | 3
      4 | 4 |  4 | 4
    (2 rows)
    
    hyuan=# select * from t, pt where a = b;
     id | a | id | b
    ----+---+----+---
      3 | 3 |  3 | 3
      4 | 4 |  4 | 4
      0 | 0 |  0 | 0
      1 | 1 |  1 | 1
      2 | 2 |  2 | 2
    (5 rows)
    
    But only the last one is correct result.
    
    The plan for above query is:
    -------------------------------------------------------------------
     Gather Motion 3:1  (slice2; segments: 3)
       ->  Nested Loop  (cost=2.27..9.00 rows=2 width=16)
             Join Filter: t.a = public.pt.b
             ->  Append  (cost=0.00..5.05 rows=2 width=8)
                   ->  Result  (cost=0.00..1.01 rows=1 width=8)
                         One-Time Filter: PartSelected
                         ->  Seq Scan on pt_1_prt_1 pt
                   ->  Result  (cost=0.00..1.01 rows=1 width=8)
                         One-Time Filter: PartSelected
                         ->  Seq Scan on pt_1_prt_2 pt
                   ->  Result  (cost=0.00..1.01 rows=1 width=8)
                         One-Time Filter: PartSelected
                         ->  Seq Scan on pt_1_prt_3 pt
                   ->  Result  (cost=0.00..1.01 rows=1 width=8)
                         One-Time Filter: PartSelected
                         ->  Seq Scan on pt_1_prt_4 pt
                   ->  Result  (cost=0.00..1.01 rows=1 width=8)
                         One-Time Filter: PartSelected
                         ->  Seq Scan on pt_1_prt_5 pt
             ->  Materialize  (cost=2.27..2.42 rows=5 width=8)
                   ->  Partition Selector for pt (dynamic scan id: 1)
                         Filter: t.a
                         ->  Broadcast Motion 3:3  (slice1; segments: 3)
                               ->  Seq Scan on t
     Settings:  enable_hashjoin=off; enable_nestloop=on
     Optimizer status: legacy query optimizer
    
    The data distribution for table t and pt in 3 segments environment is:
    hyuan=# select gp_segment_id, * from t;
     gp_segment_id | id | a
    ---------------+----+---
                 1 |  3 | 3
                 1 |  4 | 4
                 0 |  0 | 0
                 0 |  1 | 1
                 0 |  2 | 2
    (5 rows)
    
    hyuan=# select gp_segment_id, * from pt;
     gp_segment_id | id | b
    ---------------+----+---
                 0 |  0 | 0
                 0 |  1 | 1
                 0 |  2 | 2
                 1 |  3 | 3
                 1 |  4 | 4
    (5 rows)
    
    Tuples {0,1,2} of t and pt are in segment 0, tuples {3,4} of t and pt are in
    segment 1. Segment 2 has no data for t and pt.
    
    In this query, planner decides to prefetch inner child to avoid deadlock hazard
    and the cdb_strict of Material is set to false. Let's see how the query goes in
    segment 0.
    
    1. The inner child of nestloop join, material fetch one tuple from partition
    selector and then output it. Let's assume the output order of partition
    selector/broadcast motion is {0,1,2,3,4}. So the 1st tuple output by partition
    selector and material is 0.
    
    2. The partition selector decides that the selected partition for table pt is
    pt_1_prt_1, because t.a = pt.b = 0 in this partition. The outer child of
    nestloop join, Append, fetches 1 tuple from that partition, with pt.b=0.
    
    3. Nestloop join continues to execute Material of inner child to fetch other
    tuples, 1,2,3,4, but all these tuples from t don't match the join condition,
    because pt=0. No more tuples output by nestloop join for this round of loop.
    But all the partition of pt are matched and selected.
    
    4. Nestloop join fetch another tuple from pt_1_prt_2, which is 1, which can
    match with a tuple from inner child, output 1. And then fetch tuple from
    pt_1_prt_3, which is 2, matched, output 2. But pt_1_prt_4 and pt_1_prt_5 have
    no data in this segment, so the output ends with {0,1,2} in segment 0.
    
    But in segment 1, let's still assume the tuple output order of partition
    selector/broadcast motion is {0,1,2,3,4}. Since the first output tuple from
    inner child is 0, only pt_1_prt_1 is selected. But when nestloop join tries to
    fetch tuple from outer child, which in fact fetch from pt_1_prt_1 in this case,
    it returns no tuple, because pt_1_prt_1 is empty in this segment. So nestloop
    join decides that since it can't fetch any tuple from outer child, it must be
    empty, no need to execute the join, return NULL and finish it directly.
    
    Segment 2 has no data for t and pt, no tuple is output. So the final result
    gathered on master segment is {0,1,2} in this case. But if the broadcast motion
    output tuple order is {3,4,0,1,2}, the final result may be {3,4}. If the
    braodcast motion output tuple order on segment 0 is {0,1,2,3,4}, and on segment
    1 is {3,4,0,1,2}, then the final result on master is {0,1,2,3,4}, which is
    correct.
    
    The bug is fixed by setting cdb_strict of material to true when planner
    generates partition selector for the inner child of nestloop join, the material
    will fetch all the tuples from child and materialize them before emitting any
    tuple.  Thus we can make sure the partitions of pt are selected correctly.
    
    RCA by Lirong Jian <jianlirong@gmail.com>
    befc063b
dpe_optimizer.out 116.1 KB