• H
    Fix confusion with distribution keys of queries with FULL JOINs. · a25e2cd6
    Heikki Linnakangas 提交于
    There was some confusion on how NULLs are distributed, when CdbPathLocus
    is of Hashed or HashedOJ type. The comment in cdbpathlocus.h suggested
    that NULLs can be on any segment. But the rest of the code assumed that
    that's true only for HashedOJ, and that for Hashed, all NULLs are stored
    on a particular segment. There was a comment in cdbgroup.c that said "Or
    would HashedOJ ok, too?"; the answer to that is "No!". Given the comment
    in cdbpathlocus.h, I'm not suprised that the author was not very sure
    about that. Clarify the comments in cdbpathlocus.h and cdbgroup.c on that.
    
    There were a few cases where we got that actively wrong. repartitionPlan()
    function is used to inject a Redistribute Motion into queries used for
    CREATE TABLE AS and INSERT, if the "current" locus didn't match the target
    table's policy. It did not check for HashedOJ. Because of that, if the
    query contained FULL JOINs, NULL values might end up on all segments. Code
    elsewhere, particularly in cdbgroup.c, assumes that all NULLs in a table
    are stored on a single segment, identified by the cdbhash value of a NULL
    datum. Fix that, by adding a check for HashedOJ in repartitionPlan(), and
    forcing a Redistribute Motion.
    
    CREATE TABLE AS had a similar problem, in the code to decide which
    distribution key to use, if the user didn't specify DISTRIBUTED BY
    explicitly. The default behaviour is to choose a distribution key that
    matches the distribution of the query, so that we can avoid adding an
    extra Redistribute Motion. After fixing repartitionPlan, there was no
    correctness problem, but if we chose the key based on a HashedOJ locus,
    there is no performance benefit because we'd need a Redistribute Motion
    anyway. So modify the code that chooses the CTAS distribution key to
    ignore HashedOJ.
    
    While we're at it, refactor the code to choose the CTAS distribution key,
    by moving it to a separate function. It had become ridiculously deeply
    indented.
    
    Fixes https://github.com/greenplum-db/gpdb/issues/6154, and adds tests.
    Reviewed-by: NMelanie Plageman <mplageman@pivotal.io>
    a25e2cd6
gpdist.sql 52.7 KB