1. 12 5月, 2020 1 次提交
    • H
      Limit DPE stats to groups with unresolved partition selectors (#9988) · dddd8366
      Hans Zeller 提交于
      DPE stats are computed when we have a dynamic partition selector that's
      applied on another child of a join. The current code continues to use
      DPE stats even for the common ancestor join and nodes above it, but
      those nodes aren't affected by the partition selector.
      
      Regular Memo groups pick the best expression among several to compute
      stats, which makes row count estimates more reliable. We don't have
      that luxury with DPE stats, therefore they are often less reliable.
      
      By minimizing the places where we use DPE stats, we should overall get
      more reliable row count estimates with DPE stats enabled.
      
      The fix also ignores DPE stats with row counts greater than the group
      stats. Partition selectors eliminate certain partitions, therefore
      it is impossible for them to increase the row count.
      dddd8366
  2. 11 4月, 2020 1 次提交
    • S
      Handle opfamilies/opclasses for distribution in ORCA · 5e04eb14
      Shreedhar Hardikar 提交于
      GPDB 6 introduced a mechanism to distribute table tables on columns
      using a custom hash opclass, instead of using cdbhash. Before this
      commit, ORCA would ignore the distribution opclass, but ensuring the
      translator would only allow queries in which all tables were distributed
      by either their default or default "legacy" opclasses.
      
      However, in case of tables distributed by legacy or default opclasses,
      but joined using a non-default opclass operator, ORCA would produce an
      incorrect plan, giving wrong results.
      
      This commit fixes that bug by introducing support for distributed tables
      using non-default opfamilies/opclasses. But, even though the support is
      implemented, it is not fully enabled at this time. The logic to fallback
      to planner when the plan contains tables distributed with non-default
      non-legacy opclasses remains. Our intention is to support it fully in
      the future.
      
      How does this work?
      For hash joins, capture the opfamily of each hash joinable operator. Use
      that to create hash distribution spec requests for either side of the
      join.  Scan operators derive a distribution spec based on opfamily
      (corresponding to the opclass) of each distribution column.  If there is
      a mismatch between distribution spec requested/derived, add a Motion
      Redistribute node using the distribution function from the requested
      hash opfamily.
      
      The commit consists of several sub-sections:
      - Capture distr opfamilies in CMDRelation and related classes
      
        For each distribution column of the relation, track the opfamily of
        "opclass" used in the DISTRIBUTED BY clause. This information is then
        relayed to CTableDescriptor & CPhysicalScan.
      
        Also support this in other CMDRelation subclasses: CMDRelationCTAS
        (via CLogicalCTAS) & CMDRelationExternalGPDB.
      
      - Capture hash opfamily of CMDScalarOp using gpdb::GetCompatibleHashOpFamily()
        This is need to determined distribution spec requests from joins.
      
      - Track hash opfamilies of join predicates
      
        This commit extends the caching of join keys in Hash/Merge joins by
        also caching the corresponding hash opfamilies of the '=' operators
        used in those predicates.
      
      - Track opfamily in CDistributionSpecHashed.
      
        This commit also constructs CDistributionSpecHashed with opfamily
        information that was previously cached in CScalarGroup in the case of
        HashJoins.
        It also includes the compatibility checks that reject distributions
        specs with mismatched opfamilies in order to produce Redistribute
        motions.
      
      - Capture default distribution (hash) opfamily in CMDType
      - Handle legacy opfamilies in CMDScalarOp & CMDType
      - Handle opfamilies in HashExprList Expr->DXL translation
      
      ORCA-side notes:
      1. To ensure correctness, equivalent classes can only be determined over
         a specific opfamily. For example, the expression `a = b` implies a &
         b belong to an equiv classes only for the opfamily `=` belongs to.
         Otherwise expression `b |=| c` can be used to imply a & c belong to
         the same equiv class, which is incorrect, as the opfamily of `=` and
         `|=|` differ.
         For this commit, determine equiv classes only for default opfamilies.
         This will ensure correct behavior for majority of cases.
      2. This commit does *not* implement similar features for merge joins.
         That is left for future work.
      3. This commit introduces two traceflags:
         - EopttraceConsiderOpfamiliesForDistribution: If this is off,
           opfamilies is ignored and set to NULL. This mimics behavior before
           this PR. Ctest MDPs are run this way.
         - EopttraceUseLegacyOpfamilies: Set if ANY distribution col in the
           query uses a legacy opfamily/opclass. MDCache getters will then
           return legacy opfamilies instead of the default opfamilies for all
           queries.
      
      What new information is captured from GPDB?
      1. Opfamily of each distribution column in CMDRelation,
         CMDRelationCtasGPDB & CMDRelationExternalGPDB
      2. Compatible hash opfamily of each CMDScalarOp using
         gpdb::GetCompatibleHashOpFamily()
      3. Default distribution (hash) opfamily of every type.
         This maybe NULL for some types. Needed for certain operators (e.g
         HashAgg) that request distribution spec that cannot be inferred in
         any other way: cannot derive it, cannot get it from any scalar op
         etc. See GetDefaultDistributionOpfamilyForType()
      4. Legacy opfamilies for types & scalar operators.
         Needed for supporting tables distributed by legacy opclasses.
      
      Other GPDB side changes:
      
      1. HashExprList no longer carries the type of the expression (it is
         inferred from the expr instead). However, it now carries the hash
         opfamily to use when deriving the distribution hash function. To
         maintain compatibility with older versions, the opfamily is used only
         if EopttraceConsiderOpfamiliesForDistribution is set, otherwise,
         default hash distribution function of the type of the expr is used.
      2. Don't worry about left & right types in get_compatible_hash_opfamily()
      3. Consider COERCION_PATH_RELABELTYPE as binary coercible for ORCA.
      4. EopttraceUseLegacyOpfamilies is set if any table is distributed by a
         legacy opclass.
      5e04eb14