1. 05 6月, 2020 1 次提交
    • H
      Support "NDV-preserving" function and op property (#10247) · a4362cba
      Hans Zeller 提交于
      Orca uses this property for cardinality estimation of joins.
      For example, a join predicate foo join bar on foo.a = upper(bar.b)
      will have a cardinality estimate similar to foo join bar on foo.a = bar.b.
      
      Other functions, like foo join bar on foo.a = substring(bar.b, 1, 1)
      won't be treated that way, since they are more likely to have a greater
      effect on join cardinalities.
      
      Since this is specific to ORCA, we use logic in the translator to determine
      whether a function or operator is NDV-preserving. Right now, we consider
      a very limited set of operators, we may add more at a later time.
      
      Let's assume that we join tables R and S and that f is a function or
      expression that refers to a single column and does not preserve
      NDVs. Let's also assume that p is a function or expression that also
      refers to a single column and that does preserve NDVs:
      
      join predicate       card. estimate                         comment
      -------------------  -------------------------------------  -----------------------------
      col1 = col2          |R| * |S| / max(NDV(col1), NDV(col2))  build an equi-join histogram
      f(col1) = p(col2)    |R| * |S| / NDV(col2)                  use NDV-based estimation
      f(col1) = col2       |R| * |S| / NDV(col2)                  use NDV-based estimation
      p(col1) = col2       |R| * |S| / max(NDV(col1), NDV(col2))  use NDV-based estimation
      p(col1) = p(col2)    |R| * |S| / max(NDV(col1), NDV(col2))  use NDV-based estimation
      otherwise            |R| * |S| * 0.4                        this is an unsupported pred
      Note that adding casts to these expressions is ok, as well as switching left and right side.
      
      Here is a list of expressions that we currently treat as NDV-preserving:
      
      coalesce(col, const)
      col || const
      lower(col)
      trim(col)
      upper(col)
      
      One more note: We need the NDVs of the inner side of Semi and
      Anti-joins for cardinality estimation, so only normal columns and
      NDV-preserving functions are allowed in that case.
      
      This is a port of these GPDB 5X and GPOrca PRs:
      https://github.com/greenplum-db/gporca/pull/585
      https://github.com/greenplum-db/gpdb/pull/10090
      
      This is take 2, after reverting the first attempt due to a merge conflict that
      caused a test to fail.
      a4362cba
  2. 04 6月, 2020 2 次提交
    • J
      Revert "Support "NDV-preserving" function and op property (#10225)" · 898e66b8
      Jesse Zhang 提交于
      Regression test "gporca" started failing after merging d565edac.
      
      This reverts commit d565edac.
      898e66b8
    • H
      Support "NDV-preserving" function and op property (#10225) · d565edac
      Hans Zeller 提交于
      Orca uses this property for cardinality estimation of joins.
      For example, a join predicate foo join bar on foo.a = upper(bar.b)
      will have a cardinality estimate similar to foo join bar on foo.a = bar.b.
      
      Other functions, like foo join bar on foo.a = substring(bar.b, 1, 1)
      won't be treated that way, since they are more likely to have a greater
      effect on join cardinalities.
      
      Since this is specific to ORCA, we use logic in the translator to determine
      whether a function or operator is NDV-preserving. Right now, we consider
      a very limited set of operators, we may add more at a later time.
      
      Let's assume that we join tables R and S and that f is a function or
      expression that refers to a single column and does not preserve
      NDVs. Let's also assume that p is a function or expression that also
      refers to a single column and that does preserve NDVs:
      
      join predicate       card. estimate                         comment
      -------------------  -------------------------------------  -----------------------------
      col1 = col2          |R| * |S| / max(NDV(col1), NDV(col2))  build an equi-join histogram
      f(col1) = p(col2)    |R| * |S| / NDV(col2)                  use NDV-based estimation
      f(col1) = col2       |R| * |S| / NDV(col2)                  use NDV-based estimation
      p(col1) = col2       |R| * |S| / max(NDV(col1), NDV(col2))  use NDV-based estimation
      p(col1) = p(col2)    |R| * |S| / max(NDV(col1), NDV(col2))  use NDV-based estimation
      otherwise            |R| * |S| * 0.4                        this is an unsupported pred
      Note that adding casts to these expressions is ok, as well as switching left and right side.
      
      Here is a list of expressions that we currently treat as NDV-preserving:
      
      coalesce(col, const)
      col || const
      lower(col)
      trim(col)
      upper(col)
      
      One more note: We need the NDVs of the inner side of Semi and
      Anti-joins for cardinality estimation, so only normal columns and
      NDV-preserving functions are allowed in that case.
      
      This is a port of these GPDB 5X and GPOrca PRs:
      https://github.com/greenplum-db/gporca/pull/585
      https://github.com/greenplum-db/gpdb/pull/10090
      d565edac
  3. 03 6月, 2020 1 次提交
    • H
      Refactoring the DbgPrint and OsPrint methods (#10149) · b3fdede6
      Hans Zeller 提交于
      * Make DbgPrint and OsPrint methods on CRefCount
      
      Create a single DbgPrint() method on the CRefCount class. Also create
      a virtual OsPrint() method, making some objects derived from CRefCount
      easier to print from the debugger.
      
      Note that not all the OsPrint methods had the same signatures, some
      additional OsPrintxxx() methods have been generated for that.
      
      * Making print output easier to read, print some stuff on demand
      
      Required columns in required plan properties are always the same
      for a given group. Also, equivalent expressions in required distribution
      properties are important in certain cases, but in most cases they
      disrupt the display and make it harder to read.
      
      Added two traceflags, EopttracePrintRequiredColumns and
      EopttracePrintEquivDistrSpecs that have to be set to print this
      information. If you want to go back to the old display, use these
      options when running gporca_test: -T 101016 -T 101017
      
      * Add support for printing alternative plans
      
      A new method, CEngine::DbgPrintExpr() can be called from
      COptimizer::PexprOptimize, to allow printing of the best plan
      for different contexts. This is only enabled in debug builds.
      
      To use this:
      
      - run an MDP using gporca_test, using a debug build
      - print out memo after optimization (-T 101006 -T 101010)
      - set a breakpoint near the end of COptimizer::PexprOptimize()
      - if, after looking at the contents of memo, you want to see
        the optimal plan for context c of group g, do the following:
        p eng.DbgPrintExpr(g, c)
      
      You could also get the same info from the memo printout, but it
      would take a lot longer.
      b3fdede6
  4. 30 5月, 2020 1 次提交
    • C
      Penalize cross products in Orca's DPv2 algorithm more accurately (#10029) · 457bb928
      Chris Hajas 提交于
      Previously in the DPv2 transform (exhaustive2) while we penalized
      cross joins for the remaining joins in greedy, we did
      not for the first join, which in some cases selected a cross join.
      This ended up selecting a poor join order in many cases and went against
      the intent of the alternative being generated, which is to minimize
      cross joins.
      
      We also increase the cost of the default penalty from 5 to 1024, which is the value we use in the cost model during the optimization stage.
      
      The greedy alternative also wasn't kept in the heap, so we include that now too.
      457bb928
  5. 28 5月, 2020 1 次提交
    • S
      Log fewer errors (#10100) · fba77702
      Sambitesh Dash 提交于
      This is a continuation of commit 456b2b31 in GPORCA. Adding more errors to the list that
      doesn't get logged in log file. We are also removing the code that writes to std::cerr,
      generating a not very nice looking log message. Instead, add the info whether the error was
      unexpected to another log message that we also generate.
      fba77702
  6. 22 5月, 2020 1 次提交
    • C
      Let configure set C++14 mode (#10147) · b371e592
      Chris Hajas 提交于
      Commit 649ee57d "Build ORCA with C++14: Take Two (#10068)" left
      behind a major FIXME: a hard-coded CXXFLAGS in gporca.mk. At the very
      least this looks completely out of place aesthetically. But more
      importantly, this is problematic in several ways:
      
      1. It leaves the language mode for part of the code base
      (src/backend/gpopt "ORCA translator") unspecified. The ORCA translator
      closely collaborates with ORCA and directly uses much of the interfaces
      from ORCA. There is a non-hypothetical risk of non-subtle
      incompatibilities. This is obscured by the fact that GCC and upstream
      Clang (which both default to gnu++14 after their respective 6.0
      releases). Apple Clang from Xcode 11, however, reacts to it much like
      the default is still gnu++98:
      
      > In file included from CConfigParamMapping.cpp:20:
      > In file included from ../../../../src/include/gpopt/config/CConfigParamMapping.h:19:
      > In file included from ../../../../src/backend/gporca/libgpos/include/gpos/common/CBitSet.h:15:
      > In file included from ../../../../src/backend/gporca/libgpos/include/gpos/common/CDynamicPtrArray.h:15:
      > ../../../../src/backend/gporca/libgpos/include/gpos/common/CRefCount.h:68:24: error: expected ';' at end of declaration list
      >                         virtual ~CRefCount() noexcept(false)
      >                                             ^
      >                                             ;
      
      2. It potentially conflicts with other parts of the code base. Namely,
      when configured with gpcloud, we might have -std=c++11 and -std=gnu++14
      in the same breath, which is highly undesirable or an outright error.
      
      3. Idiomatically language standard selection should modify CXX, not
      CXXFLAGS, in the same vein as how AC_PROC_CC_C99 modifies CC.
      
      We already had a precedence of setting the compiler up in C++11 mode
      (albeit for a less-used component gpcloud). This patch leverages the
      same mechanism to set up CXX in C++14 mode.
      Authored-by: NChris Hajas <chajas@pivotal.io>
      b371e592
  7. 19 5月, 2020 1 次提交
    • J
      Build ORCA with C++14: Take Two (#10068) · 649ee57d
      Jesse Zhang 提交于
      This patch makes the minimal changes to build ORCA with C++14. This
      should address the grievance that ORCA cannot build with the default
      Xerces C++ (3.2 or newer, which is built with GCC 8.3 in the default
      C++14 mode) headers from Debian. I've kept the CMake build system in
      sync with the main Makefile. I've also made sure that all ORCA tests
      pass.
      
      This patch set also enables ORCA in Travis so the community gets
      compilation coverage.
      
      === FIXME / near-term TODOs:
      
      What's _not_ included in this patch, but would be nice to have soon (in
      descending order of importance):
      
      1. -std=gnu++14 ought to be done in "configure", not in a Makefile. This
      is not a pendantic aesthetic issue, sooner or later we'll run into this
      problem, especially if we're mixing multiple things built in C++.
      
      2. Clean up the Makefiles and move most CXXFLAGS override into autoconf.
      
      3. Those noexept(false) seem excessive, we should benefit from
      conditionally marking more code "noexcept" at least in production.
      
      4. Detecting whether Xerces was generated (either by autoconf or CMake)
      with a compiler that's effectively running post-C++11
      
      5. Work around a GCC 9.2 bug that crashes the loading of minidumps (I've
      tested with GCC 6 to 10). Last I checked, the bug has been fixed in GCC
      releases 10.1 and 9.3.
      
      [resolves #9923]
      [resolves #10047]
      Co-authored-by: NMelanie Plageman <mplageman@pivotal.io>
      Reviewed-by: NHans Zeller <hzeller@pivotal.io>
      Reviewed-by: NAshuka Xue <axue@pivotal.io>
      Reviewed-by: NDavid Kimura <dkimura@pivotal.io>
      649ee57d
  8. 18 5月, 2020 1 次提交
    • J
      VPATH fix for ORCA-related Makefile. · afd31921
      Jesse Zhang 提交于
      This commit fixes up a host of top_builddir vs top_srcdir confusion,
      uncovered by running a VPATH build (with ORCA enabled, of course).
      
      I've also taken this opportunity to slightly eliminate some duplication,
      using Makefile inclusion.
      
      After this commit, a VPATH build should compile.
      
      This resolves #10071.
      afd31921
  9. 14 5月, 2020 2 次提交
    • A
      Address PR Feedback · d90ceb45
      Ashuka Xue 提交于
      d90ceb45
    • A
      Allow stats estimation for text-like types only for histograms containing singleton buckets · ecefcc1c
      Ashuka Xue 提交于
      In commit `Improve statistics calculation for exprs like "var = ANY
      (ARRAY[...])"`, we improve performance in cardinality estimation for
      ArrayCmp. However, it caused ArrayCmp expressions with text-like types
      to default to NDV based cardinality estimations in spite of present and
      valid histograms.
      
      This commit re-enables using histograms for text-like types provided it
      is safe to do so.
      
      Removed because non-singleton buckets for text is not valid:
      - src/backend/gporca/data/dxl/minidump/CTE-12.mdp
      - src/backend/gporca/data/dxl/statistics/Join-Statistics-Text-Input.xml
      - src/backend/gporca/data/dxl/statistics/Join-Statistics-Text-Output.xml
      Co-authored-by: NAshuka Xue <axue@pivotal.io>
      Co-authored-by: NShreedhar Hardikar <shardikar@pivotal.io>
      ecefcc1c
  10. 13 5月, 2020 1 次提交
    • H
      Removing xerces patch (#10091) · 2448be9b
      Hans Zeller 提交于
      The scripts we use in Concourse pipelines download Apache xerces-c-3.1.2 and then apply a patch that is part of our source code tree. Abhijit has pointed out that this is no longer necessary. This commit removes the patch and uses the vanilla xerces-c-3.1.2 source code instead.
      
      Eventually, we want to stop including xerces into our releases and rely on the natively installed xerces. See also https://github.com/greenplum-db/gpdb/pull/10068.
      2448be9b
  11. 12 5月, 2020 1 次提交
    • H
      Limit DPE stats to groups with unresolved partition selectors (#9988) · cfc83810
      Hans Zeller 提交于
      DPE stats are computed when we have a dynamic partition selector that's
      applied on another child of a join. The current code continues to use
      DPE stats even for the common ancestor join and nodes above it, but
      those nodes aren't affected by the partition selector.
      
      Regular Memo groups pick the best expression among several to compute
      stats, which makes row count estimates more reliable. We don't have
      that luxury with DPE stats, therefore they are often less reliable.
      
      By minimizing the places where we use DPE stats, we should overall get
      more reliable row count estimates with DPE stats enabled.
      
      The fix also ignores DPE stats with row counts greater than the group
      stats. Partition selectors eliminate certain partitions, therefore
      it is impossible for them to increase the row count.
      cfc83810
  12. 09 5月, 2020 13 次提交
  13. 06 5月, 2020 3 次提交
    • J
      Drop trailing slashes in "subdir". · f90dd34f
      Jesse Zhang 提交于
      While functionally harmless, the trailing slashes seem to contradict our
      coding conventions. In addition, it leads to double slashes when we
      compute the linking command.
      
      While we're at it, also correct the trailing slashes in subdir for 3
      pre-existing Makefiles (fsync, heap_checksum, and walrep regress tests).
      f90dd34f
    • J
      Use CXXFLAGS instead of CPPFLAGS for -std and -W* . · d51ecabe
      Jesse Zhang 提交于
      Semantically, those are CXXFLAGS, not CPP flags (which typically deal
      with -D and -I stuff). Practically, this becomes a problem when we try
      to turn off certain warnings because CPPFLAGS come after CXXFLAGS and
      the order sensitivity of -Werror and -Wno-* flags won't mix well.
      d51ecabe
    • J
      Remove repetition in gpos Makefile · 20719067
      Jesse Zhang 提交于
      In a forthcoming commit we're gonna tweak some of these flags. To
      prevent duplication, consolidate them through Makefile inclusion first.
      This also addresses a FIXME left from commit 3f3f4a57 "Fix configure
      and cmake to build ORCA with debug".
      20719067
  14. 29 4月, 2020 2 次提交
    • S
      Improve statistics calculation for exprs like "var = ANY (ARRAY[...])" · e25bcf4e
      Shreedhar Hardikar 提交于
      Implements an algorithm in MakeHistArrayCmpAnyFilter() using CStatsPredArrayCmp:
      1. Construct a histogram with the same bucket boundaries as present in the
         base_histogram.
         This is better than using a singleton bucket per point, because it that
         case, the frequency of each bucket is so small, it is often less than
         CStatistics::Epsilon, and may be considered as 0, leading to
         cardinality misestimation. Using the same buckets as base_histogram
         also aids in joining histogram later.
      2. Compute the frequency for each bucket based on the number of points (NDV)
         present within each bucket boundary. NB: the points must be de-duplicated
         beforehand to prevent double counting.
      3. Join this "dummy_histogram" with the base_histogram to determine the buckets
         from base_histogram that should be selected (using MakeJoinHistogram)
      4. Compute and adjust the resultant scale factor for the filter.
      Co-authored-by: NAshuka Xue <axue@pivotal.io>
      Co-authored-by: NShreedhar Hardikar <shardikar@pivotal.io>
      e25bcf4e
    • A
      [Refactor] Rename functions for clarity and add DbgPrints · bce754a1
      Ashuka Xue 提交于
      Functions renamed:
      - CHistogram::Buckets -> GetNumBuckets
      - CHistogram::ParseDXLToBucketsArray -> GetBuckets
      
      Implemented DbgPrint for:
      - CBucket
      - CHistogram
      Co-authored-by: NAshuka Xue <axue@pivotal.io>
      Co-authored-by: NShreedhar Hardikar <shardikar@pivotal.io>
      bce754a1
  15. 20 4月, 2020 1 次提交
    • S
      Do not push Volatile funcs below aggs · 885ca8a9
      Sambitesh Dash 提交于
      Consider the scenario below
      
      ```
      create table tenk1 (c1 int, ten int);
      create temp sequence ts1;
      explain select * from (select distinct ten from tenk1) ss where ten < 10 + nextval('ts1') order by 1;
      ```
      
      The filter outside the subquery is a candidate to be pushed below the
      'distinct' in the sub-query.  But since 'nextval' is a volatile function, we
      should not push it.
      
      Volatile functions give different results with each execution. We don't want
      aggs to use result of a volatile function before it is necessary. We do it for
      all aggs - DISTINCT and GROUP BY.
      
      Also see commit 6327f25d.
      885ca8a9
  16. 15 4月, 2020 2 次提交
  17. 10 4月, 2020 2 次提交
    • S
      Handle opfamilies/opclasses for distribution in ORCA · e7ec9f11
      Shreedhar Hardikar 提交于
      GPDB 6 introduced a mechanism to distribute table tables on columns
      using a custom hash opclass, instead of using cdbhash. Before this
      commit, ORCA would ignore the distribution opclass, but ensuring the
      translator would only allow queries in which all tables were distributed
      by either their default or default "legacy" opclasses.
      
      However, in case of tables distributed by legacy or default opclasses,
      but joined using a non-default opclass operator, ORCA would produce an
      incorrect plan, giving wrong results.
      
      This commit fixes that bug by introducing support for distributed tables
      using non-default opfamilies/opclasses. But, even though the support is
      implemented, it is not fully enabled at this time. The logic to fallback
      to planner when the plan contains tables distributed with non-default
      non-legacy opclasses remains. Our intention is to support it fully in
      the future.
      
      How does this work?
      For hash joins, capture the opfamily of each hash joinable operator. Use
      that to create hash distribution spec requests for either side of the
      join.  Scan operators derive a distribution spec based on opfamily
      (corresponding to the opclass) of each distribution column.  If there is
      a mismatch between distribution spec requested/derived, add a Motion
      Redistribute node using the distribution function from the requested
      hash opfamily.
      
      The commit consists of several sub-sections:
      - Capture distr opfamilies in CMDRelation and related classes
      
        For each distribution column of the relation, track the opfamily of
        "opclass" used in the DISTRIBUTED BY clause. This information is then
        relayed to CTableDescriptor & CPhysicalScan.
      
        Also support this in other CMDRelation subclasses: CMDRelationCTAS
        (via CLogicalCTAS) & CMDRelationExternalGPDB.
      
      - Capture hash opfamily of CMDScalarOp using gpdb::GetCompatibleHashOpFamily()
        This is need to determined distribution spec requests from joins.
      
      - Track hash opfamilies of join predicates
      
        This commit extends the caching of join keys in Hash/Merge joins by
        also caching the corresponding hash opfamilies of the '=' operators
        used in those predicates.
      
      - Track opfamily in CDistributionSpecHashed.
      
        This commit also constructs CDistributionSpecHashed with opfamily
        information that was previously cached in CScalarGroup in the case of
        HashJoins.
        It also includes the compatibility checks that reject distributions
        specs with mismatched opfamilies in order to produce Redistribute
        motions.
      
      - Capture default distribution (hash) opfamily in CMDType
      - Handle legacy opfamilies in CMDScalarOp & CMDType
      - Handle opfamilies in HashExprList Expr->DXL translation
      
      ORCA-side notes:
      1. To ensure correctness, equivalent classes can only be determined over
         a specific opfamily. For example, the expression `a = b` implies a &
         b belong to an equiv classes only for the opfamily `=` belongs to.
         Otherwise expression `b |=| c` can be used to imply a & c belong to
         the same equiv class, which is incorrect, as the opfamily of `=` and
         `|=|` differ.
         For this commit, determine equiv classes only for default opfamilies.
         This will ensure correct behavior for majority of cases.
      2. This commit does *not* implement similar features for merge joins.
         That is left for future work.
      3. This commit introduces two traceflags:
         - EopttraceConsiderOpfamiliesForDistribution: If this is off,
           opfamilies is ignored and set to NULL. This mimics behavior before
           this PR. Ctest MDPs are run this way.
         - EopttraceUseLegacyOpfamilies: Set if ANY distribution col in the
           query uses a legacy opfamily/opclass. MDCache getters will then
           return legacy opfamilies instead of the default opfamilies for all
           queries.
      
      What new information is captured from GPDB?
      1. Opfamily of each distribution column in CMDRelation,
         CMDRelationCtasGPDB & CMDRelationExternalGPDB
      2. Compatible hash opfamily of each CMDScalarOp using
         gpdb::GetCompatibleHashOpFamily()
      3. Default distribution (hash) opfamily of every type.
         This maybe NULL for some types. Needed for certain operators (e.g
         HashAgg) that request distribution spec that cannot be inferred in
         any other way: cannot derive it, cannot get it from any scalar op
         etc. See GetDefaultDistributionOpfamilyForType()
      4. Legacy opfamilies for types & scalar operators.
         Needed for supporting tables distributed by legacy opclasses.
      
      Other GPDB side changes:
      
      1. HashExprList no longer carries the type of the expression (it is
         inferred from the expr instead). However, it now carries the hash
         opfamily to use when deriving the distribution hash function. To
         maintain compatibility with older versions, the opfamily is used only
         if EopttraceConsiderOpfamiliesForDistribution is set, otherwise,
         default hash distribution function of the type of the expr is used.
      2. Don't worry about left & right types in get_compatible_hash_opfamily()
      3. Consider COERCION_PATH_RELABELTYPE as binary coercible for ORCA.
      4. EopttraceUseLegacyOpfamilies is set if any table is distributed by a
         legacy opclass.
      e7ec9f11
    • S
      Revert "Fallback when citext op non-citext join predicate is present" · 0410b7ba
      Shreedhar Hardikar 提交于
      This reverts commit 3e45f064.
      0410b7ba
  18. 09 4月, 2020 1 次提交
  19. 08 4月, 2020 3 次提交
    • H
      Merging Orca .editorconfig into gpdb file · 1093ef02
      Hans Zeller 提交于
      1093ef02
    • C
      Fix a couple of ORCA assertions · 5a658c09
      Chris Hajas 提交于
      These were exposed when running ICW with ORCA asserts enabled.
      
      In DeriveJoinStats, EopLogicalFullOuterJoin is also a valid logical join
      operator. In IDatum, we need to check that doubles are within some
      epsilon as we're not passing in the full 64 bit IEEE value to ORCA.
      
      With fixing the assertion, we would need to regenerate the mdp for
      MinCardinalityNaryJoin However, there is no DDL/query for this test so
      it is difficult to update. Since it also didn't seem to provide much
      value, we're removing it.
      5a658c09
    • C
      45719ddd