1. 14 3月, 2019 1 次提交
  2. 11 3月, 2019 1 次提交
    • N
      Retire the reshuffle method for table data expansion (#7091) · 1c262c6e
      Ning Yu 提交于
      This method was introduced to improve the data redistribution
      performance during gpexpand phase2, however per benchmark results the
      effect does not reach our expectation.  For example when expanding a
      table from 7 segments to 8 segments the reshuffle method is only 30%
      faster than the traditional CTAS method, when expanding from 4 to 8
      segments reshuffle is even 10% slower than CTAS.  When there are indexes
      on the table the reshuffle performance can be worse, and extra VACUUM is
      needed to actually free the disk space.  According to our experiments
      the bottleneck of reshuffle method is on the tuple deletion operation,
      it is much slower than the insertion operation used by CTAS.
      
      The reshuffle method does have some benefits, it requires less extra
      disk space, it also requires less network bandwidth (similar to CTAS
      method with the new JCH reduce method, but less than CTAS + MOD).  And
      it can be faster in some cases, however as we can not automatically
      determine when it is faster it is not easy to get benefit from it in
      practice.
      
      On the other side the reshuffle method is less tested, it is possible to
      have bugs in corner cases, so it is not production ready yet.
      
      In such a case we decided to retire it entirely for now, we might add it
      back in the future if we can get rid of the slow deletion or find out
      reliable ways to automatically choose between reshuffle and ctas
      methods.
      
      Discussion: https://groups.google.com/a/greenplum.org/d/msg/gpdb-dev/8xknWag-SkI/5OsIhZWdDgAJReviewed-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
      Reviewed-by: NAshwin Agrawal <aagrawal@pivotal.io>
      1c262c6e
  3. 27 2月, 2019 1 次提交
    • J
      refactor NUMSEGMENTS related macro (#7028) · d28b7057
      Jialun 提交于
      - Retire GP_POLICY_ALL_NUMSEGMENTS and GP_POLICY_ENTRY_NUMSEGMENTS,
        unify to getgpsegmentCount
      - retire GP_POLICY_MINIMAL_NUMSEGMENTS & GP_POLICY_RANDOM_NUMSEGMENTS
      - Change NUMSEGMENTS related macro from variable macro to function
        macro
      - Change default return value of getgpsegmentCount to 1, which
        represents a singleton postgresql in utility mode
      - change __GP_POLICY_INVALID_NUMSEGMENTS to GP_POLICY_INVALID_NUMSEGMENTS
      d28b7057
  4. 06 2月, 2019 1 次提交
  5. 01 2月, 2019 1 次提交
    • H
      Use normal hash operator classes for data distribution. · 242783ae
      Heikki Linnakangas 提交于
      Replace the use of the built-in hashing support for built-in datatypes, in
      cdbhash.c, with the normal PostgreSQL hash functions. Now is a good time
      to do this, since we've already made the change to use jump consistent
      hashing in GPDB 6, so we'll need to deal with the upgrade problems
      associated with changing the hash functions, anyway.
      
      It is no longer enough to track which columns/expressions are used to
      distribute data. You also need to know the hash function used. For that,
      a new field is added to gp_distribution_policy, to record the hash
      operator class used for each distribution key column. In the planner,
      a new opfamily field is added to DistributionKey, to track that throughout
      the planning.
      
      Normally, if you do "CREATE TABLE ... DISTRIBUTED BY (column)", the
      default hash operator class for the datatype is used. But this patch
      extends the syntax so that you can specify the operator class explicitly,
      like "... DISTRIBUTED BY (column opclass)". This is similar to how an
      operator class can be specified for each column in CREATE INDEX.
      
      To support upgrade, the old hash functions have been converted to special
      (non-default) operator classes, named cdbhash_*_ops. For example, if you
      want to use the old hash function for an integer column, you could do
      "DISTRIBUTED BY (intcol cdbhash_int4_ops)". The old hard-coded whitelist
      of operators that have "compatible" cdbhash functions has been replaced
      by putting the compatible hash opclasses in the same operator family. For
      example, all legacy integer operator classes, cdbhash_int2_ops,
      cdbhash_int4_ops and cdbhash_int8_ops, are all part of the
      cdbhash_integer_ops operator family).
      
      This removes the pg_database.hashmethod field. The hash method is now
      tracked on a per-table and per-column basis, using the opclasses, so it's
      not needed anymore.
      
      To help with upgrade from GPDB 5, this introduces a new GUC called
      'gp_use_legacy_hashops'. If it's set, CREATE TABLE uses the legacy hash
      opclasses, instead of the default hash opclasses, if the opclass is not
      specified explicitly. pg_upgrade will set the new GUC, to force the use of
      legacy hashops, when restoring the schema dump. It will also set the GUC
      on all upgraded databases, as a per-database option, so any new tables
      created after upgrade will also use the legacy opclasses. It seems better
      to be consistent after upgrade, so that collocation between old and new
      tables work for example. The idea is that some time after the upgrade, the
      admin can reorganize all tables to use the default opclasses instead. At
      that point, he should also clear the GUC on the converted databases. (Or
      rather, the automated tool that hasn't been written yet, should do that.)
      
      ORCA doesn't know about hash operator classes, or the possibility that we
      might need to use a different hash function for two columns with the same
      datatype. Therefore, it cannot produce correct plans for queries that mix
      different distribution hash opclasses for the same datatype, in the same
      query. There are checks in the Query->DXL translation, to detect that
      case, and fall back to planner. As long as you stick to the default
      opclasses in all tables, we let ORCA to create the plan without any regard
      to them, and use the default opclasses when translating the DXL plan to a
      Plan tree. We also allow the case that all tables in the query use the
      "legacy" opclasses, so that ORCA works after pg_upgrade. But a mix of the
      two, or using any non-default opclasses, forces ORCA to fall back.
      
      One curiosity with this is the "int2vector" and "aclitem" datatypes. They
      have a hash opclass, but no b-tree operators. GPDB 4 used to allow them
      as DISTRIBUTED BY columns, but we forbid that in GPDB 5, in commit
      56e7c16b. Now they are allowed again, so you can specify an int2vector
      or aclitem column in DISTRIBUTED BY, but it's still pretty useless,
      because the planner still can't form EquivalenceClasses on it, and will
      treat it as "strewn" distribution, and won't co-locate joins.
      
      Abstime, reltime, tinterval datatypes don't have default hash opclasses.
      They are being removed completely on PostgreSQL v12, and users shouldn't
      be using them in the first place, so instead of adding hash opclasses for
      them now, we accept that they can't be used as distribution key columns
      anymore. Add a check to pg_upgrade, to refuse upgrade if they are used
      as distribution keys in the old cluster. Do the same for 'money' datatype
      as well, although that's not being removed in upstream.
      
      The legacy hashing code for anyarray in GPDB 5 was actually broken. It
      could produce a different hash value for two arrays that are considered
      equal, according to the = operator, if there were differences in e.g.
      whether the null bitmap was stored or not. Add a check to pg_upgrade, to
      reject the upgrade if array types were used as distribution keys. The
      upstream hash opclass for anyarray works, though, so it is OK to use
      arrays as distribution keys in new tables. We just don't support binary
      upgrading them from GPDB 5. (See github issue
      https://github.com/greenplum-db/gpdb/issues/5467). The legacy hashing of
      'anyrange' had the same problem, but that was new in GPDB 6, so we don't
      need a pg_upgrade check for that.
      
      This also tightens the checks ALTER TABLE ALTER COLUMN and CREATE UNIQUE
      INDEX, so that you can no longer create a situation where a non-hashable
      column becomes the distribution key. (Fixes github issue
      https://github.com/greenplum-db/gpdb/issues/6317)
      
      Discussion: https://groups.google.com/a/greenplum.org/forum/#!topic/gpdb-dev/4fZVeOpXllQCo-authored-by: NMel Kiyama <mkiyama@pivotal.io>
      Co-authored-by: NAbhijit Subramanya <asubramanya@pivotal.io>
      Co-authored-by: NPengzhou Tang <ptang@pivotal.io>
      Co-authored-by: NChris Hajas <chajas@pivotal.io>
      Reviewed-by: NBhuvnesh Chaudhary <bchaudhary@pivotal.io>
      Reviewed-by: NNing Yu <nyu@pivotal.io>
      Reviewed-by: NSimon Gao <sgao@pivotal.io>
      Reviewed-by: NJesse Zhang <jzhang@pivotal.io>
      Reviewed-by: NZhenghua Lyu <zlv@pivotal.io>
      Reviewed-by: NMelanie Plageman <mplageman@pivotal.io>
      Reviewed-by: NYandong Yao <yyao@pivotal.io>
      242783ae
  6. 25 1月, 2019 1 次提交
    • A
      Remove GPDB_92_MERGE_FIXME from prepunion.c · 669893be
      Alexandra Wang 提交于
      The GPDB_92_MERGE_FIXME for whether we need to deep copy or memcopy
      suffices in case of subroot can be removed as from the subroot all we
      care about is the `parse->rtable`, therefore, creating a deep copy of
      it is unnecessary.
      
      This commit also removes the `Assert()` which is valid in Upstream but
      for GPDB, since we create a new copy of the subplan if two SubPlans
      refer to the same initplan. Therefore, when we try to set references for
      subqueryscans in plans with copies of subplans refering to same
      initplan, we cannot directly Assert on the RelOptInfo's subplan being
      same as the subqueryscan's subplan.
      
      Added a test case for the same, which will ensure we do not merge back
      the Assert back from Upstream in future merges.
      Co-authored-by: NEkta Khanna <ekhanna@pivotal.io>
      669893be
  7. 15 1月, 2019 1 次提交
    • W
      Assign plan_node_id for ModifyTable, MergeAppend (#6695) · 82fb3b5a
      Wang Hao 提交于
      Some plan node types such as ModifyTable or MergeAppend is not covered in assign_plannode_id(), leads children nodes of them are not assigned with proper plan_node_id. The plan_node_id is required by gpmon and instrument for monitoring purpose, without proper plan_node_id assigned, the consistency of monitoring data will be broken.
      
      This commit refactor assign_plannode_id() to use plan_tree_walker. As a result, ModifyTable, MergeAppend and potentially Sequence are covered. Another advantage of using plan_tree_walker is when new types are introduced, we don't need to take care of assign_plannode_id anymore, plan_tree_walker should do that.
      
      Fixes https://github.com/greenplum-db/gpdb/issues/5247Reviewed-by: NNing Yu <nyu@pivotal.io>
      Reviewed-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
      82fb3b5a
  8. 19 12月, 2018 1 次提交
    • Z
      Fix split node's flow type. · 104c2b1c
      Zhenghua Lyu 提交于
      Split-update is used for update-statement on a hash-distributed
      table's hash-columns. A redistributed-motion has to be added above
      the split node in the plan and this was achieved by marking the split
      node strew. However, if the subplan's flow is an entry, we should not
      mark it strewn.
      104c2b1c
  9. 14 12月, 2018 1 次提交
  10. 13 12月, 2018 1 次提交
    • D
      Reporting cleanup for GPDB specific errors/messages · 56540f11
      Daniel Gustafsson 提交于
      The Greenplum specific error handling via ereport()/elog() calls was
      in need of a unification effort as some parts of the code was using a
      different messaging style to others (and to upstream). This aims at
      bringing many of the GPDB error calls in line with the upstream error
      message writing guidelines and thus make the user experience of
      Greenplum more consistent.
      
      The main contributions of this patch are:
      
      * errmsg() messages shall start with a lowercase letter, and not end
        with a period. errhint() and errdetail() shall be complete sentences
        starting with capital letter and ending with a period. This attempts
        to fix this on as many ereport() calls as possible, with too detailed
        errmsg() content broken up into details and hints where possible.
      
      * Reindent ereport() calls to be more consistent with the common style
        used in upstream and most parts of Greenplum:
      
      	ereport(ERROR,
      			(errcode(<CODE>),
      			 errmsg("short message describing error"),
      			 errhint("Longer message as a complete sentence.")));
      
      * Avoid breaking messages due to long lines since it makes grepping
        for error messages harder when debugging. This is also the de facto
        standard in upstream code.
      
      * Convert a few internal error ereport() calls to elog(). There are
        no doubt more that can be converted, but the low hanging fruit has
        been dealt with. Also convert a few elog() calls which are user
        facing to ereport().
      
      * Update the testfiles to match the new messages.
      
      Spelling and wording is mostly left for a follow-up commit, as this was
      getting big enough as it was. The most obvious cases have been handled
      but there is work left to be done here.
      
      Discussion: https://github.com/greenplum-db/gpdb/pull/6378Reviewed-by: NAshwin Agrawal <aagrawal@pivotal.io>
      Reviewed-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
      56540f11
  11. 03 12月, 2018 2 次提交
    • H
      Change representation of hash filter in Result from List to array. · 302a2aa8
      Heikki Linnakangas 提交于
      For consistency: this is how we represent column indexes e.g. in Sort,
      Unique, MergeAppend and many other plan types.
      Reviewed-by: NMelanie Plageman <mplageman@pivotal.io>
      302a2aa8
    • H
      Stop abusing Result's hash filter for running a plan on arbitrary segment. · 3c89b2b4
      Heikki Linnakangas 提交于
      ORCA generated plans, where the "hash filter" in the Result node was set
      to an empty set of columns. That meant "discard all the rows, on all
      segments, except one segment". This is used at least with set-returning
      functions, where we don't care where the function is executed, but it only
      needs to be executed once. (The planner creates a one-to-many Redistribute
      Motion plan in that scenario, which makes a lot more sense to me, but
      doing the same in ORCA would require more invasive surgery than what I'm
      capable of.)
      
      Instead of executing the subplan, and throwing away the result one row at
      a time, use a Result plan with a One-Off Filter. That's more efficient.
      Also, it allows removing the Result.hashFilter boolean flag, because there
      the weird case of a hashFilter with zero columns is gone. You can check
      "hashList != NIL" directly now.
      
      The old method would always choose the same segment, which seems bad for
      load distribution. The way it was chosen seemed totally accidental too:
      we initialized the cdbhash object to the initial constant value, and
      then reduced that into the target segment number, using the jump
      consistent hash algorithm. We computed that for every row, but the result
      was always the same. On a three-node cluster, the target was always
      segment 1. Now, we pick a segment at random when generating the plan.
      Reviewed-by: NMelanie Plageman <mplageman@pivotal.io>
      3c89b2b4
  12. 27 11月, 2018 1 次提交
  13. 23 11月, 2018 3 次提交
    • H
      Remove unnecessary #includes. · f30355fa
      Heikki Linnakangas 提交于
      f30355fa
    • H
      3e18f878
    • P
      Fix a bug of replicated table · 0e461e16
      Pengzhou Tang 提交于
      Previously, when creating join path between CdbLocusType_SingleQE path
      and CdbLocusType_SegmentGeneral path, we always add a motion on top
      of CdbLocusType_SegmentGeneral path so even the join path is promoted
      to executed on QD, the CdbLocusType_SegmentGeneral path can still be
      executed to segments.
                           join (CdbLocusType_SingleQE)
      					/    \
                         /      \
      CdbLocusType_SingleQE     Gather Motion
                                  \
                                CdbLocusType_SegmentGeneral
      
      For example:
      (select * from partitioned_table limit 1) as t1
      Nested Loop
          ->  Gather Motion 1:1
      	     ->  Seq Scan on replicated_table
          ->  Materialize
      		 ->  Subquery Scan on t1
      		    ->  Limit
      			   ->  Gather Motion 3:1
      	               ->  Limit
      		               ->  Seq Scan on partitioned_table
      replicated_table only store tuples on segments, so without
      the gather motion, seq scan of replicated_table doesn't
      provide tuples.
      
      There is another problem, if join path is not promoted to
      QD, the gather motion might be redundant, For example:
      
        (select * from replicated_table, (select * from
        partitioned_table limit 1) t1) sub1;
      
      Gather Motion 3:1
        -> Nested Loop
            ->  Seq Scan on partitioned_table_2
            ->  Materialize
                ->  Broadcast Motion 1:3
                    -> Nested Loop
                       ->  Gather Motion 1:1 (redundant motion)
      	                 ->  Seq Scan on replicated_table
                    ->  Materialize
      		         ->  Subquery Scan on t1
      		            ->  Limit
      			             ->  Gather Motion 3:1
      	                       ->  Limit
      		                      ->  Seq Scan on partitioned_table
      
      So in apply_motion_mutator(), we omit such redundant motion if
      it's not gathered to top slice (QD). sliceDepth == 0 means it
      is top slice, however, sliceDepth now is shared by both init
      plans and main plan, so if main plan increased the sliceDepth,
      init plan may omit the gather motion unexpectedly which create
      a wrong results.
      
      The fix is simple to reset sliceDepth for init plans
      0e461e16
  14. 22 11月, 2018 3 次提交
    • H
      Fix confusion with distribution keys of queries with FULL JOINs. · a25e2cd6
      Heikki Linnakangas 提交于
      There was some confusion on how NULLs are distributed, when CdbPathLocus
      is of Hashed or HashedOJ type. The comment in cdbpathlocus.h suggested
      that NULLs can be on any segment. But the rest of the code assumed that
      that's true only for HashedOJ, and that for Hashed, all NULLs are stored
      on a particular segment. There was a comment in cdbgroup.c that said "Or
      would HashedOJ ok, too?"; the answer to that is "No!". Given the comment
      in cdbpathlocus.h, I'm not suprised that the author was not very sure
      about that. Clarify the comments in cdbpathlocus.h and cdbgroup.c on that.
      
      There were a few cases where we got that actively wrong. repartitionPlan()
      function is used to inject a Redistribute Motion into queries used for
      CREATE TABLE AS and INSERT, if the "current" locus didn't match the target
      table's policy. It did not check for HashedOJ. Because of that, if the
      query contained FULL JOINs, NULL values might end up on all segments. Code
      elsewhere, particularly in cdbgroup.c, assumes that all NULLs in a table
      are stored on a single segment, identified by the cdbhash value of a NULL
      datum. Fix that, by adding a check for HashedOJ in repartitionPlan(), and
      forcing a Redistribute Motion.
      
      CREATE TABLE AS had a similar problem, in the code to decide which
      distribution key to use, if the user didn't specify DISTRIBUTED BY
      explicitly. The default behaviour is to choose a distribution key that
      matches the distribution of the query, so that we can avoid adding an
      extra Redistribute Motion. After fixing repartitionPlan, there was no
      correctness problem, but if we chose the key based on a HashedOJ locus,
      there is no performance benefit because we'd need a Redistribute Motion
      anyway. So modify the code that chooses the CTAS distribution key to
      ignore HashedOJ.
      
      While we're at it, refactor the code to choose the CTAS distribution key,
      by moving it to a separate function. It had become ridiculously deeply
      indented.
      
      Fixes https://github.com/greenplum-db/gpdb/issues/6154, and adds tests.
      Reviewed-by: NMelanie Plageman <mplageman@pivotal.io>
      a25e2cd6
    • H
      Cosmetic fixes in the code to determine distribution key for CTAS. · a5fa3110
      Heikki Linnakangas 提交于
      Fix indentation. In the code to generate a NOTICE, remove if() for
      condition that we had checked earlier in the function already, and use a
      StringInfo for building the string.
      a5fa3110
    • N
      New extension to debug partially distributed tables · 3119009a
      Ning Yu 提交于
      Introduced a new debugging extension gp_debug_numsegments to get / set
      the default numsegments when creating tables.
      
      gp_debug_get_create_table_default_numsegments() gets the default
      numsegments.
      
      gp_debug_set_create_table_default_numsegments(text) sets the default
      numsegments in text format, valid values are:
      - 'FULL': all the segments;
      - 'RANDOM': pick a random set of segments each time;
      - 'MINIMAL': the minimal set of segments;
      
      gp_debug_set_create_table_default_numsegments(integer) sets the default
      numsegments directly, valid range is [1, gp_num_contents_in_cluster].
      
      gp_debug_reset_create_table_default_numsegments(text) or
      gp_debug_reset_create_table_default_numsegments(integer) reset the
      default numsegments to the specified value, and the value can be reused
      later.
      
      gp_debug_reset_create_table_default_numsegments() resets the default
      numsegments to the value passed last time, if there is no previous call
      to it the value is 'FULL'.
      
      Refactored ICG test partial_table.sql to create partial tables with this
      extension.
      3119009a
  15. 13 11月, 2018 1 次提交
    • J
      Support 'copy (select statement) to file on segment' (#6077) · bad6cebc
      Jinbao Chen 提交于
      In ‘copy (select statement) to file’, we generate a query plan and set
      its dest receivor to copy_dest_receive. And run the dest receivor on QD.
      In 'copy (select statement) to file on segment', we modify the query plan,
      delete gather mothon, and let dest receivor run on QE.
      
      Change 'isCtas' in Query to 'parentStmtType' to be able to mark the upper
      utility statement type. Add a CopyIntoClause node to store copy
      informations. Add copyIntoClause to PlannedStmt.
      
      In postgres, we don't need to make a different query plan for the
      query in the utility stament. But in greenplum, we need to.
      So we use a field to indicate whether the query is contained in utitily
      statemnt, and the type of utitily statemnt.
      
      Actually the behavior of 'copy (select statement) to file on segment'
      is very similar to 'SELECT ... INTO ...' and 'CREATE TABLE ... AS SELECT ...'.
      We use distribution policy inherent in the query result as the final data
      distribution policy. If not, we use the first clomn in target list as the key,
      and redistribute. The only difference is that we used 'copy_dest_receiver'
      instead of 'intorel_dest_receiver'
      bad6cebc
  16. 07 11月, 2018 1 次提交
    • Z
      Adjust GANG size according to numsegments · 6dd2759a
      ZhangJackey 提交于
      Now we have  partial tables and flexible GANG API, so we can allocate
      GANG according to numsegments.
      
      With the commit 4eb65a53, GPDB supports table distributed on partial segments,
      and with the series of commits (a3ddac06, 576690f2), GPDB supports flexible
      gang API. Now it is a good time to combine both the new features. The goal is
      that creating gang only on the necessary segments for each slice. This commit
      also improves singleQE gang scheduling and does some code clean work. However,
      if ORCA is enabled, the behavior is just like before.
      
      The outline of this commit is:
      
        * Modify the FillSliceGangInfo API so that gang_size is truly flexible.
        * Remove numOutputSegs and outputSegIdx fields in motion node. Add a new
           field isBroadcast to mark if the motion is a broadcast motion.
        * Remove the global variable gp_singleton_segindex and make singleQE
           segment_id randomly(by gp_sess_id).
        * Remove the field numGangMembersToBeActive in Slice because it is now
           exactly slice->gangsize.
        * Modify the message printed if the GUC Test_print_direct_dispatch_info
           is set.
        * Explicitly BEGIN create a full gang now.
        * format and remove destSegIndex
        * The isReshuffle flag in ModifyTable is useless, because it only is used
           when we want to insert tuple to the segment which is out the range of
           the numsegments.
      
      Co-authored-by: Zhenghua Lyu zlv@pivotal.io
      6dd2759a
  17. 06 11月, 2018 1 次提交
  18. 29 10月, 2018 2 次提交
    • H
      Remove memory context argument from GpPolicyFetch and friends. · 6d17d31f
      Heikki Linnakangas 提交于
      Most callers were passing CurrentMemoryContext, so this makes most callers
      slightly simpler. The few places that needed to pass a different context
      now switch to the correct one before calling the GpPolicy*() function.
      Reviewed-by: NDaniel Gustafsson <dgustafsson@pivotal.io>
      6d17d31f
    • P
      Allow reshuffling tables with update triggers · 15ee1437
      Pengzhou Tang 提交于
      Previously, when updating a table with update triggers on its distribution column,
      GPDB report an error like "ERROR: UPDATE on distributed key column not allowed on
      relation with update triggers" because current GPDB executor don't support
      statement-level update triggers and will also skip row-level update triggers for
      a split-update is actually consist of delete and insert, so if the result relation
      has update triggers, GPDB reject and error out because it's not functional.
      
      There is an exception for 'ALTER TABLE SET WITH (RESHUFFLE)', RESHUFFLE also use
      split-update node internal to rebalance/expand table, however, from the view of
      users, ALTER TABLE should not hit any kind of triggers, so we don't need to
      error out as same as UPDATE command.
      15ee1437
  19. 23 10月, 2018 1 次提交
    • Z
      Table data should be reshuffled to new segments · f4f4bdcc
      ZhangJackey 提交于
      Each table has a `numsegments` attribute in the
      GP_DISTRIBUTION_POLICY table,  it indicates that the table's
      data is distributed on the first N segments, In the common case,
      the `numsegments` equal the total segment count of this
      cluster.
      
      When we add new segments into the cluster, `numsegments` no
      longer equal the actual segment count in the cluster, we
      need to reshuffle the table data to all segments in 2 steps:
      
      	* Reshuffle the table data to all segments
      	* Update `numsegments`
      
      It is easy to update `numsegments`, so we focus on how to
      reshuffle the table data, There are 3 type tables in the
      Greenplum database, they are reshuffled in different ways.
      For the hash distributed table, we reshuffle data based on
      Update statement. Updating the hash keys of the	table
      ill generate a Plan like:
      
      	Update
      		->Redistributed Motion
      			->SplitUpdate
      				->SeqScan
      
      We can not use this Plan to reshuffle table data directly.
      The problem is that we need to know the segment count
      when Motion node computes the destination segment. When
      we compute the destination segment of deleting tuple, it
      need the old segment count which is equal `numsegments`;
      n the other hand, we need to use the new segment count to
      compute the destination segment for	inserting tuple.
      So we have to add a new operator Reshuffle to compute the
      destination segment, it records the O and N (O is the count
      of old segments and N is the count of new segments), then
      the Plan would be adjusted like:
      
      	Update
      		->Explicit Motion
      			->Reshuffle
      				->SplitUpdate
      					->SeqScan
      
      It can compute the destination segments directly with O and
      N, at the same time we change the Motion type to Explicit,
      it can send a tuple to the destination segment which we
      computed in the Reshuffle node.
      
      With changing the hash method to the `jump hash`, not all
      the table data need to reshuffle, so we add an new
      ReshuffleExpr to filter the tuples which are need to
      reshuffle, this expression will compute the destination
      segment ahead of schedule, if the destination segment is
      current segment, the tuple do not need to reshuffle, with
      the ReshuffleExpr the plan would adjust like that:
      
      	Update
      		->Explicit Motion
      			->Reshuffle
      				->SplitUpdate
      					->SeqScan
      						|-ReshuffleExpr
      
      When we want to reshuffle one table, we use the SQL `ALTER
      TABLE xxx SET WITH (RESHUFFLE)`, Actually it will generate
      an new UpdateStmt parse tree, the parse tree is similar to
      the parse tree which is generated by SQL `UPDATE xxx SET
      xxx.aaa = COALESCE(xxx.aaa...) WHERE ReshuffleExpr`. We set
      an reshuffle flag in the UpdateStmt, so it can distinguish
      the common update and the reshuffling.
      
      In conclusion, we reshuffle hash distributed table by
      Reshuffle node and ReshuffleExpr, the ReshuffleExpr filter
      the tuple need to reshuffle and the Reshuffle node do the
      real reshuffling work, we can use that framework to
      implement reshuffle random distributed table and replicated
      table.
      
      For random distributed table, it has no hash keys,  each
      old segment need reshuffle (O - N) / N data to the new
      segments, In the ReshuffleExpr, we can generate a random
      value between [0, N), if the random values is greater than
      O, it means that the tuple need to reshuffle, so SeqScan
      node can return this tuple to ReshuffleNode.  Reshuffle node
      will generate a random value between [O, N), it means which
      new segment the tuple need to insert.
      
      For replicated table, the table data is same in the all old
      segments, so there do not need to delete any tuples, it only
      need copy the tuple which is in the old segments to the new
      segments, so the ReshuffleExpr do not filte any tuples, In
      the Reshuffle node, we neglect the tuple which is generated
      for deleting, only return the inserting tuple to motion. Let
      me illustrate this with an example:
      
      If there are 3 old segments in the cluster and we add 4 new
      segments, the segment ID of old segments is (0,1,2) and the
      segment ID of new segments is (3,4,5,6), when reshuffle the
      replicated table, the seg#0 is responsible to copy data to
      seg#3 and seg#6, the seg#1 is responsible to copy data to
      seg#4, the seg#2 is responsible to copy data to seg#5.
      
      
      Co-authored-by: Ning Yu nyu@pivotal.io
      Co-authored-by: Zhenghua Lyu zlv@pivotal.io
      Co-authored-by: Shujie Zhang shzhang@pivotal.io
      f4f4bdcc
  20. 20 10月, 2018 1 次提交
    • H
      Refactor the way Split Update nodes are constructed in the planner. · faf0ec6b
      Heikki Linnakangas 提交于
      One specialty in a Split Update is that the node needs the *old* values
      for all the distribution key columns, to compute the distribution hash for
      each old row, so that they can be deleted. That was previously handled at
      the time when the SplitUpdate node was created, by adding any missing
      Vars for the old values to the subplan's target list, pushing them down
      through joins and any other plan nodes, all the way down to the Scan
      node for that relation. That seemed complicated and fragile.
      
      The reason to tackle this right now is that we were seeing failures related
      to this, while working on the PostgreSQL 9.4 merge. It added a test case,
      where a Split Update was done through a security barrier view. The security
      barrier view added a SubqueryScan to the plan tree, and the mechanism to
      push through the old attributes couldn't cope with that. I'm sure we
      could've hacked that to make it work, but this refactoring seems like a
      better long term fix.
      
      This patch makes it the responsibility of preprocess_targetlist(), to ensure
      that the old values are made available to the top of the tree, if a Split
      Update is needed. preprocess_targetlist() seems like the appropriate place,
      because it already does that for columns that are not modified by the
      UPDATE.
      
      Now that we are making the decision on whether to do a split update in
      preprocess_targetlist() already, add a flag to PlannerInfo to remember that
      decision, until the point where the ModifyTable node is added to the top of
      the plan tree.
      
      Also add a test case, for an inherited table where some children have a
      different distribution key, and an UPDATE on some of the children require a
      Split Update, and others don't. That was causing me trouble at one point
      during the development, and I'm not sure if there was any existing test to
      cover that.
      faf0ec6b
  21. 28 9月, 2018 1 次提交
    • Z
      Allow tables to be distributed on a subset of segments · 4eb65a53
      ZhangJackey 提交于
      There was an assumption in gpdb that a table's data is always
      distributed on all segments, however this is not always true for example
      when a cluster is expanded from M segments to N (N > M) all the tables
      are still on M segments, to workaround the problem we used to have to
      alter all the hash distributed tables to randomly distributed to get
      correct query results, at the cost of bad performance.
      
      Now we support table data to be distributed on a subset of segments.
      
      A new columne `numsegments` is added to catalog table
      `gp_distribution_policy` to record how many segments a table's data is
      distributed on.  By doing so we could allow DMLs on M tables, joins
      between M and N tables are also supported.
      
      ```sql
      -- t1 and t2 are both distributed on (c1, c2),
      -- one on 1 segments, the other on 2 segments
      select localoid::regclass, attrnums, policytype, numsegments
          from gp_distribution_policy;
       localoid | attrnums | policytype | numsegments
      ----------+----------+------------+-------------
       t1       | {1,2}    | p          |           1
       t2       | {1,2}    | p          |           2
      (2 rows)
      
      -- t1 and t1 have exactly the same distribution policy,
      -- join locally
      explain select * from t1 a join t1 b using (c1, c2);
                         QUERY PLAN
      ------------------------------------------------
       Gather Motion 1:1  (slice1; segments: 1)
         ->  Hash Join
               Hash Cond: a.c1 = b.c1 AND a.c2 = b.c2
               ->  Seq Scan on t1 a
               ->  Hash
                     ->  Seq Scan on t1 b
       Optimizer: legacy query optimizer
      
      -- t1 and t2 are both distributed on (c1, c2),
      -- but as they have different numsegments,
      -- one has to be redistributed
      explain select * from t1 a join t2 b using (c1, c2);
                                QUERY PLAN
      ------------------------------------------------------------------
       Gather Motion 1:1  (slice2; segments: 1)
         ->  Hash Join
               Hash Cond: a.c1 = b.c1 AND a.c2 = b.c2
               ->  Seq Scan on t1 a
               ->  Hash
                     ->  Redistribute Motion 2:1  (slice1; segments: 2)
                           Hash Key: b.c1, b.c2
                           ->  Seq Scan on t2 b
       Optimizer: legacy query optimizer
      ```
      4eb65a53
  22. 27 9月, 2018 1 次提交
  23. 23 9月, 2018 1 次提交
  24. 19 9月, 2018 1 次提交
    • H
      Fix "could not find pathkey item to sort" error with MergeAppend plans. · 1722adb8
      Heikki Linnakangas 提交于
      When building a Sort node to represent the ordering that is preserved
      by a Motion node, in make_motion(), the call to make_sort_from_pathkeys()
      would sometimes fail with "could not find pathkey item to sort". This
      happened when the ordering was over a UNION ALL operation. When building
      Motion nodes for MergeAppend subpaths, the path keys that represented the
      ordering referred to the items in the append rel's target list, not the
      subpaths. In create_merge_append_plan(), where we do a similar thing for
      each subpath, we correctly passed the 'relids' argument to
      prepare_sort_from_pathkeys(), so that prepare_sort_from_pathkeys() can
      match the target list entries of the append relation with the entries of
      the subpaths. But when creating the Motion nodes for each subpath, we
      were passing NULL as 'relids' (via make_sort_from_pathkeys()).
      
      At a high level, the fix is straightforward: we need to pass the correct
      'relids' argument to prepare_sort_from_pathkeys(), in
      cdbpathtoplan_create_motion_plan(). However, the current code structure
      makes that not so straightforward, so this required some refactoring of
      the make_motion() and related functions:
      
      Previously, make_motion() and make_sorted_union_motion() would take a path
      key list as argument, to represent the ordering, and it called
      make_sort_from_pathkeys() to extract the sort columns, operators etc.
      After this patch, those functions take arrays of sort columns, operators,
      etc. directly as arguments, and the caller is expected to do the call to
      make_sort_from_pathkeys() to get them, or build them through some other
      means. In cdbpathtoplan_create_motion_plan(), call
      prepare_sort_from_pathkeys() directly, rather than the
      make_sort_from_pathkeys() wrapper, so that we can pass the 'relids'
      argument. Because prepare_sort_from_pathkeys() is marked as 'static', move
      cdbpathtoplan_create_motion_plan() from cdbpathtoplan.c to createplan.c,
      so that it can call it.
      
      Add test case. It's a slightly reduced version of a query that we already
      had in 'olap_group' test, but seems better to be explicit. Revert the
      change in expected output of 'olap_group', made in commit 28087f4e,
      which memorized the error in the expected output.
      
      Fixes https://github.com/greenplum-db/gpdb/issues/5695.
      Reviewed-by: NPengzhou Tang <ptang@pivotal.io>
      Reviewed-by: NMelanie Plageman <mplageman@pivotal.io>
      1722adb8
  25. 18 9月, 2018 1 次提交
  26. 15 9月, 2018 1 次提交
  27. 10 9月, 2018 1 次提交
  28. 03 9月, 2018 1 次提交
  29. 21 8月, 2018 1 次提交
    • T
      Do not create split update for relations excluded by constraints · 9b8dd4f4
      Taylor Vesely 提交于
      When the query_planner determines that a relation does not to need
      scanning due to constraint exclusion, it will create a 'dummy' plan for
      that operation. When we plan a split update, it does not understand this
      'dummy' plan shape, and will fail with an assertion.
      
      Instead, because an excluded relation will never return tuples, do not
      attempt to create a split update at all.
      9b8dd4f4
  30. 03 8月, 2018 1 次提交
  31. 02 8月, 2018 1 次提交
    • R
      Merge with PostgreSQL 9.2beta2. · 4750e1b6
      Richard Guo 提交于
      This is the final batch of commits from PostgreSQL 9.2 development,
      up to the point where the REL9_2_STABLE branch was created, and 9.3
      development started on the PostgreSQL master branch.
      
      Notable upstream changes:
      
      * Index-only scan was included in the batch of upstream commits. It
        allows queries to retrieve data only from indexes, avoiding heap access.
      
      * Group commit was added to work effectively under heavy load. Previously,
        batching of commits became ineffective as the write workload increased,
        because of internal lock contention.
      
      * A new fast-path lock mechanism was added to reduce the overhead of
        taking and releasing certain types of locks which are taken and released
        very frequently but rarely conflict.
      
      * The new "parameterized path" mechanism was added. It allows inner index
        scans to use values from relations that are more than one join level up
        from the scan. This can greatly improve performance in situations where
        semantic restrictions (such as outer joins) limit the allowed join orderings.
      
      * SP-GiST (Space-Partitioned GiST) index access method was added to support
        unbalanced partitioned search structures. For suitable problems, SP-GiST can
        be faster than GiST in both index build time and search time.
      
      * Checkpoints now are performed by a dedicated background process. Formerly
        the background writer did both dirty-page writing and checkpointing. Separating
        this into two processes allows each goal to be accomplished more predictably.
      
      * Custom plan was supported for specific parameter values even when using
        prepared statements.
      
      * API for FDW was improved to provide multiple access "paths" for their tables,
        allowing more flexibility in join planning.
      
      * Security_barrier option was added for views to prevents optimizations that
        might allow view-protected data to be exposed to users.
      
      * Range data type was added to store a lower and upper bound belonging to its
        base data type.
      
      * CTAS (CREATE TABLE AS/SELECT INTO) is now treated as utility statement. The
        SELECT query is planned during the execution of the utility. To conform to
        this change, GPDB executes the utility statement only on QD and dispatches
        the plan of the SELECT query to QEs.
      Co-authored-by: NAdam Lee <ali@pivotal.io>
      Co-authored-by: NAlexandra Wang <lewang@pivotal.io>
      Co-authored-by: NAshwin Agrawal <aagrawal@pivotal.io>
      Co-authored-by: NAsim R P <apraveen@pivotal.io>
      Co-authored-by: NDaniel Gustafsson <dgustafsson@pivotal.io>
      Co-authored-by: NGang Xiong <gxiong@pivotal.io>
      Co-authored-by: NHaozhou Wang <hawang@pivotal.io>
      Co-authored-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
      Co-authored-by: NJesse Zhang <sbjesse@gmail.com>
      Co-authored-by: NJinbao Chen <jinchen@pivotal.io>
      Co-authored-by: NJoao Pereira <jdealmeidapereira@pivotal.io>
      Co-authored-by: NMelanie Plageman <mplageman@pivotal.io>
      Co-authored-by: NPaul Guo <paulguo@gmail.com>
      Co-authored-by: NRichard Guo <guofenglinux@gmail.com>
      Co-authored-by: NShujie Zhang <shzhang@pivotal.io>
      Co-authored-by: NTaylor Vesely <tvesely@pivotal.io>
      Co-authored-by: NZhenghua Lyu <zlv@pivotal.io>
      4750e1b6
  32. 23 7月, 2018 1 次提交
    • Z
      Enable update on distribution column in legacy planner. · 6be0a32a
      Zhenghua Lyu 提交于
      Before, we cannot update distribution column in legacy planner, because the OLD tuple
      and NEW tuple maybe belong to different segments. We enable this by borrowing ORCA's
      logic, namely, split each update operation into delete and insert. The delete operation is hashed
      by OLD tuple attributes, and insert operation is hashed by NEW tuple attributes. This change
      includes following items:
      * We need push missed OLD attributes to sub plan tree so that that attribute could be passed to top Motion.
      * In addition, if the result relation has oids, we also need to put oid in the targetlist.
      * If result relation is partitioned, we should special treat it because resultRelations is partition tables instead of root table, but that is true for normal Insert.
      * Special treats for update triggers, because trigger cannot be executed across segments.
      * Special treatment in nodeModifyTable, so that it can process Insert/Delete for update purpose.
      * Proper initialization of SplitUpdate.
      
      There are still TODOs:
      * We don't handle cost gracefully, because we add SplitUpdate node after plan generated. Already added a FIXME for this
      * For deletion, we could optimize in just sending distribution columns instead of all columns
      
      
      Author: Xiaoran Wang <xiwang@pivotal.io>
      Author: Max Yang <myang@pivotal.io>
      Author: Shujie Zhang <shzhang@pivotal.io>
      Author: Zhenghua Lyu <zlv@pivotal.io>
      6be0a32a
  33. 11 7月, 2018 1 次提交
  34. 29 5月, 2018 1 次提交
    • N
      Support RETURNING for replicated tables. · fb7247b9
      Ning Yu 提交于
      * rpt: reorganize data when ALTER from/to replicated.
      
      There was a bug that altering from/to a replicated table has no effect,
      the root cause is that we did not change gp_distribution_policy neither
      reorganize the data.
      
      Now we perform the data reorganization by creating a temp table with the
      new dist policy and transfering all the data to it.
      
      * rpt: support RETURNING for replicated tables.
      
      This is to support below syntax (suppose foo is a replicated table):
      
      	INSERT INTO foo VALUES(1) RETURNING *;
      	UPDATE foo SET c2=c2+1 RETURNING *;
      	DELETE * FROM foo RETURNING *;
      
      A new motion type EXPLICIT GATHER MOTION is introduced in EXPLAIN
      output, data will be received from one explicit sender in this motion
      type.
      
      * rpt: fix motion type under explicit gather motion.
      
      Consider below query:
      
      	INSERT INTO foo SELECT f1+10, f2, f3+99 FROM foo
      	  RETURNING *, f1+112 IN (SELECT q1 FROM int8_tbl) AS subplan;
      
      We used to generate a plan like this:
      
      	Explicit Gather Motion 3:1  (slice2; segments: 3)
      	  ->  Insert
      	        ->  Seq Scan on foo
      	        SubPlan 1  (slice2; segments: 3)
      	          ->  Gather Motion 3:1  (slice1; segments: 1)
      	                ->  Seq Scan on int8_tbl
      
      A gather motion is used for the subplan, which is wrong and will cause a
      runtime error.
      
      A correct plan is like below:
      
      	Explicit Gather Motion 3:1  (slice2; segments: 3)
      	  ->  Insert
      	        ->  Seq Scan on foo
      	        SubPlan 1  (slice2; segments: 3)
      	          ->  Materialize
      	                ->  Broadcast Motion 3:3  (slice1; segments: 3)
      	                      ->  Seq Scan on int8_tbl
      
      * rpt: add test case for with both PRIMARY and UNIQUE.
      
      On a replicated table we could set both PRIMARY KEY and UNIQUE
      constraints, test cases are added to ensure this feature during future
      development.
      
      (cherry picked from commit 72af4af8)
      fb7247b9