1. 24 5月, 2019 2 次提交
    • H
      Control operator memory by absolute values in resource group mode (#7743) · f67c4298
      Hubert Zhang 提交于
      Old memory_spill_ratio only supports percentage format,
      which is hard for user to config the operator(join, sort etc.)
      memory. We add absolute value format to make memory_spill_ratio
      have the similar behavior like GUC statement_mem in resource
      queue mode.
      
      Also change the default values of resource group, e.g. use 128MB
      as memory_spill_ratio to make the migration from resource queue
      to resource group more smoothly.
      
      Also remove proposed column in pg_resgroupcapability.
      Co-authored-by: NNing Yu <nyu@pivotal.io>
      Co-authored-by: NHubert Zhang <hzhang@pivotal.io>
      Reviewed-by: NZhenghua Lyu <kainwen@gmail.com>
      f67c4298
    • Z
      Fix locking clause issue · 6ebce733
      Zhenghua Lyu 提交于
      This commit corrects the behaviours of select-statement with locking clause and
      do some optimization for the very simple case.
      
      There are four kinds of locking clause:
        1. for update
        2. for no key update
        3. for share
        4. for key share
      
      The key steps to implement the locking clause semantics in Postgres are:
         1. lock the table in RowShareLock mode during the parsing stage (this is the same
             for each type of locking clause)
         2. Generate a LockRows node in the plan
         3. During executing the node LockRows, it locks each tuple from below plan nodes
      
      In Greenplum, things get more complicated.
      
      If Global Deadlock Detector is disabled, we cannot simply lock tuples on segments without
      holding a high-level lock on QD, because this may lead to Global DeadLocks. Even if we enable
      Global Deadlock Detector, in the MPP environment, tuples may be motioned here, it is not possible
      to lock remote tuples now.
      
      But for the very simple case that only involves one table, we can just behave as upstream when
      global deadlock detector is enabled. And almost each select-for-xxx query in OLTP scenario
      is a simple case. For such cases, we could behave just like Postgres: holds RowShareLock on the rangetable, and locks the tuple when executing. This will improve OLTP concurrence performance
      by not locking the table in Exclusive mode.
      
      In Summary:
          * With GDD disabled, Greenplum locks the entire tables in ExclusiveLock mode for select statement
             with locking clause (for update|for no key update|for share|for key share) to conflict with DMLs
          * With GDD enabled, it keeps the same behaviour as above, except for some very simple cases 
             shown  below (these simple case, hold RowShareLock on the table, then generate lockrows plan):
      
      Simple cases have to ensure all the following condition holds:
          1. GDD is enabled
          2. Top level select statement
          3. no set operations (union, intersect, ...)
          4. From list contains and only contains one rangeVar (and is not a view)
          5. no sublink, subquery
          6. has locking clause
          7. the table in locking clause is a heap table (AO table cannot lock tuples now)
      6ebce733
  2. 22 5月, 2019 2 次提交
    • A
      Test that commits block when standby has caughtup close enough · 46c74111
      Asim R P 提交于
      Where close enough is defined in terms of number of WAL segments, using GUC
      repl_catchup_within_range.  The test aims to validate this Greenplum specific
      addition, only applicable to master/standby WAL replication.
      
      Reviewed by: Jimmy Yih <jyih@pivotal.io>
      46c74111
    • A
      Test that commits block until standby flushes commit LSN · a9414861
      Asim R P 提交于
      Master/standby WAL replication differs from primary/mirror, mainly in
      the fact that there is no third party such as FTS to monitor
      replication state.  This test validates that commits on master are
      blocked until standby confirms that WAL upto commit LSN has been
      received and flushed.  The test injects a fault on standby.
      
      Reviewed by: Jimmy Yih <jyih@pivotal.io>
      a9414861
  3. 21 5月, 2019 5 次提交
    • A
      Wait for first fault before injecting second · a42e8b91
      Asim R P 提交于
      The dtm_recovery_on_standby test starts two sessions.  Both the
      sessions are expected to hit different faults.  The test used to start
      first session, inject faults for the second session and wait for all
      the faults to be hit.  This led to spurious failures in CI because the
      first session would incorrectly hit faults intended for second
      session.  The commit fixes this, such that the test waits for the
      first session to hit the right fault before injecting faults for the
      second session.
      a42e8b91
    • G
      Revert "Fix the compile issue and recommit 0630a9c7" · cc09cb57
      Gang Xiong 提交于
      This reverts commit d97a7f6c.
      Some behave tests failed.
      cc09cb57
    • G
      Fix the compile issue and recommit 0630a9c7 · d97a7f6c
      Gang Xiong 提交于
      d97a7f6c
    • G
      Revert "Optimize explicit transactions" · 17c0e455
      Gang Xiong 提交于
      This reverts commit 0630a9c7.
      17c0e455
    • X
      Optimize explicit transactions · 0630a9c7
      xiong-gang 提交于
      Currently, explicit 'BEGIN' creates a full-size writer gang and starts a transaction
      on it, the following 'END' will commit the transaction in a two-phase way. It can be
      optimized for some cases:
      case 1:
      BEGIN;
      SELECT * FROM pg_class;
      END;
      
      case 2:
      BEGIN;
      SELECT * FROM foo;
      SELECT * FROM bar;
      END;
      
      case 3:
      BEGIN;
      INSERT INTO foo VALUES(1);
      INSERT INTO bar VALUES(2);
      END;
      
      For case 1, it's unnecessary to create a gang and no need to have two-phase commit.
      For case 2, it's unnecessary to have two-phase commit as the executors don't write
      any XLOG.
      For case 3, don't have to create a full-size writer gang and do two-phase commit on
      a full-size gang.
      Co-authored-by: NJialun Du <jdu@pivotal.io>
      0630a9c7
  4. 13 5月, 2019 1 次提交
  5. 11 5月, 2019 1 次提交
    • T
      isolation2: Refactor tablespace related tests · 697ecb5e
      Taylor Vesely 提交于
      There was a race condition where replay of a DROP TABLESPACE command on
      a replication mirror could occur after the tests had already physically
      removed the directories. It should be safe if the tablespace directories
      continue to exist after the test is complete, so don't clean them up.
      697ecb5e
  6. 09 5月, 2019 2 次提交
    • A
      Make pg_terminate_backend test more reliable · 63bed5eb
      Asim R P 提交于
      This test has failed at least once due to the terminate query being executed
      before the to be terminated 'create table' statement.  This was evident from
      master logs.  The commit makes it more reliable by injecting a fault and
      waiting for the fault to be trigggered before executing pg_terminate_backend().
      As a side benefit, we no longer need to create any additional table.
      63bed5eb
    • A
      Create gp_inject_fault extension as part of pg_regress and isolation2 · b95b5425
      Asim R P 提交于
      The extension is created by the test harness, after creating the regression /
      isolation2test databases.  Tests should directly start using the extension and
      not attempt to create it.  In addition to simplifying tests a little bit, this
      change avoids an error (duplicate key value violates unique constraint on
      pg_extension) when two or more tests execute create extension command
      concurrently.  The error is not a problem in practice, it expected because of
      the way create extension DDL works.  It is, however, unacceptable for
      regress-style tests that expect a deterministic response to each SQL command.
      b95b5425
  7. 08 5月, 2019 2 次提交
    • Z
      Fix lockmode issuses · 54551db4
      Zhenghua Lyu 提交于
      Lockmode of Update|Delete or Select-for-update Statement
      is controlled by: whether the table is AO or heap and the
      GUC gp_enable_global_deadlock_detector.
      
      The logic for lockmode is:
        1. Select-for-update always hold ExclusiveLock
        2. UPDATE|DELETE on AO tables always hold ExclusiveLock
        3. UPDATE|DELETE on heap tables hold ExclusiveLock when
           gp_enable_global_deadlock_detector is off, otherwise
           hold RowExclusiveLock
      
      We hold locks in parser stage and Initplan before executing,
      the lockmode should be the same at the two stages.
      
      This commit fixes lockmode issues to make things correct.
      Co-authored-by: NShujie Zhang <shzhang@pivotal.io>
      54551db4
    • T
      isolation2: Fix race condition in mirror_promotion · b946f18d
      Taylor Vesely 提交于
      Force the mirror to create a restartpoint, and as a side-effect replay
      the DROP TABLESPACE DDL before removing the tablespace directory.
      Co-authored-by: NSoumyadeep Chakraborty <sochakraborty@pivotal.io>
      b946f18d
  8. 02 5月, 2019 1 次提交
    • A
      Make dtm_recovery_on_standby test more deterministic · c5eb8f25
      Asim R P 提交于
      The test should wait for the transactions to be in the right state
      before promoting standby.  This commit adds a wait step to ensure just
      that.  One of the ICW jobs in CI failed because the test promoted the
      standby before the transactions were preprared on master.  This should
      no longer happen now.
      c5eb8f25
  9. 01 5月, 2019 1 次提交
    • A
      Test to verify that standby performs DTM recovery after promotion · be10a1bb
      Asim R P 提交于
      Transactions that are in the middle of two phase commit are suspended on
      master.  Standby is promoted while they are suspended.  Based on what XLOG
      records are emitted by master, the standby is expected to perform DTM recovery
      and complete the transactions upon promotion.
      be10a1bb
  10. 17 4月, 2019 1 次提交
    • D
      Fix gprecoverseg crash · d80501fa
      David Kimura 提交于
      Issue is encountered because XLogReaderState does not make any
      guarantees to preserve the XLogRecord returned between calls to
      ReadRecord. In this particular scenario we read the checkpoint and redo
      records from the backup label. After reading the latter record we have
      no guarantees that the former record is still pointing to unchanged
      memory.
      d80501fa
  11. 04 4月, 2019 1 次提交
  12. 22 3月, 2019 1 次提交
    • S
      gprecoverseg: Add --no-progress flag. · eb064718
      Shoaib Lari 提交于
      For some areas of the ICW test framework -- isolation2 in particular --
      the additional data written to stdout by gprecoverseg's progress
      increased the load on the system significantly. (Some tests are
      buffering stdout without bound, for instance.)  Additionally, the
      updates were coming at ten times a second, which is an order of
      magnitude more than the update interval we get from pg_basebackup
      itself.
      
      To help with this, we have have added a --no-progress flag that
      suppresses the output of pg_basebackup.  We have also changed the
      pg_basebackup progress update rate to once per second to minimize I/O.
      
      The impacted regression/isolation2 tests utilizing gprecoverseg have
      also been modified to use the --no-progress flag.
      Co-authored-by: NJamie McAtamney <jmcatamney@pivotal.io>
      Co-authored-by: NJacob Champion <pchampion@pivotal.io>
      eb064718
  13. 15 3月, 2019 1 次提交
  14. 14 3月, 2019 2 次提交
  15. 13 3月, 2019 1 次提交
  16. 11 3月, 2019 1 次提交
    • Z
      Only get necessary relids for partition table in InitPlan · a71447db
      Zhenghua Lyu 提交于
      Previously, during initializing ResultRelations in InitPlan on
      QD, it always builds the relids as all the relation oids in a
      partition table (including root and all its inheritors).
      Sometimes we does not need all the relids.
      
      A typical case is for ao partition table. When we directly
      insert into a specific child partition, the plan's ResultRelation
      only contains the child partition. And if we still make relids
      as root and all its inheritors, during `assignPerRelSegno`,
      it might lock each aoseg file on AccessShare mode on QEs. It
      causes confusion that the insert statement is only for a child
      partition but holding other partition's lock.
      
      This commit changes the relids building logic as:
        - if the ResultRelation contains the root partition, then
          relids is root and all its inheritors
        - otherwise, relids is a map of ResultRelations to get the
          element's relation oid
      a71447db
  17. 05 3月, 2019 1 次提交
    • P
      Fix assertion when creating unique index on tables created in utility mode · 38d40bcc
      Pengzhou Tang 提交于
      checkPolicyForUniqueIndex() checks if distribution key conflict with unique/primary
      key, for example, unique index is not allowed for random-distributed table and
      allowed for replicated-distributed table, for normal distributed, the set of
      columns being indexed should be a superset of the table.
      
      What about entry-distributed table? (eg, table created in utility mode, it has
      no records in gp_distribution_policy and GpPolicyFetch translated it to
      entry-distributed)? Such tables are localized in a single db, so adding a unique
      index should also be allowed.
      
      This is spotted by the assertion in checkPolicyForUniqueIndex() when checking
      the conflict for normal distributed tables.
      
      This fixes #5880
      38d40bcc
  18. 27 2月, 2019 1 次提交
    • N
      resgroup: allow memory overuse for hashagg spill meta data · f053e6cd
      Ning Yu 提交于
      Each hashagg spill batch file needs about 16KB memory to store the meta
      data, when there are many batch files the overall meta data size might
      exceed the assigned operatory memory.  In resource group we could allow
      this overuse as there are better memory control at transaction level,
      and the resource group shared memory is designed to serve these kinds of
      overuses.
      f053e6cd
  19. 23 2月, 2019 1 次提交
    • A
      Make `fts_errors` test deterministic · 316a322b
      Alexandra Wang 提交于
      Reset the fault injector on dbid 2 after re-verifying the segment is
      down.
      
      Both gp_request_fts_probe_scan() and gp_inject_fault() call
      getCdbComponentInfo() in order to dispatch to QEs, which will trigger
      the fault `GetDnsCachedAddress` on dbid 2, and
      gp_request_fts_probe_scan() returns true even before the probe finishes.
      Therefore, there is a race condition between the fts_probes and the
      reset of the fault injector; when the reset triggers the fault before
      the fts probe completes, the primary will be taken down without removing
      the fault, which caused all the following tests failing after a
      `gprecoverseg -ar` with double fault detected.
      Co-authored-by: NEkta Khanna <ekhanna@pivotal.io>
      316a322b
  20. 20 2月, 2019 1 次提交
  21. 16 2月, 2019 2 次提交
  22. 15 2月, 2019 1 次提交
    • T
      Recursively create partitioned indexes · f27b2a50
      Taylor Vesely 提交于
      Pull from upstream Postgres to make DefineIndex recursively create partitioned
      indexes. Instead of creating an individual IndexStmt for every partition,
      create indexes by recursing on the partition children.  This aligns index
      creation with upstream in preparation for adding INTERNAL_AUTO relationships
      between partition indexes.
      
       * The QD will now choose the same name for partition indexes as Postgres.
       * Update tests to reflect the partition index names changes.
       * The changes to DefineIndex are mostly cherry-picked from Postgres commit:
         8b08f7d4
       * transformIndexStmt and its callers have been aligned with Postgres
         REL9_4_STABLE
      Co-authored-by: NKalen Krempely <kkrempely@pivotal.io>
      f27b2a50
  23. 13 2月, 2019 2 次提交
    • A
      Make crash_recovery_dtm test stable. · 69ebd66c
      Ashwin Agrawal 提交于
      crash_recovery_dtm test has a scenario which intends to test if QE
      undergoes crash recovery after writing prepare record but before
      responding to QD, the abort processing is completed fine. For the same
      test used GUC `debug_abort_after_segment_prepared` to PANIC all the QE
      at that specific point for DELETE. Next test executes SELECT query to
      validate the DELETE was aborted. But flakiness comes if this SELECT
      query gets executed while PANIC processing is still underway as test
      had no way to wait till PANIC and restart completed before running the
      SELECT.
      
      Now the test instead uses fault injector to sleep at intended point
      and uses pg_ctl restart -w to make sure recovery is completed and only
      after that the SELECT query will be executed.
      
      So, as a result removing the test only GUC
      `debug_abort_after_segment_prepared` and related code for it.
      69ebd66c
    • D
      Increase the fts_errors gang retry timer (#6918) · 8d2c05cd
      David Kimura 提交于
      We noticed a case in CI where it seemed like it took longer than 30
      seconds to promote the mirror during recovery.
      Co-authored-by: NMelanie Plageman <mplageman@pivotal.io>
      8d2c05cd
  24. 12 2月, 2019 2 次提交
    • A
      Reduce time and stablilize uao_crash_compaction_column test. · eb193431
      Ashwin Agrawal 提交于
      Test uao_crash_compaction_column needs to wait for record to be
      replayed on mirror for checking. Depending on work load generated by
      the tests this can take long time. Hence, moving the test to run at
      start of schedule to reduce waiting for that time.
      
      Also, to reduce the flakiness seen in failure increase the number of
      retries in test to check if replay_location = flush_location on mirror
      from 1000 to 5000.
      Co-authored-by: NAlexandra Wang <lewang@pivotal.io>
      eb193431
    • A
      AO/CO avoids recording in invalid hashtable for truncate replay offset=0 · e993c328
      Ashwin Agrawal 提交于
      This commit is to fix PANIC on mirorr with "WAL contains references to
      invalid pages" on xlog replay for truncate record with offset 0.
      
      Primary creates the file first and then writes the xlog record for the
      creation for AO tables similar to heap.  Hence, file can get created
      on primary without writing xlog record if failure happens on primary
      just after creating the file. This creates situation where VACUUM can
      generate truncate record based on aoseg entry with eof 0 and file
      present on primary. Then during replay mirror may not have the file,
      as was never created on mirror. So, avoid adding the entry to invalid
      hash table for truncate at offset zero (EOF=0).  This avoids mirror
      PANIC, as anyways truncate to zero is same as file not present.
      Co-authored-by: NAlexandra Wang <lewang@pivotal.io>
      e993c328
  25. 11 2月, 2019 3 次提交
  26. 01 2月, 2019 1 次提交
    • H
      Use normal hash operator classes for data distribution. · 242783ae
      Heikki Linnakangas 提交于
      Replace the use of the built-in hashing support for built-in datatypes, in
      cdbhash.c, with the normal PostgreSQL hash functions. Now is a good time
      to do this, since we've already made the change to use jump consistent
      hashing in GPDB 6, so we'll need to deal with the upgrade problems
      associated with changing the hash functions, anyway.
      
      It is no longer enough to track which columns/expressions are used to
      distribute data. You also need to know the hash function used. For that,
      a new field is added to gp_distribution_policy, to record the hash
      operator class used for each distribution key column. In the planner,
      a new opfamily field is added to DistributionKey, to track that throughout
      the planning.
      
      Normally, if you do "CREATE TABLE ... DISTRIBUTED BY (column)", the
      default hash operator class for the datatype is used. But this patch
      extends the syntax so that you can specify the operator class explicitly,
      like "... DISTRIBUTED BY (column opclass)". This is similar to how an
      operator class can be specified for each column in CREATE INDEX.
      
      To support upgrade, the old hash functions have been converted to special
      (non-default) operator classes, named cdbhash_*_ops. For example, if you
      want to use the old hash function for an integer column, you could do
      "DISTRIBUTED BY (intcol cdbhash_int4_ops)". The old hard-coded whitelist
      of operators that have "compatible" cdbhash functions has been replaced
      by putting the compatible hash opclasses in the same operator family. For
      example, all legacy integer operator classes, cdbhash_int2_ops,
      cdbhash_int4_ops and cdbhash_int8_ops, are all part of the
      cdbhash_integer_ops operator family).
      
      This removes the pg_database.hashmethod field. The hash method is now
      tracked on a per-table and per-column basis, using the opclasses, so it's
      not needed anymore.
      
      To help with upgrade from GPDB 5, this introduces a new GUC called
      'gp_use_legacy_hashops'. If it's set, CREATE TABLE uses the legacy hash
      opclasses, instead of the default hash opclasses, if the opclass is not
      specified explicitly. pg_upgrade will set the new GUC, to force the use of
      legacy hashops, when restoring the schema dump. It will also set the GUC
      on all upgraded databases, as a per-database option, so any new tables
      created after upgrade will also use the legacy opclasses. It seems better
      to be consistent after upgrade, so that collocation between old and new
      tables work for example. The idea is that some time after the upgrade, the
      admin can reorganize all tables to use the default opclasses instead. At
      that point, he should also clear the GUC on the converted databases. (Or
      rather, the automated tool that hasn't been written yet, should do that.)
      
      ORCA doesn't know about hash operator classes, or the possibility that we
      might need to use a different hash function for two columns with the same
      datatype. Therefore, it cannot produce correct plans for queries that mix
      different distribution hash opclasses for the same datatype, in the same
      query. There are checks in the Query->DXL translation, to detect that
      case, and fall back to planner. As long as you stick to the default
      opclasses in all tables, we let ORCA to create the plan without any regard
      to them, and use the default opclasses when translating the DXL plan to a
      Plan tree. We also allow the case that all tables in the query use the
      "legacy" opclasses, so that ORCA works after pg_upgrade. But a mix of the
      two, or using any non-default opclasses, forces ORCA to fall back.
      
      One curiosity with this is the "int2vector" and "aclitem" datatypes. They
      have a hash opclass, but no b-tree operators. GPDB 4 used to allow them
      as DISTRIBUTED BY columns, but we forbid that in GPDB 5, in commit
      56e7c16b. Now they are allowed again, so you can specify an int2vector
      or aclitem column in DISTRIBUTED BY, but it's still pretty useless,
      because the planner still can't form EquivalenceClasses on it, and will
      treat it as "strewn" distribution, and won't co-locate joins.
      
      Abstime, reltime, tinterval datatypes don't have default hash opclasses.
      They are being removed completely on PostgreSQL v12, and users shouldn't
      be using them in the first place, so instead of adding hash opclasses for
      them now, we accept that they can't be used as distribution key columns
      anymore. Add a check to pg_upgrade, to refuse upgrade if they are used
      as distribution keys in the old cluster. Do the same for 'money' datatype
      as well, although that's not being removed in upstream.
      
      The legacy hashing code for anyarray in GPDB 5 was actually broken. It
      could produce a different hash value for two arrays that are considered
      equal, according to the = operator, if there were differences in e.g.
      whether the null bitmap was stored or not. Add a check to pg_upgrade, to
      reject the upgrade if array types were used as distribution keys. The
      upstream hash opclass for anyarray works, though, so it is OK to use
      arrays as distribution keys in new tables. We just don't support binary
      upgrading them from GPDB 5. (See github issue
      https://github.com/greenplum-db/gpdb/issues/5467). The legacy hashing of
      'anyrange' had the same problem, but that was new in GPDB 6, so we don't
      need a pg_upgrade check for that.
      
      This also tightens the checks ALTER TABLE ALTER COLUMN and CREATE UNIQUE
      INDEX, so that you can no longer create a situation where a non-hashable
      column becomes the distribution key. (Fixes github issue
      https://github.com/greenplum-db/gpdb/issues/6317)
      
      Discussion: https://groups.google.com/a/greenplum.org/forum/#!topic/gpdb-dev/4fZVeOpXllQCo-authored-by: NMel Kiyama <mkiyama@pivotal.io>
      Co-authored-by: NAbhijit Subramanya <asubramanya@pivotal.io>
      Co-authored-by: NPengzhou Tang <ptang@pivotal.io>
      Co-authored-by: NChris Hajas <chajas@pivotal.io>
      Reviewed-by: NBhuvnesh Chaudhary <bchaudhary@pivotal.io>
      Reviewed-by: NNing Yu <nyu@pivotal.io>
      Reviewed-by: NSimon Gao <sgao@pivotal.io>
      Reviewed-by: NJesse Zhang <jzhang@pivotal.io>
      Reviewed-by: NZhenghua Lyu <zlv@pivotal.io>
      Reviewed-by: NMelanie Plageman <mplageman@pivotal.io>
      Reviewed-by: NYandong Yao <yyao@pivotal.io>
      242783ae