1. 29 10月, 2018 1 次提交
    • T
      Simplify direct dispatch related code (#6080) · 576690f2
      Tang Pengzhou 提交于
      * Simplify direct dispatch related code
      This commit include two parts:
      * simplify direct-dispatch dispatching code
      * simplify direct-dispatch DTM related code
      Previously, cdbdisp_dispatchToGang need a CdbDispatchDirectDesc info,
      now gang only contain inuse segments, so direct-dispatch info is useless.
      Another thing is, we need to decide if DTM is available for direct-dispatch
      within dtmPreCommand, the logic is complex, you need to know if the main plan
      is direct-dispatch and if the init plan contain direct-dispatch.
      one example is:
      "update foo set foo.c2 = 2
      where foo.c1 = 1 and exists (select * from bar where bar.c1=4)"
      main plan can be direct dispatched to segment 1, init plan can be direct
      dispatched to segment 2, with the old logic, the DTM like PREPARE need to
      dispatched to all segments, so dtmPreCommand need to dispatch a DTM named
      'DTX_PROTOCOL_COMMAND_STAY_AT_OR_BECOME_IMPLIED' to all segment so those
      segments like segment 3 who didn't receive the plan can be ready for two
      phase commit.
      With the new gang API, we can simplify this process, we add a list in
      currentGxact to record which segments are actually get involved in a two
      phase commit, then we can dispatch DTM to them directly. This is also very
      usefully for queries on tables that are not fully expaned yet.
      * support direct dispatch to more than one segment
  2. 15 10月, 2018 1 次提交
    • N
      Retire threaded dispatcher · 87394a7b
      Ning Yu 提交于
      Now there is only the async dispatcher.  The dispatcher API interface is
      kept so we might add new backend in the future.
      The GUC gp_connections_per_thread is also retired which was used to
      switch between the async and threaded backends.
  3. 27 9月, 2018 1 次提交
    • T
      Dispatcher can create flexible size gang (#5701) · a3ddac06
      Tang Pengzhou 提交于
      * change type of db_descriptors to SegmentDatabaseDescriptor **
      A new gang definination may consist of cached segdbDesc and new
      created segdbDesc, there is no need to palloc all segdbDesc struct
      as new.
      * Remove unnecessary allocate gang unit test
      * Manage idle segment dbs using CdbComponentDatabases instead of available* lists.
      To support vary size gang, we now need to manage segment dbs in a lower
      granularity, previously, idle QEs is managed by a bunch of lists like
      availablePrimaryWriterGang, availableReaderGangsN, this restrict
      dispatcher to only create N-size (N = number of segments) or 1-size
      CdbComponentDatabases is a snapshot of segment components within current
      cluster, now it maintains a freelist for each segment component. When
      creating gang, dispatcher will make up a gang from each segment
      component (from freelist or create a new segment db). When cleaning up
      a gang, dispatcher will return idle segment dbs to each segment
      CdbComponentDatabases provide a few functions to manipulate segment dbs
      (SegmentDatabaseDescriptor *):
      * cdbcomponent_getCdbComponents
      * cdbcomponent_destroyCdbComponents
      * cdbcomponent_allocateIdleSegdb
      * cdbcomponent_recycleIdleSegdb
      * cdbcomponent_cleanupIdleSegdbs
      CdbComponentDatabases is also FTS version sensitive, so once a FTS
      version changed, CdbComponentDatabases destroy all idle segment dbs
      and allocate QEs in the new promoted segment. This provides the ability
      to transparent mirror failover to users.
      Since segment dbs(SegmentDatabaseDescriptor *) are managed by
      CdbComponentDatabases now, we can simplify the memory context
      management by replacing GangContext & perGangContext with
      DispatcherContext & CdbComponentsContext.
      * Postpone the error hanlding when creating gang
      Now we have AtAbort_DispatcherState, one advantage of it is that
      we can postpone gang error hanlding in this function and make
      code cleaner.
      * Handle FTS version change correctly
      In some cases, when a FTS version changed, we can't update current
      snapshot of segment components, to be more specifically, we can't
      destroy current writer segment dbs and create new segment dbs.
      These cases include:
      * session has temp table created.
      * query need two-phase commit and gxid has been dispatched to
      * Replace <gangId, sliceId> map with <qeIdentifier, sliceId> map
      We used to dispatch a <gangId, sliceId> map along with query to
      segment dbs so segment dbs can know which slice they should
      Now gangId is useless for a segment db because a segment db can
      be reused by different gang, so we need a new way to tell the
      info to segment dbs. To resolve this, CdbComponentDatabases
      assign a unique identifier to each segment db and make up a
      bitmap set which consist of segment identifiers for each slice,
      segment dbs then can go through the slice table and find the
      right slice to execute.
      * Allow dispatcher to create vary size gang and refine AssignGangs()
      Previously, dispatcher can only create N-size gang for
      restrict dispatcher in many ways, one example is direct
      dispatch, it always create a N-size gang even it only
      dispatch the command to one segment, another example is
      some operations may be able to use N+ size gang, like
      hash join, if both inner and outer plan is redistributed,
      the hash join node can associate with a N+ size gang to
      execute. This commit changes the API of createGang() so the
      caller can specify a list of segments (partial or even
      duplicate segments), CdbCompoentDatabase will guarantee
      each segment has only one writer in a session. With this
      it also resolves another pain point of AssignGangs(), so
      the caller don't need to promote a GANGTYPE_PRIMARY_READER
      _READER to GANGTYPE_PRIMARY_WRITER for replicated table
      (see FinalizeSliceTree()).
      With this commit, AssignGang() is very clear now.
  4. 06 9月, 2018 1 次提交
    • T
      Integrate Gang management from portal to Dispatcher and simplify AssignGangs for init plans (#5555) · 78a4890a
      Tang Pengzhou 提交于
      * Simplify the AssignGangs() logic for init plans
      Previously, AssignGangs() assign gangs for both main plans and
      init plans in one shot. Because init plans and main plan are
      executed sequentially, so the gangs can be reused between main
      plan and init plans, function AccumSliceReq() is designed for
      This process can be simplified: already know the root slice
      index id will be adjusted to according init plan id, init plan
      only need to assign their own slices.
      * Integrate Gang management from portal to Dispatcher
      Previously, Gang was managed by portal, freeGangsForPortal()
      was used to cleanup gang resource, DTM related commands also
      needed a gang to dispatch command outside of a portal and
      used freeGangsForPortal() too. There might be multiple
      command/plan/utility executed within one portal, all commands
      relied on a dispatcher routine like CdbDispatchCommand /
      CdbDispatchPlan/CdbDispatchUtility... to dispatch, gangs were
      created by each dispatcher routines, but not be recycled or
      destroyed when a routine finished except for primary writer
      gang, one defect of this is gang resource cannot be reused
      between dispatcher routines. GPDB already had an optimization
      for init plans, if a plan contained init plans, AssignGangs
      was called before execution of any of them it went through
      the whole slice tree and created the maximum gang that both
      main plan and init plans needed, this was doable because init
      plans and main plan were executed sequentially, but it also
      made AssignGangs logic complex, meanwhile, reusing an not
      clean gang was not safe.
      Another confusing thing was the gang and dispatcher were
      managed separately which cause context inconsistent like:
      when a dispatcher state was destroyed, gang was not recycled,
      when a gang was destroyed by portal, the dispatcher state was
      still in use and may refer to the context of a destroyed gang.
      As described above, this commit integrates gang management
      with dispatcher, a dispather state is responsible for creating
      and tracking gangs as needed and destroy them when dispatcher
      state is destroyed.
      * Handle the case when primary writer gang has gone
      When members of primary writer gang gone, the writer gang
      is destroyed immediately (primaryWriterGang is set to NULL)
      when a dispatcher rountine (eg.CdbDispatchCommand) finished.
      So when dispatching two-phase-DTM/DTX related command, QD
      doesn't know writer gang has gone, it may get unexpected
      error like 'savepoint not exist', 'subtransaction level not
      match', 'temp file not exist'.
      Previously, primaryWriterGang is not reset when DTM/DTX
      commands start even it is pointing to invalid segments, so
      those DTM/DTX commands will not actually sent to segments,
      an normal error reported on QD looks like 'could not
      connect to segment: initialization of segworker'.
      So we need a way to info global transaction that its writer
      gang has lost. so when aborting transaction, QD can:
      1. disconnect all reader gangs, this is usefull to skip
      dispatching "ABORT_NO_PREPARE"
      2. reset session and drop temp files because temp files in
      segment is gone.
      3. report a error when dispatching "rollback savepoint" DTX
      because savepoint in segment is gone.
      4. report a error when dispatch "abort subtransaction" DTX
      because subtransaction is rollback when writer segment is down.
  5. 14 8月, 2018 2 次提交
    • P
      Remove cdbdisp_finishCommand · 957629d1
      Pengzhou Tang 提交于
      Previously, cdbdisp_finishCommand did three things:
      1. cdbdisp_checkDispatchResult
      2. cdbdisp_getDispatchResult
      3. cdbdisp_destroyDispatcherState
      However, cdbdisp_finishCommand didn't make code cleaner or more
      convenient to use, in contrast, it makes error handling more
      difficult and makes code more complicated and inconsistent.
      This commit also reset estate->dispatcherState to NULL to avoid
      re-entry of cdbdisp_* functions.
    • P
      Rename CdbCheckDispatchResult for name convention · 60bd3ab2
      Pengzhou Tang 提交于
      Use cdbdisp_checkDispatchResult instead of CdbCheckDispatchResult
      to be consistent of cdbdisp_* functions.
  6. 09 5月, 2018 1 次提交
  7. 01 3月, 2018 1 次提交
    • H
      Give a better error message, if preparing an xact fails. · b3c50e40
      Heikki Linnakangas 提交于
      If an error happens in the prepare phase of two-phase commit, relay the
      original error back to the client, instead of the fairly opaque
      "Abort [Prepared]' broadcast failed to one or more segments" message you
      got previously. A lot of things happen during the prepare phase that
      can legitimately fail, like checking deferred constraints, like in the
      'constraints' regression test. But even without that, there can be
      triggers, ON COMMIT actions, etc., any of which can fail.
      This commit consists of several parts:
      * Pass 'true' for the 'raiseError' argument when dispatching the prepare
        dtx command in doPrepareTransaction(), so that the error is emitted to
        the client.
      * Bubble up an ErrorData struct, with as many fields intact as possible,
        to the caller,  when dispatching a dtx command. (Instead of constructing
        a message in a StringInfo). So that we can re-throw the message to
        the client, with its original formatting.
      * Don't throw an error in performDtxProtocolCommand(), if we try to abort
        a prepared transaction that doesn't exist. That is business-as-usual,
        if a transaction throws an error before finishing the prepare phase.
      * Suppress the "NOTICE: Releasing segworker groups to retry broadcast."
        message, when aborting a prepared transaction.
      Put together, the effect is if an error happens during prepare phase, the
      client receives a message that is largely indistinguishable from the
      message you'd get if the same failure happened while running a normal
      Fixes github issue #4530.
  8. 25 1月, 2018 1 次提交
    • D
      Propagate segment errcodes to dispatcher · 58003bc7
      Daniel Gustafsson 提交于
      The errcode thrown in an ereport() on a segment was passed back to
      the dispatcher, but then dropped and replaced with a default errcode
      of ERRCODE_DATA_EXCEPTION. This works for most situations, but when
      trapping errors the exact errcode must be propagated. This extends
      the API to extract the errcode as well. The below case illustrates
      the previous issue:
        CREATE TABLE test1(id int primary key);
        CREATE TABLE test2(id int primary key);
        INSERT INTO test1 VALUES(1);
        INSERT INTO test2 VALUES(1);
        CREATE OR REPLACE FUNCTION merge_table() RETURNS void AS $$
      	v_insert_sql varchar;
      	v_insert_sql :='INSERT INTO test1 SELECT * FROM test2';
      	EXECUTE v_insert_sql;
      	EXCEPTION WHEN unique_violation THEN
      		RAISE NOTICE 'unique_violation';
        $$ LANGUAGE plpgsql volatile;
        SELECT merge_table();
  9. 02 11月, 2017 1 次提交
    • H
      Wake up faster, if a segment returns an error. · 3bbedbe9
      Heikki Linnakangas 提交于
      Previously, if a segment reported an error after starting up the
      interconnect, it would take up to 250 ms for the main thread in the QD
      process to wake up and poll the dispatcher connections, and to see that
      there was an error. Shorten that time, by waking up immediately if the
      QD->QE libpq socket becomes readable while we're waiting for data to
      arrive in a Motion node.
      This isn't a complete solution, because this will only wake up if one
      arbitrarily chosen connection becomes readable, and we still rely on
      polling for the others. But this greatly speeds up many common scenarios.
      In particular, the "qp_functions_in_select" test now runs in under 5 s
      on my laptop, when it took about 60 seconds before.
  10. 01 9月, 2017 1 次提交
  11. 14 11月, 2016 1 次提交
    • X
      Use nonblocking mechanism to send data in async dispatcher. · 2516eac6
      xiong-gang 提交于
      pqFlush is sending data synchronously though the socket is set
      O_NONBLOCK, this incurs performance downgradation. This commit uses
      pqFlushNonBlocking instead, and synchronizes the completion of
      dispatching to all Gangs before query execution.
      Signed-off-by: Kenan Yao<kyao@pivotal.io>
  12. 04 11月, 2016 1 次提交
  13. 13 9月, 2016 1 次提交
    • P
      Speed up QE cancel when one or more QEs got errors · 39ed6031
      Pengzhou Tang 提交于
      QD need to cancel QEs when
      1) QD get a error
      2) one or more QEs got error and cancelOnError was set to true.
      We want to cancel QEs as soon as possible once above conditions are reached, but considering
      the cost of cancelling QEs is high, we want to process as many pending finish QEs as possible
      before actually cancel. The original interval before cancelling is 2 seconds which is too
      long that users will see an obvious delay before errors are reported, this commit lower
      this interval to 100 ms to speed up the cancelling process.
  14. 25 7月, 2016 1 次提交
    • P
      Refactor utility statement dispatch interfaces · 01769ada
      Pengzhou Tang 提交于
      refactor CdbDispatchUtilityStatement() to make it flexible for cdbCopyStart(),
      dispatchVacuum() to call directly. Introduce flags like DF_NEED_TWO_SNAPSHOT,
      DF_WITH_SNAPSHOT, DF_CANCEL_ON_ERROR to make function call much clearer
  15. 17 7月, 2016 2 次提交
  16. 22 6月, 2016 1 次提交
  17. 19 5月, 2016 1 次提交
  18. 06 5月, 2016 1 次提交
  19. 12 2月, 2016 1 次提交
  20. 21 12月, 2015 1 次提交
    • P
      Fix Errors caused by SET command if cursor is declared · d2725929
      Pengzhou Tang 提交于
      SET command is session effective, all existed idle gangs should be set for later reuse,
      but for busy gangs declared by cursors, errors occur if they receive a set command. Way
      to fix it is marking busy gangs to no reuse so they can be destroyed after cursors been
  21. 28 10月, 2015 1 次提交