提交 · d1ba4da5ae27ffb7af80a464c0e3d62a5c540bca · Greenplum / Gpdb

15 7月, 2020 1 次提交

Cleanup idle reader gang after utility statements · d1ba4da5

由 Hubert Zhang 提交于 7月 15, 2020

Reader gangs use local snapshot to access catalog, as a result, it will
not synchronize with the sharedSnapshot from write gang which will
lead to inconsistent visibility of catalog table on idle reader gang.
Considering the case:

select * from t, t t1; -- create a reader gang.
begin;
create role r1;
set role r1;  -- set command will also dispatched to idle reader gang

When set role command dispatched to idle reader gang, reader gang
cannot see the new tuple t1 in catalog table pg_auth.
To fix this issue, we should drop the idle reader gangs after each
utility statement which may modify the catalog table.
Reviewed-by: NZhenghua Lyu <zlv@pivotal.io>

d1ba4da5

28 10月, 2019 1 次提交

Fix various issues in Gang management (#8893) · 797065c5

由 Paul Guo 提交于 10月 28, 2019

1. Do not call elog(FATAL) in cleanupQE() since it could be called in
cdbdisp_destroyDispatcherState() to destroy CdbDispatcherState. This leads to
reentrance of cdbdisp_destroyDispatcherState() which is not supported. Changing
the code to return false instead and to sanity check the reentrance. Returning
false should be ok since that leads to gang destroying and thus QE resources
should be destroyed themselves. Here is a typical stack of reentrance.

0x0000000000b8ffeb in cdbdisp_destroyDispatcherState (ds=0x2eff168) at cdbdisp.c:345
0x0000000000b90385 in cleanup_dispatcher_handle (h=0x2eff0d8) at cdbdisp.c:488
0x0000000000b904c0 in cdbdisp_cleanupDispatcherHandle (owner=0x2e80de0) at cdbdisp.c:555
0x0000000000b27fb7 in CdbResourceOwnerWalker (owner=0x2e80de0, callback=0xb90479 <cdbdisp_cleanupDispatcherHandle>) at resowner.c:1375
0x0000000000b27fd8 in CdbResourceOwnerWalker (owner=0x2f30358, callback=0xb90479 <cdbdisp_cleanupDispatcherHandle>) at resowner.c:1379
0x0000000000b903d9 in AtAbort_DispatcherState () at cdbdisp.c:511
0x000000000053b8ab in AbortTransaction () at xact.c:3319
0x000000000053e057 in AbortOutOfAnyTransaction () at xact.c:5248
0x00000000005c6869 in RemoveTempRelationsCallback (code=1, arg=0) at namespace.c:4088
0x000000000093c193 in shmem_exit (code=1) at ipc.c:257
0x000000000093c088 in proc_exit_prepare (code=1) at ipc.c:214
0x000000000093bf86 in proc_exit (code=1) at ipc.c:104
0x0000000000adb6e2 in errfinish (dummy=0) at elog.c:754
0x0000000000ade465 in elog_finish (elevel=21, fmt=0xe847c0 "cleanup called when a segworker is still busy") at elog.c:1735
0x0000000000beca81 in cleanupQE (segdbDesc=0x2ee9048) at cdbutil.c:846
0x0000000000becbc8 in cdbcomponent_recycleIdleQE (segdbDesc=0x2ee9048, forceDestroy=0 '\000') at cdbutil.c:871
0x0000000000b9815a in RecycleGang (gp=0x2eff7f0, forceDestroy=0 '\000') at cdbgang.c:861
0x0000000000b9009e in cdbdisp_destroyDispatcherState (ds=0x2eff168) at cdbdisp.c:372
0x0000000000b96957 in CdbDispatchCopyStart (cdbCopy=0x2f23828, stmt=0x2e364d0, flags=5) at cdbdisp_query.c:1442

2. Force to drop the reader gang for named portal if set command happens
previously since that setting was not dispatched to that gang and thus we
should not reuse them.

3. Now that we have the mechanism of destroying DispatcherState in resource
owner callback when aborting transaction. It is not needed to destroy in some
dispatcher code.

The added test cases and some existing test cases cover almost all code change
except the change in cdbdisp_dispatchX() (I can not find a solution to test
this, and I'll keep it in my mind to see how to test that or similar code).

Reviewed-by: Pengzhou Tang
Reviewed-by: Asim R P

797065c5

14 12月, 2018 2 次提交
- H
  Remove misc unused code. · aa4466ca
  由 Heikki Linnakangas 提交于 12月 14, 2018
```
I don't know what all of these were used for originally, but it's dead
code now.
```
  aa4466ca
- H
  
  Remove a bunch of unnecessary #includes. · bb803c72
  由 Heikki Linnakangas 提交于 12月 14, 2018
  
  bb803c72
29 10月, 2018 1 次提交

Simplify direct dispatch related code (#6080) · 576690f2

由 Tang Pengzhou 提交于 10月 29, 2018

* Simplify direct dispatch related code

This commit include two parts:
* simplify direct-dispatch dispatching code
* simplify direct-dispatch DTM related code

Previously, cdbdisp_dispatchToGang need a CdbDispatchDirectDesc info,
now gang only contain inuse segments, so direct-dispatch info is useless.

Another thing is, we need to decide if DTM is available for direct-dispatch
within dtmPreCommand, the logic is complex, you need to know if the main plan
is direct-dispatch and if the init plan contain direct-dispatch.

one example is:
"update foo set foo.c2 = 2
where foo.c1 = 1 and exists (select * from bar where bar.c1=4)"

main plan can be direct dispatched to segment 1, init plan can be direct
dispatched to segment 2, with the old logic, the DTM like PREPARE need to
dispatched to all segments, so dtmPreCommand need to dispatch a DTM named
'DTX_PROTOCOL_COMMAND_STAY_AT_OR_BECOME_IMPLIED' to all segment so those
segments like segment 3 who didn't receive the plan can be ready for two
phase commit.

With the new gang API, we can simplify this process, we add a list in
currentGxact to record which segments are actually get involved in a two
phase commit, then we can dispatch DTM to them directly. This is also very
usefully for queries on tables that are not fully expaned yet.

* support direct dispatch to more than one segment

576690f2

15 10月, 2018 1 次提交

Retire threaded dispatcher · 87394a7b

由 Ning Yu 提交于 10月 15, 2018

Now there is only the async dispatcher.  The dispatcher API interface is
kept so we might add new backend in the future.

The GUC gp_connections_per_thread is also retired which was used to
switch between the async and threaded backends.

87394a7b

27 9月, 2018 1 次提交

Dispatcher can create flexible size gang (#5701) · a3ddac06

由 Tang Pengzhou 提交于 9月 27, 2018

* change type of db_descriptors to SegmentDatabaseDescriptor **

A new gang definination may consist of cached segdbDesc and new
created segdbDesc, there is no need to palloc all segdbDesc struct
as new.

* Remove unnecessary allocate gang unit test

* Manage idle segment dbs using CdbComponentDatabases instead of available* lists.

To support vary size gang, we now need to manage segment dbs in a lower
granularity, previously, idle QEs is managed by a bunch of lists like
availablePrimaryWriterGang, availableReaderGangsN, this restrict
dispatcher to only create N-size (N = number of segments) or 1-size
gang.

CdbComponentDatabases is a snapshot of segment components within current
cluster, now it maintains a freelist for each segment component. When
creating gang, dispatcher will make up a gang from each segment
component (from freelist or create a new segment db). When cleaning up
a gang, dispatcher will return idle segment dbs to each segment
component.

CdbComponentDatabases provide a few functions to manipulate segment dbs
(SegmentDatabaseDescriptor *):
* cdbcomponent_getCdbComponents
* cdbcomponent_destroyCdbComponents
* cdbcomponent_allocateIdleSegdb
* cdbcomponent_recycleIdleSegdb
* cdbcomponent_cleanupIdleSegdbs

CdbComponentDatabases is also FTS version sensitive, so once a FTS
version changed, CdbComponentDatabases destroy all idle segment dbs
and allocate QEs in the new promoted segment. This provides the ability
to transparent mirror failover to users.

Since segment dbs(SegmentDatabaseDescriptor *) are managed by
CdbComponentDatabases now, we can simplify the memory context
management by replacing GangContext & perGangContext with
DispatcherContext & CdbComponentsContext.

* Postpone the error hanlding when creating gang

Now we have AtAbort_DispatcherState, one advantage of it is that
we can postpone gang error hanlding in this function and make
code cleaner.

* Handle FTS version change correctly

In some cases, when a FTS version changed, we can't update current
snapshot of segment components, to be more specifically, we can't
destroy current writer segment dbs and create new segment dbs.

These cases include:
* session has temp table created.
* query need two-phase commit and gxid has been dispatched to
  segments.

* Replace <gangId, sliceId> map with <qeIdentifier, sliceId> map

We used to dispatch a <gangId, sliceId> map along with query to
segment dbs so segment dbs can know which slice they should
execute.

Now gangId is useless for a segment db because a segment db can
be reused by different gang, so we need a new way to tell the
info to segment dbs. To resolve this, CdbComponentDatabases
assign a unique identifier to each segment db and make up a
bitmap set which consist of segment identifiers for each slice,
segment dbs then can go through the slice table and find the
right slice to execute.

* Allow dispatcher to create vary size gang and refine AssignGangs()

Previously, dispatcher can only create N-size gang for
GANGTYPE_PRIMARY_WRITER or GANGTYPE_PRIMARY_READER. this
restrict dispatcher in many ways, one example is direct
dispatch, it always create a N-size gang even it only
dispatch the command to one segment, another example is
some operations may be able to use N+ size gang, like
hash join, if both inner and outer plan is redistributed,
the hash join node can associate with a N+ size gang to
execute. This commit changes the API of createGang() so the
caller can specify a list of segments (partial or even
duplicate segments), CdbCompoentDatabase will guarantee
each segment has only one writer in a session. With this
it also resolves another pain point of AssignGangs(), so
the caller don't need to promote a GANGTYPE_PRIMARY_READER
to GANGTYPE_PRIMARY_WRITER, or promote a GANGTYPE_SINGLETON
_READER to GANGTYPE_PRIMARY_WRITER for replicated table
(see FinalizeSliceTree()).

With this commit, AssignGang() is very clear now.

a3ddac06

06 9月, 2018 1 次提交

Integrate Gang management from portal to Dispatcher and simplify AssignGangs for init plans (#5555) · 78a4890a

由 Tang Pengzhou 提交于 9月 06, 2018

* Simplify the AssignGangs() logic for init plans

Previously, AssignGangs() assign gangs for both main plans and
init plans in one shot. Because init plans and main plan are
executed sequentially, so the gangs can be reused between main
plan and init plans, function AccumSliceReq() is designed for
this.

This process can be simplified: already know the root slice
index id will be adjusted to according init plan id, init plan
only need to assign their own slices.

* Integrate Gang management from portal to Dispatcher

Previously, Gang was managed by portal, freeGangsForPortal()
was used to cleanup gang resource, DTM related commands also
needed a gang to dispatch command outside of a portal and
used freeGangsForPortal() too. There might be multiple
command/plan/utility executed within one portal, all commands
relied on a dispatcher routine like CdbDispatchCommand /
CdbDispatchPlan/CdbDispatchUtility... to dispatch, gangs were
created by each dispatcher routines, but not be recycled or
destroyed when a routine finished except for primary writer
gang, one defect of this is gang resource cannot be reused
between dispatcher routines. GPDB already had an optimization
for init plans, if a plan contained init plans, AssignGangs
was called before execution of any of them it went through
the whole slice tree and created the maximum gang that both
main plan and init plans needed, this was doable because init
plans and main plan were executed sequentially, but it also
made AssignGangs logic complex, meanwhile, reusing an not
clean gang was not safe.

Another confusing thing was the gang and dispatcher were
managed separately which cause context inconsistent like:
when a dispatcher state was destroyed, gang was not recycled,
when a gang was destroyed by portal, the dispatcher state was
still in use and may refer to the context of a destroyed gang.

As described above, this commit integrates gang management
with dispatcher, a dispather state is responsible for creating
and tracking gangs as needed and destroy them when dispatcher
state is destroyed.

* Handle the case when primary writer gang has gone

When members of primary writer gang gone, the writer gang
is destroyed immediately (primaryWriterGang is set to NULL)
when a dispatcher rountine (eg.CdbDispatchCommand) finished.
So when dispatching two-phase-DTM/DTX related command, QD
doesn't know writer gang has gone, it may get unexpected
error like 'savepoint not exist', 'subtransaction level not
match', 'temp file not exist'.

Previously, primaryWriterGang is not reset when DTM/DTX
commands start even it is pointing to invalid segments, so
those DTM/DTX commands will not actually sent to segments,
an normal error reported on QD looks like 'could not
connect to segment: initialization of segworker'.

So we need a way to info global transaction that its writer
gang has lost. so when aborting transaction, QD can:
1. disconnect all reader gangs, this is usefull to skip
dispatching "ABORT_NO_PREPARE"
2. reset session and drop temp files because temp files in
segment is gone.
3. report a error when dispatching "rollback savepoint" DTX
because savepoint in segment is gone.
4. report a error when dispatch "abort subtransaction" DTX
because subtransaction is rollback when writer segment is down.

78a4890a

14 8月, 2018 2 次提交

Remove cdbdisp_finishCommand · 957629d1

由 Pengzhou Tang 提交于 7月 31, 2018

Previously, cdbdisp_finishCommand did three things:
1. cdbdisp_checkDispatchResult
2. cdbdisp_getDispatchResult
3. cdbdisp_destroyDispatcherState

However, cdbdisp_finishCommand didn't make code cleaner or more
convenient to use, in contrast, it makes error handling more
difficult and makes code more complicated and inconsistent.

This commit also reset estate->dispatcherState to NULL to avoid
re-entry of cdbdisp_* functions.

957629d1

P
Rename CdbCheckDispatchResult for name convention · 60bd3ab2
由 Pengzhou Tang 提交于 7月 31, 2018
```
Use cdbdisp_checkDispatchResult instead of CdbCheckDispatchResult
to be consistent of cdbdisp_* functions.
```
60bd3ab2

09 5月, 2018 1 次提交
- X
  Refactor interconnect and dispatcher resource cleanup · b0353e0a
  由 xiong-gang 提交于 5月 09, 2018
```
Use resource owner to do the cleanup of dispatcher and interconnect(#4761)
```
  b0353e0a
01 3月, 2018 1 次提交

Give a better error message, if preparing an xact fails. · b3c50e40

由 Heikki Linnakangas 提交于 3月 01, 2018

If an error happens in the prepare phase of two-phase commit, relay the
original error back to the client, instead of the fairly opaque
"Abort [Prepared]' broadcast failed to one or more segments" message you
got previously. A lot of things happen during the prepare phase that
can legitimately fail, like checking deferred constraints, like in the
'constraints' regression test. But even without that, there can be
triggers, ON COMMIT actions, etc., any of which can fail.

This commit consists of several parts:

* Pass 'true' for the 'raiseError' argument when dispatching the prepare
  dtx command in doPrepareTransaction(), so that the error is emitted to
  the client.

* Bubble up an ErrorData struct, with as many fields intact as possible,
  to the caller,  when dispatching a dtx command. (Instead of constructing
  a message in a StringInfo). So that we can re-throw the message to
  the client, with its original formatting.

* Don't throw an error in performDtxProtocolCommand(), if we try to abort
  a prepared transaction that doesn't exist. That is business-as-usual,
  if a transaction throws an error before finishing the prepare phase.

* Suppress the "NOTICE: Releasing segworker groups to retry broadcast."
  message, when aborting a prepared transaction.

Put together, the effect is if an error happens during prepare phase, the
client receives a message that is largely indistinguishable from the
message you'd get if the same failure happened while running a normal
statement.

Fixes github issue #4530.

b3c50e40

25 1月, 2018 1 次提交

Propagate segment errcodes to dispatcher · 58003bc7

由 Daniel Gustafsson 提交于 1月 25, 2018

The errcode thrown in an ereport() on a segment was passed back to
the dispatcher, but then dropped and replaced with a default errcode
of ERRCODE_DATA_EXCEPTION. This works for most situations, but when
trapping errors the exact errcode must be propagated. This extends
the API to extract the errcode as well. The below case illustrates
the previous issue:

  CREATE TABLE test1(id int primary key);
  CREATE TABLE test2(id int primary key);
  INSERT INTO test1 VALUES(1);
  INSERT INTO test2 VALUES(1);
  CREATE OR REPLACE FUNCTION merge_table() RETURNS void AS $$
  DECLARE
	v_insert_sql varchar;
  BEGIN
	v_insert_sql :='INSERT INTO test1 SELECT * FROM test2';
	EXECUTE v_insert_sql;
	EXCEPTION WHEN unique_violation THEN
		RAISE NOTICE 'unique_violation';
	END;
  $$ LANGUAGE plpgsql volatile;
  SELECT merge_table();

58003bc7

02 11月, 2017 1 次提交

Wake up faster, if a segment returns an error. · 3bbedbe9

由 Heikki Linnakangas 提交于 11月 02, 2017

Previously, if a segment reported an error after starting up the
interconnect, it would take up to 250 ms for the main thread in the QD
process to wake up and poll the dispatcher connections, and to see that
there was an error. Shorten that time, by waking up immediately if the
QD->QE libpq socket becomes readable while we're waiting for data to
arrive in a Motion node.

This isn't a complete solution, because this will only wake up if one
arbitrarily chosen connection becomes readable, and we still rely on
polling for the others. But this greatly speeds up many common scenarios.
In particular, the "qp_functions_in_select" test now runs in under 5 s
on my laptop, when it took about 60 seconds before.

3bbedbe9

01 9月, 2017 1 次提交

Fix Copyright and file headers across the tree · ed7414ee

由 Daniel Gustafsson 提交于 9月 01, 2017

This bumps the copyright years to the appropriate years after not
having been updated for some time. Also reformats existing code
headers to match the upstream style to ensure consistency.

ed7414ee

14 11月, 2016 1 次提交

Use nonblocking mechanism to send data in async dispatcher. · 2516eac6

由 xiong-gang 提交于 11月 14, 2016

pqFlush is sending data synchronously though the socket is set
O_NONBLOCK, this incurs performance downgradation. This commit uses
pqFlushNonBlocking instead, and synchronizes the completion of
dispatching to all Gangs before query execution.

Signed-off-by: Kenan Yao<kyao@pivotal.io>

2516eac6

04 11月, 2016 1 次提交
- X
  Rename the interface routines of dispatcher, and add a README for illustration · be13fd00
  由 xiong-gang 提交于 11月 04, 2016
```
Signed-off-by: NKenan Yao <kyao@pivotal.io>
```
  be13fd00
13 9月, 2016 1 次提交

Speed up QE cancel when one or more QEs got errors · 39ed6031

由 Pengzhou Tang 提交于 9月 05, 2016

QD need to cancel QEs when
1) QD get a error
2) one or more QEs got error and cancelOnError was set to true.

We want to cancel QEs as soon as possible once above conditions are reached, but considering
the cost of cancelling QEs is high, we want to process as many pending finish QEs as possible
before actually cancel. The original interval before cancelling is 2 seconds which is too
long that users will see an obvious delay before errors are reported, this commit lower
this interval to 100 ms to speed up the cancelling process.

39ed6031

25 7月, 2016 1 次提交

Refactor utility statement dispatch interfaces · 01769ada

由 Pengzhou Tang 提交于 7月 08, 2016

refactor CdbDispatchUtilityStatement() to make it flexible for cdbCopyStart(),
dispatchVacuum() to call directly. Introduce flags like DF_NEED_TWO_SNAPSHOT,
DF_WITH_SNAPSHOT, DF_CANCEL_ON_ERROR to make function call much clearer

01769ada

17 7月, 2016 2 次提交
- G
  
  Add asynchronous implementation of dispatcher · c0fa7236
  由 Gang Xiong 提交于 6月 23, 2016
  
  c0fa7236
- G
  
  Refactor interface between cdbdisp.c and cdbdisp_thread.c · 5cc587ff
  由 Gang Xiong 提交于 6月 22, 2016
  
  5cc587ff
22 6月, 2016 1 次提交
- G
  
  Remove unnecessary argument of cdbdisp_makeDispatcherState · a2ecd1fa
  由 Gang Xiong 提交于 6月 22, 2016
  
  a2ecd1fa
19 5月, 2016 1 次提交

Split cdbdisp.c into several files, and put them into a new · 895b7d50

由 Pengzhou Tang 提交于 5月 12, 2016

dispatcher/ directory

This commit has no logic change, it just contains movement of code across
files, to make dispatcher code clearer, and easier for unit testing.

Signed-off-by: Kenan Yao

895b7d50

06 5月, 2016 1 次提交

refactor dispatcher code · 7363a6d9

由 Gang Xiong 提交于 4月 23, 2016

refactor cdbdisp_dispatchToGang interface.
refactor memory management in dispatch.

7363a6d9

12 2月, 2016 1 次提交
- H
  Misc header file cleanup · 442c105e
  由 Heikki Linnakangas 提交于 2月 11, 2016
```
Remove unnecessary #includes, add #includes that are actually needed by
some headers.
```
  442c105e
21 12月, 2015 1 次提交

Fix Errors caused by SET command if cursor is declared · d2725929

由 Pengzhou Tang 提交于 12月 11, 2015

SET command is session effective, all existed idle gangs should be set for later reuse,
but for busy gangs declared by cursors, errors occur if they receive a set command. Way
to fix it is marking busy gangs to no reuse so they can be destroyed after cursors been
closed.

d2725929

28 10月, 2015 1 次提交
- I
  
  Import Greenplum source code. · 6b0e52be
  由 Initial Greenplum code dump 提交于 10月 23, 2015
  
  6b0e52be