- 29 9月, 2018 3 次提交
-
-
由 Paul Guo 提交于
PG9.4 starts to allow the WITH syntax to support options in create tablespace. Greenplum previously used the OPTIONS syntax to support per segment location. Let's union them all to use the WITH syntax only, following upstream. Note the greenplum specific OPTIONS exists in gpdb master only.
-
由 Taylor Vesely 提交于
Because the optimizer has its own memory management system, and does not make use of AllocationSets, the only way we know how much memory the optimizer is using is by intercepting the calls to malloc() and free(). When the GUC 'optimizer_use_gpdb_allocators' is set to 'true' Orca will replace its native alloc() and free() methods with Ext_OptimizerAlloc() and Ext_OptimizerFree(). These calls will track the total memory usage in the active 'Optimizer' memory account, and the total outstanding memory between queries in OptimizerOutstandingMemoryBalance. This is a problem when accounting for ORCA memory in the 'X_NestedExecutor' account, because unless you add the OptimizerOutstandingMemoryBalance ORCA can free memory that was allocated in a previous query and underflow the account. Both in order to get an accurate idea of how much memory the optimizer is using, and to prevent problems with the 'X_NestedExecutor' account, it makes sense to track the ORCA's memory usage in a single account. Therefore, only create one 'Optimizer' account per query, no matter how many times we call it. Co-authored-by: NDavid Kimura <dkimura@pivotal.io>
-
由 Taylor Vesely 提交于
The memory accounting system generates a new memory account for every execution node initialized in ExecInitNode. The address to these memory accounts is stored in the shortLivingMemoryAccountArray. If the memory allocated for shortLivingMemoryAccountArray is full, we will repalloc the array with double the number of available entries. After creating approximately 67000000 memory accounts, it will need to allocate more than 1GB of memory to increase the array size, and throw an ERROR, canceling the running query. PL/pgSQL and SQL functions will create new executors/plan nodes that must be tracked my the memory accounting system. This level of detail is not necessary for tracking memory leaks, and creating a separate memory account for every executor will use large amount of memory just to track these memory accounts. Instead of tracking millions of individual memory accounts, we consolidate any child executor account into a special 'X_NestedExecutor' account. If explain_memory_verbosity is set to 'detailed' and below, consolidate all child executors into this account. If more detail is needed for debugging, set explain_memory_verbosity to 'debug', where, as was the previous behavior, every executor will be assigned its own MemoryAccountId. Originally we tried to remove nested execution accounts after they finish executing, but rolling over those accounts into a 'X_NestedExecutor' account was impracticable to accomplish without the possibility of a future regression. If any accounts are created between nested executors that are not rolled over to an 'X_NestedExecutor' account, recording which accounts are rolled over can grow in the same way that the shortLivingMemoryAccountArray is growing today, and would also grow too large to reasonably fit in memory. If we were to iterate through the SharedHeaders every time that we finish a nested executor, it is not likely to be very performant. While we were at it, convert some of the convenience macros dealing with memory accounting for executor / planner node into functions, and move them out of memory accounting header files into the sole callers' compilation units. Co-authored-by: NAshwin Agrawal <aagrawal@pivotal.io> Co-authored-by: NEkta Khanna <ekhanna@pivotal.io> Co-authored-by: NAdam Berlin <aberlin@pivotal.io> Co-authored-by: NJoao Pereira <jdealmeidapereira@pivotal.io> Co-authored-by: NMelanie Plageman <mplageman@pivotal.io>
-
- 28 9月, 2018 13 次提交
-
-
由 Daniel Gustafsson 提交于
This is a backport of the below commit from postgres 12dev, which in turn is a patch that was influenced by an optimization from the previous version of the Greenplum Window code. The idea is to order the Sort nodes based on sort prefixes, such that sorts can be reused by subsequent nodes. As this uses EXPLAIN in the test output, a new expected file is added for ORCA output even though the patch only touches the postgres planner. commit 728202b6 Author: Andrew Gierth <rhodiumtoad@postgresql.org> Date: Fri Sep 14 17:35:42 2018 +0100 Order active window clauses for greater reuse of Sort nodes. By sorting the active window list lexicographically by the sort clause list but putting longer clauses before shorter prefixes, we generate more chances to elide Sort nodes when building the path. Author: Daniel Gustafsson (with some editorialization by me) Reviewed-by: Alexander Kuzmenkov, Masahiko Sawada, Tom Lane Discussion: https://postgr.es/m/124A7F69-84CD-435B-BA0E-2695BE21E5C2%40yesql.se
-
由 Daniel Gustafsson 提交于
There were a few cases of broken queries in the test suites which weren't done on purpose in order to test the parser/grammar. This fixes the ones that stood out, but there are likely to be more in ignore blocks that slip through the cracks. Reviewed-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
-
由 Heikki Linnakangas 提交于
The purpose of this code was to treat the first ORDER BY column, in a window agg like "ROW_NUMBER() OVER (ORDER BY x RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING", the same way as volatile expressions, and add them to the target list as is. That was to ensure that it would be available for computing the window bounds. But upstream commit a2099360, merged as part of the 9.3 merge, got rid of the distinction between volatile and non-volatile expressions, so we no longer need to treat the first ORDER BY column any different either.
-
由 Heikki Linnakangas 提交于
These were swapped. It's been wrong ever since we merged the operator family patch, during the 8.3 merge. But apparently it wasn't causing any ill effect, or at least I was not able to find a case that would fail because of it. This was caught by new sanity checks in the 'opr_sanity' regression test, introduced in the upcoming 9.4 merge.
-
由 Heikki Linnakangas 提交于
It's not cool to use the raw xmax value as part of the cache key. If the raw xmax represents a multi-xid, the real deleter XID would be something else. We could get fooled, if we cached a multi-XID value, and later saw a tuple with a regular xmax, with the same numerical value as the cached multi-XID. I think this was actually broken before the 9.3 merge already. If a transaction locked a tuple, and deleted another tuple, and a concurrent scan sees the locked tuple first, it might think that the deleted tuple is also visible to it, because it has the same xmin+xmax combination as the locked tuple.
-
由 Heikki Linnakangas 提交于
make_windowInputTargetList() seems like a better place for this code, as suggested by the FIXME comment that was left here in the 9.3 merge.
-
由 Heikki Linnakangas 提交于
The new regression tests revealed that it doesn't work. With an assertion- enabled ORCA build, I got an assertion failure like this: 2018-08-23 11:20:08:371479 EEST,THD000,ERROR,""/home/heikki/gpdb/optimizer-main/libgpos/include/gpos/common/CDynamicPtrArray.h:300: Failed assertion: pos < m_size && ""Out of bounds access"" Stack trace: 1 0x00007f363fb3e78a gpos::CException::Raise + 252 2 0x00007f3640be0970 gpos::CDynamicPtrArray + 84 3 0x00007f3640c93dac gpopt::CWindowPreprocessor::SplitPrjList + 1162 4 0x00007f3640c9404b gpopt::CWindowPreprocessor::SplitSeqPrj + 303 5 0x00007f3640c94b61 gpopt::CWindowPreprocessor::PexprSeqPrj2Join + 357 6 0x00007f3640c95276 gpopt::CWindowPreprocessor::PexprPreprocess + 316 7 0x00007f3640c240a2 gpopt::CExpressionPreprocessor::PexprPreprocess + 1098 8 0x00007f3640bc2d62 gpopt::CQueryContext::CQueryContext + 696 9 0x00007f3640bc36df gpopt::CQueryContext::PqcGenerate + 1413 10 0x00007f3640c95d86 gpopt::COptimizer::PdxlnOptimize + 1042 11 0x000055b2e8252f26 COptTasks::OptimizeTask + 1488 12 0x00007f363fb58a0d gpos::CTask::Execute + 183 13 0x00007f363fb5d447 gpos::CWorker::Execute + 199 14 0x00007f363fb56d77 gpos::CAutoTaskProxy::Execute + 287 15 0x00007f363fb3479b gpos_exec + 800 "",",,,,,,"explain select dt, pn, sum(distinct pn) over (partition by dt), sum(pn) over (partition by dt) from sale;",0,,"COptTasks.cpp",545, 2018-08-23 11:20:08.372392 EEST,"heikki","postgres",p19807,th1163394560,"[local]",,2018-08-23 11:19:53 EEST,0,con4,cmd7,seg-1,,dx6,,sx1,"LOG","00000","Planner produced plan :1",,,,,,"explain select dt, pn, sum(distinct pn) over (partition by dt), sum(pn) over (partition by dt) from sale;",0,,"orca.c",61, This caused the query to fall back to planner, which worked. But with assertions disabled, it crashed instead. We should fix ORCA to deal with that. One option is to rip out all the special code to plan DISTINCT-qualified aggregates in ORCA, and just pass through the windistinct flag to the executor. That's basically what the Postgres planner does, and the executor will deal with deduplicating the input. But for now, let's just stop the crashing.
-
由 Heikki Linnakangas 提交于
GPDB 5 supported DISTINCT in window aggregates, e.g: COUNT(DISTINCT x) OVER (PARTITION BY y) However, PostgreSQL does not support that, and as a result, GPDB lost that capability as part of the window functions rewrite, too. In the upstream, there's an explicit check for that, that it was lost in the window function rewrite. So the parser accepted that, but it was executed just like if there was no DISTINCT. There were also no tests for this, that would return a different result with the DISTINCT than without, which is why no-one noticed it. To fix, implement the DISTINCT support, to the same extent that the old implementation supported it. The new implementation adds a little sort + deduplicate step for each DISTINCT aggregate. I'm not sure how this compares with the old implementation, performance-wise, but at least it works now. Also, add the missing tests.
-
由 Pengzhou Tang 提交于
When TCP connections cannot be setup for a long time, we check if some segments are already failed out, the check is an high-cost operation, so we set the interval to 2 seconds. We used to use a counter to record the interval which is not reliable because a loop cycle (500ms) may be interrupted earlier due to EINT/EAGAIN of select(). To not affect the setup performance of TCP interconnect, we need to make the interval mechanism more reliable.
-
由 ZhangJackey 提交于
There was an assumption in gpdb that a table's data is always distributed on all segments, however this is not always true for example when a cluster is expanded from M segments to N (N > M) all the tables are still on M segments, to workaround the problem we used to have to alter all the hash distributed tables to randomly distributed to get correct query results, at the cost of bad performance. Now we support table data to be distributed on a subset of segments. A new columne `numsegments` is added to catalog table `gp_distribution_policy` to record how many segments a table's data is distributed on. By doing so we could allow DMLs on M tables, joins between M and N tables are also supported. ```sql -- t1 and t2 are both distributed on (c1, c2), -- one on 1 segments, the other on 2 segments select localoid::regclass, attrnums, policytype, numsegments from gp_distribution_policy; localoid | attrnums | policytype | numsegments ----------+----------+------------+------------- t1 | {1,2} | p | 1 t2 | {1,2} | p | 2 (2 rows) -- t1 and t1 have exactly the same distribution policy, -- join locally explain select * from t1 a join t1 b using (c1, c2); QUERY PLAN ------------------------------------------------ Gather Motion 1:1 (slice1; segments: 1) -> Hash Join Hash Cond: a.c1 = b.c1 AND a.c2 = b.c2 -> Seq Scan on t1 a -> Hash -> Seq Scan on t1 b Optimizer: legacy query optimizer -- t1 and t2 are both distributed on (c1, c2), -- but as they have different numsegments, -- one has to be redistributed explain select * from t1 a join t2 b using (c1, c2); QUERY PLAN ------------------------------------------------------------------ Gather Motion 1:1 (slice2; segments: 1) -> Hash Join Hash Cond: a.c1 = b.c1 AND a.c2 = b.c2 -> Seq Scan on t1 a -> Hash -> Redistribute Motion 2:1 (slice1; segments: 2) Hash Key: b.c1, b.c2 -> Seq Scan on t2 b Optimizer: legacy query optimizer ```
-
由 mkiyama 提交于
-
由 Mel Kiyama 提交于
-How to find the boostfs config. guide -How to find the boostfs RPM
-
由 Lisa Owen 提交于
* docs - sql and catalog ref page updates for i/o conversion casts * address comments from heikki
-
- 27 9月, 2018 12 次提交
-
-
由 Joao Pereira 提交于
The commit 7605710c did not update the yml file with the pipeline configuration for master.
-
由 David Kimura 提交于
- Due to changes in the structure of gpaddon we can no longer use the resource gpaddon_src to compile 5X for Binary Swap Jobs. - From this point on we should use the 5X_RELEASE tag on gpaddon to compile Greenplum for these jobs. - Change the expected quicklz error message while building OSS Greenplum - Explicitly add Greenplum bin folder to the path - Add back the rsync of the quicklz addon folder This was added back to ensure that the enterpreize build still works correctly - Use the correct branch to compile Binary Swap version Ensure that quicklz is not build for windows We will not support, at this time, the compilation of quicklz and installation for our windows built Signed-off-by: NJoao Pereira <jdealmeidapereira@pivotal.io>
-
由 Heikki Linnakangas 提交于
The proprietary build can install them as normal C language functions, with CREATE FUNCTION, instead. In the passing, remove some unused QuickLZ debugging GUCs. This doesn't yet get rid of all references to QuickLZ, unfortunately. The GUC and reloption validation code still needs to know about it, so that they can validate the options read from postgresql.conf, when starting up postmaster. For the same reason, you cannot yet add custom compression algorithms, besides quicklz, as an extension. But this is another step in the right direction, anyway. Co-authored-by: NJimmy Yih <jyih@pivotal.io> Co-authored-by: NJoao Pereira <jdealmeidapereira@pivotal.io>
-
由 Daniel Gustafsson 提交于
The comment states that "small" might be defined by socket.h, and while thats not true for all versions of sys/socket.h, it's still not a good name to use as it's common in Windows headers (should we ever revive a Windows port). Renaming to a non-colliding name is a small price to pay to avoid subtle bugs, so rename and remove the preprocessor dance. Reviewed-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
-
由 Daniel Gustafsson 提交于
The test suite, which was ported over from TINC, was ignoring so much of the memorized output that it more or less didn't test anything (and the ignored blocks was as full of outdated output as one would imagine). The code was also formatted in weird ways and had needless NOTICEs thrown during execution. This refactors the testsuite to remove all ignore blocks, removes some utterly pointless tests (there are many more of them left), formats the code to be readable, fixes the output to work and removes some duplicate tests. The remaining bits of the suite is by no means terribly interestering, but it runs fast enough that it's worth keeping the leftovers for now. Reviewed-by: NHeikki Linnakangas <hlinnakangas@pivotal.io>
-
由 Peifeng Qiu 提交于
Upstream has upgraded windows compile script and use newer version of Perl. This may block current merging effort. We plan to do windows native compiling for gpdb 6 so this job is no longer necessary for gpdb_master.
-
由 Tang Pengzhou 提交于
* change type of db_descriptors to SegmentDatabaseDescriptor ** A new gang definination may consist of cached segdbDesc and new created segdbDesc, there is no need to palloc all segdbDesc struct as new. * Remove unnecessary allocate gang unit test * Manage idle segment dbs using CdbComponentDatabases instead of available* lists. To support vary size gang, we now need to manage segment dbs in a lower granularity, previously, idle QEs is managed by a bunch of lists like availablePrimaryWriterGang, availableReaderGangsN, this restrict dispatcher to only create N-size (N = number of segments) or 1-size gang. CdbComponentDatabases is a snapshot of segment components within current cluster, now it maintains a freelist for each segment component. When creating gang, dispatcher will make up a gang from each segment component (from freelist or create a new segment db). When cleaning up a gang, dispatcher will return idle segment dbs to each segment component. CdbComponentDatabases provide a few functions to manipulate segment dbs (SegmentDatabaseDescriptor *): * cdbcomponent_getCdbComponents * cdbcomponent_destroyCdbComponents * cdbcomponent_allocateIdleSegdb * cdbcomponent_recycleIdleSegdb * cdbcomponent_cleanupIdleSegdbs CdbComponentDatabases is also FTS version sensitive, so once a FTS version changed, CdbComponentDatabases destroy all idle segment dbs and allocate QEs in the new promoted segment. This provides the ability to transparent mirror failover to users. Since segment dbs(SegmentDatabaseDescriptor *) are managed by CdbComponentDatabases now, we can simplify the memory context management by replacing GangContext & perGangContext with DispatcherContext & CdbComponentsContext. * Postpone the error hanlding when creating gang Now we have AtAbort_DispatcherState, one advantage of it is that we can postpone gang error hanlding in this function and make code cleaner. * Handle FTS version change correctly In some cases, when a FTS version changed, we can't update current snapshot of segment components, to be more specifically, we can't destroy current writer segment dbs and create new segment dbs. These cases include: * session has temp table created. * query need two-phase commit and gxid has been dispatched to segments. * Replace <gangId, sliceId> map with <qeIdentifier, sliceId> map We used to dispatch a <gangId, sliceId> map along with query to segment dbs so segment dbs can know which slice they should execute. Now gangId is useless for a segment db because a segment db can be reused by different gang, so we need a new way to tell the info to segment dbs. To resolve this, CdbComponentDatabases assign a unique identifier to each segment db and make up a bitmap set which consist of segment identifiers for each slice, segment dbs then can go through the slice table and find the right slice to execute. * Allow dispatcher to create vary size gang and refine AssignGangs() Previously, dispatcher can only create N-size gang for GANGTYPE_PRIMARY_WRITER or GANGTYPE_PRIMARY_READER. this restrict dispatcher in many ways, one example is direct dispatch, it always create a N-size gang even it only dispatch the command to one segment, another example is some operations may be able to use N+ size gang, like hash join, if both inner and outer plan is redistributed, the hash join node can associate with a N+ size gang to execute. This commit changes the API of createGang() so the caller can specify a list of segments (partial or even duplicate segments), CdbCompoentDatabase will guarantee each segment has only one writer in a session. With this it also resolves another pain point of AssignGangs(), so the caller don't need to promote a GANGTYPE_PRIMARY_READER to GANGTYPE_PRIMARY_WRITER, or promote a GANGTYPE_SINGLETON _READER to GANGTYPE_PRIMARY_WRITER for replicated table (see FinalizeSliceTree()). With this commit, AssignGang() is very clear now.
-
由 Paul Guo 提交于
As the comment said, this was useful howerver now that we have upstream add_rte_to_flat_rtable() to handle that, let's remove this call.
-
由 Divya Bhargov 提交于
Co-authored-by: NDivya Bhargov <dbhargov@pivotal.io> Co-authored-by: NLav Jain <ljain@pivotal.io>
-
由 Daniel Gustafsson 提交于
Fixes clang (and probably gcc) compiler warning on unused variable. Reviewed-by: NPaul Guo <pguo@pivotal.io> Reviewed-by: NVenkatesh Raghavan <vraghavan@pivotal.io>
-
由 David Kimura 提交于
Until we have replication slots this will keep enough xlog segments around so that mirrors have an opportunity to reconnect when a checkpoint removes a segment while the mirror is not streaming. Co-authored-by: NTaylor Vesely <tvesely@pivotal.io>
-
由 Heikki Linnakangas 提交于
As far as I can see, the 'is_internal' flag is passed through to possible object access hook, but it has no other effect. Mark the LOV index and heap created for bitmap indexes, as well as constrains created for exchanged partitions as 'internal'.
-
- 26 9月, 2018 12 次提交
-
-
由 Heikki Linnakangas 提交于
I'm not entirely sure what was going on here before. I suspect we had backported some fixes from later upstream versions, and they caused merge conflicts and confusion now. But in any case, I see no reason to deviate from upstream now, so just remove the FIXME.
-
由 Heikki Linnakangas 提交于
We had backported upstream commits 425bef6ee7 and 2cd72ba42d earlier, but those got partially reverted in the 9.3 merge. Or earlier, or we hadn't backported them completely to begin with - I didn't investigate the exact path of how we got here. In any case, a partial backport is confusing, so take the code around this from the tip of 9.3 stable, so that we have both of those commits fully backported.
-
由 Adam Berlin 提交于
-
由 Adam Berlin 提交于
-
由 Adam Berlin 提交于
-
由 Adam Berlin 提交于
-
由 Asim R P 提交于
The functions allow obtaining or removing entries from the shared hash table maintained on QD. Default size of this hash table is 1000 and entries are removed only after it is filled to capacity. The two functions should be helpful for testing as well as troubleshooting issues with appendonly tables in production deployments. Co-authored-by: NJimmy Yih <jyih@pivotal.io>
-
由 Asim R P 提交于
A segment file that is compacted by vacuum is left in awaiting drop state on QEs. Such a segment file should not be chosen for new inserts because it will never be considered for reading during scans. This patch fixes a bug in the logic to determine if a segment file is in awaiting drop state. Precondition for the bug includes a specific interleaving of vacuum and insert transactions on the same appendonly table, manifested in the accompanying test. The fix is to use SnapshotNow instead of MVCC snapshot. A segment file whose state is updated to awaiting drop by a vacuum compaction transaction may still be be seen as available for inserts through MVCC snapshot. When a vacuum compaction transaction is in progress, the aoentry for the relation in appendonly hash cannot be evicted and the need for obtaining state from QEs does not arise.
-
由 Asim R P 提交于
Spotted while reading.
-
由 Asim R P 提交于
This commit promotes a few assertions into elog(ERROR) so as to avoid new data being appended to a segmene file that is not in available state. Scans on an AO table do not read segment files that are awaiting to be dropped. New data, if inserted in such a segment file, will be lost forever. The accompanying isolation2 test demonstrates a bug that hits these errors. The test uses a newly added UDF to evict an entry from the appendonly hash table. In production, an entry is evicted when the appendonly hash table is filled (default capacity of 1000 entries). Note: the bug will be fixed in a separate patch. Co-authored-by: NAdam Berlin <aberlin@pivotal.io>
-
由 David Yozie 提交于
-
由 Ekta Khanna 提交于
-