提交 · 175c25e8fb0494933087ff19ef29d7377e021702 · Greenplum / Gpdb

13 1月, 2018 3 次提交

H
Fix deletion of AO and AOCS tables, to remove all segments. · 175c25e8
由 Heikki Linnakangas 提交于 12月 15, 2017
```
This hopefully fixes the gp_replica_check failures we're seeing in the
pipeline.
```
175c25e8

Remove a lot of persistent table and mirroring stuff. · 5c158ff3

由 Heikki Linnakangas 提交于 12月 14, 2017

* Revert almost all the changes in smgr.c / md.c, to not go through
  the Mirrored* APIs.

* Remove mmxlog stuff. Use upstream "pending relation deletion" code
  instead.

* Get rid of multiple startup passes. Now it's just a single pass like
  in the upstream.

* Revert the way database drop/create are handled to the way it is in
  upstream. Doesn't use PT anymore, but accesses file system directly,
  and WAL-logs a single CREATE/DROP DATABASE WAL record.

* Get rid of MirroredLock

* Remove a few tests that were specific to persistent tables.

* Plus a lot of little removals and reverts to upstream code.

5c158ff3

Remove cdbfilerepprimary.c. · 1d38b8e1

由 Ashwin Agrawal 提交于 12月 11, 2017

This was little painful one to entangle, but seems done now. Though if any
shake-up happens shoudl be primary suspect.

1d38b8e1

21 11月, 2017 1 次提交

Move some GPDB-specific code out of smgr.c and md.c. · 306b189d

由 Heikki Linnakangas 提交于 11月 20, 2017

For clarity, and to make merging easier.

The code to manage the hash table of "pending resync EOFs" for append-only
tables is moved to smgr_ao.c. One notable change here is that the
pendingDeletesPerformed flag is removed. It was used to track whether there
are any pending deletes, or any pending AO table resyncs, but we might as
well check the pending delete list and the pending syncs hash table
directly, it's hardly any slower than checking a separate boolean.

There are still plenty of GPDB changes in smgr.c, but this is a good step
forward.

306b189d

24 6月, 2017 1 次提交
- A
  
  Fix compiler warning unused function 'register_unlink'. · 5c3be9a0
  由 Ashwin Agrawal 提交于 6月 21, 2017
  
  5c3be9a0
07 3月, 2017 3 次提交

Checkpointer and BgWriter code closer to PG 9.2. · a453e7a3

由 Ashwin Agrawal 提交于 3月 03, 2017

Rename checkpoint.c to checkpointer.c. And move the code from bgwriter.c to
checkpointer.c and also renames most of corresponding data structures to refect
the clear ownership and association. This commit brings it as close as possible
to PostgreSQL 9.2.

Reference to PostgreSQL related commits:
commit 806a2aee
    Split work of bgwriter between 2 processes: bgwriter and checkpointer.
commit bf405ba8
    Add new file for checkpointer.c
commit 8f28789b
    Rename BgWriterShmem/Request to CheckpointerShmem/Request
commit d843589e5ab361dd4738dab5c9016e704faf4153
    Fix management of pendingOpsTable in auxiliary processes.

a453e7a3

Correctly maintain pendingOpsTable in checkpoint process. · 0291ff60

由 Ashwin Agrawal, Asim R P and Xin Zhang 提交于 2月 16, 2017

We had partially pulled the fix to separate checkpoint and bgwriter
processes and introduced a bug where pendingOpsTable was maintained in
both the processes.  The pendingOpsTable records pending fsync
requests.  Only checkpoint process should keep it.  Bgwriter should
only write out dirty pages to OS cache.  Apparently, upstream also had
this same bug and it was fixed in
d843589e5ab361dd4738dab5c9016e704faf4153

Also ensure that background writer sweeps buffers even in the first run after
checkpoint.  There is no reason to hold off until next run and this is how it
works in upstream.

Fixes issue discussed on mailing list:
https://groups.google.com/a/greenplum.org/forum/#!topic/gpdb-dev/PHKuQPNwWs0

0291ff60

pg_regress test to validate fsync requests are not lost. · 85b7754d

由 Ashwin Agrawal and Xin Zhang 提交于 2月 21, 2017

The commit includes a UDF to walk dirty shared buffers and a new fault
`fault_counter` to count the number of files fsync'ed by checkpointer process.

Also another new fault `bg_buffer_sync_default_logic` to flush all buffers for
BgBufferSync() for the background writer process.

85b7754d

03 3月, 2017 1 次提交

Remove unnecessary unlink in checkpoint process · 2b2026c7

由 Ashwin Agrawal and Xin Zhang 提交于 2月 28, 2017

In PostgreSQL, the unlink is deferred and handled by checkpoint.

In GPDB, the unlink is always handled by persistent tables and hence it's
protected of duplicated relfilenode deletion during recovery. Functionally,
there is no harm to send unlink to checkpoint process, and checkpoint process
cannot even find the relfilenode to be deleted. However, this will be a
performance impact under scenario like deleting a table with large number of
partitions, where the fsync request queue is unnecessarily filled.

The detailed discussions are at:
https://groups.google.com/a/greenplum.org/forum/#!searchin/gpdb-dev/mdunlink%7Csort:relevance/gpdb-dev/PHKuQPNwWs0/1kIwDk-CEgAJ

2b2026c7

20 12月, 2016 1 次提交

Add support to pg_upgrade for upgrading Greenplum clusters · 675b2991

由 Heikki Linnakangas 提交于 12月 20, 2016

This commit substantially rewrites pg_upgrade to handle upgrading a
Greenplum cluster from 4.3 to 5.0. The Greenplum specifics of pg_upgrade
are documented in contrib/pg_upgrade/README.gpdb. A summary of the
changes is listed below:

 - Make pg_upgrade to pass the pre-checks against GPDB 4.3.
 - Restore dumped schema in utility mode: pg_upgrade is executed on a
   single server in offline mode so ensure we are using utility mode.
 - Disable pg_upgrade checks that don't apply when upgrading to 8.3:
   When support for upgrading to Greenplum 6.0 is added the checks that
   make sense to backport will need to be readded.
 - Support AO/AOCS table: This bumps the AO table version number, and
   adds a conversion routine for numeric attributes. The on-disk format
   of numerics changed between PostgreSQL 8.3 and 8.4. With this commit,
   we can distinguish between AO segments created in the old format and
   the new, and read both formats. New AO segments are always created in
   the new format. Also performs a check for AO tables having NUMERIC
   attributes without free segfiles. Since AO table segments cannot be
   rewritten if there are no free segfiles, issue a warning if such a
   table is encountered during the upgrade.
 - Add code to convert heap pages offline: Bumps heap page format version
   number. While this isn't strictly necessary, when we're doing the
   conversion off-line, it reduces confusion if something goes wrong.
 - Add check for money datatype: the upgrade doesn't support the money
   datatype so check for it's presence and abort upgrade if found.
 - Create new Oid in QD and pass new Oids in dump for pg_upgrade on QE:
   When upgrading from GPDB4 to 5, we need to create new arraytypes for
   the base relation rowtypes in the QD, but we also need to dispatch
   these new OIDs to the QEs. Objects assigning InvalidOid in the Oid
   dispatcher will cause a new Oid to be assigned. Once the new cluster
   is restored, dump the new Oids into a separate dumpfile which isn't
   unlinked on exit. If this file is placed into the cwd of pg_upgrade
   on the QEs, it will be pulled into the db dump and used during
   restoring, thus "dispatching" the Oids from the QD even though they
   are offline. pg_upgrade doesn't at this point know if it's running
   at a QD or a QE so it will always dump this file and include the
   InvalidOid markers.
 - gp_relation_node is reset and rebuilt during upgrade once the data
   files from the old cluster are available to the new cluster. This
   change required altering how checkpoints are requested in the
   backend.
 - Mark indexes as invalid to ensure they are rebuilt in the new
   cluster.
 - Copy the pg_distributedlog from old to new during upgrade: We need
   the distributedlog in the new cluster to be able to start up once
   the upgrade has pulled over the clog.
 - Dont delete dumps when runnin with --debug: While not specific to
   Greenplum, this is a local addition which greatly helps testing
   and development of pg_upgrade.

For testing purposes, a small test cluster created with Greenplum 4.3
is included in contrib/pg_upgrade/test

Heikki Linnakangas, Daniel Gustafsson and Dave Cramer

675b2991

10 5月, 2016 1 次提交

Fix tests for MirroredBufferPool_Truncate() invocation · b1531912

由 Daniel Gustafsson 提交于 5月 09, 2016

MirroredBufferedPool_Truncate() returns a boolean so testing for
(!returnvalue < 0) will always return false and not what we expect
it to test.

Github user fengttt.

b1531912

26 11月, 2015 1 次提交

Minor cosmetic cleanup. · e2cf5231

由 Heikki Linnakangas 提交于 11月 26, 2015

Move a comment outside local block, so that the indentation matches
upstream. Remove some unnecessary #includes.

e2cf5231

28 10月, 2015 1 次提交
- I
  
  Import Greenplum source code. · 6b0e52be
  由 Initial Greenplum code dump 提交于 10月 23, 2015
  
  6b0e52be
26 6月, 2012 1 次提交

Backport fsync queue compaction logic to all supported branches. · ef0f9dde

由 Robert Haas 提交于 6月 26, 2012

This backports commit 7f242d88,
except for the counter in pg_stat_bgwriter.  The underlying problem
(namely, that a full fsync request queue causes terrible checkpoint
behavior) continues to be reported in the wild, and this code seems
to be safe and robust enough to risk back-porting the fix.

ef0f9dde

27 6月, 2009 1 次提交

Cleanup and code review for the patch that made bgwriter active during · 2de48a83

由 Tom Lane 提交于 6月 26, 2009

archive recovery. Invent a separate state variable and inquiry function
for XLogInsertAllowed() to clarify some tests and make the management of
writing the end-of-recovery checkpoint less klugy. Fix several places
that were incorrectly testing InRecovery when they should be looking at
RecoveryInProgress or XLogInsertAllowed (because they will now be executed
in the bgwriter not startup process). Clarify handling of bad LSNs passed
to XLogFlush during recovery. Use a spinlock for setting/testing
SharedRecoveryInProgress. Improve quite a lot of comments.

Heikki and Tom

2de48a83

26 6月, 2009 1 次提交

Fix some serious bugs in archive recovery, now that bgwriter is active · 7e48b77b

由 Heikki Linnakangas 提交于 6月 25, 2009

during it:

When bgwriter is active, the startup process can't perform mdsync() correctly
because it won't see the fsync requests accumulated in bgwriter's private
pendingOpsTable. Therefore make bgwriter responsible for the end-of-recovery
checkpoint as well, when it's active.

When bgwriter is active (= archive recovery), the startup process must not
accumulate fsync requests to its own pendingOpsTable, since bgwriter won't
see them there when it performs restartpoints. Make startup process drop its
pendingOpsTable when bgwriter is launched to avoid that.

Update minimum recovery point one last time when leaving archive recovery.
It won't be updated by the end-of-recovery checkpoint because XLogFlush()
sees us as out of recovery already.

This fixes bug #4879 reported by Fujii Masao.

7e48b77b

11 6月, 2009 1 次提交
- B
  8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list · d7471402
  由 Bruce Momjian 提交于 6月 11, 2009
```
provided by Andrew.
```
  d7471402
12 3月, 2009 1 次提交

Code review for dtrace probes added (so far) to 8.4. Adjust placement of · e04810e8

由 Tom Lane 提交于 3月 11, 2009

some bufmgr probes, take out redundant and memory-leak-inducing path arguments
to smgr__md__read__done and smgr__md__write__done, fix bogus attempt to
recalculate space used in sort__done, clean up formatting in places where
I'm not sure pgindent will do a nice job by itself.

e04810e8

12 1月, 2009 1 次提交

Implement prefetching via posix_fadvise() for bitmap index scans. A new · b7b8f0b6

由 Tom Lane 提交于 1月 12, 2009

GUC variable effective_io_concurrency controls how many concurrent block
prefetch requests will be issued.

(The best way to handle this for plain index scans is still under debate,
so that part is not applied yet --- tgl)

Greg Stark

b7b8f0b6

02 1月, 2009 1 次提交
- B
  
  Update copyright for 2009. · 511db38a
  由 Bruce Momjian 提交于 1月 01, 2009
  
  511db38a
17 12月, 2008 1 次提交

The attached patch contains a couple of fixes in the existing probes and · 5a90bc1f

由 Bruce Momjian 提交于 12月 17, 2008

includes a few new ones.

- Fixed compilation errors on OS X for probes that use typedefs
- Fixed a number of probes to pass ForkNumber per the relation forks
patch
- The new probes are those that were taken out from the previous
submitted patch and required simple fixes. Will submit the other probes
that may require more discussion in a separate patch.

Robert Lor

5a90bc1f

14 11月, 2008 1 次提交
- H
  Fix oversight in previous error-reporting patch; mustn't pfree path string · f06b7604
  由 Heikki Linnakangas 提交于 11月 14, 2008
```
before passing it to elog.
```
  f06b7604
11 11月, 2008 1 次提交

Change error messages to print the physical path, like · 7e8b0b9a

由 Heikki Linnakangas 提交于 11月 11, 2008

"base/11517/3767_fsm", instead of symbolic names like "1663/11517/3767/1",
per Alvaro's suggestion. I didn't change the messages in the higher-level
index, heap and FSM routines, though, where the fork is implicit.

7e8b0b9a

11 8月, 2008 1 次提交

Introduce the concept of relation forks. An smgr relation can now consist · 3f0e808c

由 Heikki Linnakangas 提交于 8月 11, 2008

of multiple forks, and each fork can be created and grown separately.

The bulk of this patch is about changing the smgr API to include an extra
ForkNumber argument in every smgr function. Also, smgrscheduleunlink and
smgrdounlink no longer implicitly call smgrclose, because other forks might
still exist after unlinking one. The callers of those functions have been
modified to call smgrclose instead.

This patch in itself doesn't have any user-visible effect, but provides the
infrastructure needed for upcoming patches. The additional forks envisioned
are a rewritten FSM implementation that doesn't rely on a fixed-size shared
memory block, and a visibility map to allow skipping portions of a table in
VACUUM that have no dead tuples.

3f0e808c

02 5月, 2008 1 次提交

Remove the recently added USE_SEGMENTED_FILES option, and indeed remove all · 3c6248a8

由 Tom Lane 提交于 5月 02, 2008

support for a nonsegmented mode from md.c.  Per recent discussions, there
doesn't seem to be much value in a "never segment" option as opposed to
segmenting with a suitably large segment size.  So instead provide a
configure-time switch to set the desired segment size in units of gigabytes.
While at it, expose a configure switch for BLCKSZ as well.

Zdenek Kotala

3c6248a8

18 4月, 2008 2 次提交

Fix two race conditions between the pending unlink mechanism that was put in · b8c58230

由 Heikki Linnakangas 提交于 4月 18, 2008

place to prevent reusing relation OIDs before next checkpoint, and DROP
DATABASE. First, if a database was dropped, bgwriter would still try to unlink
the files that the rmtree() call by the DROP DATABASE command has already
deleted, or is just about to delete. Second, if a database is dropped, and
another database is created with the same OID, bgwriter would in the worst
case delete a relation in the new database that happened to get the same OID
as a dropped relation in the old database.

To fix these race conditions:
- make rmtree() ignore ENOENT errors. This fixes the 1st race condition.
- make ForgetDatabaseFsyncRequests forget unlink requests as well.
- force checkpoint on in dropdb on all platforms

Since ForgetDatabaseFsyncRequests() is asynchronous, the 2nd change isn't
enough on its own to fix the problem of dropping and creating a database with
same OID, but forcing a checkpoint on DROP DATABASE makes it sufficient.

Per Tom Lane's bug report and proposal. Backpatch to 8.3.

b8c58230

Fix two race conditions between the pending unlink mechanism that was put in · 9cb91f90

由 Heikki Linnakangas 提交于 4月 18, 2008

place to prevent reusing relation OIDs before next checkpoint, and DROP
DATABASE. First, if a database was dropped, bgwriter would still try to unlink
the files that the rmtree() call by the DROP DATABASE command has already
deleted, or is just about to delete. Second, if a database is dropped, and
another database is created with the same OID, bgwriter would in the worst
case delete a relation in the new database that happened to get the same OID
as a dropped relation in the old database.

To fix these race conditions:
- make rmtree() ignore ENOENT errors. This fixes the 1st race condition.
- make ForgetDatabaseFsyncRequests forget unlink requests as well.
- force checkpoint on in dropdb on all platforms

Since ForgetDatabaseFsyncRequests() is asynchronous, the 2nd change isn't
enough on its own to fix the problem of dropping and creating a database with
same OID, but forcing a checkpoint on DROP DATABASE makes it sufficient.

Per Tom Lane's bug report and proposal. Backpatch to 8.3.

9cb91f90

11 3月, 2008 1 次提交

Provide a build-time option to store large relations as single files, rather · f0828b2f

由 Tom Lane 提交于 3月 10, 2008

than dividing them into 1GB segments as has been our longtime practice.  This
requires working support for large files in the operating system; at least for
the time being, it won't be the default.

Zdenek Kotala

f0828b2f

02 1月, 2008 1 次提交
- B
  
  Update copyrights in source tree to 2008. · 9098ab9e
  由 Bruce Momjian 提交于 1月 01, 2008
  
  9098ab9e
16 11月, 2007 5 次提交
- T
  
  Fix stupid typo in recently-added code :-( · eae7e00f
  由 Tom Lane 提交于 11月 16, 2007
  
  eae7e00f
- B
  Re-run pgindent with updated list of typedefs. (Updated README should · f6e8730d
  由 Bruce Momjian 提交于 11月 15, 2007
```
avoid this problem in the future.)
```
  f6e8730d
- T
  Use ftruncate() not truncate() in mdunlink. Seems Windows doesn't · 591b9b09
  由 Tom Lane 提交于 11月 15, 2007
```
support the latter.
```
  591b9b09
- B
  
  pgindent run for 8.3. · fdf5a5ef
  由 Bruce Momjian 提交于 11月 15, 2007
  
  fdf5a5ef
- T
  Prevent re-use of a deleted relation's relfilenode until after the next · 6cc4451b
  由 Tom Lane 提交于 11月 15, 2007
```
checkpoint.  This guards against an unlikely data-loss scenario in which
we re-use the relfilenode, then crash, then replay the deletion and
recreation of the file.  Even then we'd be OK if all insertions into the
new relation had been WAL-logged ... but that's not guaranteed given all
the no-WAL-logging optimizations that have recently been added.

Patch by Heikki Linnakangas, per a discussion last month.
```
  6cc4451b
03 7月, 2007 1 次提交

Fix incorrect comment about the timing of AbsorbFsyncRequests() during · 83aaebba

由 Tom Lane 提交于 7月 03, 2007

checkpoint. The comment claimed that we could do this anytime after
setting the checkpoint REDO point, but actually BufferSync is relying
on the assumption that buffers dumped by other backends will be fsync'd
too. So we really could not do it any sooner than we are doing it.

83aaebba

13 4月, 2007 1 次提交

Rearrange mdsync() looping logic to avoid the problem that a sufficiently · 995ba280

由 Tom Lane 提交于 4月 12, 2007

fast flow of new fsync requests can prevent mdsync() from ever completing.
This was an unforeseen consequence of a patch added in Mar 2006 to prevent
the fsync request queue from overflowing.  Problem identified by Heikki
Linnakangas and independently by ITAGAKI Takahiro; fix based on ideas from
Takahiro-san, Heikki, and Tom.

Back-patch as far as 8.1 because a previous back-patch introduced the problem
into 8.1 ...

995ba280

18 1月, 2007 1 次提交
- T
  Extend yesterday's patch so that the bgwriter is also told to forget · eddbf397
  由 Tom Lane 提交于 1月 17, 2007
```
pending fsyncs during DROP DATABASE.  Obviously necessary in hindsight :-(
```
  eddbf397
17 1月, 2007 1 次提交

Revise bgwriter fsync-request mechanism to improve robustness when a table · 6d660587

由 Tom Lane 提交于 1月 17, 2007

is deleted. A backend about to unlink a file now sends a "revoke fsync"
request to the bgwriter to make it clean out pending fsync requests. There
is still a race condition where the bgwriter may try to fsync after the unlink
has happened, but we can resolve that by rechecking the fsync request queue
to see if a revoke request arrived meanwhile. This eliminates the former
kluge of "just assuming" that an ENOENT failure is okay, and lets us handle
the fact that on Windows it might be EACCES too without introducing any
questionable assumptions. After an idea of mine improved by Magnus.

The HEAD patch doesn't apply cleanly to 8.2, but I'll see about a back-port
later. In the meantime this could do with some testing on Windows; I've been
able to force it through the code path via ENOENT, but that doesn't prove that
it actually fixes the Windows problem ...

6d660587

06 1月, 2007 1 次提交
- B
  Update CVS HEAD for 2007 copyright. Back branches are typically not · 29dccf5f
  由 Bruce Momjian 提交于 1月 05, 2007
```
back-stamped for this.
```
  29dccf5f
04 1月, 2007 1 次提交

Clean up smgr.c/md.c APIs as per discussion a couple months ago. Instead of · ef072219

由 Tom Lane 提交于 1月 03, 2007

having md.c return a success/failure boolean to smgr.c, which was just going
to elog anyway, let md.c issue the elog messages itself. This allows better
error reporting, particularly in cases such as "short read" or "short write"
which Peter was complaining of. Also, remove the kluge of allowing mdread()
to return zeroes from a read-beyond-EOF: this is now an error condition
except when InRecovery or zero_damaged_pages = true. (Hash indexes used to
require that behavior, but no more.) Also, enforce that mdwrite() is to be
used for rewriting existing blocks while mdextend() is to be used for
extending the relation EOF. This restriction lets us get rid of the old
ad-hoc defense against creating huge files by an accidental reference to
a bogus block number: we'll only create new segments in mdextend() not
mdwrite() or mdread(). (Again, when InRecovery we allow it anyway, since
we need to allow updates of blocks that were later truncated away.)
Also, clean up the original makeshift patch for bug #2737: move the
responsibility for padding relation segments to full length into md.c.

ef072219