提交 · 4da99ea4231e3d8bbf28b666748c1028e7b7d665 · Greenplum / Gpdb

27 6月, 2011 1 次提交

Avoid having two copies of the HOT-chain search logic. · 4da99ea4

由 Robert Haas 提交于 6月 27, 2011

It's been like this since HOT was originally introduced, but the logic
is complex enough that this is a recipe for bugs, as we've already
found out with SSI.  So refactor heap_hot_search_buffer() so that it
can satisfy the needs of index_getnext(), and make index_getnext() use
that rather than duplicating the logic.

This change was originally proposed by Heikki Linnakangas as part of a
larger refactoring oriented towards allowing index-only scans.  I
extracted and adjusted this part, since it seems to have independent
merit.  Review by Jeff Davis.

4da99ea4

22 6月, 2011 1 次提交

Make the visibility map crash-safe. · 503c7305

由 Robert Haas 提交于 6月 21, 2011

This involves two main changes from the previous behavior.  First,
when we set a bit in the visibility map, emit a new WAL record of type
XLOG_HEAP2_VISIBLE.  Replay sets the page-level PD_ALL_VISIBLE bit and
the visibility map bit.  Second, when inserting, updating, or deleting
a tuple, we can no longer get away with clearing the visibility map
bit after releasing the lock on the corresponding heap page, because
an intervening crash might leave the visibility map bit set and the
page-level bit clear.  Making this work requires a bit of interface
refactoring.

In passing, a few minor but related cleanups: change the test in
visibilitymap_set and visibilitymap_clear to throw an error if the
wrong page (or no page) is pinned, rather than silently doing nothing;
this case should never occur.  Also, remove duplicate definitions of
InvalidXLogRecPtr.

Patch by me, review by Noah Misch.

503c7305

15 6月, 2011 1 次提交

Make non-MVCC snapshots exempt from predicate locking. Scans with non-MVCC · 0a0e2b52

由 Heikki Linnakangas 提交于 6月 15, 2011

snapshots, like in REINDEX, are basically non-transactional operations. The
DDL operation itself might participate in SSI, but there's separate
functions for that.

Kevin Grittner and Dan Ports, with some changes by me.

0a0e2b52

31 5月, 2011 1 次提交

The row-version chaining in Serializable Snapshot Isolation was still wrong. · 3103f9a7

由 Heikki Linnakangas 提交于 5月 30, 2011

On further analysis, it turns out that it is not needed to duplicate predicate
locks to the new row version at update, the lock on the version that the
transaction saw as visible is enough. However, there was a different bug in
the code that checks for dangerous structures when a new rw-conflict happens.
Fix that bug, and remove all the row-version chaining related code.

Kevin Grittner & Dan Ports, with some comment editorialization by me.

3103f9a7

25 4月, 2011 1 次提交

Fix SSI-related assertion failure. · b429519d

由 Robert Haas 提交于 4月 25, 2011

Bug #5899, reported by Marko Tiikkaja.

Heikki Linnakangas, reviewed by Kevin Grittner and Dan Ports.

b429519d

10 4月, 2011 1 次提交
- B
  
  pgindent run before PG 9.1 beta 1. · bf50caf1
  由 Bruce Momjian 提交于 4月 10, 2011
  
  bf50caf1
04 3月, 2011 1 次提交
- H
  You must hold a lock on the heap page when you call · ee3838b1
  由 Heikki Linnakangas 提交于 3月 04, 2011
```
CheckForSerializableConflictOut(), because it can set hint bits.

YAMAMOTO Takashi
```
  ee3838b1
08 2月, 2011 1 次提交

Implement genuine serializable isolation level. · dafaa3ef

由 Heikki Linnakangas 提交于 2月 07, 2011

Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.

To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.

A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.

Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.

We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.

Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.

Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen

dafaa3ef

02 1月, 2011 2 次提交

Basic foreign table support. · 0d692a0d

由 Robert Haas 提交于 1月 01, 2011

Foreign tables are a core component of SQL/MED. This commit does
not provide a working SQL/MED infrastructure, because foreign tables
cannot yet be queried. Support for foreign table scans will need to
be added in a future patch. However, this patch creates the necessary
system catalog structure, syntax support, and support for ancillary
operations such as COMMENT and SECURITY LABEL.

Shigeru Hanada, heavily revised by Robert Haas

0d692a0d

B

Stamp copyrights for year 2011. · 5d950e3b
由 Bruce Momjian 提交于 1月 01, 2011

5d950e3b

14 12月, 2010 1 次提交

Generalize concept of temporary relations to "relation persistence". · 5f7b58fa

由 Robert Haas 提交于 12月 13, 2010

This commit replaces pg_class.relistemp with pg_class.relpersistence;
and also modifies the RangeVar node type to carry relpersistence rather
than istemp. It also removes removes rd_istemp from RelationData and
instead performs the correct computation based on relpersistence.

For clarity, we add three new macros: RelationNeedsWAL(),
RelationUsesLocalBuffers(), and RelationUsesTempNamespace(), so that we
can clarify the purpose of each check that previous depended on
rd_istemp.

This is intended as infrastructure for the upcoming unlogged tables
patch, as well as for future possible work on global temporary tables.

5f7b58fa

09 12月, 2010 2 次提交

S

Self review of previous patch. Fix assumption that xmax >= xmin. · 9975c683
由 Simon Riggs 提交于 12月 09, 2010

9975c683

Reduce spurious Hot Standby conflicts from never-visible records. · b9075a6d

由 Simon Riggs 提交于 12月 09, 2010

Hot Standby conflicts only with tuples that were visible at
some point. So ignore tuples from aborted transactions or for
tuples updated/deleted during the inserting transaction when
generating the conflict transaction ids.

Following detailed analysis and test case by Noah Misch.
Original report covered btree delete records, correctly observed
by Heikki Linnakangas that this applies to other cases also.
Fix covers all sources of cleanup records via common code.

b9075a6d

09 11月, 2010 1 次提交

In rewriteheap.c (used by VACUUM FULL and CLUSTER), calculate the tuple · 000efc3d

由 Heikki Linnakangas 提交于 11月 09, 2010

length stored in the line pointer the same way it's calculated in the normal
heap_insert() codepath. As noted by Jeff Davis, the length stored by
raw_heap_insert() included padding but the one stored by the normal codepath
did not. While the mismatch seems to be harmless, inconsistency isn't good,
and the normal codepath has received a lot more testing over the years.

Backpatch to 8.3 where the heap rewrite code was introduced.

000efc3d

08 10月, 2010 1 次提交

Improve logging in VACUUM FULL VERBOSE and CLUSTER VERBOSE. · 9cc8c84e

由 Tom Lane 提交于 10月 07, 2010

This patch resurrects some of the information that could be logged by the
old, now-dead implementation of VACUUM FULL, in particular counts of live
and dead tuples and the time taken for the table rebuild proper.  There's
still no logging about the ensuing index rebuilds, though.

Itagaki Takahiro

9cc8c84e

21 9月, 2010 1 次提交
- M
  
  Remove cvs keywords from all files. · 9f2e2113
  由 Magnus Hagander 提交于 9月 20, 2010
  
  9f2e2113
20 9月, 2010 1 次提交
- B
  
  Update HOT README about when single-page vacuums happen. · cecde975
  由 Bruce Momjian 提交于 9月 19, 2010
  
  cecde975
12 9月, 2010 1 次提交

SERIALIZABLE transactions are actually implemented beneath the covers with · 5eb15c99

由 Joe Conway 提交于 9月 11, 2010

transaction snapshots, i.e. a snapshot registered at the beginning of
a transaction. Change variable naming and comments to reflect this reality
in preparation for a future, truly serializable mode, e.g.
Serializable Snapshot Isolation (SSI).

For the moment transaction snapshots are still used to implement
SERIALIZABLE, but hopefully not for too much longer. Patch by Kevin
Grittner and Dan Ports with review and some minor wording changes by me.

5eb15c99

19 8月, 2010 1 次提交

Tidy up a few calls to smrgextend(). · d37781fa

由 Robert Haas 提交于 8月 19, 2010

In the new API introduced by my patch to include the backend ID in
temprel filenames, the last argument to smrgextend() became skipFsync
rather than isTemp, but these calls didn't get the memo. It's not
really a problem to pass rel->rd_istemp rather than just plain false,
because smgrextend() now automatically skips the fsync for temprels
anyway, but this seems cleaner and saves some minute number of cycles.

d37781fa

14 8月, 2010 1 次提交

Include the backend ID in the relpath of temporary relations. · debcec7d

由 Robert Haas 提交于 8月 13, 2010

This allows us to reliably remove all leftover temporary relation
files on cluster startup without reference to system catalogs or WAL;
therefore, we no longer include temporary relations in XLOG_XACT_COMMIT
and XLOG_XACT_ABORT WAL records.

Since these changes require including a backend ID in each
SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id
field has been reduced from two bytes to one, and the maximum number
of connections has been reduced from INT_MAX / 4 to 2^23-1.  It would
be possible to remove these restrictions by increasing the size of
SharedInvalidationMessage by 4 bytes, but right now that doesn't seem
like a good trade-off.

Review by Jaime Casanova and Tom Lane.

debcec7d

30 7月, 2010 1 次提交

Fix possible page corruption by ALTER TABLE .. SET TABLESPACE. · 1a078629

由 Robert Haas 提交于 7月 29, 2010

If a zeroed page is present in the heap, ALTER TABLE .. SET TABLESPACE will
set the LSN and TLI while copying it, which is wrong, and heap_xlog_newpage()
will do the same thing during replay, so the corruption propagates to any
standby.  Note, however, that the bug can't be demonstrated unless archiving
is enabled, since in that case we skip WAL logging altogether, and the LSN/TLI
are not set.

Back-patch to 8.0; prior releases do not have tablespaces.

Analysis and patch by Jeff Davis.  Adjustments for back-branches and minor
wordsmithing by me.

1a078629

07 7月, 2010 1 次提交
- B
  
  pgindent run for 9.0, second run · 239d769e
  由 Bruce Momjian 提交于 7月 06, 2010
  
  239d769e
03 5月, 2010 2 次提交

T

Improve printing of XLOG_HEAP_NEWPAGE records to include the forknum. · 609a63fd
由 Tom Lane 提交于 5月 02, 2010

609a63fd

Fix replay of XLOG_HEAP_NEWPAGE WAL records to pay attention to the forknum · e55e6ecf

由 Tom Lane 提交于 5月 02, 2010

field of the WAL record. The previous coding always wrote to the main fork,
resulting in data corruption if the page was meant to go into a non-default
fork.

At present, the only operation that can produce such WAL records is
ALTER TABLE/INDEX SET TABLESPACE when executed with archive_mode = on.
Data corruption would be observed on standby slaves, and could occur on the
master as well if a database crash and recovery occurred after committing
the ALTER and before the next checkpoint. Per report from Gordon Shannon.

Back-patch to 8.4; the problem doesn't exist in earlier branches because
we didn't have a concept of multiple relation forks then.

e55e6ecf

29 4月, 2010 1 次提交

Introduce wal_level GUC to explicitly control if information needed for · 9b8a7332

由 Heikki Linnakangas 提交于 4月 28, 2010

archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.

Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.

Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.

9b8a7332

24 4月, 2010 1 次提交
- R
  Fix various instances of "the the". · 33980a06
  由 Robert Haas 提交于 4月 23, 2010
```
Two of these were pointed out by Erik Rijkers; the rest I found.
```
  33980a06
22 4月, 2010 2 次提交

Further reductions in Hot Standby conflict processing. These · 781ec6b7

由 Simon Riggs 提交于 4月 22, 2010

come from the realistion that HEAP2_CLEAN records don't
always remove user visible data, so conflict processing for
them can be skipped. Confirm validity using Assert checks,
clarify circumstances under which we log heap_cleanup_info
records. Tuning arises from bug fixing of earlier safety
check failures.

781ec6b7

Fix oversight in collecting values for cleanup_info records. · bc2b85d9

由 Simon Riggs 提交于 4月 21, 2010

vacuum_log_cleanup_info() now generates log records with a valid
latestRemovedXid set in all cases. Also be careful not to zero the
value when we do a round of vacuuming part-way through lazy_scan_heap().
Incidentally, this reduces frequency of conflicts in Hot Standby.

bc2b85d9

26 2月, 2010 1 次提交
- B
  
  pgindent run for 9.0 · 65e806cb
  由 Bruce Momjian 提交于 2月 26, 2010
  
  65e806cb
15 2月, 2010 1 次提交

Wrap calls to SearchSysCache and related functions using macros. · e26c539e

由 Robert Haas 提交于 2月 14, 2010

The purpose of this change is to eliminate the need for every caller
of SearchSysCache, SearchSysCacheCopy, SearchSysCacheExists,
GetSysCacheOid, and SearchSysCacheList to know the maximum number
of allowable keys for a syscache entry (currently 4).  This will
make it far easier to increase the maximum number of keys in a
future release should we choose to do so, and it makes the code
shorter, too.

Design and review by Tom Lane.

e26c539e

10 2月, 2010 1 次提交

Fix up rickety handling of relation-truncation interlocks. · cbe9d6be

由 Tom Lane 提交于 2月 09, 2010

Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr
relation entries, so that they will get reset to InvalidBlockNumber whenever
an smgr-level flush happens. Because we now send smgr invalidation messages
immediately (not at end of transaction) when a relation truncation occurs,
this ensures that other backends will reset their values before they next
access the relation. We no longer need the unreliable assumption that a
VACUUM that's doing a truncation will hold its AccessExclusive lock until
commit --- in fact, we can intentionally release that lock as soon as we've
completed the truncation. This patch therefore reverts (most of) Alvaro's
patch of 2009-11-10, as well as my marginal hacking on it yesterday. We can
also get rid of assorted no-longer-needed relcache flushes, which are far more
expensive than an smgr flush because they kill a lot more state.

In passing this patch fixes smgr_redo's failure to perform visibility-map
truncation, and cleans up some rather dubious assumptions in freespace.c and
visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of
date.

cbe9d6be

08 2月, 2010 1 次提交

Remove old-style VACUUM FULL (which was known for a little while as · 0a469c87

由 Tom Lane 提交于 2月 08, 2010

VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity.
Per discussion, the use case for this method of vacuuming is no longer large
enough to justify maintaining it; not to mention that we don't wish to invest
the work that would be needed to make it play nicely with Hot Standby.

Aside from the code directly related to old-style VACUUM FULL, this commit
removes support for certain WAL record types that could only be generated
within VACUUM FULL, redirect-pointer removal in heap_page_prune, and
nontransactional generation of cache invalidation sinval messages (the last
being the sticking point for Hot Standby).

We still have to retain all code that copes with finding HEAP_MOVED_OFF and
HEAP_MOVED_IN flag bits on existing tuples. This can't be removed as long
as we want to support in-place update from pre-9.0 databases.

0a469c87

04 2月, 2010 1 次提交

Restructure CLUSTER/newstyle VACUUM FULL/ALTER TABLE support so that swapping · 9727c583

由 Tom Lane 提交于 2月 04, 2010

of old and new toast tables can be done either at the logical level (by
swapping the heaps' reltoastrelid links) or at the physical level (by swapping
the relfilenodes of the toast tables and their indexes). This is necessary
infrastructure for upcoming changes to support CLUSTER/VAC FULL on shared
system catalogs, where we cannot change reltoastrelid. The physical swap
saves a few catalog updates too.

We unfortunately have to keep the logical-level swap logic because in some
cases we will be adding or deleting a toast table, so there's no possibility
of a physical swap. However, that only happens as a consequence of schema
changes in the table, which we do not need to support for system catalogs,
so such cases aren't an obstacle for that.

In passing, refactor the cluster support functions a little bit to eliminate
unnecessarily-duplicated code; and fix the problem that while CLUSTER had
been taught to rename the final toast table at need, ALTER TABLE had not.

9727c583

03 2月, 2010 1 次提交

Move the responsibility of writing a "unlogged WAL operation" record from · 9de778b2

由 Heikki Linnakangas 提交于 2月 03, 2010

heap_sync() to the callers, because heap_sync() is sometimes called even
if the operation itself is WAL-logged. This eliminates the bogus unlogged
records from CLUSTER that Simon Riggs reported, patch by Fujii Masao.

9de778b2

30 1月, 2010 1 次提交

Filter recovery conflicts based upon dboid from relfilenode of WAL · 76be0c81

由 Simon Riggs 提交于 1月 29, 2010

records for heap and btree. Minor change, mostly API changes to
pass through the required values. This is a simple change though
also provides the refactoring required for further enhancements
to conflict processing using the relOid. Changes only have effect
during Hot Standby.

76be0c81

21 1月, 2010 1 次提交

Write a WAL record whenever we perform an operation without WAL-logging · 09b115f7

由 Heikki Linnakangas 提交于 1月 20, 2010

that would've been WAL-logged if archiving was enabled. If we encounter
such records in archive recovery anyway, we know that some data is
missing from the log. A WARNING is emitted in that case.

Original patch by Fujii Masao, with changes by me.

09b115f7

14 1月, 2010 1 次提交

First part of refactoring of code for ResolveRecoveryConflict. Purposes · e99767bc

由 Simon Riggs 提交于 1月 14, 2010

of this are to centralise the conflict code to allow further change,
as well as to allow passing through the full reason for the conflict
through to the conflicting backends. Backend state alters how we
can handle different types of conflict so this is now required.
As originally suggested by Heikki, no longer optional.

e99767bc

10 1月, 2010 1 次提交

Remove partial, broken support for NULL pointers when fetching attributes. · 84b6d5f3

由 Robert Haas 提交于 1月 10, 2010

Previously, fastgetattr() and heap_getattr() tested their fourth argument
against a null pointer, but any attempt to use them with a literal-NULL
fourth argument evaluated to *(void *)0, resulting in a compiler error.
Remove these NULL tests to avoid leading future readers of this code to
believe that this has a chance of working. Also clean up related legacy
code in nocachegetattr(), heap_getsysattr(), and nocache_index_getattr().

The new coding standard is that any code which calls a getattr-type
function or macro which takes an isnull argument MUST pass a valid
boolean pointer. Per discussion with Bruce Momjian, Tom Lane, Alvaro
Herrera.

84b6d5f3

03 1月, 2010 1 次提交
- B
  
  Update copyright for the year 2010. · 02398008
  由 Bruce Momjian 提交于 1月 02, 2010
  
  02398008
19 12月, 2009 1 次提交

Allow read only connections during recovery, known as Hot Standby. · efc16ea5

由 Simon Riggs 提交于 12月 19, 2009

Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.

New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.

This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.

Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.

Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.

efc16ea5