提交 · 3b5548a3d524e3b37d49f79f707d2119ecdfa303 · Greenplum / Gpdb

14 5月, 2012 1 次提交

Update comments that became out-of-date with the PGXACT struct. · 9e4637bf

由 Heikki Linnakangas 提交于 5月 14, 2012

When the "hot" members of PGPROC were split off to separate PGXACT structs,
many PGPROC fields referred to in comments were moved to PGXACT, but the
comments were neglected in the commit. Mostly this is just a search/replace
of PGPROC with PGXACT, but the way the dummy PGPROC entries are created for
prepared transactions changed more, making some of the comments totally
bogus.

Noah Misch

9e4637bf

09 5月, 2012 1 次提交
- S
  
  Rename BgWriterShmem/Request to CheckpointerShmem/Request · 8f28789b
  由 Simon Riggs 提交于 5月 09, 2012
  
  8f28789b
02 5月, 2012 2 次提交
- R
  Further corrections from the department of redundancy department. · 1b4998fd
  由 Robert Haas 提交于 5月 02, 2012
```
Thom Brown
```
  1b4998fd
- H
  Remove duplicate words in comments. · f291ccd4
  由 Heikki Linnakangas 提交于 5月 02, 2012
```
Found these with grep -r "for for ".
```
  f291ccd4
24 4月, 2012 1 次提交
- R
  Lots of doc corrections. · 5d4b60f2
  由 Robert Haas 提交于 4月 23, 2012
```
Josh Kupershmidt
```
  5d4b60f2
07 2月, 2012 1 次提交

Add locking around WAL-replay modification of shared-memory variables. · c6d76d7c

由 Tom Lane 提交于 2月 06, 2012

Originally, most of this code assumed that no Postgres backends could be
running concurrently with it, and so no locking could be needed. That
assumption fails in Hot Standby. While it's still true that Hot Standby
backends should never change values like nextXid, they can examine them,
and consistency is important in some cases such as when computing a
snapshot. Therefore, prudence requires that WAL replay code obtain the
relevant locks when modifying such variables, even though it can examine
them without taking a lock. We were following that coding rule in some
places but not all. This commit applies the coding rule uniformly to all
updates of ShmemVariableCache and MultiXactState fields; a search of the
replay routines did not find any other cases that seemed to be at risk.

In addition, this commit fixes a longstanding thinko in replay of NEXTOID
and checkpoint records: we tried to advance nextOid only if it was behind
the value in the WAL record, but the comparison would draw the wrong
conclusion if OID wraparound had occurred since the previous value.
Better to just unconditionally assign the new value, since OID assignment
shouldn't be happening during replay anyway.

The additional locking seems to be more in the nature of future-proofing
than fixing any live bug, so I am not going to back-patch it. The NEXTOID
fix will be back-patched separately.

c6d76d7c

05 2月, 2012 1 次提交

Add missing Assert and fix inaccurate elog message in standby_redo(). · 2af72cef

由 Tom Lane 提交于 2月 04, 2012

All other WAL redo routines either call RestoreBkpBlocks() or Assert that
they haven't been passed any backup blocks. Make this one do likewise.
Also, fix incorrect routine name in its failure message.

2af72cef

30 1月, 2012 1 次提交
- T
  Assorted comment fixes, mostly just typos, but some obsolete statements. · ad10853b
  由 Tom Lane 提交于 1月 29, 2012
```
YAMAMOTO Takashi
```
  ad10853b
24 1月, 2012 1 次提交

Resolve timing issue with logging locks for Hot Standby. · c172b7b0

由 Simon Riggs 提交于 1月 23, 2012

We log AccessExclusiveLocks for replay onto standby nodes,
but because of timing issues on ProcArray it is possible to
log a lock that is still held by a just committed transaction
that is very soon to be removed. To avoid any timing issue we
avoid applying locks made by transactions with InvalidXid.

Simon Riggs, bug report Tom Lane, diagnosis Pavan Deolasee

c172b7b0

02 1月, 2012 1 次提交
- B
  
  Update copyright notices for year 2012. · e126958c
  由 Bruce Momjian 提交于 1月 01, 2012
  
  e126958c
28 12月, 2011 1 次提交

Remove support for on_exit() · d383c23f

由 Peter Eisentraut 提交于 12月 27, 2011

All supported platforms support the C89 standard function atexit()
(SunOS 4 probably being the last one not to), and supporting both
makes the code clumsy.

d383c23f

17 12月, 2011 1 次提交

Various micro-optimizations for GetSnapshopData(). · 0d76b60d

由 Robert Haas 提交于 12月 16, 2011

Heikki Linnakangas had the idea of rearranging GetSnapshotData to
avoid checking for sub-XIDs when no top-level XID is present. This
patch does that plus further a bit of further, related rearrangement.
Benchmarking show a significant improvement on unlogged tables at
higher concurrency levels, and mostly indifferent result on permanent
tables (which are presumably bottlenecked elsewhere). Most of the
benefit seems to come from using the new NormalTransactionIdPrecedes()
macro rather than the function call TransactionIdPrecedes().

0d76b60d

25 11月, 2011 1 次提交

Move "hot" members of PGPROC into a separate PGXACT array. · ed0b409d

由 Robert Haas 提交于 11月 25, 2011

This speeds up snapshot-taking and reduces ProcArrayLock contention.
Also, the PGPROC (and PGXACT) structures used by two-phase commit are
now allocated as part of the main array, rather than in a separate
array, and we keep ProcArray sorted in pointer order. These changes
are intended to minimize the number of cache lines that must be pulled
in to take a snapshot, and testing shows a substantial increase in
performance on both read and write workloads at high concurrencies.

Pavan Deolasee, Heikki Linnakangas, Robert Haas

ed0b409d

02 11月, 2011 2 次提交

Derive oldestActiveXid at correct time for Hot Standby. · 86e33648

由 Simon Riggs 提交于 11月 02, 2011

There was a timing window between when oldestActiveXid was derived
and when it should have been derived that only shows itself under
heavy load. Move code around to ensure correct timing of derivation.
No change to StartupSUBTRANS() code, which is where this failed.

Bug report by Chris Redekop

86e33648

Start Hot Standby faster when initial snapshot is incomplete. · 10b7c686

由 Simon Riggs 提交于 11月 02, 2011

If the initial snapshot had overflowed then we can start whenever
the latest snapshot is empty, not overflowed or as we did already,
start when the xmin on primary was higher than xmax of our starting
snapshot, which proves we have full snapshot data.

Bug report by Chris Redekop

10b7c686

23 10月, 2011 1 次提交

Support synchronization of snapshots through an export/import procedure. · bb446b68

由 Tom Lane 提交于 10月 22, 2011

A transaction can export a snapshot with pg_export_snapshot(), and then
others can import it with SET TRANSACTION SNAPSHOT.  The data does not
leave the server so there are not security issues.  A snapshot can only
be imported while the exporting transaction is still running, and there
are some other restrictions.

I'm not totally convinced that we've covered all the bases for SSI (true
serializable) mode, but it works fine for lesser isolation modes.

Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified
by Tom Lane

bb446b68

21 10月, 2011 1 次提交

Simplify and improve ProcessStandbyHSFeedbackMessage logic. · b4a0223d

由 Tom Lane 提交于 10月 20, 2011

There's no need to clamp the standby's xmin to be greater than
GetOldestXmin's result; if there were any such need this logic would be
hopelessly inadequate anyway, because it fails to account for
within-database versus cluster-wide values of GetOldestXmin. So get rid of
that, and just rely on sanity-checking that the xmin is not wrapped around
relative to the nextXid counter. Also, don't reset the walsender's xmin if
the current feedback xmin is indeed out of range; that just creates more
problems than we already had. Lastly, don't bother to take the
ProcArrayLock; there's no need to do that to set xmin.

Also improve the comments about this in GetOldestXmin itself.

b4a0223d

10 9月, 2011 1 次提交

Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h. · a7801b62

由 Tom Lane 提交于 9月 09, 2011

As per my recent proposal, this refactors things so that these typedefs and
macros are available in a header that can be included in frontend-ish code.
I also changed various headers that were undesirably including
utils/timestamp.h to include datatype/timestamp.h instead.  Unsurprisingly,
this showed that half the system was getting utils/timestamp.h by way of
xlog.h.

No actual code changes here, just header refactoring.

a7801b62

04 9月, 2011 1 次提交

Clean up the #include mess a little. · 1609797c

由 Tom Lane 提交于 9月 04, 2011

walsender.h should depend on xlog.h, not vice versa. (Actually, the
inclusion was circular until a couple hours ago, which was even sillier;
but Bruce broke it in the expedient rather than logically correct
direction.) Because of that poor decision, plus blind application of
pgrminclude, we had a situation where half the system was depending on
xlog.h to include such unrelated stuff as array.h and guc.h. Clean up
the header inclusion, and manually revert a lot of what pgrminclude had
done so things build again.

This episode reinforces my feeling that pgrminclude should not be run
without adult supervision. Inclusion changes in header files in particular
need to be reviewed with great care. More generally, it'd be good if we
had a clearer notion of module layering to dictate which headers can sanely
include which others ... but that's a big task for another day.

1609797c

01 9月, 2011 1 次提交
- B
  
  Remove unnecessary #include references, per pgrminclude script. · 6416a82a
  由 Bruce Momjian 提交于 9月 01, 2011
  
  6416a82a
18 8月, 2011 1 次提交

Remove obsolete README file. · 24bf1552

由 Robert Haas 提交于 8月 18, 2011

Perhaps we ought to add some other kind of documentation here instead,
but for now let's get rid of this woefully obsolete description of the
sinval machinery.

24bf1552

05 8月, 2011 1 次提交

Create VXID locks "lazily" in the main lock table. · 84e37126

由 Robert Haas 提交于 8月 04, 2011

Instead of entering them on transaction startup, we materialize them
only when someone wants to wait, which will occur only during CREATE
INDEX CONCURRENTLY. In Hot Standby mode, the startup process must also
be able to probe for conflicting VXID locks, but the lock need never be
fully materialized, because the startup process does not use the normal
lock wait mechanism. Since most VXID locks never need to touch the
lock manager partition locks, this can significantly reduce blocking
contention on read-heavy workloads.

Patch by me. Review by Jeff Davis.

84e37126

03 8月, 2011 1 次提交

Move CheckRecoveryConflictDeadlock() call to a safer place. · ac36e6f7

由 Tom Lane 提交于 8月 02, 2011

This kluge was inserted in a spot apparently chosen at random: the lock
manager's state is not yet fully set up for the wait, and in particular
LockWaitCancel hasn't been armed by setting lockAwaited, so the ProcLock
will not get cleaned up if the ereport is thrown.  This seems to not cause
any observable problem in trivial test cases, because LockReleaseAll will
silently clean up the debris; but I was able to cause failures with tests
involving subtransactions.

Fixes breakage induced by commit c85c9414.
Back-patch to all affected branches.

ac36e6f7

01 8月, 2011 1 次提交
- R
  
  Minor stylistic corrections. · 85b436f7
  由 Robert Haas 提交于 8月 01, 2011
  
  85b436f7
30 7月, 2011 1 次提交

Reduce sinval synchronization overhead. · b4fbe392

由 Robert Haas 提交于 7月 29, 2011

Testing shows that the overhead of acquiring and releasing
SInvalReadLock and msgNumLock on high-core count boxes can waste a lot
of CPU time and hurt performance.  This patch adds a per-backend flag
that allows us to skip all that locking in most cases.  Further
testing shows that this improves performance even when sinval traffic
is very high.

Patch by me.  Review and testing by Noah Misch.

b4fbe392

09 7月, 2011 1 次提交

Try to acquire relation locks in RangeVarGetRelid. · 4240e429

由 Robert Haas 提交于 7月 08, 2011

In the previous coding, we would look up a relation in RangeVarGetRelid,
lock the resulting OID, and then AcceptInvalidationMessages(). While
this was sufficient to ensure that we noticed any changes to the
relation definition before building the relcache entry, it didn't
handle the possibility that the name we looked up no longer referenced
the same OID. This was particularly problematic in the case where a
table had been dropped and recreated: we'd latch on to the entry for
the old relation and fail later on. Now, we acquire the relation lock
inside RangeVarGetRelid, and retry the name lookup if we notice that
invalidation messages have been processed meanwhile. Many operations
that would previously have failed with an error in the presence of
concurrent DDL will now succeed.

There is a good deal of work remaining to be done here: many callers
of RangeVarGetRelid still pass NoLock for one reason or another. In
addition, nothing in this patch guards against the possibility that
the meaning of an unqualified name might change due to the creation
of a relation in a schema earlier in the user's search path than the
one where it was previously found. Furthermore, there's nothing at
all here to guard against similar race conditions for non-relations.
For all that, it's a start.

Noah Misch and Robert Haas

4240e429

08 7月, 2011 1 次提交

Introduce a pipe between postmaster and each backend, which can be used to · 89fd72cb

由 Heikki Linnakangas 提交于 7月 08, 2011

detect postmaster death. Postmaster keeps the write-end of the pipe open,
so when it dies, children get EOF in the read-end. That can conveniently
be waited for in select(), which allows eliminating some of the polling
loops that check for postmaster death. This patch doesn't yet change all
the loops to use the new mechanism, expect a follow-on patch to do that.

This changes the interface to WaitLatch, so that it takes as argument a
bitmask of events that it waits for. Possible events are latch set, timeout,
postmaster death, and socket becoming readable or writeable.

The pipe method behaves slightly differently from the kill() method
previously used in PostmasterIsAlive() in the case that postmaster has died,
but its parent has not yet read its exit code with waitpid(). The pipe
returns EOF as soon as the process dies, but kill() continues to return
true until waitpid() has been called (IOW while the process is a zombie).
Because of that, change PostmasterIsAlive() to use the pipe too, otherwise
WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while
PostmasterIsAlive() would claim it's still alive. That could easily lead to
busy-waiting while postmaster is in zombie state.

Peter Geoghegan with further changes by me, reviewed by Fujii Masao and
Florian Pflug.

89fd72cb

19 6月, 2011 1 次提交
- P
  
  Capitalization fixes · 8a8fbe7e
  由 Peter Eisentraut 提交于 6月 19, 2011
  
  8a8fbe7e
12 4月, 2011 1 次提交

Clean up most -Wunused-but-set-variable warnings from gcc 4.6 · 5caa3479

由 Peter Eisentraut 提交于 4月 11, 2011

This warning is new in gcc 4.6 and part of -Wall.  This patch cleans
up most of the noise, but there are some still warnings that are
trickier to remove.

5caa3479

10 4月, 2011 1 次提交
- B
  
  pgindent run before PG 9.1 beta 1. · bf50caf1
  由 Bruce Momjian 提交于 4月 10, 2011
  
  bf50caf1
23 3月, 2011 1 次提交

Prevent intermittent hang in recovery from bgwriter interaction. · b98ac467

由 Simon Riggs 提交于 3月 23, 2011

Startup process waited for cleanup lock but when hot_standby = off
the pid was not registered, so that the bgwriter would not wake
the waiting process as intended.

b98ac467

11 3月, 2011 1 次提交
- R
  Minor sync rep corrections. · 64360987
  由 Robert Haas 提交于 3月 10, 2011
```
Fujii Masao, with a bit of additional wordsmithing by me.
```
  64360987
09 3月, 2011 1 次提交

Don't throw a warning if vacuum sees PD_ALL_VISIBLE flag set on a page that · 93d88823

由 Heikki Linnakangas 提交于 3月 08, 2011

contains newly-inserted tuples that according to our OldestXmin are not
yet visible to everyone. The value returned by GetOldestXmin() is conservative,
and it can move backwards on repeated calls, so if we see that contradiction
between the PD_ALL_VISIBLE flag and status of tuples on the page, we have to
assume it's because an earlier vacuum calculated a higher OldestXmin value,
and all the tuples really are visible to everyone.

We have received several reports of this bug, with the "PD_ALL_VISIBLE flag
was incorrectly set in relation ..." warning appearing in logs. We were
finally able to hunt it down with David Gould's help to run extra diagnostics
in an environment where this happened frequently.

Also reword the warning, per Robert Haas' suggestion, to not imply that the
PD_ALL_VISIBLE flag is necessarily at fault, as it might also be a symptom
of corruption on a tuple header.

Backpatch to 8.4, where the PD_ALL_VISIBLE flag was introduced.

93d88823

07 3月, 2011 1 次提交

Efficient transaction-controlled synchronous replication. · a8a8a3e0

由 Simon Riggs 提交于 3月 06, 2011

If a standby is broadcasting reply messages and we have named
one or more standbys in synchronous_standby_names then allow
users who set synchronous_replication to wait for commit, which
then provides strict data integrity guarantees. Design avoids
sending and receiving transaction state information so minimises
bookkeeping overheads. We synchronize with the highest priority
standby that is connected and ready to synchronize. Other standbys
can be defined to takeover in case of standby failure.

This version has very strict behaviour; more relaxed options
may be added at a later date.

Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime
Casanova, Heikki Linnakangas and Robert Haas, plus the assistance
of many other design reviewers.

a8a8a3e0

17 2月, 2011 1 次提交

Hot Standby feedback for avoidance of cleanup conflicts on standby. · bca8b7f1

由 Simon Riggs 提交于 2月 16, 2011

Standby optionally sends back information about oldestXmin of queries
which is then checked and applied to the WALSender's proc->xmin.
GetOldestXmin() is modified slightly to agree with GetSnapshotData(),
so that all backends on primary include WALSender within their snapshots.
Note this does nothing to change the snapshot xmin on either master or
standby. Feedback piggybacks on the standby reply message.
vacuum_defer_cleanup_age is no longer used on standby, though parameter
still exists on primary, since some use cases still exist.

Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas

bca8b7f1

08 2月, 2011 1 次提交

Implement genuine serializable isolation level. · dafaa3ef

由 Heikki Linnakangas 提交于 2月 07, 2011

Until now, our Serializable mode has in fact been what's called Snapshot
Isolation, which allows some anomalies that could not occur in any
serialized ordering of the transactions. This patch fixes that using a
method called Serializable Snapshot Isolation, based on research papers by
Michael J. Cahill (see README-SSI for full references). In Serializable
Snapshot Isolation, transactions run like they do in Snapshot Isolation,
but a predicate lock manager observes the reads and writes performed and
aborts transactions if it detects that an anomaly might occur. This method
produces some false positives, ie. it sometimes aborts transactions even
though there is no anomaly.

To track reads we implement predicate locking, see storage/lmgr/predicate.c.
Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
memory is finite, so when a transaction takes many tuple-level locks on a
page, the locks are promoted to a single page-level lock, and further to a
single relation level lock if necessary. To lock key values with no matching
tuple, a sequential scan always takes a relation-level lock, and an index
scan acquires a page-level lock that covers the search key, whether or not
there are any matching keys at the moment.

A predicate lock doesn't conflict with any regular locks or with another
predicate locks in the normal sense. They're only used by the predicate lock
manager to detect the danger of anomalies. Only serializable transactions
participate in predicate locking, so there should be no extra overhead for
for other transactions.

Predicate locks can't be released at commit, but must be remembered until
all the transactions that overlapped with it have completed. That means that
we need to remember an unbounded amount of predicate locks, so we apply a
lossy but conservative method of tracking locks for committed transactions.
If we run short of shared memory, we overflow to a new "pg_serial" SLRU
pool.

We don't currently allow Serializable transactions in Hot Standby mode.
That would be hard, because even read-only transactions can cause anomalies
that wouldn't otherwise occur.

Serializable isolation mode now means the new fully serializable level.
Repeatable Read gives you the old Snapshot Isolation level that we have
always had.

Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
Anssi Kääriäinen

dafaa3ef

01 2月, 2011 1 次提交

Fix error code for canceling statement due to conflict with recovery. · 8585ad36

由 Simon Riggs 提交于 1月 31, 2011

All retryable conflict errors now have an error code that indicates that
a retry is possible, correcting my incomplete fix of 2010/05/12

Tatsuo Ishii and Simon Riggs, input from Robert Haas and Florian Pflug

8585ad36

18 1月, 2011 1 次提交
- H
  
  Fix thinko in comment. Spotted by Jim Nasby. · b1dc45c1
  由 Heikki Linnakangas 提交于 1月 18, 2011
  
  b1dc45c1
15 1月, 2011 1 次提交

Treat a WAL sender process that hasn't started streaming yet as a regular · 8f5d65e9

由 Heikki Linnakangas 提交于 1月 15, 2011

backend, as far as the postmaster shutdown logic is concerned. That means,
fast shutdown will wait for WAL sender processes to exit before signaling
bgwriter to finish. This avoids race conditions between a base backup stopping
or starting, and bgwriter writing the shutdown checkpoint WAL record. We don't
want e.g the end-of-backup WAL record to be written after the shutdown
checkpoint.

8f5d65e9

02 1月, 2011 1 次提交
- B
  
  Stamp copyrights for year 2011. · 5d950e3b
  由 Bruce Momjian 提交于 1月 01, 2011
  
  5d950e3b