1. 14 5月, 2012 1 次提交
    • H
      Update comments that became out-of-date with the PGXACT struct. · 9e4637bf
      Heikki Linnakangas 提交于
      When the "hot" members of PGPROC were split off to separate PGXACT structs,
      many PGPROC fields referred to in comments were moved to PGXACT, but the
      comments were neglected in the commit. Mostly this is just a search/replace
      of PGPROC with PGXACT, but the way the dummy PGPROC entries are created for
      prepared transactions changed more, making some of the comments totally
      bogus.
      
      Noah Misch
      9e4637bf
  2. 09 5月, 2012 1 次提交
  3. 02 5月, 2012 2 次提交
  4. 24 4月, 2012 1 次提交
  5. 07 2月, 2012 1 次提交
    • T
      Add locking around WAL-replay modification of shared-memory variables. · c6d76d7c
      Tom Lane 提交于
      Originally, most of this code assumed that no Postgres backends could be
      running concurrently with it, and so no locking could be needed.  That
      assumption fails in Hot Standby.  While it's still true that Hot Standby
      backends should never change values like nextXid, they can examine them,
      and consistency is important in some cases such as when computing a
      snapshot.  Therefore, prudence requires that WAL replay code obtain the
      relevant locks when modifying such variables, even though it can examine
      them without taking a lock.  We were following that coding rule in some
      places but not all.  This commit applies the coding rule uniformly to all
      updates of ShmemVariableCache and MultiXactState fields; a search of the
      replay routines did not find any other cases that seemed to be at risk.
      
      In addition, this commit fixes a longstanding thinko in replay of NEXTOID
      and checkpoint records: we tried to advance nextOid only if it was behind
      the value in the WAL record, but the comparison would draw the wrong
      conclusion if OID wraparound had occurred since the previous value.
      Better to just unconditionally assign the new value, since OID assignment
      shouldn't be happening during replay anyway.
      
      The additional locking seems to be more in the nature of future-proofing
      than fixing any live bug, so I am not going to back-patch it.  The NEXTOID
      fix will be back-patched separately.
      c6d76d7c
  6. 05 2月, 2012 1 次提交
  7. 30 1月, 2012 1 次提交
  8. 24 1月, 2012 1 次提交
    • S
      Resolve timing issue with logging locks for Hot Standby. · c172b7b0
      Simon Riggs 提交于
      We log AccessExclusiveLocks for replay onto standby nodes,
      but because of timing issues on ProcArray it is possible to
      log a lock that is still held by a just committed transaction
      that is very soon to be removed. To avoid any timing issue we
      avoid applying locks made by transactions with InvalidXid.
      
      Simon Riggs, bug report Tom Lane, diagnosis Pavan Deolasee
      c172b7b0
  9. 02 1月, 2012 1 次提交
  10. 28 12月, 2011 1 次提交
    • P
      Remove support for on_exit() · d383c23f
      Peter Eisentraut 提交于
      All supported platforms support the C89 standard function atexit()
      (SunOS 4 probably being the last one not to), and supporting both
      makes the code clumsy.
      d383c23f
  11. 17 12月, 2011 1 次提交
    • R
      Various micro-optimizations for GetSnapshopData(). · 0d76b60d
      Robert Haas 提交于
      Heikki Linnakangas had the idea of rearranging GetSnapshotData to
      avoid checking for sub-XIDs when no top-level XID is present.  This
      patch does that plus further a bit of further, related rearrangement.
      Benchmarking show a significant improvement on unlogged tables at
      higher concurrency levels, and mostly indifferent result on permanent
      tables (which are presumably bottlenecked elsewhere).  Most of the
      benefit seems to come from using the new NormalTransactionIdPrecedes()
      macro rather than the function call TransactionIdPrecedes().
      0d76b60d
  12. 25 11月, 2011 1 次提交
    • R
      Move "hot" members of PGPROC into a separate PGXACT array. · ed0b409d
      Robert Haas 提交于
      This speeds up snapshot-taking and reduces ProcArrayLock contention.
      Also, the PGPROC (and PGXACT) structures used by two-phase commit are
      now allocated as part of the main array, rather than in a separate
      array, and we keep ProcArray sorted in pointer order.  These changes
      are intended to minimize the number of cache lines that must be pulled
      in to take a snapshot, and testing shows a substantial increase in
      performance on both read and write workloads at high concurrencies.
      
      Pavan Deolasee, Heikki Linnakangas, Robert Haas
      ed0b409d
  13. 02 11月, 2011 2 次提交
    • S
      Derive oldestActiveXid at correct time for Hot Standby. · 86e33648
      Simon Riggs 提交于
      There was a timing window between when oldestActiveXid was derived
      and when it should have been derived that only shows itself under
      heavy load. Move code around to ensure correct timing of derivation.
      No change to StartupSUBTRANS() code, which is where this failed.
      
      Bug report by Chris Redekop
      86e33648
    • S
      Start Hot Standby faster when initial snapshot is incomplete. · 10b7c686
      Simon Riggs 提交于
      If the initial snapshot had overflowed then we can start whenever
      the latest snapshot is empty, not overflowed or as we did already,
      start when the xmin on primary was higher than xmax of our starting
      snapshot, which proves we have full snapshot data.
      
      Bug report by Chris Redekop
      10b7c686
  14. 23 10月, 2011 1 次提交
    • T
      Support synchronization of snapshots through an export/import procedure. · bb446b68
      Tom Lane 提交于
      A transaction can export a snapshot with pg_export_snapshot(), and then
      others can import it with SET TRANSACTION SNAPSHOT.  The data does not
      leave the server so there are not security issues.  A snapshot can only
      be imported while the exporting transaction is still running, and there
      are some other restrictions.
      
      I'm not totally convinced that we've covered all the bases for SSI (true
      serializable) mode, but it works fine for lesser isolation modes.
      
      Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified
      by Tom Lane
      bb446b68
  15. 21 10月, 2011 1 次提交
    • T
      Simplify and improve ProcessStandbyHSFeedbackMessage logic. · b4a0223d
      Tom Lane 提交于
      There's no need to clamp the standby's xmin to be greater than
      GetOldestXmin's result; if there were any such need this logic would be
      hopelessly inadequate anyway, because it fails to account for
      within-database versus cluster-wide values of GetOldestXmin.  So get rid of
      that, and just rely on sanity-checking that the xmin is not wrapped around
      relative to the nextXid counter.  Also, don't reset the walsender's xmin if
      the current feedback xmin is indeed out of range; that just creates more
      problems than we already had.  Lastly, don't bother to take the
      ProcArrayLock; there's no need to do that to set xmin.
      
      Also improve the comments about this in GetOldestXmin itself.
      b4a0223d
  16. 10 9月, 2011 1 次提交
    • T
      Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h. · a7801b62
      Tom Lane 提交于
      As per my recent proposal, this refactors things so that these typedefs and
      macros are available in a header that can be included in frontend-ish code.
      I also changed various headers that were undesirably including
      utils/timestamp.h to include datatype/timestamp.h instead.  Unsurprisingly,
      this showed that half the system was getting utils/timestamp.h by way of
      xlog.h.
      
      No actual code changes here, just header refactoring.
      a7801b62
  17. 04 9月, 2011 1 次提交
    • T
      Clean up the #include mess a little. · 1609797c
      Tom Lane 提交于
      walsender.h should depend on xlog.h, not vice versa.  (Actually, the
      inclusion was circular until a couple hours ago, which was even sillier;
      but Bruce broke it in the expedient rather than logically correct
      direction.)  Because of that poor decision, plus blind application of
      pgrminclude, we had a situation where half the system was depending on
      xlog.h to include such unrelated stuff as array.h and guc.h.  Clean up
      the header inclusion, and manually revert a lot of what pgrminclude had
      done so things build again.
      
      This episode reinforces my feeling that pgrminclude should not be run
      without adult supervision.  Inclusion changes in header files in particular
      need to be reviewed with great care.  More generally, it'd be good if we
      had a clearer notion of module layering to dictate which headers can sanely
      include which others ... but that's a big task for another day.
      1609797c
  18. 01 9月, 2011 1 次提交
  19. 18 8月, 2011 1 次提交
    • R
      Remove obsolete README file. · 24bf1552
      Robert Haas 提交于
      Perhaps we ought to add some other kind of documentation here instead,
      but for now let's get rid of this woefully obsolete description of the
      sinval machinery.
      24bf1552
  20. 05 8月, 2011 1 次提交
    • R
      Create VXID locks "lazily" in the main lock table. · 84e37126
      Robert Haas 提交于
      Instead of entering them on transaction startup, we materialize them
      only when someone wants to wait, which will occur only during CREATE
      INDEX CONCURRENTLY.  In Hot Standby mode, the startup process must also
      be able to probe for conflicting VXID locks, but the lock need never be
      fully materialized, because the startup process does not use the normal
      lock wait mechanism.  Since most VXID locks never need to touch the
      lock manager partition locks, this can significantly reduce blocking
      contention on read-heavy workloads.
      
      Patch by me.  Review by Jeff Davis.
      84e37126
  21. 03 8月, 2011 1 次提交
    • T
      Move CheckRecoveryConflictDeadlock() call to a safer place. · ac36e6f7
      Tom Lane 提交于
      This kluge was inserted in a spot apparently chosen at random: the lock
      manager's state is not yet fully set up for the wait, and in particular
      LockWaitCancel hasn't been armed by setting lockAwaited, so the ProcLock
      will not get cleaned up if the ereport is thrown.  This seems to not cause
      any observable problem in trivial test cases, because LockReleaseAll will
      silently clean up the debris; but I was able to cause failures with tests
      involving subtransactions.
      
      Fixes breakage induced by commit c85c9414.
      Back-patch to all affected branches.
      ac36e6f7
  22. 01 8月, 2011 1 次提交
  23. 30 7月, 2011 1 次提交
    • R
      Reduce sinval synchronization overhead. · b4fbe392
      Robert Haas 提交于
      Testing shows that the overhead of acquiring and releasing
      SInvalReadLock and msgNumLock on high-core count boxes can waste a lot
      of CPU time and hurt performance.  This patch adds a per-backend flag
      that allows us to skip all that locking in most cases.  Further
      testing shows that this improves performance even when sinval traffic
      is very high.
      
      Patch by me.  Review and testing by Noah Misch.
      b4fbe392
  24. 09 7月, 2011 1 次提交
    • R
      Try to acquire relation locks in RangeVarGetRelid. · 4240e429
      Robert Haas 提交于
      In the previous coding, we would look up a relation in RangeVarGetRelid,
      lock the resulting OID, and then AcceptInvalidationMessages().  While
      this was sufficient to ensure that we noticed any changes to the
      relation definition before building the relcache entry, it didn't
      handle the possibility that the name we looked up no longer referenced
      the same OID.  This was particularly problematic in the case where a
      table had been dropped and recreated: we'd latch on to the entry for
      the old relation and fail later on.  Now, we acquire the relation lock
      inside RangeVarGetRelid, and retry the name lookup if we notice that
      invalidation messages have been processed meanwhile.  Many operations
      that would previously have failed with an error in the presence of
      concurrent DDL will now succeed.
      
      There is a good deal of work remaining to be done here: many callers
      of RangeVarGetRelid still pass NoLock for one reason or another.  In
      addition, nothing in this patch guards against the possibility that
      the meaning of an unqualified name might change due to the creation
      of a relation in a schema earlier in the user's search path than the
      one where it was previously found.  Furthermore, there's nothing at
      all here to guard against similar race conditions for non-relations.
      For all that, it's a start.
      
      Noah Misch and Robert Haas
      4240e429
  25. 08 7月, 2011 1 次提交
    • H
      Introduce a pipe between postmaster and each backend, which can be used to · 89fd72cb
      Heikki Linnakangas 提交于
      detect postmaster death. Postmaster keeps the write-end of the pipe open,
      so when it dies, children get EOF in the read-end. That can conveniently
      be waited for in select(), which allows eliminating some of the polling
      loops that check for postmaster death. This patch doesn't yet change all
      the loops to use the new mechanism, expect a follow-on patch to do that.
      
      This changes the interface to WaitLatch, so that it takes as argument a
      bitmask of events that it waits for. Possible events are latch set, timeout,
      postmaster death, and socket becoming readable or writeable.
      
      The pipe method behaves slightly differently from the kill() method
      previously used in PostmasterIsAlive() in the case that postmaster has died,
      but its parent has not yet read its exit code with waitpid(). The pipe
      returns EOF as soon as the process dies, but kill() continues to return
      true until waitpid() has been called (IOW while the process is a zombie).
      Because of that, change PostmasterIsAlive() to use the pipe too, otherwise
      WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while
      PostmasterIsAlive() would claim it's still alive. That could easily lead to
      busy-waiting while postmaster is in zombie state.
      
      Peter Geoghegan with further changes by me, reviewed by Fujii Masao and
      Florian Pflug.
      89fd72cb
  26. 19 6月, 2011 1 次提交
  27. 12 4月, 2011 1 次提交
  28. 10 4月, 2011 1 次提交
  29. 23 3月, 2011 1 次提交
  30. 11 3月, 2011 1 次提交
  31. 09 3月, 2011 1 次提交
    • H
      Don't throw a warning if vacuum sees PD_ALL_VISIBLE flag set on a page that · 93d88823
      Heikki Linnakangas 提交于
      contains newly-inserted tuples that according to our OldestXmin are not
      yet visible to everyone. The value returned by GetOldestXmin() is conservative,
      and it can move backwards on repeated calls, so if we see that contradiction
      between the PD_ALL_VISIBLE flag and status of tuples on the page, we have to
      assume it's because an earlier vacuum calculated a higher OldestXmin value,
      and all the tuples really are visible to everyone.
      
      We have received several reports of this bug, with the "PD_ALL_VISIBLE flag
      was incorrectly set in relation ..." warning appearing in logs. We were
      finally able to hunt it down with David Gould's help to run extra diagnostics
      in an environment where this happened frequently.
      
      Also reword the warning, per Robert Haas' suggestion, to not imply that the
      PD_ALL_VISIBLE flag is necessarily at fault, as it might also be a symptom
      of corruption on a tuple header.
      
      Backpatch to 8.4, where the PD_ALL_VISIBLE flag was introduced.
      93d88823
  32. 07 3月, 2011 1 次提交
    • S
      Efficient transaction-controlled synchronous replication. · a8a8a3e0
      Simon Riggs 提交于
      If a standby is broadcasting reply messages and we have named
      one or more standbys in synchronous_standby_names then allow
      users who set synchronous_replication to wait for commit, which
      then provides strict data integrity guarantees. Design avoids
      sending and receiving transaction state information so minimises
      bookkeeping overheads. We synchronize with the highest priority
      standby that is connected and ready to synchronize. Other standbys
      can be defined to takeover in case of standby failure.
      
      This version has very strict behaviour; more relaxed options
      may be added at a later date.
      
      Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime
      Casanova, Heikki Linnakangas and Robert Haas, plus the assistance
      of many other design reviewers.
      a8a8a3e0
  33. 17 2月, 2011 1 次提交
    • S
      Hot Standby feedback for avoidance of cleanup conflicts on standby. · bca8b7f1
      Simon Riggs 提交于
      Standby optionally sends back information about oldestXmin of queries
      which is then checked and applied to the WALSender's proc->xmin.
      GetOldestXmin() is modified slightly to agree with GetSnapshotData(),
      so that all backends on primary include WALSender within their snapshots.
      Note this does nothing to change the snapshot xmin on either master or
      standby. Feedback piggybacks on the standby reply message.
      vacuum_defer_cleanup_age is no longer used on standby, though parameter
      still exists on primary, since some use cases still exist.
      
      Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas
      bca8b7f1
  34. 08 2月, 2011 1 次提交
    • H
      Implement genuine serializable isolation level. · dafaa3ef
      Heikki Linnakangas 提交于
      Until now, our Serializable mode has in fact been what's called Snapshot
      Isolation, which allows some anomalies that could not occur in any
      serialized ordering of the transactions. This patch fixes that using a
      method called Serializable Snapshot Isolation, based on research papers by
      Michael J. Cahill (see README-SSI for full references). In Serializable
      Snapshot Isolation, transactions run like they do in Snapshot Isolation,
      but a predicate lock manager observes the reads and writes performed and
      aborts transactions if it detects that an anomaly might occur. This method
      produces some false positives, ie. it sometimes aborts transactions even
      though there is no anomaly.
      
      To track reads we implement predicate locking, see storage/lmgr/predicate.c.
      Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared
      memory is finite, so when a transaction takes many tuple-level locks on a
      page, the locks are promoted to a single page-level lock, and further to a
      single relation level lock if necessary. To lock key values with no matching
      tuple, a sequential scan always takes a relation-level lock, and an index
      scan acquires a page-level lock that covers the search key, whether or not
      there are any matching keys at the moment.
      
      A predicate lock doesn't conflict with any regular locks or with another
      predicate locks in the normal sense. They're only used by the predicate lock
      manager to detect the danger of anomalies. Only serializable transactions
      participate in predicate locking, so there should be no extra overhead for
      for other transactions.
      
      Predicate locks can't be released at commit, but must be remembered until
      all the transactions that overlapped with it have completed. That means that
      we need to remember an unbounded amount of predicate locks, so we apply a
      lossy but conservative method of tracking locks for committed transactions.
      If we run short of shared memory, we overflow to a new "pg_serial" SLRU
      pool.
      
      We don't currently allow Serializable transactions in Hot Standby mode.
      That would be hard, because even read-only transactions can cause anomalies
      that wouldn't otherwise occur.
      
      Serializable isolation mode now means the new fully serializable level.
      Repeatable Read gives you the old Snapshot Isolation level that we have
      always had.
      
      Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and
      Anssi Kääriäinen
      dafaa3ef
  35. 01 2月, 2011 1 次提交
  36. 18 1月, 2011 1 次提交
  37. 15 1月, 2011 1 次提交
    • H
      Treat a WAL sender process that hasn't started streaming yet as a regular · 8f5d65e9
      Heikki Linnakangas 提交于
      backend, as far as the postmaster shutdown logic is concerned. That means,
      fast shutdown will wait for WAL sender processes to exit before signaling
      bgwriter to finish. This avoids race conditions between a base backup stopping
      or starting, and bgwriter writing the shutdown checkpoint WAL record. We don't
      want e.g the end-of-backup WAL record to be written after the shutdown
      checkpoint.
      8f5d65e9
  38. 02 1月, 2011 1 次提交