1. 17 9月, 2013 1 次提交
  2. 04 9月, 2013 2 次提交
  3. 29 8月, 2013 2 次提交
    • H
      Use a non-locking initial test in TAS_SPIN on x86_64. · b03d196b
      Heikki Linnakangas 提交于
      Testing done in 2011 by Tom Lane concluded that this is a win on Intel Xeons
      and AMD Opterons, but it was not changed back then, because of an old
      comment in tas() that suggested that it's a huge loss on older Opterons.
      However, didn't have separate TAS() and TAS_SPIN() macros back then, so the
      comment referred to doing a non-locked initial test even on the first
      access, in uncontended case. I don't have access to older Opterons, but I'm
      pretty sure that doing an initial unlocked test is unlikely to be a loss
      while spinning, even though it might be for the first access.
      
      We probably should do the same on 32-bit x86, but I'm afraid of changing it
      without any testing. Hence just add a note to the x86 implementation
      suggesting that we probably should do the same there.
      b03d196b
    • R
      Allow discovery of whether a dynamic background worker is running. · 090d0f20
      Robert Haas 提交于
      Using the infrastructure provided by this patch, it's possible either
      to wait for the startup of a dynamically-registered background worker,
      or to poll the status of such a worker without waiting.  In either
      case, the current PID of the worker process can also be obtained.
      As usual, worker_spi is updated to demonstrate the new functionality.
      
      Patch by me.  Review by Andres Freund.
      090d0f20
  4. 18 7月, 2013 1 次提交
    • T
      Fix a few problems in barrier.h. · 89779bf2
      Tom Lane 提交于
      On HPPA, implement pg_memory_barrier() as pg_compiler_barrier(), which
      should be correct since this arch doesn't do memory access reordering,
      and is anyway better than the completely-nonfunctional-on-this-arch
      dummy_spinlock code.  (But note this patch only fixes things for gcc,
      not for builds with HP's compiler.)
      
      Also, fix incorrect default definition of pg_memory_barrier as a macro
      requiring an argument.
      
      Also, fix incorrect spelling of "#elif" as "#else if" in icc code path
      (spotted by pgindent).
      
      This doesn't come close to fixing all of the functional and stylistic
      deficiencies in barrier.h, but at least it un-breaks my personal build.
      Now that we're actually using barriers in the code, this file is going
      to need some serious attention.
      89779bf2
  5. 17 7月, 2013 1 次提交
    • R
      Allow background workers to be started dynamically. · 7f7485a0
      Robert Haas 提交于
      There is a new API, RegisterDynamicBackgroundWorker, which allows
      an ordinary user backend to register a new background writer during
      normal running.  This means that it's no longer necessary for all
      background workers to be registered during processing of
      shared_preload_libraries, although the option of registering workers
      at that time remains available.
      
      When a background worker exits and will not be restarted, the
      slot previously used by that background worker is automatically
      released and becomes available for reuse.  Slots used by background
      workers that are configured for automatic restart can't (yet) be
      released without shutting down the system.
      
      This commit adds a new source file, bgworker.c, and moves some
      of the existing control logic for background workers there.
      Previously, there was little enough logic that it made sense to
      keep everything in postmaster.c, but not any more.
      
      This commit also makes the worker_spi contrib module into an
      extension and adds a new function, worker_spi_launch, which can
      be used to demonstrate the new facility.
      7f7485a0
  6. 09 7月, 2013 2 次提交
  7. 08 7月, 2013 1 次提交
    • H
      Improve scalability of WAL insertions. · 9a20a9b2
      Heikki Linnakangas 提交于
      This patch replaces WALInsertLock with a number of WAL insertion slots,
      allowing multiple backends to insert WAL records to the WAL buffers
      concurrently. This is particularly useful for parallel loading large amounts
      of data on a system with many CPUs.
      
      This has one user-visible change: switching to a new WAL segment with
      pg_switch_xlog() now fills the remaining unused portion of the segment with
      zeros. This potentially adds some overhead, but it has been a very common
      practice by DBA's to clear the "tail" of the segment with an external
      pg_clearxlogtail utility anyway, to make the WAL files compress better.
      With this patch, it's no longer necessary to do that.
      
      This patch adds a new GUC, xloginsert_slots, to tune the number of WAL
      insertion slots. Performance testing suggests that the default, 8, works
      pretty well for all kinds of worklods, but I left the GUC in place to allow
      others with different hardware to test that easily. We might want to remove
      that before release.
      
      Reviewed by Andres Freund.
      9a20a9b2
  8. 02 7月, 2013 1 次提交
    • R
      Use an MVCC snapshot, rather than SnapshotNow, for catalog scans. · 568d4138
      Robert Haas 提交于
      SnapshotNow scans have the undesirable property that, in the face of
      concurrent updates, the scan can fail to see either the old or the new
      versions of the row.  In many cases, we work around this by requiring
      DDL operations to hold AccessExclusiveLock on the object being
      modified; in some cases, the existing locking is inadequate and random
      failures occur as a result.  This commit doesn't change anything
      related to locking, but will hopefully pave the way to allowing lock
      strength reductions in the future.
      
      The major issue has held us back from making this change in the past
      is that taking an MVCC snapshot is significantly more expensive than
      using a static special snapshot such as SnapshotNow.  However, testing
      of various worst-case scenarios reveals that this problem is not
      severe except under fairly extreme workloads.  To mitigate those
      problems, we avoid retaking the MVCC snapshot for each new scan;
      instead, we take a new snapshot only when invalidation messages have
      been processed.  The catcache machinery already requires that
      invalidation messages be sent before releasing the related heavyweight
      lock; else other backends might rely on locally-cached data rather
      than scanning the catalog at all.  Thus, making snapshot reuse
      dependent on the same guarantees shouldn't break anything that wasn't
      already subtly broken.
      
      Patch by me.  Review by Michael Paquier and Andres Freund.
      568d4138
  9. 30 6月, 2013 1 次提交
  10. 23 6月, 2013 1 次提交
    • S
      Ensure no xid gaps during Hot Standby startup · 1f09121b
      Simon Riggs 提交于
      In some cases with higher numbers of subtransactions
      it was possible for us to incorrectly initialize
      subtrans leading to complaints of missing pages.
      
      Bug report by Sergey Konoplev
      Analysis and fix by Andres Freund
      1f09121b
  11. 17 6月, 2013 1 次提交
    • J
      Add buffer_std flag to MarkBufferDirtyHint(). · b8fd1a09
      Jeff Davis 提交于
      MarkBufferDirtyHint() writes WAL, and should know if it's got a
      standard buffer or not. Currently, the only callers where buffer_std
      is false are related to the FSM.
      
      In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive.
      
      Back-patch to 9.3.
      b8fd1a09
  12. 14 6月, 2013 1 次提交
    • T
      Refactor checksumming code to make it easier to use externally. · f0421634
      Tom Lane 提交于
      pg_filedump and other external utility programs are likely to want to be
      able to check Postgres page checksums.  To avoid messy duplication of code,
      move the checksumming functionality into an exported header file, much as
      we did awhile back for the CRC code.
      
      In passing, get rid of an unportable assumption that a static char[] array
      will be word-aligned, and do some other minor code beautification.
      f0421634
  13. 05 6月, 2013 1 次提交
    • T
      Add ARM64 (aarch64) support to s_lock.h. · 5c7603c3
      Tom Lane 提交于
      Use the same gcc atomic functions as we do on newer ARM chips.
      (Basically this is a copy and paste of the __arm__ code block,
      but omitting the SWPB option since that definitely won't work.)
      
      Back-patch to 9.2.  The patch would work further back, but we'd also
      need to update config.guess/config.sub in older branches to make them
      build out-of-the-box, and there hasn't been demand for it.
      
      Mark Salter
      5c7603c3
  14. 30 5月, 2013 1 次提交
  15. 30 4月, 2013 1 次提交
  16. 29 4月, 2013 1 次提交
    • S
      Introduce new page checksum algorithm and module. · 43e7a668
      Simon Riggs 提交于
      Isolate checksum calculation to its own module, so that bufpage
      knows little if anything about the details of the calculation.
      
      This implementation is a modified FNV-1a hash checksum, details
      of which are given in the new checksum.c header comments.
      
      Basic implementation only, so we fix the output value.
      
      Later related commits will add version numbers to pg_control,
      compiler optimization flags and memory barriers.
      
      Ants Aasma, reviewed by Jeff Davis and Simon Riggs
      43e7a668
  17. 22 3月, 2013 1 次提交
    • S
      Allow I/O reliability checks using 16-bit checksums · 96ef3b8f
      Simon Riggs 提交于
      Checksums are set immediately prior to flush out of shared buffers
      and checked when pages are read in again. Hint bit setting will
      require full page write when block is dirtied, which causes various
      infrastructure changes. Extensive comments, docs and README.
      
      WARNING message thrown if checksum fails on non-all zeroes page;
      ERROR thrown but can be disabled with ignore_checksum_failure = on.
      
      Feature enabled by an initdb option, since transition from option off
      to option on is long and complex and has not yet been implemented.
      Default is not to use checksums.
      
      Checksum used is WAL CRC-32 truncated to 16-bits.
      
      Simon Riggs, Jeff Davis, Greg Smith
      Wide input and assistance from many community members. Thank you.
      96ef3b8f
  18. 18 3月, 2013 1 次提交
    • S
      Remove PageSetTLI and rename pd_tli to pd_checksum · bb7cc262
      Simon Riggs 提交于
      Remove use of PageSetTLI() from all page manipulation functions
      and adjust README to indicate change in the way we make changes
      to pages. Repurpose those bytes into the pd_checksum field and
      explain how that works in comments about page header.
      
      Refactoring ahead of actual feature patch which would make use
      of the checksum field, arriving later.
      
      Jeff Davis, with comments and doc changes by Simon Riggs
      Direction suggested by Robert Haas; many others providing
      review comments.
      bb7cc262
  19. 17 3月, 2013 1 次提交
    • T
      Add lock_timeout configuration parameter. · d43837d0
      Tom Lane 提交于
      This GUC allows limiting the time spent waiting to acquire any one
      heavyweight lock.
      
      In support of this, improve the recently-added timeout infrastructure
      to permit efficiently enabling or disabling multiple timeouts at once.
      That reduces the performance hit from turning on lock_timeout, though
      it's still not zero.
      
      Zoltán Böszörményi, reviewed by Tom Lane,
      Stephen Frost, and Hari Babu
      d43837d0
  20. 28 2月, 2013 1 次提交
    • H
      Add support for piping COPY to/from an external program. · 3d009e45
      Heikki Linnakangas 提交于
      This includes backend "COPY TO/FROM PROGRAM '...'" syntax, and corresponding
      psql \copy syntax. Like with reading/writing files, the backend version is
      superuser-only, and in the psql version, the program is run in the client.
      
      In the passing, the psql \copy STDIN/STDOUT syntax is subtly changed: if you
      the stdin/stdout is quoted, it's now interpreted as a filename. For example,
      "\copy foo from 'stdin'" now reads from a file called 'stdin', not from
      standard input. Before this, there was no way to specify a filename called
      stdin, stdout, pstdin or pstdout.
      
      This creates a new function in pgport, wait_result_to_str(), which can
      be used to convert the exit status of a process, as returned by wait(3),
      to a human-readable string.
      
      Etsuro Fujita, reviewed by Amit Kapila.
      3d009e45
  21. 12 2月, 2013 1 次提交
  22. 23 1月, 2013 1 次提交
    • A
      Improve concurrency of foreign key locking · 0ac5ad51
      Alvaro Herrera 提交于
      This patch introduces two additional lock modes for tuples: "SELECT FOR
      KEY SHARE" and "SELECT FOR NO KEY UPDATE".  These don't block each
      other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
      FOR UPDATE".  UPDATE commands that do not modify the values stored in
      the columns that are part of the key of the tuple now grab a SELECT FOR
      NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
      with tuple locks of the FOR KEY SHARE variety.
      
      Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
      means the concurrency improvement applies to them, which is the whole
      point of this patch.
      
      The added tuple lock semantics require some rejiggering of the multixact
      module, so that the locking level that each transaction is holding can
      be stored alongside its Xid.  Also, multixacts now need to persist
      across server restarts and crashes, because they can now represent not
      only tuple locks, but also tuple updates.  This means we need more
      careful tracking of lifetime of pg_multixact SLRU files; since they now
      persist longer, we require more infrastructure to figure out when they
      can be removed.  pg_upgrade also needs to be careful to copy
      pg_multixact files over from the old server to the new, or at least part
      of multixact.c state, depending on the versions of the old and new
      servers.
      
      Tuple time qualification rules (HeapTupleSatisfies routines) need to be
      careful not to consider tuples with the "is multi" infomask bit set as
      being only locked; they might need to look up MultiXact values (i.e.
      possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
      whereas they previously were assured to only use information readily
      available from the tuple header.  This is considered acceptable, because
      the extra I/O would involve cases that would previously cause some
      commands to block waiting for concurrent transactions to finish.
      
      Another important change is the fact that locking tuples that have
      previously been updated causes the future versions to be marked as
      locked, too; this is essential for correctness of foreign key checks.
      This causes additional WAL-logging, also (there was previously a single
      WAL record for a locked tuple; now there are as many as updated copies
      of the tuple there exist.)
      
      With all this in place, contention related to tuples being checked by
      foreign key rules should be much reduced.
      
      As a bonus, the old behavior that a subtransaction grabbing a stronger
      tuple lock than the parent (sub)transaction held on a given tuple and
      later aborting caused the weaker lock to be lost, has been fixed.
      
      Many new spec files were added for isolation tester framework, to ensure
      overall behavior is sane.  There's probably room for several more tests.
      
      There were several reviewers of this patch; in particular, Noah Misch
      and Andres Freund spent considerable time in it.  Original idea for the
      patch came from Simon Riggs, after a problem report by Joel Jacobson.
      Most code is from me, with contributions from Marti Raudsepp, Alexander
      Shulgin, Noah Misch and Andres Freund.
      
      This patch was discussed in several pgsql-hackers threads; the most
      important start at the following message-ids:
      	AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com
      	1290721684-sup-3951@alvh.no-ip.org
      	1294953201-sup-2099@alvh.no-ip.org
      	1320343602-sup-2290@alvh.no-ip.org
      	1339690386-sup-8927@alvh.no-ip.org
      	4FE5FF020200002500048A3D@gw.wicourts.gov
      	4FEAB90A0200002500048B7D@gw.wicourts.gov
      0ac5ad51
  23. 18 1月, 2013 1 次提交
    • A
      Accelerate end-of-transaction dropping of relations · 279628a0
      Alvaro Herrera 提交于
      When relations are dropped, at end of transaction we need to remove the
      files and clean the buffer pool of buffers containing pages of those
      relations.  Previously we would scan the buffer pool once per relation
      to clean up buffers.  When there are many relations to drop, the
      repeated scans make this process slow; so we now instead pass a list of
      relations to drop and scan the pool once, checking each buffer against
      the passed list.  When the number of relations is larger than a
      threshold (which as of this patch is being set to 20 relations) we sort
      the array before starting, and bsearch the array; when it's smaller, we
      simply scan the array linearly each time, because that's faster.  The
      exact optimal threshold value depends on many factors, but the
      difference is not likely to be significant enough to justify making it
      user-settable.
      
      This has been measured to be a significant win (a 15x win when dropping
      100,000 relations; an extreme case, but reportedly a real one).
      
      Author: Tomas Vondra, some tweaks by me
      Reviewed by: Robert Haas, Shigeru Hanada, Andres Freund, Álvaro Herrera
      279628a0
  24. 17 1月, 2013 1 次提交
    • H
      Make GiST indexes on-disk compatible with 9.2 again. · 9ee4d06f
      Heikki Linnakangas 提交于
      The patch that turned XLogRecPtr into a uint64 inadvertently changed the
      on-disk format of GiST indexes, because the NSN field in the GiST page
      opaque is an XLogRecPtr. That breaks pg_upgrade. Revert the format of that
      field back to the two-field struct that XLogRecPtr was before. This is the
      same we did to LSNs in the page header to avoid changing on-disk format.
      
      Bump catversion, as this invalidates any existing GiST indexes built on
      9.3devel.
      9ee4d06f
  25. 02 1月, 2013 1 次提交
  26. 12 12月, 2012 1 次提交
    • K
      Fix performance problems with autovacuum truncation in busy workloads. · b19e4250
      Kevin Grittner 提交于
      In situations where there are over 8MB of empty pages at the end of
      a table, the truncation work for trailing empty pages takes longer
      than deadlock_timeout, and there is frequent access to the table by
      processes other than autovacuum, there was a problem with the
      autovacuum worker process being canceled by the deadlock checking
      code. The truncation work done by autovacuum up that point was
      lost, and the attempt tried again by a later autovacuum worker. The
      attempts could continue indefinitely without making progress,
      consuming resources and blocking other processes for up to
      deadlock_timeout each time.
      
      This patch has the autovacuum worker checking whether it is
      blocking any other thread at 20ms intervals. If such a condition
      develops, the autovacuum worker will persist the work it has done
      so far, release its lock on the table, and sleep in 50ms intervals
      for up to 5 seconds, hoping to be able to re-acquire the lock and
      try again. If it is unable to get the lock in that time, it moves
      on and a worker will try to continue later from the point this one
      left off.
      
      While this patch doesn't change the rules about when and what to
      truncate, it does cause the truncation to occur sooner, with less
      blocking, and with the consumption of fewer resources when there is
      contention for the table's lock.
      
      The only user-visible change other than improved performance is
      that the table size during truncation may change incrementally
      instead of just once.
      
      This problem exists in all supported versions but is infrequently
      reported, although some reports of performance problems when
      autovacuum runs might be caused by this. Initial commit is just the
      master branch, but this should probably be backpatched once the
      build farm and general developer usage confirm that there are no
      surprising effects.
      
      Jan Wieck
      b19e4250
  27. 07 12月, 2012 1 次提交
    • A
      Background worker processes · da07a1e8
      Alvaro Herrera 提交于
      Background workers are postmaster subprocesses that run arbitrary
      user-specified code.  They can request shared memory access as well as
      backend database connections; or they can just use plain libpq frontend
      database connections.
      
      Modules listed in shared_preload_libraries can register background
      workers in their _PG_init() function; this is early enough that it's not
      necessary to provide an extra GUC option, because the necessary extra
      resources can be allocated early on.  Modules can install more than one
      bgworker, if necessary.
      
      Care is taken that these extra processes do not interfere with other
      postmaster tasks: only one such process is started on each ServerLoop
      iteration.  This means a large number of them could be waiting to be
      started up and postmaster is still able to quickly service external
      connection requests.  Also, shutdown sequence should not be impacted by
      a worker process that's reasonably well behaved (i.e. promptly responds
      to termination signals.)
      
      The current implementation lets worker processes specify their start
      time, i.e. at what point in the server startup process they are to be
      started: right after postmaster start (in which case they mustn't ask
      for shared memory access), when consistent state has been reached
      (useful during recovery in a HOT standby server), or when recovery has
      terminated (i.e. when normal backends are allowed).
      
      In case of a bgworker crash, actions to take depend on registration
      data: if shared memory was requested, then all other connections are
      taken down (as well as other bgworkers), just like it were a regular
      backend crashing.  The bgworker itself is restarted, too, within a
      configurable timeframe (which can be configured to be never).
      
      More features to add to this framework can be imagined without much
      effort, and have been discussed, but this seems good enough as a useful
      unit already.
      
      An elementary sample module is supplied.
      
      Author: Álvaro Herrera
      
      This patch is loosely based on prior patches submitted by KaiGai Kohei,
      and unsubmitted code by Simon Riggs.
      
      Reviewed by: KaiGai Kohei, Markus Wanner, Andres Freund,
      Heikki Linnakangas, Simon Riggs, Amit Kapila
      da07a1e8
  28. 03 12月, 2012 3 次提交
    • S
      Refactor inCommit flag into generic delayChkpt flag. · f21bb9cf
      Simon Riggs 提交于
      Rename PGXACT->inCommit flag into delayChkpt flag,
      and generalise comments to allow use in other situations,
      such as the forthcoming potential use in checksum patch.
      Replace wait loop to look for VXIDs with delayChkpt set.
      No user visible changes, not behaviour changes at present.
      
      Simon Riggs, reviewed and rebased by Jeff Davis
      f21bb9cf
    • T
      Don't advance checkPoint.nextXid near the end of a checkpoint sequence. · 3114cb60
      Tom Lane 提交于
      This reverts commit c1113069 in favor of
      actually fixing the problem: namely, that we should never have been
      modifying the checkpoint record's nextXid at this point to begin with.
      The nextXid should match the state as of the checkpoint's logical WAL
      position (ie the redo point), not the state as of its physical position.
      It's especially bogus to advance it in some wal_levels and not others.
      In any case there is no need for the checkpoint record to carry the
      same nextXid shown in the XLOG_RUNNING_XACTS record just emitted by
      LogStandbySnapshot, as any replay operation will already have adopted
      that value as current.
      
      This fixes bug #7710 from Tarvi Pillessaar, and probably also explains bug
      #6291 from Daniel Farina, in that if a checkpoint were in progress at the
      instant of XID wraparound, the epoch bump would be lost as reported.
      (And, of course, these days there's at least a 50-50 chance of a checkpoint
      being in progress at any given instant.)
      
      Diagnosed by me and independently by Andres Freund.  Back-patch to all
      branches supporting hot standby.
      3114cb60
    • S
      Rearrange storage of data in xl_running_xacts. · 5c117258
      Simon Riggs 提交于
      Previously we stored all xids mixed together.
      Now we store top-level xids first, followed
      by all subxids. Also skip logging any subxids
      if the snapshot is suboverflowed, since there
      are potentially large numbers of them and they
      are not useful in that case anyway. Has value
      in the envisaged design for decoding of WAL.
      No planned effect on Hot Standby.
      
      Andres Freund, reviewed by me
      5c117258
  29. 30 11月, 2012 1 次提交
  30. 27 11月, 2012 1 次提交
    • H
      Add OpenTransientFile, with automatic cleanup at end-of-xact. · 1f67078e
      Heikki Linnakangas 提交于
      Files opened with BasicOpenFile or PathNameOpenFile are not automatically
      cleaned up on error. That puts unnecessary burden on callers that only want
      to keep the file open for a short time. There is AllocateFile, but that
      returns a buffered FILE * stream, which in many cases is not the nicest API
      to work with. So add function called OpenTransientFile, which returns a
      unbuffered fd that's cleaned up like the FILE* returned by AllocateFile().
      
      This plugs a few rare fd leaks in error cases:
      
      1. copy_file() - fixed by by using OpenTransientFile instead of BasicOpenFile
      2. XLogFileInit() - fixed by adding close() calls to the error cases. Can't
         use OpenTransientFile here because the fd is supposed to persist over
         transaction boundaries.
      3. lo_import/lo_export - fixed by using OpenTransientFile instead of
         PathNameOpenFile.
      
      In addition to plugging those leaks, this replaces many BasicOpenFile() calls
      with OpenTransientFile() that were not leaking, because the code meticulously
      closed the file on error. That wasn't strictly necessary, but IMHO it's good
      for robustness.
      
      The same leaks exist in older versions, but given the rarity of the issues,
      I'm not backpatching this. Not yet, anyway - it might be good to backpatch
      later, after this mechanism has had some more testing in master branch.
      1f67078e
  31. 09 11月, 2012 1 次提交
    • T
      Fix WaitLatch() to return promptly when the requested timeout expires. · 3e7fdcff
      Tom Lane 提交于
      If the sleep is interrupted by a signal, we must recompute the remaining
      time to wait; otherwise, a steady stream of non-wait-terminating interrupts
      could delay return from WaitLatch indefinitely.  This has been shown to be
      a problem for the autovacuum launcher, and there may well be other places
      now or in the future with similar issues.  So we'd better make the function
      robust, even though this'll add at least one gettimeofday call per wait.
      
      Back-patch to 9.2.  We might eventually need to fix 9.1 as well, but the
      code is quite different there, and the usage of WaitLatch in 9.1 is so
      limited that it's not clearly important to do so.
      
      Reported and diagnosed by Jeff Janes, though I rewrote his patch rather
      heavily.
      3e7fdcff
  32. 18 10月, 2012 2 次提交
    • T
      Close un-owned SMgrRelations at transaction end. · ff3f9c8d
      Tom Lane 提交于
      If an SMgrRelation is not "owned" by a relcache entry, don't allow it to
      live past transaction end.  This design allows the same SMgrRelation to be
      used for blind writes of multiple blocks during a transaction, but ensures
      that we don't hold onto such an SMgrRelation indefinitely.  Because an
      SMgrRelation typically corresponds to open file descriptors at the fd.c
      level, leaving it open when there's no corresponding relcache entry can
      mean that we prevent the kernel from reclaiming deleted disk space.
      (While CacheInvalidateSmgr messages usually fix that, there are cases
      where they're not issued, such as DROP DATABASE.  We might want to add
      some more sinval messaging for that, but I'd be inclined to keep this
      type of logic anyway, since allowing VFDs to accumulate indefinitely
      for blind-written relations doesn't seem like a good idea.)
      
      This code replaces a previous attempt towards the same goal that proved
      to be unreliable.  Back-patch to 9.1 where the previous patch was added.
      ff3f9c8d
    • T
      Revert "Use "transient" files for blind writes, take 2". · 9bacf0e3
      Tom Lane 提交于
      This reverts commit fba105b1.
      That approach had problems with the smgr-level state not tracking what
      we really want to happen, and with the VFD-level state not tracking the
      smgr-level state very well either.  In consequence, it was still possible
      to hold kernel file descriptors open for long-gone tables (as in recent
      report from Tore Halset), and yet there were also cases of FDs being closed
      undesirably soon.  A replacement implementation will follow.
      9bacf0e3
  33. 15 10月, 2012 1 次提交
    • T
      Split up process latch initialization for more-fail-soft behavior. · e81e8f93
      Tom Lane 提交于
      In the previous coding, new backend processes would attempt to create their
      self-pipe during the OwnLatch call in InitProcess.  However, pipe creation
      could fail if the kernel is short of resources; and the system does not
      recover gracefully from a FATAL error right there, since we have armed the
      dead-man switch for this process and not yet set up the on_shmem_exit
      callback that would disarm it.  The postmaster then forces an unnecessary
      database-wide crash and restart, as reported by Sean Chittenden.
      
      There are various ways we could rearrange the code to fix this, but the
      simplest and sanest seems to be to split out creation of the self-pipe into
      a new function InitializeLatchSupport, which must be called from a place
      where failure is allowed.  For most processes that gets called in
      InitProcess or InitAuxiliaryProcess, but processes that don't call either
      but still use latches need their own calls.
      
      Back-patch to 9.1, which has only a part of the latch logic that 9.2 and
      HEAD have, but nonetheless includes this bug.
      e81e8f93
  34. 10 10月, 2012 1 次提交
    • T
      Remove unnecessary overhead in backend's large-object operations. · 7e0cce02
      Tom Lane 提交于
      Do read/write permissions checks at most once per large object descriptor,
      not once per lo_read or lo_write call as before.  The repeated tests were
      quite useless in the read case since the snapshot-based tests were
      guaranteed to produce the same answer every time.  In the write case,
      the extra tests could in principle detect revocation of write privileges
      after a series of writes has started --- but there's a race condition there
      anyway, since we'd check privileges before performing and certainly before
      committing the write.  So there's no real advantage to checking every
      single time, and we might as well redefine it as "only check the first
      time".
      
      On the same reasoning, remove the LargeObjectExists checks in inv_write
      and inv_truncate.  We already checked existence when the descriptor was
      opened, and checking again doesn't provide any real increment of safety
      that would justify the cost.
      7e0cce02