1. 10 2月, 2010 1 次提交
    • T
      Fix up rickety handling of relation-truncation interlocks. · cbe9d6be
      Tom Lane 提交于
      Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr
      relation entries, so that they will get reset to InvalidBlockNumber whenever
      an smgr-level flush happens.  Because we now send smgr invalidation messages
      immediately (not at end of transaction) when a relation truncation occurs,
      this ensures that other backends will reset their values before they next
      access the relation.  We no longer need the unreliable assumption that a
      VACUUM that's doing a truncation will hold its AccessExclusive lock until
      commit --- in fact, we can intentionally release that lock as soon as we've
      completed the truncation.  This patch therefore reverts (most of) Alvaro's
      patch of 2009-11-10, as well as my marginal hacking on it yesterday.  We can
      also get rid of assorted no-longer-needed relcache flushes, which are far more
      expensive than an smgr flush because they kill a lot more state.
      
      In passing this patch fixes smgr_redo's failure to perform visibility-map
      truncation, and cleans up some rather dubious assumptions in freespace.c and
      visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of
      date.
      cbe9d6be
  2. 09 2月, 2010 1 次提交
  3. 08 2月, 2010 2 次提交
    • T
      Remove old-style VACUUM FULL (which was known for a little while as · 0a469c87
      Tom Lane 提交于
      VACUUM FULL INPLACE), along with a boatload of subsidiary code and complexity.
      Per discussion, the use case for this method of vacuuming is no longer large
      enough to justify maintaining it; not to mention that we don't wish to invest
      the work that would be needed to make it play nicely with Hot Standby.
      
      Aside from the code directly related to old-style VACUUM FULL, this commit
      removes support for certain WAL record types that could only be generated
      within VACUUM FULL, redirect-pointer removal in heap_page_prune, and
      nontransactional generation of cache invalidation sinval messages (the last
      being the sticking point for Hot Standby).
      
      We still have to retain all code that copes with finding HEAP_MOVED_OFF and
      HEAP_MOVED_IN flag bits on existing tuples.  This can't be removed as long
      as we want to support in-place update from pre-9.0 databases.
      0a469c87
    • T
      Create a "relation mapping" infrastructure to support changing the relfilenodes · b9b8831a
      Tom Lane 提交于
      of shared or nailed system catalogs.  This has two key benefits:
      
      * The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs.
      
      * We no longer have to use an unsafe reindex-in-place approach for reindexing
        shared catalogs.
      
      CLUSTER on nailed catalogs now works too, although I left it disabled on
      shared catalogs because the resulting pg_index.indisclustered update would
      only be visible in one database.
      
      Since reindexing shared system catalogs is now fully transactional and
      crash-safe, the former special cases in REINDEX behavior have been removed;
      shared catalogs are treated the same as non-shared.
      
      This commit does not do anything about the recently-discussed problem of
      deadlocks between VACUUM FULL/CLUSTER on a system catalog and other
      concurrent queries; will address that in a separate patch.  As a stopgap,
      parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid
      such failures during the regression tests.
      b9b8831a
  4. 06 1月, 2010 1 次提交
    • I
      Support rewritten-based full vacuum as VACUUM FULL. Traditional · 946cf229
      Itagaki Takahiro 提交于
      VACUUM FULL was renamed to VACUUM FULL INPLACE. Also added a new
      option -i, --inplace for vacuumdb to perform FULL INPLACE vacuuming.
      
      Since the new VACUUM FULL uses CLUSTER infrastructure, we cannot
      use it for system tables. VACUUM FULL for system tables always
      fall back into VACUUM FULL INPLACE silently.
      
      Itagaki Takahiro, reviewed by Jeff Davis and Simon Riggs.
      946cf229
  5. 03 1月, 2010 1 次提交
  6. 31 12月, 2009 1 次提交
    • T
      Revise pgstat's tracking of tuple changes to improve the reliability of · 48c192c1
      Tom Lane 提交于
      decisions about when to auto-analyze.
      
      The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
      where all three of these numbers could be bad estimates from ANALYZE itself.
      Even worse, in the presence of a steady flow of HOT updates and matching
      HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
      three numbers are exactly right, because n_dead_tuples could hold steady.
      
      To fix, replace last_anl_tuples with an accurately tracked count of the total
      number of committed tuple inserts + updates + deletes since the last ANALYZE
      on the table.  This can still be compared to the same threshold as before, but
      it's much more trustworthy than the old computation.  Tracking this requires
      one more intra-transaction counter per modified table within backends, but no
      additional memory space in the stats collector.  There probably isn't any
      measurable speed difference; if anything it might be a bit faster than before,
      since I was able to eliminate some per-tuple arithmetic operations in favor of
      adding sums once per (sub)transaction.
      
      Also, simplify the logic around pgstat vacuum and analyze reporting messages
      by not trying to fold VACUUM ANALYZE into a single pgstat message.
      
      The original thought behind this patch was to allow scheduling of analyzes
      on parent tables by artificially inflating their changes_since_analyze count.
      I've left that for a separate patch since this change seems to stand on its
      own merit.
      48c192c1
  7. 30 12月, 2009 1 次提交
  8. 19 12月, 2009 1 次提交
    • S
      Allow read only connections during recovery, known as Hot Standby. · efc16ea5
      Simon Riggs 提交于
      Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
      
      New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
      
      This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
      
      Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
      
      Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
      efc16ea5
  9. 10 12月, 2009 1 次提交
    • T
      Prevent indirect security attacks via changing session-local state within · 62aba765
      Tom Lane 提交于
      an allegedly immutable index function.  It was previously recognized that
      we had to prevent such a function from executing SET/RESET ROLE/SESSION
      AUTHORIZATION, or it could trivially obtain the privileges of the session
      user.  However, since there is in general no privilege checking for changes
      of session-local state, it is also possible for such a function to change
      settings in a way that might subvert later operations in the same session.
      Examples include changing search_path to cause an unexpected function to
      be called, or replacing an existing prepared statement with another one
      that will execute a function of the attacker's choosing.
      
      The present patch secures VACUUM, ANALYZE, and CREATE INDEX/REINDEX against
      these threats, which are the same places previously deemed to need protection
      against the SET ROLE issue.  GUC changes are still allowed, since there are
      many useful cases for that, but we prevent security problems by forcing a
      rollback of any GUC change after completing the operation.  Other cases are
      handled by throwing an error if any change is attempted; these include temp
      table creation, closing a cursor, and creating or deleting a prepared
      statement.  (In 7.4, the infrastructure to roll back GUC changes doesn't
      exist, so we settle for rejecting changes of "search_path" in these contexts.)
      
      Original report and patch by Gurjeet Singh, additional analysis by
      Tom Lane.
      
      Security: CVE-2009-4136
      62aba765
  10. 07 12月, 2009 1 次提交
  11. 17 11月, 2009 1 次提交
  12. 11 11月, 2009 1 次提交
    • A
      Fix longstanding problems in VACUUM caused by untimely interruptions · e7ec0222
      Alvaro Herrera 提交于
      In VACUUM FULL, an interrupt after the initial transaction has been recorded
      as committed can cause postmaster to restart with the following error message:
      PANIC: cannot abort transaction NNNN, it was already committed
      This problem has been reported many times.
      
      In lazy VACUUM, an interrupt after the table has been truncated by
      lazy_truncate_heap causes other backends' relcache to still point to the
      removed pages; this can cause future INSERT and UPDATE queries to error out
      with the following error message:
      could not read block XX of relation 1663/NNN/MMMM: read only 0 of 8192 bytes
      The window to this race condition is extremely narrow, but it has been seen in
      the wild involving a cancelled autovacuum process.
      
      The solution for both problems is to inhibit interrupts in both operations
      until after the respective transactions have been committed.  It's not a
      complete solution, because the transaction could theoretically be aborted by
      some other error, but at least fixes the most common causes of both problems.
      e7ec0222
  13. 26 10月, 2009 1 次提交
    • T
      Re-implement EvalPlanQual processing to improve its performance and eliminate · 9f2ee8f2
      Tom Lane 提交于
      a lot of strange behaviors that occurred in join cases.  We now identify the
      "current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
      UPDATE/SHARE queries.  If an EvalPlanQual recheck is necessary, we jam the
      appropriate row into each scan node in the rechecking plan, forcing it to emit
      only that one row.  The former behavior could rescan the whole of each joined
      relation for each recheck, which was terrible for performance, and what's much
      worse could result in duplicated output tuples.
      
      Also, the original implementation of EvalPlanQual could not re-use the recheck
      execution tree --- it had to go through a full executor init and shutdown for
      every row to be tested.  To avoid this overhead, I've associated a special
      runtime Param with each LockRows or ModifyTable plan node, and arranged to
      make every scan node below such a node depend on that Param.  Thus, by
      signaling a change in that Param, the EPQ machinery can just rescan the
      already-built test plan.
      
      This patch also adds a prohibition on set-returning functions in the
      targetlist of SELECT FOR UPDATE/SHARE.  This is needed to avoid the
      duplicate-output-tuple problem.  It seems fairly reasonable since the
      other restrictions on SELECT FOR UPDATE are meant to ensure that there
      is a unique correspondence between source tuples and result tuples,
      which an output SRF destroys as much as anything else does.
      9f2ee8f2
  14. 01 9月, 2009 2 次提交
  15. 31 8月, 2009 1 次提交
    • T
      Track the current XID wrap limit (or more accurately, the oldest unfrozen · 25ec228e
      Tom Lane 提交于
      XID) in checkpoint records.  This eliminates the need to recompute the value
      from scratch during database startup, which is one of the two remaining
      reasons for the flatfile code to exist.  It should also simplify life for
      hot-standby operation.
      
      To avoid bloating the checkpoint records unreasonably, I switched from
      tracking the oldest database by name to tracking it by OID.  This turns
      out to save cycles in general (everywhere but the warning-generating
      paths, which we hardly care about) and also helps us deal with the case
      that the oldest database got dropped instead of being vacuumed.  The prior
      coding might go for a long time without updating the wrap limit in that case,
      which is bad because it might result in a lot of useless autovacuum activity.
      25ec228e
  16. 24 8月, 2009 1 次提交
    • T
      Fix a violation of WAL coding rules in the recent patch to include an · 7fc7a7c4
      Tom Lane 提交于
      "all tuples visible" flag in heap page headers.  The flag update *must*
      be applied before calling XLogInsert, but heap_update and the tuple
      moving routines in VACUUM FULL were ignoring this rule.  A crash and
      replay could therefore leave the flag incorrectly set, causing rows
      to appear visible in seqscans when they should not be.  This might explain
      recent reports of data corruption from Jeff Ross and others.
      
      In passing, do a bit of editorialization on comments in visibilitymap.c.
      7fc7a7c4
  17. 11 6月, 2009 1 次提交
  18. 07 6月, 2009 1 次提交
    • T
      Improve the IndexVacuumInfo/IndexBulkDeleteResult API to allow somewhat sane · 32ea2363
      Tom Lane 提交于
      behavior in cases where we don't know the heap tuple count accurately; in
      particular partial vacuum, but this also makes the API a bit more useful
      for ANALYZE.  This patch adds "estimated_count" flags to both structs so
      that an approximate count can be flagged as such, and adjusts the logic
      so that approximate counts are not used for updating pg_class.reltuples.
      
      This fixes my previous complaint that VACUUM was putting ridiculous values
      into pg_class.reltuples for indexes.  The actual impact of that bug is
      limited, because the planner only pays attention to reltuples for an index
      if the index is partial; which probably explains why beta testers hadn't
      noticed a degradation in plan quality from it.  But it needs to be fixed.
      
      The whole thing is a bit messy and should be redesigned in future, because
      reltuples now has the potential to drift quite far away from reality when
      a long period elapses with no non-partial vacuums.  But this is as good as
      it's going to get for 8.4.
      32ea2363
  19. 01 4月, 2009 1 次提交
    • T
      Modify the relcache to record the temp status of both local and nonlocal · 948d6ec9
      Tom Lane 提交于
      temp relations; this is no more expensive than before, now that we have
      pg_class.relistemp.  Insert tests into bufmgr.c to prevent attempting
      to fetch pages from nonlocal temp relations.  This provides a low-level
      defense against bugs-of-omission allowing temp pages to be loaded into shared
      buffers, as in the contrib/pgstattuple problem reported by Stuart Bishop.
      While at it, tweak a bunch of places to use new relcache tests (instead of
      expensive probes into pg_namespace) to detect local or nonlocal temp tables.
      948d6ec9
  20. 25 3月, 2009 1 次提交
    • T
      Implement "fastupdate" support for GIN indexes, in which we try to accumulate · ff301d6e
      Tom Lane 提交于
      multiple index entries in a holding area before adding them to the main index
      structure.  This helps because bulk insert is (usually) significantly faster
      than retail insert for GIN.
      
      This patch also removes GIN support for amgettuple-style index scans.  The
      API defined for amgettuple is difficult to support with fastupdate, and
      the previously committed partial-match feature didn't really work with
      it either.  We might eventually figure a way to put back amgettuple
      support, but it won't happen for 8.4.
      
      catversion bumped because of change in GIN's pg_am entry, and because
      the format of GIN indexes changed on-disk (there's a metapage now,
      and possibly a pending list).
      
      Teodor Sigaev
      ff301d6e
  21. 16 1月, 2009 1 次提交
  22. 02 1月, 2009 1 次提交
  23. 17 12月, 2008 1 次提交
    • H
      Don't reset pg_class.reltuples and relpages in VACUUM, if any pages were · dcf84099
      Heikki Linnakangas 提交于
      skipped. We could update relpages anyway, but it seems better to only
      update it together with reltuples, because we use the reltuples/relpages
      ratio in the planner. Also don't update n_live_tuples in pgstat.
      
      ANALYZE in VACUUM ANALYZE now needs to update pg_class, if the
      VACUUM-phase didn't do so. Added some boolean-passing to let analyze_rel
      know if it should update pg_class or not.
      
      I also moved the relcache invalidation (to update rd_targblock) from
      vac_update_relstats to where RelationTruncate is called, because
      vac_update_relstats is not called for partial vacuums anymore. It's more
      obvious to send the invalidation close to the truncation that requires it.
      
      Per report by Ned T. Crigler.
      dcf84099
  24. 03 12月, 2008 1 次提交
    • H
      Introduce visibility map. The visibility map is a bitmap with one bit per · 608195a3
      Heikki Linnakangas 提交于
      heap page, where a set bit indicates that all tuples on the page are
      visible to all transactions, and the page therefore doesn't need
      vacuuming. It is stored in a new relation fork.
      
      Lazy vacuum uses the visibility map to skip pages that don't need
      vacuuming. Vacuum is also responsible for setting the bits in the map.
      In the future, this can hopefully be used to implement index-only-scans,
      but we can't currently guarantee that the visibility map is always 100%
      up-to-date.
      
      In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on
      each heap page, also indicating that all tuples on the page are visible to
      all transactions. It's important that this flag is kept up-to-date. It
      is also used to skip visibility tests in sequential scans, which gives a
      small performance gain on seqscans.
      608195a3
  25. 19 11月, 2008 1 次提交
    • H
      Rethink the way FSM truncation works. Instead of WAL-logging FSM · 33960006
      Heikki Linnakangas 提交于
      truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To
      make that cleaner from modularity point of view, move the WAL-logging one
      level up to RelationTruncate, and move RelationTruncate and all the
      related WAL-logging to new src/backend/catalog/storage.c file. Introduce
      new RelationCreateStorage and RelationDropStorage functions that are used
      instead of calling smgrcreate/smgrscheduleunlink directly. Move the
      pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new
      functions. This leaves smgr.c as a thin wrapper around md.c; all the
      transactional stuff is now in storage.c.
      
      This will make it easier to add new forks with similar truncation logic,
      like the visibility map.
      33960006
  26. 10 11月, 2008 1 次提交
  27. 31 10月, 2008 1 次提交
    • H
      Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer · 19c8dc83
      Heikki Linnakangas 提交于
      functions into one ReadBufferExtended function, that takes the strategy
      and mode as argument. There's three modes, RBM_NORMAL which is the default
      used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and
      a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages
      without throwing an error. The FSM needs the new mode to recover from
      corrupt pages, which could happend if we crash after extending an FSM file,
      and the new page is "torn".
      
      Add fork number to some error messages in bufmgr.c, that still lacked it.
      19c8dc83
  28. 30 9月, 2008 1 次提交
    • H
      Rewrite the FSM. Instead of relying on a fixed-size shared memory segment, the · 15c121b3
      Heikki Linnakangas 提交于
      free space information is stored in a dedicated FSM relation fork, with each
      relation (except for hash indexes; they don't use FSM).
      
      This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any
      trace of them from the backend, initdb, and documentation.
      
      Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also
      introduce a new variant of the get_raw_page(regclass, int4, int4) function in
      contrib/pageinspect that let's you to return pages from any relation fork, and
      a new fsm_page_contents() function to inspect the new FSM pages.
      15c121b3
  29. 11 9月, 2008 1 次提交
    • A
      Initialize the minimum frozen Xid in vac_update_datfrozenxid using · d53a5668
      Alvaro Herrera 提交于
      GetOldestXmin() instead of RecentGlobalXmin; this is safer because we do not
      depend on the latter being correctly set elsewhere, and while it is more
      expensive, this code path is not performance-critical.  This is a real
      risk for autovacuum, because it can execute whole cycles without doing
      a single vacuum, which would mean that RecentGlobalXmin would stay at its
      initialization value, FirstNormalTransactionId, causing a bogus value to be
      inserted in pg_database.  This bug could explain some recent reports of
      failure to truncate pg_clog.
      
      At the same time, change the initialization of RecentGlobalXmin to
      InvalidTransactionId, and ensure that it's set to something else whenever
      it's going to be used.  Using it as FirstNormalTransactionId in HOT page
      pruning could incur in data loss.  InitPostgres takes care of setting it
      to a valid value, but the extra checks are there to prevent "special"
      backends from behaving in unusual ways.
      
      Per Tom Lane's detailed problem dissection in 29544.1221061979@sss.pgh.pa.us
      d53a5668
  30. 13 8月, 2008 1 次提交
  31. 05 6月, 2008 1 次提交
  32. 15 5月, 2008 1 次提交
  33. 13 5月, 2008 1 次提交
    • A
      Improve snapshot manager by keeping explicit track of snapshots. · 5da9da71
      Alvaro Herrera 提交于
      There are two ways to track a snapshot: there's the "registered" list, which
      is used for arbitrary long-lived snapshots; and there's the "active stack",
      which is used for the snapshot that is considered "active" at any time.
      This also allows users of snapshots to stop worrying about snapshot memory
      allocation and freeing, and about using PG_TRY blocks around ActiveSnapshot
      assignment.  This is all done automatically now.
      
      As a consequence, this allows us to reset MyProc->xmin when there are no
      more snapshots registered in the current backend, reducing the impact that
      long-running transactions have on VACUUM.
      5da9da71
  34. 12 5月, 2008 1 次提交
    • A
      Restructure some header files a bit, in particular heapam.h, by removing some · f8c4d7db
      Alvaro Herrera 提交于
      unnecessary #include lines in it.  Also, move some tuple routine prototypes and
      macros to htup.h, which allows removal of heapam.h inclusion from some .c
      files.
      
      For this to work, a new header file access/sysattr.h needed to be created,
      initially containing attribute numbers of system columns, for pg_dump usage.
      
      While at it, make contrib ltree, intarray and hstore header files more
      consistent with our header style.
      f8c4d7db
  35. 27 3月, 2008 3 次提交
  36. 19 3月, 2008 1 次提交