1. 31 12月, 2009 1 次提交
    • T
      Revise pgstat's tracking of tuple changes to improve the reliability of · 48c192c1
      Tom Lane 提交于
      decisions about when to auto-analyze.
      
      The previous code depended on n_live_tuples + n_dead_tuples - last_anl_tuples,
      where all three of these numbers could be bad estimates from ANALYZE itself.
      Even worse, in the presence of a steady flow of HOT updates and matching
      HOT-tuple reclamations, auto-analyze might never trigger at all, even if all
      three numbers are exactly right, because n_dead_tuples could hold steady.
      
      To fix, replace last_anl_tuples with an accurately tracked count of the total
      number of committed tuple inserts + updates + deletes since the last ANALYZE
      on the table.  This can still be compared to the same threshold as before, but
      it's much more trustworthy than the old computation.  Tracking this requires
      one more intra-transaction counter per modified table within backends, but no
      additional memory space in the stats collector.  There probably isn't any
      measurable speed difference; if anything it might be a bit faster than before,
      since I was able to eliminate some per-tuple arithmetic operations in favor of
      adding sums once per (sub)transaction.
      
      Also, simplify the logic around pgstat vacuum and analyze reporting messages
      by not trying to fold VACUUM ANALYZE into a single pgstat message.
      
      The original thought behind this patch was to allow scheduling of analyzes
      on parent tables by artificially inflating their changes_since_analyze count.
      I've left that for a separate patch since this change seems to stand on its
      own merit.
      48c192c1
  2. 30 12月, 2009 1 次提交
  3. 19 12月, 2009 1 次提交
    • S
      Allow read only connections during recovery, known as Hot Standby. · efc16ea5
      Simon Riggs 提交于
      Enabled by recovery_connections = on (default) and forcing archive recovery using a recovery.conf. Recovery processing now emulates the original transactions as they are replayed, providing full locking and MVCC behaviour for read only queries. Recovery must enter consistent state before connections are allowed, so there is a delay, typically short, before connections succeed. Replay of recovering transactions can conflict and in some cases deadlock with queries during recovery; these result in query cancellation after max_standby_delay seconds have expired. Infrastructure changes have minor effects on normal running, though introduce four new types of WAL record.
      
      New test mode "make standbycheck" allows regression tests of static command behaviour on a standby server while in recovery. Typical and extreme dynamic behaviours have been checked via code inspection and manual testing. Few port specific behaviours have been utilised, though primary testing has been on Linux only so far.
      
      This commit is the basic patch. Additional changes will follow in this release to enhance some aspects of behaviour, notably improved handling of conflicts, deadlock detection and query cancellation. Changes to VACUUM FULL are also required.
      
      Simon Riggs, with significant and lengthy review by Heikki Linnakangas, including streamlined redesign of snapshot creation and two-phase commit.
      
      Important contributions from Florian Pflug, Mark Kirkwood, Merlin Moncure, Greg Stark, Gianni Ciolli, Gabriele Bartolini, Hannu Krosing, Robert Haas, Tatsuo Ishii, Hiroyuki Yamada plus support and feedback from many other community members.
      efc16ea5
  4. 10 12月, 2009 1 次提交
    • T
      Prevent indirect security attacks via changing session-local state within · 62aba765
      Tom Lane 提交于
      an allegedly immutable index function.  It was previously recognized that
      we had to prevent such a function from executing SET/RESET ROLE/SESSION
      AUTHORIZATION, or it could trivially obtain the privileges of the session
      user.  However, since there is in general no privilege checking for changes
      of session-local state, it is also possible for such a function to change
      settings in a way that might subvert later operations in the same session.
      Examples include changing search_path to cause an unexpected function to
      be called, or replacing an existing prepared statement with another one
      that will execute a function of the attacker's choosing.
      
      The present patch secures VACUUM, ANALYZE, and CREATE INDEX/REINDEX against
      these threats, which are the same places previously deemed to need protection
      against the SET ROLE issue.  GUC changes are still allowed, since there are
      many useful cases for that, but we prevent security problems by forcing a
      rollback of any GUC change after completing the operation.  Other cases are
      handled by throwing an error if any change is attempted; these include temp
      table creation, closing a cursor, and creating or deleting a prepared
      statement.  (In 7.4, the infrastructure to roll back GUC changes doesn't
      exist, so we settle for rejecting changes of "search_path" in these contexts.)
      
      Original report and patch by Gurjeet Singh, additional analysis by
      Tom Lane.
      
      Security: CVE-2009-4136
      62aba765
  5. 07 12月, 2009 1 次提交
  6. 17 11月, 2009 1 次提交
  7. 11 11月, 2009 1 次提交
    • A
      Fix longstanding problems in VACUUM caused by untimely interruptions · e7ec0222
      Alvaro Herrera 提交于
      In VACUUM FULL, an interrupt after the initial transaction has been recorded
      as committed can cause postmaster to restart with the following error message:
      PANIC: cannot abort transaction NNNN, it was already committed
      This problem has been reported many times.
      
      In lazy VACUUM, an interrupt after the table has been truncated by
      lazy_truncate_heap causes other backends' relcache to still point to the
      removed pages; this can cause future INSERT and UPDATE queries to error out
      with the following error message:
      could not read block XX of relation 1663/NNN/MMMM: read only 0 of 8192 bytes
      The window to this race condition is extremely narrow, but it has been seen in
      the wild involving a cancelled autovacuum process.
      
      The solution for both problems is to inhibit interrupts in both operations
      until after the respective transactions have been committed.  It's not a
      complete solution, because the transaction could theoretically be aborted by
      some other error, but at least fixes the most common causes of both problems.
      e7ec0222
  8. 26 10月, 2009 1 次提交
    • T
      Re-implement EvalPlanQual processing to improve its performance and eliminate · 9f2ee8f2
      Tom Lane 提交于
      a lot of strange behaviors that occurred in join cases.  We now identify the
      "current" row for every joined relation in UPDATE, DELETE, and SELECT FOR
      UPDATE/SHARE queries.  If an EvalPlanQual recheck is necessary, we jam the
      appropriate row into each scan node in the rechecking plan, forcing it to emit
      only that one row.  The former behavior could rescan the whole of each joined
      relation for each recheck, which was terrible for performance, and what's much
      worse could result in duplicated output tuples.
      
      Also, the original implementation of EvalPlanQual could not re-use the recheck
      execution tree --- it had to go through a full executor init and shutdown for
      every row to be tested.  To avoid this overhead, I've associated a special
      runtime Param with each LockRows or ModifyTable plan node, and arranged to
      make every scan node below such a node depend on that Param.  Thus, by
      signaling a change in that Param, the EPQ machinery can just rescan the
      already-built test plan.
      
      This patch also adds a prohibition on set-returning functions in the
      targetlist of SELECT FOR UPDATE/SHARE.  This is needed to avoid the
      duplicate-output-tuple problem.  It seems fairly reasonable since the
      other restrictions on SELECT FOR UPDATE are meant to ensure that there
      is a unique correspondence between source tuples and result tuples,
      which an output SRF destroys as much as anything else does.
      9f2ee8f2
  9. 01 9月, 2009 2 次提交
  10. 31 8月, 2009 1 次提交
    • T
      Track the current XID wrap limit (or more accurately, the oldest unfrozen · 25ec228e
      Tom Lane 提交于
      XID) in checkpoint records.  This eliminates the need to recompute the value
      from scratch during database startup, which is one of the two remaining
      reasons for the flatfile code to exist.  It should also simplify life for
      hot-standby operation.
      
      To avoid bloating the checkpoint records unreasonably, I switched from
      tracking the oldest database by name to tracking it by OID.  This turns
      out to save cycles in general (everywhere but the warning-generating
      paths, which we hardly care about) and also helps us deal with the case
      that the oldest database got dropped instead of being vacuumed.  The prior
      coding might go for a long time without updating the wrap limit in that case,
      which is bad because it might result in a lot of useless autovacuum activity.
      25ec228e
  11. 24 8月, 2009 1 次提交
    • T
      Fix a violation of WAL coding rules in the recent patch to include an · 7fc7a7c4
      Tom Lane 提交于
      "all tuples visible" flag in heap page headers.  The flag update *must*
      be applied before calling XLogInsert, but heap_update and the tuple
      moving routines in VACUUM FULL were ignoring this rule.  A crash and
      replay could therefore leave the flag incorrectly set, causing rows
      to appear visible in seqscans when they should not be.  This might explain
      recent reports of data corruption from Jeff Ross and others.
      
      In passing, do a bit of editorialization on comments in visibilitymap.c.
      7fc7a7c4
  12. 11 6月, 2009 1 次提交
  13. 07 6月, 2009 1 次提交
    • T
      Improve the IndexVacuumInfo/IndexBulkDeleteResult API to allow somewhat sane · 32ea2363
      Tom Lane 提交于
      behavior in cases where we don't know the heap tuple count accurately; in
      particular partial vacuum, but this also makes the API a bit more useful
      for ANALYZE.  This patch adds "estimated_count" flags to both structs so
      that an approximate count can be flagged as such, and adjusts the logic
      so that approximate counts are not used for updating pg_class.reltuples.
      
      This fixes my previous complaint that VACUUM was putting ridiculous values
      into pg_class.reltuples for indexes.  The actual impact of that bug is
      limited, because the planner only pays attention to reltuples for an index
      if the index is partial; which probably explains why beta testers hadn't
      noticed a degradation in plan quality from it.  But it needs to be fixed.
      
      The whole thing is a bit messy and should be redesigned in future, because
      reltuples now has the potential to drift quite far away from reality when
      a long period elapses with no non-partial vacuums.  But this is as good as
      it's going to get for 8.4.
      32ea2363
  14. 01 4月, 2009 1 次提交
    • T
      Modify the relcache to record the temp status of both local and nonlocal · 948d6ec9
      Tom Lane 提交于
      temp relations; this is no more expensive than before, now that we have
      pg_class.relistemp.  Insert tests into bufmgr.c to prevent attempting
      to fetch pages from nonlocal temp relations.  This provides a low-level
      defense against bugs-of-omission allowing temp pages to be loaded into shared
      buffers, as in the contrib/pgstattuple problem reported by Stuart Bishop.
      While at it, tweak a bunch of places to use new relcache tests (instead of
      expensive probes into pg_namespace) to detect local or nonlocal temp tables.
      948d6ec9
  15. 25 3月, 2009 1 次提交
    • T
      Implement "fastupdate" support for GIN indexes, in which we try to accumulate · ff301d6e
      Tom Lane 提交于
      multiple index entries in a holding area before adding them to the main index
      structure.  This helps because bulk insert is (usually) significantly faster
      than retail insert for GIN.
      
      This patch also removes GIN support for amgettuple-style index scans.  The
      API defined for amgettuple is difficult to support with fastupdate, and
      the previously committed partial-match feature didn't really work with
      it either.  We might eventually figure a way to put back amgettuple
      support, but it won't happen for 8.4.
      
      catversion bumped because of change in GIN's pg_am entry, and because
      the format of GIN indexes changed on-disk (there's a metapage now,
      and possibly a pending list).
      
      Teodor Sigaev
      ff301d6e
  16. 16 1月, 2009 1 次提交
  17. 02 1月, 2009 1 次提交
  18. 17 12月, 2008 1 次提交
    • H
      Don't reset pg_class.reltuples and relpages in VACUUM, if any pages were · dcf84099
      Heikki Linnakangas 提交于
      skipped. We could update relpages anyway, but it seems better to only
      update it together with reltuples, because we use the reltuples/relpages
      ratio in the planner. Also don't update n_live_tuples in pgstat.
      
      ANALYZE in VACUUM ANALYZE now needs to update pg_class, if the
      VACUUM-phase didn't do so. Added some boolean-passing to let analyze_rel
      know if it should update pg_class or not.
      
      I also moved the relcache invalidation (to update rd_targblock) from
      vac_update_relstats to where RelationTruncate is called, because
      vac_update_relstats is not called for partial vacuums anymore. It's more
      obvious to send the invalidation close to the truncation that requires it.
      
      Per report by Ned T. Crigler.
      dcf84099
  19. 03 12月, 2008 1 次提交
    • H
      Introduce visibility map. The visibility map is a bitmap with one bit per · 608195a3
      Heikki Linnakangas 提交于
      heap page, where a set bit indicates that all tuples on the page are
      visible to all transactions, and the page therefore doesn't need
      vacuuming. It is stored in a new relation fork.
      
      Lazy vacuum uses the visibility map to skip pages that don't need
      vacuuming. Vacuum is also responsible for setting the bits in the map.
      In the future, this can hopefully be used to implement index-only-scans,
      but we can't currently guarantee that the visibility map is always 100%
      up-to-date.
      
      In addition to the visibility map, there's a new PD_ALL_VISIBLE flag on
      each heap page, also indicating that all tuples on the page are visible to
      all transactions. It's important that this flag is kept up-to-date. It
      is also used to skip visibility tests in sequential scans, which gives a
      small performance gain on seqscans.
      608195a3
  20. 19 11月, 2008 1 次提交
    • H
      Rethink the way FSM truncation works. Instead of WAL-logging FSM · 33960006
      Heikki Linnakangas 提交于
      truncations in FSM code, call FreeSpaceMapTruncateRel from smgr_redo. To
      make that cleaner from modularity point of view, move the WAL-logging one
      level up to RelationTruncate, and move RelationTruncate and all the
      related WAL-logging to new src/backend/catalog/storage.c file. Introduce
      new RelationCreateStorage and RelationDropStorage functions that are used
      instead of calling smgrcreate/smgrscheduleunlink directly. Move the
      pending rel deletion stuff from smgrcreate/smgrscheduleunlink to the new
      functions. This leaves smgr.c as a thin wrapper around md.c; all the
      transactional stuff is now in storage.c.
      
      This will make it easier to add new forks with similar truncation logic,
      like the visibility map.
      33960006
  21. 10 11月, 2008 1 次提交
  22. 31 10月, 2008 1 次提交
    • H
      Unite ReadBufferWithFork, ReadBufferWithStrategy, and ZeroOrReadBuffer · 19c8dc83
      Heikki Linnakangas 提交于
      functions into one ReadBufferExtended function, that takes the strategy
      and mode as argument. There's three modes, RBM_NORMAL which is the default
      used by plain ReadBuffer(), RBM_ZERO, which replaces ZeroOrReadBuffer, and
      a new mode RBM_ZERO_ON_ERROR, which allows callers to read corrupt pages
      without throwing an error. The FSM needs the new mode to recover from
      corrupt pages, which could happend if we crash after extending an FSM file,
      and the new page is "torn".
      
      Add fork number to some error messages in bufmgr.c, that still lacked it.
      19c8dc83
  23. 30 9月, 2008 1 次提交
    • H
      Rewrite the FSM. Instead of relying on a fixed-size shared memory segment, the · 15c121b3
      Heikki Linnakangas 提交于
      free space information is stored in a dedicated FSM relation fork, with each
      relation (except for hash indexes; they don't use FSM).
      
      This eliminates the max_fsm_relations and max_fsm_pages GUC options; remove any
      trace of them from the backend, initdb, and documentation.
      
      Rewrite contrib/pg_freespacemap to match the new FSM implementation. Also
      introduce a new variant of the get_raw_page(regclass, int4, int4) function in
      contrib/pageinspect that let's you to return pages from any relation fork, and
      a new fsm_page_contents() function to inspect the new FSM pages.
      15c121b3
  24. 11 9月, 2008 1 次提交
    • A
      Initialize the minimum frozen Xid in vac_update_datfrozenxid using · d53a5668
      Alvaro Herrera 提交于
      GetOldestXmin() instead of RecentGlobalXmin; this is safer because we do not
      depend on the latter being correctly set elsewhere, and while it is more
      expensive, this code path is not performance-critical.  This is a real
      risk for autovacuum, because it can execute whole cycles without doing
      a single vacuum, which would mean that RecentGlobalXmin would stay at its
      initialization value, FirstNormalTransactionId, causing a bogus value to be
      inserted in pg_database.  This bug could explain some recent reports of
      failure to truncate pg_clog.
      
      At the same time, change the initialization of RecentGlobalXmin to
      InvalidTransactionId, and ensure that it's set to something else whenever
      it's going to be used.  Using it as FirstNormalTransactionId in HOT page
      pruning could incur in data loss.  InitPostgres takes care of setting it
      to a valid value, but the extra checks are there to prevent "special"
      backends from behaving in unusual ways.
      
      Per Tom Lane's detailed problem dissection in 29544.1221061979@sss.pgh.pa.us
      d53a5668
  25. 13 8月, 2008 1 次提交
  26. 05 6月, 2008 1 次提交
  27. 15 5月, 2008 1 次提交
  28. 13 5月, 2008 1 次提交
    • A
      Improve snapshot manager by keeping explicit track of snapshots. · 5da9da71
      Alvaro Herrera 提交于
      There are two ways to track a snapshot: there's the "registered" list, which
      is used for arbitrary long-lived snapshots; and there's the "active stack",
      which is used for the snapshot that is considered "active" at any time.
      This also allows users of snapshots to stop worrying about snapshot memory
      allocation and freeing, and about using PG_TRY blocks around ActiveSnapshot
      assignment.  This is all done automatically now.
      
      As a consequence, this allows us to reset MyProc->xmin when there are no
      more snapshots registered in the current backend, reducing the impact that
      long-running transactions have on VACUUM.
      5da9da71
  29. 12 5月, 2008 1 次提交
    • A
      Restructure some header files a bit, in particular heapam.h, by removing some · f8c4d7db
      Alvaro Herrera 提交于
      unnecessary #include lines in it.  Also, move some tuple routine prototypes and
      macros to htup.h, which allows removal of heapam.h inclusion from some .c
      files.
      
      For this to work, a new header file access/sysattr.h needed to be created,
      initially containing attribute numbers of system columns, for pg_dump usage.
      
      While at it, make contrib ltree, intarray and hstore header files more
      consistent with our header style.
      f8c4d7db
  30. 27 3月, 2008 3 次提交
  31. 19 3月, 2008 1 次提交
  32. 15 3月, 2008 1 次提交
  33. 10 3月, 2008 1 次提交
  34. 20 2月, 2008 1 次提交
  35. 12 2月, 2008 1 次提交
    • T
      Repair VACUUM FULL bug introduced by HOT patch: the original way of · c931c071
      Tom Lane 提交于
      calculating a page's initial free space was fine, and should not have been
      "improved" by letting PageGetHeapFreeSpace do it.  VACUUM FULL is going to
      reclaim LP_DEAD line pointers later, so there is no need for a guard
      against the page being too full of line pointers, and having one risks
      rejecting pages that are perfectly good move destinations.
      
      This also exposed a second bug, which is that the empty_end_pages logic
      assumed that any page with no live tuples would get entered into the
      fraged_pages list automatically (by virtue of having more free space than
      the threshold in the do_frag calculation).  This assumption certainly
      seems risky when a low fillfactor has been chosen, and even without
      tunable fillfactor I think it could conceivably fail on a page with many
      unused line pointers.  So fix the code to force do_frag true when notup
      is true, and patch this part of the fix all the way back.
      
      Per report from Tomas Szepe.
      c931c071
  36. 04 1月, 2008 1 次提交
    • T
      Make standard maintenance operations (including VACUUM, ANALYZE, REINDEX, · eedb068c
      Tom Lane 提交于
      and CLUSTER) execute as the table owner rather than the calling user, using
      the same privilege-switching mechanism already used for SECURITY DEFINER
      functions.  The purpose of this change is to ensure that user-defined
      functions used in index definitions cannot acquire the privileges of a
      superuser account that is performing routine maintenance.  While a function
      used in an index is supposed to be IMMUTABLE and thus not able to do anything
      very interesting, there are several easy ways around that restriction; and
      even if we could plug them all, there would remain a risk of reading sensitive
      information and broadcasting it through a covert channel such as CPU usage.
      
      To prevent bypassing this security measure, execution of SET SESSION
      AUTHORIZATION and SET ROLE is now forbidden within a SECURITY DEFINER context.
      
      Thanks to Itagaki Takahiro for reporting this vulnerability.
      
      Security: CVE-2007-6600
      eedb068c
  37. 02 1月, 2008 1 次提交