1. 02 6月, 2007 2 次提交
  2. 01 6月, 2007 2 次提交
    • T
      Fix several hash functions that were taking chintzy shortcuts instead of · 1f559b7d
      Tom Lane 提交于
      delivering a well-randomized hash value.  I got religion on this after
      observing that performance of multi-batch hash join degrades terribly if the
      higher-order bits of hash values aren't random, as indeed was true for say
      hashes of small integer values.  It's now expected and documented that hash
      functions should use hash_any or some comparable method to ensure that all
      bits of their output are about equally random.
      
      initdb forced because this change invalidates existing hash indexes.  For the
      same reason, this isn't back-patchable; the hash join performance problem
      will get a band-aid fix in the back branches.
      1f559b7d
    • T
      Change build_index_pathkeys() so that the expressions it builds to represent · 10f719af
      Tom Lane 提交于
      index key columns always have the type expected by the index's associated
      operators, ie, we add RelabelType nodes when dealing with binary-compatible
      index opclasses.  This is needed to get varchar indexes to play nicely with
      the new EquivalenceClass machinery, as per recent gripe from Josh Berkus that
      CVS HEAD was failing to match a varchar index column to a constant restriction
      in the query.
      
      It seems likely that this change will allow removal of a lot of ugly ad-hoc
      RelabelType-stripping that the planner has traditionally done while matching
      expressions to other expressions, but I'll worry about that some other day.
      10f719af
  3. 31 5月, 2007 2 次提交
    • T
      Make large sequential scans and VACUUMs work in a limited-size "ring" of · d526575f
      Tom Lane 提交于
      buffers, rather than blowing out the whole shared-buffer arena.  Aside from
      avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
      to cause a WAL flush for every page it modified, because we had it hacked to
      use only a single buffer.  Those flushes will now occur only once per
      ring-ful.  The exact ring size, and the threshold for seqscans to switch into
      the ring usage pattern, remain under debate; but the infrastructure seems
      done.  The key bit of infrastructure is a new optional BufferAccessStrategy
      object that can be passed to ReadBuffer operations; this replaces the former
      StrategyHintVacuum API.
      
      This patch also changes the buffer usage-count methodology a bit: we now
      advance usage_count when first pinning a buffer, rather than when last
      unpinning it.  To preserve the behavior that a buffer's lifetime starts to
      decrease when it's released, the clock sweep code is modified to not decrement
      usage_count of pinned buffers.
      
      Work not done in this commit: teach GiST and GIN indexes to use the vacuum
      BufferAccessStrategy for vacuum-driven fetches.
      
      Original patch by Simon, reworked by Heikki and again by Tom.
      d526575f
    • T
      Fix trivial misspelling in comment. · 14c4d3de
      Tom Lane 提交于
      14c4d3de
  4. 29 5月, 2007 1 次提交
  5. 28 5月, 2007 1 次提交
  6. 27 5月, 2007 2 次提交
    • T
      pgstat's on-proc-exit hook has to execute after the last transaction commit · 8d675c85
      Tom Lane 提交于
      or abort within a backend; rearrange InitPostgres processing to make it so.
      Revealed by just-added Asserts along with ECPG regression tests (hm, I wonder
      why the core regression tests didn't expose it?).  This possibly is another
      reason for missing stats updates ...
      8d675c85
    • T
      Fix up pgstats counting of live and dead tuples to recognize that committed · 77947c51
      Tom Lane 提交于
      and aborted transactions have different effects; also teach it not to assume
      that prepared transactions are always committed.
      
      Along the way, simplify the pgstats API by tying counting directly to
      Relations; I cannot detect any redeeming social value in having stats
      pointers in HeapScanDesc and IndexScanDesc structures.  And fix a few
      corner cases in which counts might be missed because the relation's
      pgstat_info pointer hadn't been set.
      77947c51
  7. 26 5月, 2007 1 次提交
    • T
      Create hooks to let a loadable plugin monitor (or even replace) the planner · 604ffd28
      Tom Lane 提交于
      and/or create plans for hypothetical situations; in particular, investigate
      plans that would be generated using hypothetical indexes.  This is a
      heavily-rewritten version of the hooks proposed by Gurjeet Singh for his
      Index Advisor project.  In this formulation, the index advisor can be
      entirely a loadable module instead of requiring a significant part to be
      in the core backend, and plans can be generated for hypothetical indexes
      without requiring the creation and rolling-back of system catalog entries.
      
      The index advisor patch as-submitted is not compatible with these hooks,
      but it needs significant work anyway due to other 8.2-to-8.3 planner
      changes.  With these hooks in the core backend, development of the advisor
      can proceed as a pgfoundry project.
      604ffd28
  8. 23 5月, 2007 1 次提交
    • T
      Repair planner bug introduced in 8.2 by ability to rearrange outer joins: · 11086f2f
      Tom Lane 提交于
      in cases where a sub-SELECT inserts a WHERE clause between two outer joins,
      that clause may prevent us from re-ordering the two outer joins.  The code
      was considering only the joins' own ON-conditions in determining reordering
      safety, which is not good enough.  Add a "delay_upper_joins" flag to
      OuterJoinInfo to flag that we have detected such a clause and higher-level
      outer joins shouldn't be permitted to commute with this one.  (This might
      seem overly coarse, but given the current rules for OJ reordering, it's
      sufficient AFAICT.)
      
      The failure case is actually pretty narrow: it needs a WHERE clause within
      the RHS of a left join that checks the RHS of a lower left join, but is not
      strict for that RHS (else we'd have simplified the lower join to a plain
      join).  Even then no failure will be manifest unless the planner chooses to
      rearrange the join order.
      
      Per bug report from Adam Terrey.
      11086f2f
  9. 22 5月, 2007 3 次提交
    • T
      Fix best_inner_indexscan to return both the cheapest-total-cost and · d7153c5f
      Tom Lane 提交于
      cheapest-startup-cost innerjoin indexscans, and make joinpath.c consider
      both of these (when different) as the inside of a nestloop join.  The
      original design was based on the assumption that indexscan paths always
      have negligible startup cost, and so total cost is the only important
      figure of merit; an assumption that's obviously broken by bitmap
      indexscans.  This oversight could lead to choosing poor plans in cases
      where fast-start behavior is more important than total cost, such as
      LIMIT and IN queries.  8.1-vintage brain fade exposed by an example from
      Chuck D.
      d7153c5f
    • T
      Teach tuplestore.c to throw away data before the "mark" point when the caller · 2415ad98
      Tom Lane 提交于
      is using mark/restore but not rewind or backward-scan capability.  Insert a
      materialize plan node between a mergejoin and its inner child if the inner
      child is a sort that is expected to spill to disk.  The materialize shields
      the sort from the need to do mark/restore and thereby allows it to perform
      its final merge pass on-the-fly; while the materialize itself is normally
      cheap since it won't spill to disk unless the number of tuples with equal
      key values exceeds work_mem.
      
      Greg Stark, with some kibitzing from Tom Lane.
      2415ad98
    • P
      XPath fixes: · 3963574d
      Peter Eisentraut 提交于
       - Function renamed to "xpath".
       - Function is now strict, per discussion.
       - Return empty array in case when XPath expression detects nothing
         (previously, NULL was returned in such case), per discussion.
       - (bugfix) Work with fragments with prologue: select xpath('/a',
         '<?xml version="1.0"?><a /><b />'); // now XML datum is always wrapped
         with dummy <x>...</x>, XML prologue simply goes away (if any).
       - Some cleanup.
      
      Nikolay Samokhvalov
      
      Some code cleanup and documentation work by myself.
      3963574d
  10. 21 5月, 2007 1 次提交
    • T
      To support external compression of archived WAL data, add a flag bit to · a8d539f1
      Tom Lane 提交于
      WAL records that shows whether it is safe to remove full-page images
      (ie, whether or not an on-line backup was in progress when the WAL entry
      was made).  Also make provision for an XLOG_NOOP record type that can be
      used to fill in the extra space when decompressing the data for restore.
      
      This is the portion of Koichi Suzuki's "full page writes" patch that
      has to go into the core database.  The remainder of that work is two
      external compression and decompression programs, which for the time being
      will undergo separate development on pgfoundry.  Per discussion.
      
      Also, twiddle the handling of BTREE_SPLIT records to ensure it'll be
      possible to compress them (the previous coding caused essential info
      to be omitted).  The other commonly-used record types seem OK already,
      with the possible exception of GIN and GIST WAL records, which I don't
      understand well enough to opine on.
      a8d539f1
  11. 19 5月, 2007 1 次提交
    • A
      Have CLUSTER advance the table's relfrozenxid. The new frozen point is the · b40776d2
      Alvaro Herrera 提交于
      FreezeXid introduced in a recent commit, so there isn't any data loss in this
      approach.
      
      Doing it causes ALTER TABLE (or rather, the forms of it that cause a full table
      rewrite) to be affected as well.  In this case, the frozen point is RecentXmin,
      because after the rewrite all the tuples are relabeled with the rewriting
      transaction's Xid.
      
      TOAST tables are fixed automatically as well, as fallout of the way they were
      already being handled in the respective code paths.
      
      With this patch, there is no longer need to VACUUM tables for Xid wraparound
      purposes that have been cleaned up via TRUNCATE or CLUSTER.
      b40776d2
  12. 18 5月, 2007 2 次提交
    • T
      Temporary fix for the problem that pg_stat_activity, inet_client_addr(), · dbb76935
      Tom Lane 提交于
      and inet_server_addr() fail if the client connected over a "scoped" IPv6
      address.  In this case getnameinfo() will return a string ending with
      a poorly-standardized "%something" zone specifier, which these functions
      try to feed to network_in(), which won't take it.  So that we don't lose
      functionality altogether, suppress the zone specifier before giving the
      string to network_in().  Per report from Brian Hirt.
      
      TODO: probably someday the inet type should support scoped IPv6 addresses,
      and then this patch should be reverted.
      
      Backpatch to 8.2 ... is it worth going further?
      dbb76935
    • T
      Fix parameter recalculation for Limit nodes: during a ReScan call we must · b11123b6
      Tom Lane 提交于
      recompute the limit/offset immediately, so that the updated values are
      available when the child's ReScan function is invoked.  Add a regression
      test for this, too.  Bug is new in HEAD (due to the bounded-sorting patch)
      so no need for back-patch.
      
      I did not do anything about merging this signaling with chgParam processing,
      but if we were to do that we'd still need to compute the updated values
      at this point rather than during the first ProcNode call.
      
      Per observation and test case from Greg Stark, though I didn't use his patch.
      b11123b6
  13. 17 5月, 2007 2 次提交
  14. 16 5月, 2007 1 次提交
  15. 12 5月, 2007 3 次提交
    • T
      Fix the problem that creating a user-defined type named _foo, followed by one · 9aa3c782
      Tom Lane 提交于
      named foo, would work but the other ordering would not.  If a user-specified
      type or table name collides with an existing auto-generated array name, just
      rename the array type out of the way by prepending more underscores.  This
      should not create any backward-compatibility issues, since the cases in which
      this will happen would have failed outright in prior releases.
      
      Also fix an oversight in the arrays-of-composites patch: ALTER TABLE RENAME
      renamed the table's rowtype but not its array type.
      9aa3c782
    • T
      Fix my oversight in enabling domains-of-domains: ALTER DOMAIN ADD CONSTRAINT · d8326119
      Tom Lane 提交于
      needs to check the new constraint against columns of derived domains too.
      
      Also, make it error out if the domain to be modified is used within any
      composite-type columns.  Eventually we should support that case, but it seems
      a bit painful, and not suitable for a back-patch.  For the moment just let the
      user know we can't do it.
      
      Backpatch to 8.2, which is the only released version that allows nested
      domains.  Possibly the other part should be back-patched further.
      d8326119
    • T
      Support arrays of composite types, including the rowtypes of regular tables · bc8036fc
      Tom Lane 提交于
      and views (but not system catalogs, nor sequences or toast tables).  Get rid
      of the hardwired convention that a type's array type is named exactly "_type",
      instead using a new column pg_type.typarray to provide the linkage.  (It still
      will be named "_type", though, except in odd corner cases such as
      maximum-length type names.)
      
      Along the way, make tracking of owner and schema dependencies for types more
      uniform: a type directly created by the user has these dependencies, while a
      table rowtype or auto-generated array type does not have them, but depends on
      its parent object instead.
      
      David Fetter, Andrew Dunstan, Tom Lane
      bc8036fc
  16. 09 5月, 2007 2 次提交
    • T
      Reserve some pg_statistic "kind" codes for use by the ESRI ST_Geometry · 5b7cf08d
      Tom Lane 提交于
      datatype project.  Per request from Ale Raza (araza at esri.com).
      5b7cf08d
    • N
      Add a hash function for "numeric". Mark the equality operator for · ade493e0
      Neil Conway 提交于
      numerics as "oprcanhash", and make the corresponding system catalog
      updates. As a result, hash indexes, hashed aggregation, and hash
      joins can now be used with the numeric type. Bump the catversion.
      
      The only tricky aspect to doing this is writing a correct hash
      function: it's possible for two Numerics to be equal according to
      their equality operator, but have different in-memory bit patterns.
      To cope with this, the hash function doesn't consider the Numeric's
      "scale" or "sign", and explictly skips any leading or trailing
      zeros in the Numeric's digit buffer (the current implementation
      should suppress any such zeros, but it seems unwise to rely upon
      this). See discussion on pgsql-patches for more details.
      ade493e0
  17. 05 5月, 2007 1 次提交
  18. 04 5月, 2007 4 次提交
    • T
      c7464720
    • T
      A few fixups in error handling: mark pg_re_throw() as noreturn for gcc, · 79ca7ffe
      Tom Lane 提交于
      and for other compilers, insert a dummy exit() call so that they understand
      PG_RE_THROW() doesn't return.  Insert fflush(stderr) in ExceptionalCondition,
      per recent buildfarm evidence that that might not happen automatically on some
      platforms.  And const-ify ExceptionalCondition's declaration while at it.
      79ca7ffe
    • T
      Teach tuplesort.c about "top N" sorting, in which only the first N tuples · d26559db
      Tom Lane 提交于
      need be returned.  We keep a heap of the current best N tuples and sift-up
      new tuples into it as we scan the input.  For M input tuples this means
      only about M*log(N) comparisons instead of M*log(M), not to mention a lot
      less workspace when N is small --- avoiding spill-to-disk for large M
      is actually the most attractive thing about it.  Patch includes planner
      and executor support for invoking this facility in ORDER BY ... LIMIT
      queries.  Greg Stark, with some editorialization by moi.
      d26559db
    • T
      Tweak hash index AM to use the new ReadOrZeroBuffer bufmgr API when fetching · 0fef38da
      Tom Lane 提交于
      pages it intends to zero immediately.  Just to show there is some use for that
      function besides WAL recovery :-).
      Along the way, fold _hash_checkpage and _hash_pageinit calls into _hash_getbuf
      and friends, instead of expecting callers to do that separately.
      0fef38da
  19. 03 5月, 2007 1 次提交
    • T
      During WAL recovery, when reading a page that we intend to overwrite completely · 8c3cc86e
      Tom Lane 提交于
      from the WAL data, don't bother to physically read it; just have bufmgr.c
      return a zeroed-out buffer instead.  This speeds recovery significantly,
      and also avoids unnecessary failures when a page-to-be-overwritten has corrupt
      page headers on disk.  This replaces a former kluge that accomplished the
      latter by pretending zero_damaged_pages was always ON during WAL recovery;
      which was OK when the kluge was put in, but is unsafe when restoring a WAL
      log that was written with full_page_writes off.
      
      Heikki Linnakangas
      8c3cc86e
  20. 02 5月, 2007 1 次提交
    • T
      Fix oversight in PG_RE_THROW processing: it's entirely possible that there · 88f1fd29
      Tom Lane 提交于
      isn't any place to throw the error to.  If so, we should treat the error
      as FATAL, just as we would have if it'd been thrown outside the PG_TRY
      block to begin with.
      
      Although this is clearly a *potential* source of bugs, it is not clear
      at the moment whether it is an *actual* source of bugs; there may not
      presently be any PG_TRY blocks in code that can be reached with no outer
      longjmp catcher.  So for the moment I'm going to be conservative and not
      back-patch this.  The change breaks ABI for users of PG_RE_THROW and hence
      might create compatibility problems for loadable modules, so we should not
      put it into released branches without proof that it's needed.
      88f1fd29
  21. 01 5月, 2007 2 次提交
  22. 30 4月, 2007 1 次提交
    • T
      Implement rate-limiting logic on how often backends will attempt to send · 957d08c8
      Tom Lane 提交于
      messages to the stats collector.  This avoids the problem that enabling
      stats_row_level for autovacuum has a significant overhead for short
      read-only transactions, as noted by Arjen van der Meijden.  We can avoid
      an extra gettimeofday call by piggybacking on the one done for WAL-logging
      xact commit or abort (although that doesn't help read-only transactions,
      since they don't WAL-log anything).
      
      In my proposal for this, I noted that we could change the WAL log entries
      for commit/abort to record full TimestampTz precision, instead of only
      time_t as at present.  That's not done in this patch, but will be committed
      separately.
      957d08c8
  23. 28 4月, 2007 1 次提交
    • T
      Modify processing of DECLARE CURSOR and EXPLAIN so that they can resolve the · bbbe825f
      Tom Lane 提交于
      types of unspecified parameters when submitted via extended query protocol.
      This worked in 8.2 but I had broken it during plancache changes.  DECLARE
      CURSOR is now treated almost exactly like a plain SELECT through parse
      analysis, rewrite, and planning; only just before sending to the executor
      do we divert it away to ProcessUtility.  This requires a special-case check
      in a number of places, but practically all of them were already special-casing
      SELECT INTO, so it's not too ugly.  (Maybe it would be a good idea to merge
      the two by treating IntoClause as a form of utility statement?  Not going to
      worry about that now, though.)  That approach doesn't work for EXPLAIN,
      however, so for that I punted and used a klugy solution of running parse
      analysis an extra time if under extended query protocol.
      bbbe825f
  24. 27 4月, 2007 2 次提交
    • T
      Fix dynahash.c to suppress hash bucket splits while a hash_seq_search() scan · a2e923a6
      Tom Lane 提交于
      is in progress on the same hashtable.  This seems the least invasive way to
      fix the recently-recognized problem that a split could cause the scan to
      visit entries twice or (with much lower probability) miss them entirely.
      The only field-reported problem caused by this is the "failed to re-find
      shared lock object" PANIC in COMMIT PREPARED reported by Michel Dorochevsky,
      which was caused by multiply visited entries.  However, it seems certain
      that mdsync() is vulnerable to missing required fsync's due to missed
      entries, and I am fearful that RelationCacheInitializePhase2() might be at
      risk as well.  Because of that and the generalized hazard presented by this
      bug, back-patch all the supported branches.
      
      Along the way, fix pg_prepared_statement() and pg_cursor() to not assume
      that the hashtables they are examining will stay static between calls.
      This is risky regardless of the newly noted dynahash problem, because
      hash_seq_search() has never promised to cope with deletion of table entries
      other than the just-returned one.  There may be no bug here because the only
      supported way to call these functions is via ExecMakeTableFunctionResult()
      which will cycle them to completion before doing anything very interesting,
      but it seems best to get rid of the assumption.  This affects 8.2 and HEAD
      only, since those functions weren't there earlier.
      a2e923a6
    • N
      Rename the newly-added commands for discarding session state. · 16efdb5e
      Neil Conway 提交于
      RESET SESSION, RESET PLANS, and RESET TEMP are now DISCARD ALL,
      DISCARD PLANS, and DISCARD TEMP, respectively. This is to avoid
      confusion with the pre-existing RESET variants: the DISCARD
      commands are not actually similar to RESET. Patch from Marko
      Kreen, with some minor editorialization.
      16efdb5e