indexam.sgml 39.3 KB
Newer Older
1
<!-- $PostgreSQL: pgsql/doc/src/sgml/indexam.sgml,v 2.23 2007/04/06 22:33:41 tgl Exp $ -->
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

<chapter id="indexam">
 <title>Index Access Method Interface Definition</title>

  <para>
   This chapter defines the interface between the core
   <productname>PostgreSQL</productname> system and <firstterm>index access
   methods</>, which manage individual index types.  The core system
   knows nothing about indexes beyond what is specified here, so it is
   possible to develop entirely new index types by writing add-on code.
  </para>

  <para>
   All indexes in <productname>PostgreSQL</productname> are what are known
   technically as <firstterm>secondary indexes</>; that is, the index is
   physically separate from the table file that it describes.  Each index
   is stored as its own physical <firstterm>relation</> and so is described
   by an entry in the <structname>pg_class</> catalog.  The contents of an
   index are entirely under the control of its index access method.  In
   practice, all index access methods divide indexes into standard-size
   pages so that they can use the regular storage manager and buffer manager
   to access the index contents.  (All the existing index access methods
   furthermore use the standard page layout described in <xref
   linkend="storage-page-layout">, and they all use the same format for index
   tuple headers; but these decisions are not forced on an access method.)
  </para>

  <para>
   An index is effectively a mapping from some data key values to
   <firstterm>tuple identifiers</>, or <acronym>TIDs</>, of row versions
   (tuples) in the index's parent table.  A TID consists of a
   block number and an item number within that block (see <xref
   linkend="storage-page-layout">).  This is sufficient
   information to fetch a particular row version from the table.
36
   Indexes are not directly aware that under MVCC, there might be multiple
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
   extant versions of the same logical row; to an index, each tuple is
   an independent object that needs its own index entry.  Thus, an
   update of a row always creates all-new index entries for the row, even if
   the key values did not change.  Index entries for dead tuples are
   reclaimed (by vacuuming) when the dead tuples themselves are reclaimed.
  </para>

 <sect1 id="index-catalog">
  <title>Catalog Entries for Indexes</title>

  <para>
   Each index access method is described by a row in the
   <structname>pg_am</structname> system catalog (see
   <xref linkend="catalog-pg-am">).  The principal contents of a
   <structname>pg_am</structname> row are references to
   <link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>
   entries that identify the index access
   functions supplied by the access method.  The APIs for these functions
   are defined later in this chapter.  In addition, the
   <structname>pg_am</structname> row specifies a few fixed properties of
57
   the access method, such as whether it can support multicolumn indexes.
58 59 60 61 62 63 64 65
   There is not currently any special support
   for creating or deleting <structname>pg_am</structname> entries;
   anyone able to write a new access method is expected to be competent
   to insert an appropriate row for themselves.
  </para>

  <para>
   To be useful, an index access method must also have one or more
66
   <firstterm>operator families</> and
67
   <firstterm>operator classes</> defined in
68
   <link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
69 70 71 72 73
   <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
   <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
   <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
   These entries allow the planner
   to determine what kinds of query qualifications can be used with
74
   indexes of this access method.  Operator families and classes are described
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
   in <xref linkend="xindex">, which is prerequisite material for reading
   this chapter.
  </para>

  <para>
   An individual index is defined by a 
   <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
   entry that describes it as a physical relation, plus a
   <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
   entry that shows the logical content of the index &mdash; that is, the set
   of index columns it has and the semantics of those columns, as captured by
   the associated operator classes.  The index columns (key values) can be
   either simple columns of the underlying table or expressions over the table
   rows.  The index access method normally has no interest in where the index
   key values come from (it is always handed precomputed key values) but it
   will be very interested in the operator class information in
   <structname>pg_index</structname>.  Both of these catalog entries can be
   accessed as part of the <structname>Relation</> data structure that is
   passed to all operations on the index.
  </para>

  <para>
   Some of the flag columns of <structname>pg_am</structname> have nonobvious
   implications.  The requirements of <structfield>amcanunique</structfield>
99
   are discussed in <xref linkend="index-unique-checks">.
100
   The <structfield>amcanmulticol</structfield> flag asserts that the
101
   access method supports multicolumn indexes, while
102 103 104 105 106 107 108 109 110 111
   <structfield>amoptionalkey</structfield> asserts that it allows scans
   where no indexable restriction clause is given for the first index column.
   When <structfield>amcanmulticol</structfield> is false,
   <structfield>amoptionalkey</structfield> essentially says whether the
   access method allows full-index scans without any restriction clause.
   Access methods that support multiple index columns <emphasis>must</>
   support scans that omit restrictions on any or all of the columns after
   the first; however they are permitted to require some restriction to
   appear for the first index column, and this is signaled by setting
   <structfield>amoptionalkey</structfield> false.
112 113 114
   <structfield>amindexnulls</structfield> asserts that index entries are
   created for NULL key values.  Since most indexable operators are
   strict and hence cannot return TRUE for NULL inputs,
115
   it is at first sight attractive to not store index entries for null values:
116
   they could never be returned by an index scan anyway.  However, this
117 118 119 120 121
   argument fails when an index scan has no restriction clause for a given
   index column.  In practice this means that
   indexes that have <structfield>amoptionalkey</structfield> true must
   index nulls, since the planner might decide to use such an index
   with no scan keys at all.  A related restriction is that an index
122 123
   access method that supports multiple index columns <emphasis>must</>
   support indexing null values in columns after the first, because the planner
124 125
   will assume the index can be used for queries that do not restrict
   these columns.  For example, consider an index on (a,b) and a query with
126 127 128 129
   <literal>WHERE a = 4</literal>.  The system will assume the index can be
   used to scan for rows with <literal>a = 4</literal>, which is wrong if the
   index omits rows where <literal>b</> is null.
   It is, however, OK to omit rows where the first indexed column is null.
T
Teodor Sigaev 已提交
130
   Thus, <structfield>amindexnulls</structfield> should be set true only if the
131
   index access method indexes all rows, including arbitrary combinations of
132 133 134 135
   null values.  An index access method that sets
   <structfield>amindexnulls</structfield> may also set
   <structfield>amsearchnulls</structfield>, indicating that it supports
   <literal>IS NULL</> clauses as search conditions.
136 137 138 139 140 141 142 143 144 145 146 147 148 149
  </para>

 </sect1>

 <sect1 id="index-functions">
  <title>Index Access Method Functions</title>

  <para>
   The index construction and maintenance functions that an index access
   method must provide are:
  </para>

  <para>
<programlisting>
150
IndexBuildResult *
151 152 153 154 155 156 157 158 159 160
ambuild (Relation heapRelation,
         Relation indexRelation,
         IndexInfo *indexInfo);
</programlisting>
   Build a new index.  The index relation has been physically created,
   but is empty.  It must be filled in with whatever fixed data the
   access method requires, plus entries for all tuples already existing
   in the table.  Ordinarily the <function>ambuild</> function will call
   <function>IndexBuildHeapScan()</> to scan the table for existing tuples
   and compute the keys that need to be inserted into the index.
161 162
   The function must return a palloc'd struct containing statistics about
   the new index.
163 164 165 166
  </para>

  <para>
<programlisting>
167
bool
168
aminsert (Relation indexRelation,
169 170
          Datum *values,
          bool *isnull,
171 172 173 174
          ItemPointer heap_tid,
          Relation heapRelation,
          bool check_uniqueness);
</programlisting>
175 176
   Insert a new tuple into an existing index.  The <literal>values</> and
   <literal>isnull</> arrays give the key values to be indexed, and
177 178 179
   <literal>heap_tid</> is the TID to be indexed.
   If the access method supports unique indexes (its
   <structname>pg_am</>.<structfield>amcanunique</> flag is true) then
180
   <literal>check_uniqueness</> might be true, in which case the access method
181 182 183
   must verify that there is no conflicting row; this is the only situation in
   which the access method normally needs the <literal>heapRelation</>
   parameter.  See <xref linkend="index-unique-checks"> for details.
184 185 186
   The result is TRUE if an index entry was inserted, FALSE if not. (A FALSE
   result does not denote an error condition, but is used for cases such
   as an index AM refusing to index a NULL.)
187 188 189 190 191
  </para>

  <para>
<programlisting>
IndexBulkDeleteResult *
192 193
ambulkdelete (IndexVacuumInfo *info,
              IndexBulkDeleteResult *stats,
194 195 196 197 198 199
              IndexBulkDeleteCallback callback,
              void *callback_state);
</programlisting>
   Delete tuple(s) from the index.  This is a <quote>bulk delete</> operation
   that is intended to be implemented by scanning the whole index and checking
   each entry to see if it should be deleted.
200
   The passed-in <literal>callback</> function must be called, in the style
201 202 203 204
   <literal>callback(<replaceable>TID</>, callback_state) returns bool</literal>,
   to determine whether any particular index entry, as identified by its
   referenced TID, is to be deleted.  Must return either NULL or a palloc'd
   struct containing statistics about the effects of the deletion operation.
205 206
   It is OK to return NULL if no information needs to be passed on to
   <function>amvacuumcleanup</>.
207 208
  </para>

209
  <para>
210
   Because of limited <varname>maintenance_work_mem</>,
211
   <function>ambulkdelete</> might need to be called more than once when many
212 213 214 215 216 217
   tuples are to be deleted.  The <literal>stats</> argument is the result
   of the previous call for this index (it is NULL for the first call within a
   <command>VACUUM</> operation).  This allows the AM to accumulate statistics
   across the whole operation.  Typically, <function>ambulkdelete</> will
   modify and return the same struct if the passed <literal>stats</> is not
   null.
218 219
  </para>

220 221 222
  <para>
<programlisting>
IndexBulkDeleteResult *
223
amvacuumcleanup (IndexVacuumInfo *info,
224 225
                 IndexBulkDeleteResult *stats);
</programlisting>
226 227
   Clean up after a <command>VACUUM</command> operation (zero or more
   <function>ambulkdelete</> calls).  This does not have to do anything
228
   beyond returning index statistics, but it might perform bulk cleanup
229 230 231 232 233 234 235 236 237
   such as reclaiming empty index pages.  <literal>stats</> is whatever the
   last <function>ambulkdelete</> call returned, or NULL if
   <function>ambulkdelete</> was not called because no tuples needed to be
   deleted.  If the result is not NULL it must be a palloc'd struct.
   The statistics it contains will be used to update <structname>pg_class</>,
   and will be reported by <command>VACUUM</> if <literal>VERBOSE</> is given.
   It is OK to return NULL if the index was not changed at all during the
   <command>VACUUM</command> operation, but otherwise correct stats should
   be returned.
238 239
  </para>

240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
  <para>
<programlisting>
void
amcostestimate (PlannerInfo *root,
                IndexOptInfo *index,
                List *indexQuals,
                RelOptInfo *outer_rel,
                Cost *indexStartupCost,
                Cost *indexTotalCost,
                Selectivity *indexSelectivity,
                double *indexCorrelation);
</programlisting>
   Estimate the costs of an index scan.  This function is described fully
   in <xref linkend="index-cost-estimation">, below.
  </para>

  <para>
<programlisting>
bytea *
amoptions (ArrayType *reloptions,
           bool validate);
</programlisting>
   Parse and validate the reloptions array for an index.  This is called only
   when a non-null reloptions array exists for the index.
   <parameter>reloptions</> is a <type>text</> array containing entries of the
   form <replaceable>name</><literal>=</><replaceable>value</>.
   The function should construct a <type>bytea</> value, which will be copied
   into the <structfield>rd_options</> field of the index's relcache entry.
   The data contents of the <type>bytea</> value are open for the access
   method to define, but the standard access methods currently all use struct
   <structname>StdRdOptions</>.
   When <parameter>validate</> is true, the function should report a suitable
   error message if any of the options are unrecognized or have invalid
   values; when <parameter>validate</> is false, invalid entries should be
   silently ignored.  (<parameter>validate</> is false when loading options
   already stored in <structname>pg_catalog</>; an invalid entry could only
   be found if the access method has changed its rules for options, and in
   that case ignoring obsolete entries is appropriate.)
   It is OK to return NULL if default behavior is wanted.
  </para>

281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302
  <para>
   The purpose of an index, of course, is to support scans for tuples matching
   an indexable <literal>WHERE</> condition, often called a
   <firstterm>qualifier</> or <firstterm>scan key</>.  The semantics of
   index scanning are described more fully in <xref linkend="index-scanning">,
   below.  The scan-related functions that an index access method must provide
   are:
  </para>

  <para>
<programlisting>
IndexScanDesc
ambeginscan (Relation indexRelation,
             int nkeys,
             ScanKey key);
</programlisting>
   Begin a new scan.  The <literal>key</> array (of length <literal>nkeys</>)
   describes the scan key(s) for the index scan.  The result must be a
   palloc'd struct. For implementation reasons the index access method
   <emphasis>must</> create this struct by calling
   <function>RelationGetIndexScan()</>.  In most cases
   <function>ambeginscan</> itself does little beyond making that call;
303
   the interesting parts of index-scan startup are in <function>amrescan</>.
304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322
  </para>

  <para>
<programlisting>
boolean
amgettuple (IndexScanDesc scan,
            ScanDirection direction);
</programlisting>
   Fetch the next tuple in the given scan, moving in the given
   direction (forward or backward in the index).  Returns TRUE if a tuple was
   obtained, FALSE if no matching tuples remain.  In the TRUE case the tuple
   TID is stored into the <literal>scan</> structure.  Note that
   <quote>success</> means only that the index contains an entry that matches
   the scan keys, not that the tuple necessarily still exists in the heap or
   will pass the caller's snapshot test.
  </para>

  <para>
<programlisting>
323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344
boolean
amgetmulti (IndexScanDesc scan,
            ItemPointer tids,
            int32 max_tids,
            int32 *returned_tids);
</programlisting>
   Fetch multiple tuples in the given scan.  Returns TRUE if the scan should
   continue, FALSE if no matching tuples remain.  <literal>tids</> points to
   a caller-supplied array of <literal>max_tids</>
   <structname>ItemPointerData</> records, which the call fills with TIDs of
   matching tuples.  <literal>*returned_tids</> is set to the number of TIDs
   actually returned.  This can be less than <literal>max_tids</>, or even
   zero, even when the return value is TRUE.  (This provision allows the
   access method to choose the most efficient stopping points in its scan,
   for example index page boundaries.)  <function>amgetmulti</> and
   <function>amgettuple</> cannot be used in the same index scan; there
   are other restrictions too when using <function>amgetmulti</>, as explained
   in <xref linkend="index-scanning">.
  </para>

  <para>
<programlisting>
345 346 347 348 349 350 351
void
amrescan (IndexScanDesc scan,
          ScanKey key);
</programlisting>
   Restart the given scan, possibly with new scan keys (to continue using
   the old keys, NULL is passed for <literal>key</>).  Note that it is not
   possible for the number of keys to be changed.  In practice the restart
352
   feature is used when a new outer tuple is selected by a nested-loop join
353 354 355
   and so a new key comparison value is needed, but the scan key structure
   remains the same.  This function is also called by
   <function>RelationGetIndexScan()</>, so it is used for initial setup
356
   of an index scan as well as rescanning.
357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386
  </para>

  <para>
<programlisting>
void
amendscan (IndexScanDesc scan);
</programlisting>
   End a scan and release resources.  The <literal>scan</> struct itself
   should not be freed, but any locks or pins taken internally by the
   access method must be released.
  </para>

  <para>
<programlisting>
void
ammarkpos (IndexScanDesc scan);
</programlisting>
   Mark current scan position.  The access method need only support one
   remembered scan position per scan.
  </para>

  <para>
<programlisting>
void
amrestrpos (IndexScanDesc scan);
</programlisting>
   Restore the scan to the most recently marked position.
  </para>

  <para>
387
   By convention, the <literal>pg_proc</literal> entry for an index
388 389 390 391 392
   access method function should show the correct number of arguments,
   but declare them all as type <type>internal</> (since most of the arguments
   have types that are not known to SQL, and we don't want users calling
   the functions directly anyway).  The return type is declared as
   <type>void</>, <type>internal</>, or <type>boolean</> as appropriate.
393 394 395 396
   The only exception is <function>amoptions</>, which should be correctly
   declared as taking <type>text[]</> and <type>bool</> and returning
   <type>bytea</>.  This provision allows client code to execute
   <function>amoptions</> to test validity of options settings.
397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416
  </para>

 </sect1>

 <sect1 id="index-scanning">
  <title>Index Scanning</title>

  <para>
   In an index scan, the index access method is responsible for regurgitating
   the TIDs of all the tuples it has been told about that match the
   <firstterm>scan keys</>.  The access method is <emphasis>not</> involved in
   actually fetching those tuples from the index's parent table, nor in
   determining whether they pass the scan's time qualification test or other
   conditions.
  </para>

  <para>
   A scan key is the internal representation of a <literal>WHERE</> clause of
   the form <replaceable>index_key</> <replaceable>operator</>
   <replaceable>constant</>, where the index key is one of the columns of the
417
   index and the operator is one of the members of the operator family
418 419 420 421 422 423
   associated with that index column.  An index scan has zero or more scan
   keys, which are implicitly ANDed &mdash; the returned tuples are expected
   to satisfy all the indicated conditions.
  </para>

  <para>
424
   The operator family can indicate that the index is <firstterm>lossy</> for a
425 426
   particular operator; this implies that the index scan will return all the
   entries that pass the scan key, plus possibly additional entries that do
427
   not.  The core system's index-scan machinery will then apply that operator
428 429 430 431 432 433 434 435 436
   again to the heap tuple to verify whether or not it really should be
   selected.  For non-lossy operators, the index scan must return exactly the
   set of matching entries, as there is no recheck.
  </para>

  <para>
   Note that it is entirely up to the access method to ensure that it
   correctly finds all and only the entries passing all the given scan keys.
   Also, the core system will simply hand off all the <literal>WHERE</>
437
   clauses that match the index keys and operator families, without any
438 439 440 441 442 443 444 445 446 447
   semantic analysis to determine whether they are redundant or
   contradictory.  As an example, given
   <literal>WHERE x &gt; 4 AND x &gt; 14</> where <literal>x</> is a b-tree
   indexed column, it is left to the b-tree <function>amrescan</> function
   to realize that the first scan key is redundant and can be discarded.
   The extent of preprocessing needed during <function>amrescan</> will
   depend on the extent to which the index access method needs to reduce
   the scan keys to a <quote>normalized</> form.
  </para>

448 449 450 451 452 453 454 455 456
  <para>
   Some access methods return index entries in a well-defined order, others
   do not.  If entries are returned in sorted order, the access method should
   set <structname>pg_am</>.<structfield>amcanorder</> true to indicate that
   it supports ordered scans.
   All such access methods must use btree-compatible strategy numbers for
   their equality and ordering operators.
  </para>

457 458 459 460 461 462 463 464 465
  <para>
   The <function>amgettuple</> function has a <literal>direction</> argument,
   which can be either <literal>ForwardScanDirection</> (the normal case)
   or  <literal>BackwardScanDirection</>.  If the first call after
   <function>amrescan</> specifies <literal>BackwardScanDirection</>, then the
   set of matching index entries is to be scanned back-to-front rather than in
   the normal front-to-back direction, so <function>amgettuple</> must return
   the last matching tuple in the index, rather than the first one as it
   normally would.  (This will only occur for access
466
   methods that advertise they support ordered scans.)  After the
467 468 469 470 471 472
   first call, <function>amgettuple</> must be prepared to advance the scan in
   either direction from the most recently returned entry.
  </para>

  <para>
   The access method must support <quote>marking</> a position in a scan
473
   and later returning to the marked position.  The same position might be
474 475 476 477 478 479 480 481 482 483 484 485
   restored multiple times.  However, only one position need be remembered
   per scan; a new <function>ammarkpos</> call overrides the previously
   marked position.
  </para>

  <para>
   Both the scan position and the mark position (if any) must be maintained
   consistently in the face of concurrent insertions or deletions in the
   index.  It is OK if a freshly-inserted entry is not returned by a scan that
   would have found the entry if it had existed when the scan started, or for
   the scan to return such an entry upon rescanning or backing
   up even though it had not been returned the first time through.  Similarly,
486
   a concurrent delete might or might not be reflected in the results of a scan.
487 488
   What is important is that insertions or deletions not cause the scan to
   miss or multiply return entries that were not themselves being inserted or
489
   deleted.
490 491
  </para>

492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510
  <para>
   Instead of using <function>amgettuple</>, an index scan can be done with 
   <function>amgetmulti</> to fetch multiple tuples per call.  This can be
   noticeably more efficient than <function>amgettuple</> because it allows
   avoiding lock/unlock cycles within the access method.  In principle
   <function>amgetmulti</> should have the same effects as repeated
   <function>amgettuple</> calls, but we impose several restrictions to
   simplify matters.  In the first place, <function>amgetmulti</> does not
   take a <literal>direction</> argument, and therefore it does not support
   backwards scan nor intrascan reversal of direction.  The access method
   need not support marking or restoring scan positions during an
   <function>amgetmulti</> scan, either.  (These restrictions cost little
   since it would be difficult to use these features in an
   <function>amgetmulti</> scan anyway: adjusting the caller's buffered
   list of TIDs would be complex.)  Finally, <function>amgetmulti</> does
   not guarantee any locking of the returned tuples, with implications
   spelled out in <xref linkend="index-locking">.
  </para>

511 512 513 514 515 516
 </sect1>

 <sect1 id="index-locking">
  <title>Index Locking Considerations</title>

  <para>
517 518 519
   Index access methods must handle concurrent updates
   of the index by multiple processes.
   The core <productname>PostgreSQL</productname> system obtains
520
   <literal>AccessShareLock</> on the index during an index scan, and
521 522
   <literal>RowExclusiveLock</> when updating the index (including plain
   <command>VACUUM</>).  Since these lock
523
   types do not conflict, the access method is responsible for handling any
524
   fine-grained locking it might need.  An exclusive lock on the index as a whole
525 526
   will be taken only during index creation, destruction,
   <command>REINDEX</>, or <command>VACUUM FULL</>.
527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542
  </para>

  <para>
   Building an index type that supports concurrent updates usually requires
   extensive and subtle analysis of the required behavior.  For the b-tree
   and hash index types, you can read about the design decisions involved in
   <filename>src/backend/access/nbtree/README</> and
   <filename>src/backend/access/hash/README</>.
  </para>

  <para>
   Aside from the index's own internal consistency requirements, concurrent
   updates create issues about consistency between the parent table (the
   <firstterm>heap</>) and the index.  Because
   <productname>PostgreSQL</productname> separates accesses 
   and updates of the heap from those of the index, there are windows in
543
   which the index might be inconsistent with the heap.  We handle this problem
544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562
   with the following rules:

    <itemizedlist>
     <listitem>
      <para>
       A new heap entry is made before making its index entries.  (Therefore
       a concurrent index scan is likely to fail to see the heap entry.
       This is okay because the index reader would be uninterested in an
       uncommitted row anyway.  But see <xref linkend="index-unique-checks">.)
      </para>
     </listitem>
     <listitem>
      <para>
       When a heap entry is to be deleted (by <command>VACUUM</>), all its
       index entries must be removed first.
      </para>
     </listitem>
     <listitem>
      <para>
563
       An index scan must maintain a pin
564 565 566 567 568 569 570 571
       on the index page holding the item last returned by
       <function>amgettuple</>, and <function>ambulkdelete</> cannot delete
       entries from pages that are pinned by other backends.  The need
       for this rule is explained below.
      </para>
     </listitem>
    </itemizedlist>

572
   Without the third rule, it is possible for an index reader to
573 574
   see an index entry just before it is removed by <command>VACUUM</>, and
   then to arrive at the corresponding heap entry after that was removed by
575
   <command>VACUUM</>.
576 577 578 579 580 581 582 583 584 585 586 587
   This creates no serious problems if that item
   number is still unused when the reader reaches it, since an empty
   item slot will be ignored by <function>heap_fetch()</>.  But what if a
   third backend has already re-used the item slot for something else?
   When using an MVCC-compliant snapshot, there is no problem because
   the new occupant of the slot is certain to be too new to pass the
   snapshot test.  However, with a non-MVCC-compliant snapshot (such as
   <literal>SnapshotNow</>), it would be possible to accept and return
   a row that does not in fact match the scan keys.  We could defend
   against this scenario by requiring the scan keys to be rechecked
   against the heap row in all cases, but that is too expensive.  Instead,
   we use a pin on an index page as a proxy to indicate that the reader
588
   might still be <quote>in flight</> from the index entry to the matching
589 590
   heap entry.  Making <function>ambulkdelete</> block on such a pin ensures
   that <command>VACUUM</> cannot delete the heap entry before the reader
591
   is done with it.  This solution costs little in run time, and adds blocking
592 593 594 595 596 597 598 599 600
   overhead only in the rare cases where there actually is a conflict.
  </para>

  <para>
   This solution requires that index scans be <quote>synchronous</>: we have
   to fetch each heap tuple immediately after scanning the corresponding index
   entry.  This is expensive for a number of reasons.  An
   <quote>asynchronous</> scan in which we collect many TIDs from the index,
   and only visit the heap tuples sometime later, requires much less index
601
   locking overhead and can allow a more efficient heap access pattern.
602
   Per the above analysis, we must use the synchronous approach for
603 604 605 606 607 608 609 610 611
   non-MVCC-compliant snapshots, but an asynchronous scan is workable
   for a query using an MVCC snapshot.
  </para>

  <para>
   In an <function>amgetmulti</> index scan, the access method need not
   guarantee to keep an index pin on any of the returned tuples.  (It would be
   impractical to pin more than the last one anyway.)  Therefore
   it is only safe to use such scans with MVCC-compliant snapshots.
612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663
  </para>

 </sect1>

 <sect1 id="index-unique-checks">
  <title>Index Uniqueness Checks</title>

  <para>
   <productname>PostgreSQL</productname> enforces SQL uniqueness constraints
   using <firstterm>unique indexes</>, which are indexes that disallow
   multiple entries with identical keys.  An access method that supports this
   feature sets <structname>pg_am</>.<structfield>amcanunique</> true.
   (At present, only b-tree supports it.)
  </para>

  <para>
   Because of MVCC, it is always necessary to allow duplicate entries to
   exist physically in an index: the entries might refer to successive
   versions of a single logical row.  The behavior we actually want to
   enforce is that no MVCC snapshot could include two rows with equal
   index keys.  This breaks down into the following cases that must be
   checked when inserting a new row into a unique index:

    <itemizedlist>
     <listitem>
      <para>
       If a conflicting valid row has been deleted by the current transaction,
       it's okay.  (In particular, since an UPDATE always deletes the old row
       version before inserting the new version, this will allow an UPDATE on
       a row without changing the key.)
      </para>
     </listitem>
     <listitem>
      <para>
       If a conflicting row has been inserted by an as-yet-uncommitted
       transaction, the would-be inserter must wait to see if that transaction
       commits.  If it rolls back then there is no conflict.  If it commits
       without deleting the conflicting row again, there is a uniqueness
       violation.  (In practice we just wait for the other transaction to
       end and then redo the visibility check in toto.)
      </para>
     </listitem>
     <listitem>
      <para>
       Similarly, if a conflicting valid row has been deleted by an
       as-yet-uncommitted transaction, the would-be inserter must wait
       for that transaction to commit or abort, and then repeat the test.
      </para>
     </listitem>
    </itemizedlist>
  </para>

664 665 666 667 668 669 670 671 672 673
  <para>
   Furthermore, immediately before raising a uniqueness violation
   according to the above rules, the access method must recheck the
   liveness of the row being inserted.  If it is committed dead then
   no error should be raised.  (This case cannot occur during the
   ordinary scenario of inserting a row that's just been created by
   the current transaction.  It can happen during
   <command>CREATE UNIQUE INDEX CONCURRENTLY</>, however.) 
  </para>

674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712
  <para>
   We require the index access method to apply these tests itself, which
   means that it must reach into the heap to check the commit status of
   any row that is shown to have a duplicate key according to the index
   contents.  This is without a doubt ugly and non-modular, but it saves
   redundant work: if we did a separate probe then the index lookup for
   a conflicting row would be essentially repeated while finding the place to
   insert the new row's index entry.  What's more, there is no obvious way
   to avoid race conditions unless the conflict check is an integral part
   of insertion of the new index entry.
  </para>

  <para>
   The main limitation of this scheme is that it has no convenient way
   to support deferred uniqueness checks.
  </para>

 </sect1>

 <sect1 id="index-cost-estimation">
  <title>Index Cost Estimation Functions</title>

  <para>
   The amcostestimate function is given a list of WHERE clauses that have
   been determined to be usable with the index.  It must return estimates
   of the cost of accessing the index and the selectivity of the WHERE
   clauses (that is, the fraction of parent-table rows that will be
   retrieved during the index scan).  For simple cases, nearly all the
   work of the cost estimator can be done by calling standard routines
   in the optimizer; the point of having an amcostestimate function is
   to allow index access methods to provide index-type-specific knowledge,
   in case it is possible to improve on the standard estimates.
  </para>

  <para>
   Each amcostestimate function must have the signature:

<programlisting>
void
713
amcostestimate (PlannerInfo *root,
714 715
                IndexOptInfo *index,
                List *indexQuals,
716
                RelOptInfo *outer_rel,
717 718 719 720 721 722 723 724 725 726 727 728 729
                Cost *indexStartupCost,
                Cost *indexTotalCost,
                Selectivity *indexSelectivity,
                double *indexCorrelation);
</programlisting>

   The first four parameters are inputs:

   <variablelist>
    <varlistentry>
     <term>root</term>
     <listitem>
      <para>
730
       The planner's information about the query being processed.
731 732 733 734 735 736 737 738
      </para>
     </listitem>
    </varlistentry>

    <varlistentry>
     <term>index</term>
     <listitem>
      <para>
739
       The index being considered.
740 741 742 743 744 745 746 747 748 749 750 751 752 753
      </para>
     </listitem>
    </varlistentry>

    <varlistentry>
     <term>indexQuals</term>
     <listitem>
      <para>
       List of index qual clauses (implicitly ANDed);
       a NIL list indicates no qualifiers are available.
       Note that the list contains expression trees, not ScanKeys.
      </para>
     </listitem>
    </varlistentry>
754 755 756 757 758 759 760 761 762 763 764 765 766 767

    <varlistentry>
     <term>outer_rel</term>
     <listitem>
      <para>
       If the index is being considered for use in a join inner indexscan,
       the planner's information about the outer side of the join.  Otherwise
       NULL.  When non-NULL, some of the qual clauses will be join clauses
       with this rel rather than being simple restriction clauses.  Also,
       the cost estimator should expect that the index scan will be repeated
       for each row of the outer rel.
      </para>
     </listitem>
    </varlistentry>
768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820
   </variablelist>
  </para>

  <para>
   The last four parameters are pass-by-reference outputs:

   <variablelist>
    <varlistentry>
     <term>*indexStartupCost</term>
     <listitem>
      <para>
       Set to cost of index start-up processing
      </para>
     </listitem>
    </varlistentry>

    <varlistentry>
     <term>*indexTotalCost</term>
     <listitem>
      <para>
       Set to total cost of index processing
      </para>
     </listitem>
    </varlistentry>

    <varlistentry>
     <term>*indexSelectivity</term>
     <listitem>
      <para>
       Set to index selectivity
      </para>
     </listitem>
    </varlistentry>

    <varlistentry>
     <term>*indexCorrelation</term>
     <listitem>
      <para>
       Set to correlation coefficient between index scan order and
       underlying table's order
      </para>
     </listitem>
    </varlistentry>
   </variablelist>
  </para>

  <para>
   Note that cost estimate functions must be written in C, not in SQL or
   any available procedural language, because they must access internal
   data structures of the planner/optimizer.
  </para>

  <para>
821
   The index access costs should be computed using the parameters used by
822
   <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
823 824 825 826 827 828
   disk block fetch has cost <varname>seq_page_cost</>, a nonsequential fetch
   has cost <varname>random_page_cost</>, and the cost of processing one index
   row should usually be taken as <varname>cpu_index_tuple_cost</>.  In
   addition, an appropriate multiple of <varname>cpu_operator_cost</> should
   be charged for any comparison operators invoked during index processing
   (especially evaluation of the indexQuals themselves).
829 830 831 832
  </para>

  <para>
   The access costs should include all disk and CPU costs associated with
833 834
   scanning the index itself, but <emphasis>not</> the costs of retrieving or
   processing the parent-table rows that are identified by the index.
835 836 837
  </para>

  <para>
838 839 840 841
   The <quote>start-up cost</quote> is the part of the total scan cost that
   must be expended before we can begin to fetch the first row.  For most
   indexes this can be taken as zero, but an index type with a high start-up
   cost might want to set it nonzero.
842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857
  </para>

  <para>
   The indexSelectivity should be set to the estimated fraction of the parent
   table rows that will be retrieved during the index scan.  In the case
   of a lossy index, this will typically be higher than the fraction of
   rows that actually pass the given qual conditions.
  </para>

  <para>
   The indexCorrelation should be set to the correlation (ranging between
   -1.0 and 1.0) between the index order and the table order.  This is used
   to adjust the estimate for the cost of fetching rows from the parent
   table.
  </para>

858 859 860 861 862
  <para>
   In the join case, the returned numbers should be averages expected for
   any one scan of the index.
  </para>

863 864 865 866 867 868 869 870 871 872 873 874 875 876
  <procedure>
   <title>Cost Estimation</title>
   <para>
    A typical cost estimator will proceed as follows:
   </para>

   <step>
    <para>
     Estimate and return the fraction of parent-table rows that will be visited
     based on the given qual conditions.  In the absence of any index-type-specific
     knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:

<programlisting>
*indexSelectivity = clauselist_selectivity(root, indexQuals,
877
                                           index-&gt;rel-&gt;relid, JOIN_INNER);
878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904
</programlisting>
    </para>
   </step>

   <step>
    <para>
     Estimate the number of index rows that will be visited during the
     scan.  For many index types this is the same as indexSelectivity times
     the number of rows in the index, but it might be more.  (Note that the
     index's size in pages and rows is available from the IndexOptInfo struct.)
    </para>
   </step>

   <step>
    <para>
     Estimate the number of index pages that will be retrieved during the scan.
     This might be just indexSelectivity times the index's size in pages.
    </para>
   </step>

   <step>
    <para>
     Compute the index access cost.  A generic estimator might do this:

<programlisting>
    /*
     * Our generic assumption is that the index pages will be read
905
     * sequentially, so they cost seq_page_cost each, not random_page_cost.
906 907 908
     * Also, we charge for evaluation of the indexquals at each index row.
     * All the costs are assumed to be paid incrementally during the scan.
     */
909
    cost_qual_eval(&amp;index_qual_cost, indexQuals, root);
910
    *indexStartupCost = index_qual_cost.startup;
911
    *indexTotalCost = seq_page_cost * numIndexPages +
912 913
        (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
</programlisting>
914 915 916

     However, the above does not account for amortization of index reads
     across repeated index scans in the join case.
917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934
    </para>
   </step>

   <step>
    <para>
     Estimate the index correlation.  For a simple ordered index on a single
     field, this can be retrieved from pg_statistic.  If the correlation
     is not known, the conservative estimate is zero (no correlation).
    </para>
   </step>
  </procedure>

  <para>
   Examples of cost estimator functions can be found in
   <filename>src/backend/utils/adt/selfuncs.c</filename>.
  </para>
 </sect1>
</chapter>