Add some real documentation about the overall filesystem layout used by

a Postgres database. Update page.sgml to match 8.0 tuple header layout.

Add some real documentation about the overall filesystem layout used by
a Postgres database. Update page.sgml to match 8.0 tuple header layout.
7f4b5a00 · Tom Lane · c7866f66 · 7f4b5a00 · 7f4b5a00 · 7f4b5a00
4 changed file
--- a/doc/src/sgml/filelayout.sgml
+++ b/doc/src/sgml/filelayout.sgml
+<!--
+$PostgreSQL: pgsql/doc/src/sgml/filelayout.sgml,v 1.1 2004/11/12 21:50:53 tgl Exp $
+-->
+
+<chapter id="file-layout">
+
+<title>Database File Layout</title>
+
+<abstract>
+<para>
+A description of the database physical storage layout.
+</para>
+</abstract>
+
+<para>
+This section provides an overview of the physical format used by
+<productname>PostgreSQL</productname> databases.
+</para>
+
+<para>
+All the data needed for a database cluster is stored within the cluster's data
+directory, commonly referred to as <varname>PGDATA</> (after the name of the
+environment variable that can be used to define it).  A common location for
+<varname>PGDATA</> is <filename>/var/lib/pgsql/data</>.  Multiple clusters,
+managed by different postmasters, can exist on the same machine.
+</para>
+
+<para>
+The <varname>PGDATA</> directory contains several subdirectories and control
+files, as shown in <xref linkend="pgdata-contents-table">.  In addition to
+these required items, the cluster configuration files
+<filename>postgresql.conf</filename>, <filename>pg_hba.conf</filename>, and
+<filename>pg_ident.conf</filename> are traditionally stored in
+<varname>PGDATA</> (although beginning in
+<productname>PostgreSQL</productname> 8.0 it is possible to keep them
+elsewhere). 
+</para>
+
+<table tocentry="1" id="pgdata-contents-table">
+<title>Contents of <varname>PGDATA</></title>
+<tgroup cols="2">
+<thead>
+<row>
+<entry>
+Item
+</entry>
+<entry>Description</entry>
+</row>
+</thead>
+
+<tbody>
+
+<row>
+ <entry><filename>PG_VERSION</></entry>
+ <entry>A file containing the major version number of <productname>PostgreSQL</productname></entry>
+</row>
+
+<row>
+ <entry><filename>base</></entry>
+ <entry>Subdirectory containing per-database subdirectories</entry>
+</row>
+
+<row>
+ <entry><filename>global</></entry>
+ <entry>Subdirectory containing cluster-wide tables, such as
+ <structname>pg_database</></entry>
+</row>
+
+<row>
+ <entry><filename>pg_clog</></entry>
+ <entry>Subdirectory containing transaction commit status data</entry>
+</row>
+
+<row>
+ <entry><filename>pg_subtrans</></entry>
+ <entry>Subdirectory containing subtransaction status data</entry>
+</row>
+
+<row>
+ <entry><filename>pg_tblspc</></entry>
+ <entry>Subdirectory containing symbolic links to tablespaces</entry>
+</row>
+
+<row>
+ <entry><filename>pg_xlog</></entry>
+ <entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
+</row>
+
+<row>
+ <entry><filename>postmaster.opts</></entry>
+ <entry>A file recording the command-line options the postmaster was
+last started with</entry>
+</row>
+
+<row>
+ <entry><filename>postmaster.pid</></entry>
+ <entry>A lock file recording the current postmaster PID and shared memory
+segment ID (not present after postmaster shutdown)</entry>
+</row>
+
+</tbody>
+</tgroup>
+</table>
+
+<para>
+For each database in the cluster there is a subdirectory within
+<varname>PGDATA</><filename>/base</>, named after the database's OID in
+<structname>pg_database</>.  This subdirectory is the default location
+for the database's files; in particular, its system catalogs are stored
+there.
+</para>
+
+<para>
+Each table and index is stored in a separate file, named after the table
+or index's <firstterm>filenode</> number, which can be found in
+<structname>pg_class</>.<structfield>relfilenode</>.
+</para>
+
+<caution>
+<para>
+Note that while a table's filenode often matches its OID, this is
+<emphasis>not</> necessarily the case; some operations, like
+<command>TRUNCATE</>, <command>REINDEX</>, <command>CLUSTER</> and some forms
+of <command>ALTER TABLE</>, can change the filenode while preserving the OID.
+Avoid assuming that filenode and table OID are the same.
+</para>
+</caution>
+
+<para>
+When a table or index exceeds 1Gb, it is divided into gigabyte-sized
+<firstterm>segments</>.  The first segment's file name is the same as the
+filenode; subsequent segments are named filenode.1, filenode.2, etc.
+This arrangement avoids problems on platforms that have file size limitations.
+The contents of tables and indexes are discussed further in
+<xref linkend="page">.
+</para>
+
+<para>
+A table that has columns with potentially large entries will have an
+associated <firstterm>TOAST</> table, which is used for out-of-line storage of
+field values that are too large to keep in the table rows proper.
+<structname>pg_class</>.<structfield>reltoastrelid</> links from a table to
+its TOAST table, if any.
+</para>
+
+<para>
+Tablespaces make the scenario more complicated.  Each non-default tablespace
+has a symbolic link inside the <varname>PGDATA</><filename>/pg_tblspc</>
+directory, which points to the physical tablespace directory (as specified in
+its <command>CREATE TABLESPACE</> command).  The symbolic link is named after
+the tablespace's OID.  Inside the physical tablespace directory there is
+a subdirectory for each database that has elements in the tablespace, named
+after the database's OID.  Tables within that directory follow the filenode
+naming scheme.  The <literal>pg_default</> tablespace is not accessed through
+<filename>pg_tblspc</>, but corresponds to
+<varname>PGDATA</><filename>/base</>.  Similarly, the <literal>pg_global</>
+tablespace is not accessed through <filename>pg_tblspc</>, but corresponds to
+<varname>PGDATA</><filename>/global</>.
+</para>
+
+</chapter>
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
-<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.38 2004/06/07 04:04:47 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.39 2004/11/12 21:50:53 tgl Exp $ -->

 <!entity history    SYSTEM "history.sgml">
 <!entity info       SYSTEM "info.sgml">
@@ -74,6 +74,7 @@
 <!entity arch-dev   SYSTEM "arch-dev.sgml">
 <!entity bki        SYSTEM "bki.sgml">
 <!entity catalogs   SYSTEM "catalogs.sgml">
+<!entity filelayout SYSTEM "filelayout.sgml">
 <!entity geqo       SYSTEM "geqo.sgml">
 <!entity gist       SYSTEM "gist.sgml">
 <!entity indexcost  SYSTEM "indexcost.sgml">

--- a/doc/src/sgml/page.sgml
+++ b/doc/src/sgml/page.sgml
 <!--
-$PostgreSQL: pgsql/doc/src/sgml/page.sgml,v 1.18 2004/07/21 22:31:18 tgl Exp $
+$PostgreSQL: pgsql/doc/src/sgml/page.sgml,v 1.19 2004/11/12 21:50:53 tgl Exp $
 -->

 <chapter id="page">

-<title>Page Files</title>
+<title>Database Page Layout</title>

 <abstract>
 <para>
@@ -14,11 +14,15 @@ A description of the database file page format.

 <para>
 This section provides an overview of the page format used by
-<productname>PostgreSQL</productname> tables and indexes.  (Index
-access methods need not use this page format.  At present, all index
-methods do use this basic format, but the data kept on index metapages
-usually doesn't follow the item layout rules exactly.)  TOAST tables
-and sequences are formatted just like a regular table.
+<productname>PostgreSQL</productname> tables and indexes.<footnote>
+  <para>
+    Actually, index access methods need not use this page format.
+    All the existing index methods do use this basic format,
+    but the data kept on index metapages usually doesn't follow
+    the item layout rules.
+  </para>
+</footnote>
+TOAST tables and sequences are formatted just like a regular table.
 </para>

 <para>
@@ -31,14 +35,22 @@ an item is a row; in an index, an item is an index entry.
 </para>

 <para>
+Every table and index is stored as an array of <firstterm>pages</> of a
+fixed size (usually 8K, although a different page size can be selected
+when compiling the server).  In a table, all the pages are logically
+equivalent, so a particular item (row) can be stored in any page.  In
+indexes, the first page is generally reserved as a <firstterm>metapage</>
+holding control information, and there may be different types of pages
+within the index, depending on the index access method.
+</para>

-<xref linkend="page-table"> shows the basic layout of a page.
+<para>
+<xref linkend="page-table"> shows the overall layout of a page.
 There are five parts to each page.
-
 </para>

 <table tocentry="1" id="page-table">
-<title>Sample Page Layout</title>
+<title>Overall Page Layout</title>
 <titleabbrev>Page Layout</titleabbrev>
 <tgroup cols="2">
 <thead>
@@ -60,12 +72,14 @@ free space pointers.</entry>

 <row>
 <entry>ItemPointerData</entry>
-<entry>Array of (offset,length) pairs pointing to the actual items.</entry>
+<entry>Array of (offset,length) pairs pointing to the actual items.
+4 bytes per item.</entry>
 </row>

 <row>
 <entry>Free space</entry>
-<entry>The unallocated space. All new rows are allocated from here, generally from the end.</entry>
+<entry>The unallocated space. New item pointers are allocated from the start
+of this area, new items from the end.</entry>
 </row>

 <row>
@@ -74,7 +88,7 @@ free space pointers.</entry>
 </row>

 <row>
-<entry>Special Space</entry>
+<entry>Special space</entry>
 <entry>Index access method specific data. Different methods store different
 data. Empty in ordinary tables.</entry>
 </row>
@@ -87,13 +101,24 @@ data. Empty in ordinary tables.</entry>

  The first 20 bytes of each page consists of a page header
  (PageHeaderData). Its format is detailed in <xref
-  linkend="pageheaderdata-table">. The first two fields deal with WAL
-  related stuff. This is followed by three 2-byte integer fields
+  linkend="pageheaderdata-table">. The first two fields track the most
+  recent WAL entry related to this page. They are followed by three 2-byte
+  integer fields
  (<structfield>pd_lower</structfield>, <structfield>pd_upper</structfield>,
-  and <structfield>pd_special</structfield>). These represent byte offsets to
-  the start
+  and <structfield>pd_special</structfield>). These contain byte offsets
+  from the page start to the start
  of unallocated space, to the end of unallocated space, and to the start of
  the special space. 
+  The last 2 bytes of the page header,
+  <structfield>pd_pagesize_version</structfield>, store both the page size
+  and a version indicator.  Beginning with
+  <productname>PostgreSQL</productname> 8.0 the version number is 2; 
+  <productname>PostgreSQL</productname> 7.3 and 7.4 used version number 1;
+  prior releases used version number 0.
+  (The basic page layout and header format has not changed in these versions,
+  but the layout of heap row headers has.)  The page size
+  is basically only present as a cross-check; there is no support for having
+  more than one page size in an installation.
  
 </para>
 
@@ -156,25 +181,12 @@ data. Empty in ordinary tables.</entry>
  <filename>src/include/storage/bufpage.h</filename>.
 </para>

- <para>  
-  Special space is a region at the end of the page that is allocated at page
-  initialization time and contains information specific to an access method. 
-  The last 2 bytes of the page header,
-  <structfield>pd_pagesize_version</structfield>, store both the page size
-  and a version indicator.  Beginning with
-  <productname>PostgreSQL</productname> 7.3 the version number is 1; prior
-  releases used version number 0.  (The basic page layout and header format
-  has not changed, but the layout of heap row headers has.)  The page size
-  is basically only present as a cross-check; there is no support for having
-  more than one page size in an installation.
- </para>
-
 <para>

  Following the page header are item identifiers
  (<type>ItemIdData</type>), each requiring four bytes.
  An item identifier contains a byte-offset to
-  the start of an item, its length in bytes, and a set of attribute bits
+  the start of an item, its length in bytes, and a few attribute bits
  which affect its interpretation.
  New item identifiers are allocated
  as needed from the beginning of the unallocated space.
@@ -203,16 +215,18 @@ data. Empty in ordinary tables.</entry>
 <para>
 
  The final section is the <quote>special section</quote> which may
-  contain anything the access method wishes to store. Ordinary tables
-  do not use this at all (indicated by setting
-  <structfield>pd_special</> to equal the pagesize).
+  contain anything the access method wishes to store.  For example,
+  b-tree indexes store links to the page's left and right siblings,
+  as well as some other data relevant to the index structure.
+  Ordinary tables do not use a special section at all (indicated by setting
+  <structfield>pd_special</> to equal the page size).
  
 </para>
 
 <para>

-  All table rows are structured the same way. There is a fixed-size
-  header (occupying 23 bytes on most machines), followed by an optional null
+  All table rows are structured in the same way. There is a fixed-size
+  header (occupying 27 bytes on most machines), followed by an optional null
  bitmap, an optional object ID field, and the user data. The header is
  detailed
  in <xref linkend="heaptupleheaderdata-table">.  The actual user data
@@ -258,7 +272,7 @@ data. Empty in ordinary tables.</entry>
   <entry>t_cmin</entry>
   <entry>CommandId</entry>
   <entry>4 bytes</entry>
-   <entry>insert CID stamp (overlays with t_xmax)</entry>
+   <entry>insert CID stamp</entry>
  </row>
  <row>
   <entry>t_xmax</entry>
@@ -276,7 +290,7 @@ data. Empty in ordinary tables.</entry>
   <entry>t_xvac</entry>
   <entry>TransactionId</entry>
   <entry>4 bytes</entry>
-   <entry>XID for VACUUM operation moving row version</entry>
+   <entry>XID for VACUUM operation moving a row version</entry>
  </row>
  <row>
   <entry>t_ctid</entry>
@@ -294,7 +308,7 @@ data. Empty in ordinary tables.</entry>
   <entry>t_infomask</entry>
   <entry>uint16</entry>
   <entry>2 bytes</entry>
-   <entry>various flags</entry>
+   <entry>various flag bits</entry>
  </row>
  <row>
   <entry>t_hoff</entry>
@@ -314,9 +328,10 @@ data. Empty in ordinary tables.</entry>
 <para>
 
  Interpreting the actual data can only be done with information obtained
-  from other tables, mostly <firstterm>pg_attribute</firstterm>. The
-  particular fields are <structfield>attlen</structfield> and
-  <structfield>attalign</structfield>. There is no way to directly get a
+  from other tables, mostly <structname>pg_attribute</structname>. The
+  key values needed to identify field locations are
+  <structfield>attlen</structfield> and <structfield>attalign</structfield>.
+  There is no way to directly get a
  particular attribute, except when there are only fixed width fields and no
  NULLs. All this trickery is wrapped up in the functions
  <firstterm>heap_getattr</firstterm>, <firstterm>fastgetattr</firstterm>
@@ -329,10 +344,11 @@ data. Empty in ordinary tables.</entry>
  whether the field is NULL according to the null bitmap. If it is, go to
  the next. Then make sure you have the right alignment.  If the field is a
  fixed width field, then all the bytes are simply placed. If it's a
-  variable length field (attlen == -1) then it's a bit more complicated,
-  using the variable length structure <type>varattrib</type>.
-  Depending on the flags, the data may be either inline, compressed or in
-  another table (TOAST).
+  variable length field (attlen = -1) then it's a bit more complicated.
+  All variable-length datatypes share the common header structure
+  <type>varattrib</type>, which includes the total length of the stored
+  value and some flag bits.  Depending on the flags, the data may be either
+  inline or in another table (TOAST); it might be compressed, too.
  
 </para>
 </chapter>
--- a/doc/src/sgml/postgres.sgml
+++ b/doc/src/sgml/postgres.sgml
 <!--
-$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.64 2004/04/20 01:11:49 momjian Exp $
+$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.65 2004/11/12 21:50:53 tgl Exp $
 -->

 <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.2//EN" [
@@ -235,6 +235,7 @@ $PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.64 2004/04/20 01:11:49 momjian
  &geqo;
  &indexcost;
  &gist;
+  &filelayout;
  &page;
  &bki;