docs - update bloat best practices (#10067)

* docs - update bloat best practices information from dev. --Remove copying or redistributing table data as alternatives to VACUUM FULL --Mention that VACUUM (without FULL) maintenance is for both heap and AO tables. Also Reorganized information. Clarified ACCESS EXCLUSIVE lock is reason users cannot access table during VACUUM FULL * docs - updates based on review comments. * docs - removed warning about stopping VACUUM FULL.

docs - update bloat best practices (#10067)
* docs - update bloat best practices information from dev. --Remove copying or redistributing table data as alternatives to VACUUM FULL --Mention that VACUUM (without FULL) maintenance is for both heap and AO tables. Also Reorganized information. Clarified ACCESS EXCLUSIVE lock is reason users cannot access table during VACUUM FULL * docs - updates based on review comments. * docs - removed warning about stopping VACUUM FULL.
ea2610e2 · Mel Kiyama · David Yozie · b486393e · ea2610e2 · ea2610e2
Showing with 101 addition and 108 deletion

gpdb-doc/dita/admin_guide/monitoring/monitoring.dita gpdb-doc/dita/admin_guide/monitoring/monitoring.dita +7 -4

gpdb-doc/dita/best_practices/bloat.xml gpdb-doc/dita/best_practices/bloat.xml +94 -104

未找到文件。
--- a/gpdb-doc/dita/admin_guide/monitoring/monitoring.dita
+++ b/gpdb-doc/dita/admin_guide/monitoring/monitoring.dita
@@ -411,11 +411,14 @@ WHERE logseverity in ('FATAL', 'ERROR')
                                    that cannot be recovered by a regular <codeph>VACUUM</codeph>
                                    command. <p>Recommended frequency: weekly or
                                        monthly</p><p>Severity: WARNING</p></entry>
-                                <entry>Check the <codeph>gp_bloat_diag</codeph> view in each database:
+                                <entry>Check the <codeph>gp_bloat_diag</codeph> view in each
+                                    database:
                                    <codeblock>SELECT * FROM gp_toolkit.gp_bloat_diag;</codeblock></entry>
-                                <entry>Execute a <codeph>VACUUM FULL</codeph> statement at a time
-                                    when users are not accessing the table to remove bloat and
-                                    compact the data.</entry>
+                                <entry><codeph>VACUUM FULL</codeph> acquires an <codeph>ACCESS
+                                        EXCLUSIVE</codeph> lock on tables. Run <codeph>VACUUM
+                                        FULL</codeph> during a time when users and applications do
+                                    not require access to the tables, such as during a time of low
+                                    activity, or during a maintenance window.</entry>
                            </row>
                        </tbody>
                    </tgroup>

--- a/gpdb-doc/dita/best_practices/bloat.xml
+++ b/gpdb-doc/dita/best_practices/bloat.xml
 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
 <topic id="topic_gft_h11_bp">
-  <title>Managing Bloat in the Database</title>
+  <title>Managing Bloat in a Database</title>
  <body>
-    <p>Greenplum Database heap tables use the PostgreSQL Multiversion Concurrency Control (MVCC)
-      storage implementation. A deleted or updated row is logically deleted from the database, but a
-      non-visible image of the row remains in the table. These deleted rows, also called expired
-      rows, are tracked in a free space map. Running <codeph>VACUUM</codeph> marks the expired rows
-      as free space that is available for reuse by subsequent inserts. </p>
-    <p>If the free space map is not large enough to accommodate all of the expired rows, the
-        <codeph>VACUUM</codeph> command is unable to reclaim space for expired rows that overflowed
-      the free space map. The disk space may only be recovered by running <codeph>VACUUM
-        FULL</codeph>, which locks the table, copies rows one-by-one to the beginning of the file,
-      and truncates the file. This is an expensive operation that can take an exceptional amount of
-      time to complete with a large table. It should be used only on smaller tables. If you attempt
-      to kill a <codeph>VACUUM FULL</codeph> operation, the system can be disrupted. </p>
-    <note type="important">
-      <p>It is very important to run <codeph>VACUUM</codeph> after large <codeph>UPDATE</codeph> and
-          <codeph>DELETE</codeph> operations to avoid the necessity of ever running <codeph>VACUUM
-          FULL</codeph>.</p>
-    </note>
-    <p>If the free space map overflows and it is necessary to recover the space it is recommended to
-      use the <codeph>CREATE TABLE...AS SELECT</codeph> command to copy the table to a new table,
-      which will create a new compact table. Then drop the original table and rename the copied
-      table. </p>
-    <p>It is normal for tables that have frequent updates to have a small or moderate amount of
-      expired rows and free space that will be reused as new data is added. But when the table is
-      allowed to grow so large that active data occupies just a small fraction of the space, the
-      table has become significantly "bloated." Bloated tables require more disk storage and
-      additional I/O that can slow down query execution.</p>
-    <p>Bloat affects heap tables, system catalogs, and indexes. </p>
-    <p>Running the <codeph>VACUUM</codeph> statement on tables regularly prevents them from growing
-      too large. If the table does become significantly bloated, the <codeph>VACUUM FULL</codeph>
-      statement (or an alternative procedure) must be used to compact the file. If a large table
-      becomes significantly bloated, it is better to use one of the alternative methods described in
-        <xref href="#topic_gft_h11_bp/remove_bloat" format="dita" type="section"/> to remove the
-      bloat.</p>
-    <note type="caution"><b>Never</b> run <codeph>VACUUM FULL &lt;database_name></codeph> and do not
-      run <codeph>VACUUM FULL</codeph> on large tables in a Greenplum Database.</note>
-    <section>
+    <p>Database bloat occurs in heap tables, append-optimized tables, indexes, and system catalogs
+      and affects database performance and disk usage. You can detect database bloat and remove it
+      from the database.</p>
+    <ul id="ul_dwh_5s3_plb">
+      <li><xref href="#topic_gft_h11_bp/about_bloat" format="dita"/></li>
+      <li><xref href="#topic_gft_h11_bp/detect_bloat" format="dita"/></li>
+      <li><xref href="#topic_gft_h11_bp/remove_bloat" format="dita"/></li>
+      <li><xref href="#topic_gft_h11_bp/bloat_ao_tables" format="dita"/></li>
+      <li><xref href="#topic_gft_h11_bp/index_bloat" format="dita"/></li>
+      <li><xref href="#topic_gft_h11_bp/bloat_catalog" format="dita"/></li>
+    </ul>
+    <section id="about_bloat">
+      <title>About Bloat</title>
+      <p>Database bloat is disk space that was used by a table or index and is available for reuse
+        by the database but has not been reclaimed. Bloat is created when updating tables or
+        indexes.</p>
+      <p>Because Greenplum Database heap tables use the PostgreSQL Multiversion Concurrency Control
+        (MVCC) storage implementation, a deleted or updated row is logically deleted from the
+        database, but a non-visible image of the row remains in the table. These deleted rows, also
+        called expired rows, are tracked in a free space map. Running <codeph>VACUUM</codeph> marks
+        the expired rows as free space that is available for reuse by subsequent inserts. </p>
+      <p>It is normal for tables that have frequent updates to have a small or moderate amount of
+        expired rows and free space that will be reused as new data is added. But when the table is
+        allowed to grow so large that active data occupies just a small fraction of the space, the
+        table has become significantly bloated. Bloated tables require more disk storage and
+        additional I/O that can slow down query execution.</p>
+      <note type="important">
+        <p>It is very important to run <codeph>VACUUM</codeph> on individual tables after large
+            <codeph>UPDATE</codeph> and <codeph>DELETE</codeph> operations to avoid the necessity of
+          ever running <codeph>VACUUM FULL</codeph>.</p>
+      </note>
+      <p>Running the <codeph>VACUUM</codeph> command regularly on tables prevents them from growing
+        too large. If the table does become significantly bloated, the <codeph>VACUUM FULL</codeph>
+        command must be used to compact the table data. </p>
+      <p>If the free space map is not large enough to accommodate all of the expired rows, the
+          <codeph>VACUUM</codeph> command is unable to reclaim space for expired rows that
+        overflowed the free space map. The disk space may only be recovered by running
+          <codeph>VACUUM FULL</codeph>, which locks the table, creates a new table, copies the table
+        data to the new table, and then drops old table. This is an expensive operation that can
+        take an exceptional amount of time to complete with a large table. </p>
+      <note type="warning"><codeph>VACUUM FULL</codeph> acquires an <codeph>ACCESS
+          EXCLUSIVE</codeph> lock on tables. You should not run <codeph>VACUUM FULL
+          &lt;database_name></codeph>. If you run <codeph>VACUUM FULL</codeph> on tables, run it
+        during a time when users and applications do not require access to the tables, such as
+        during a time of low activity, or during a maintenance window.</note>
+    </section>
+    <section id="detect_bloat">
      <title>Detecting Bloat</title>
      <p>The statistics collected by the <codeph>ANALYZE</codeph> statement can be used to calculate
        the expected number of disk pages required to store a table. The difference between the
        expected number of pages and the actual number of pages is a measure of bloat. The
-          <codeph>gp_toolkit</codeph> schema provides a <codeph>gp_bloat_diag</codeph> view that
-        identifies table bloat by comparing the ratio of expected to actual pages. To use it, make
-        sure statistics are up to date for all of the tables in the database, then run the following
+          <codeph>gp_toolkit</codeph> schema provides the <codeph><xref
+            href="../ref_guide/gp_toolkit.xml#topic3" type="topic" format="dita"
+            class="- topic/xref "/></codeph> view that identifies table bloat by comparing the ratio
+        of expected to actual pages. To use it, make sure statistics are up to date for all of the
+        tables in the database, then run the following
        SQL:<codeblock>gpadmin=# SELECT * FROM gp_toolkit.gp_bloat_diag;
 bdirelid | bdinspname | bdirelname | bdirelpages | bdiexppages |                bdidiag                
 ----------+------------+------------+-------------+-------------+---------------------------------------
@@ -89,34 +102,31 @@
        reclaim space used by rows that overflowed the free space map and reduce the size of the
        table file. However, a <codeph>VACUUM FULL</codeph> statement is an expensive operation that
        requires an <codeph>ACCESS EXCLUSIVE</codeph> lock and may take an exceptionally long and
-        unpredictable amount of time to finish. Rather than run <codeph>VACUUM FULL</codeph> on a
-        large table, an alternative method is required to remove bloat from a large file. Note that
-        every method for removing bloat from large tables is resource intensive and should be done
-        only under extreme circumstances. </p>
-      <p>The first method to remove bloat from a large table is to create a copy of the table
-        excluding the expired rows, drop the original table, and rename the copy. This method uses
-        the <codeph>CREATE TABLE &lt;table_name> AS SELECT</codeph> statement to create the new
-        table, for
-        example:<codeblock>gpadmin=# CREATE TABLE mytable_tmp AS SELECT * FROM mytable;
-gpadmin=# DROP TABLE mytable;
-gpadmin=# ALTER TABLE mytabe_tmp RENAME TO mytable;</codeblock></p>
-      <p>A second way to remove bloat from a table is to redistribute the table, which rebuilds the
-        table without the expired rows. Follow these steps:<ol id="ol_bqc_xhq_bp">
-          <li>Make a note of the table's distribution columns.</li>
-          <li>Change the table's distribution policy to
-              random:<codeblock>ALTER TABLE mytable SET WITH (REORGANIZE=false) 
-DISTRIBUTED randomly;</codeblock><p>This
-              changes the distribution policy for the table, but does not move any data. The command
-              should complete instantly. </p></li>
-          <li>Change the distribution policy back to its initial
-              setting:<codeblock>ALTER TABLE mytable SET WITH (REORGANIZE=true) 
-DISTRIBUTED BY (<i>&lt;original distribution columns&gt;</i>);</codeblock><p>This
-              step redistributes the data. Since the table was previously distributed with the same
-              distribution key, the rows are simply rewritten on the same segment, excluding expired
-              rows. </p></li>
-        </ol></p>
+        unpredictable amount of time to finish for large tables. You should run <codeph>VACUUM
+          FULL</codeph> on tables during a time when users and applications do not require access to
+        the tables being vacuumed, such as during a time of low activity, or during a maintenance
+        window.</p>
+    </section>
+    <section id="bloat_ao_tables">
+      <title>Removing Bloat from Append-Optimized Tables</title>
+      <p>Append-optimized tables are handled much differently than heap tables. Although
+        append-optimized tables allow update, insert, and delete operations, these operations are
+        not optimized and are not recommended with append-optimized tables. If you heed this advice
+        and use append-optimized for <i>load-once/read-many</i> workloads, <codeph>VACUUM</codeph>
+        on an append-optimized table runs almost instantaneously. </p>
+      <p>If you do run <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> commands on an
+        append-optimized table, expired rows are tracked in an auxiliary bitmap instead of the free
+        space map. <codeph>VACUUM</codeph> is the only way to recover the space. Running
+          <codeph>VACUUM</codeph> on an append-optimized table with expired rows compacts a table by
+        rewriting the entire table without the expired rows. However, no action is performed if the
+        percentage of expired rows in the table exceeds the value of the
+          <codeph>gp_appendonly_compaction_threshold</codeph> configuration parameter, which is 10
+        (10%) by default. The threshold is checked on each segment, so it is possible that a
+          <codeph>VACUUM</codeph> statement will compact an append-only table on some segments and
+        not others. Compacting append-only tables can be disabled by setting the
+          <codeph>gp_appendonly_compaction</codeph> parameter to <codeph>no</codeph>.</p>
    </section>
-    <section>
+    <section id="index_bloat">
      <title>Removing Bloat from Indexes</title>
      <p>The <codeph>VACUUM</codeph> command only recovers space from tables. To recover the space
        from indexes, recreate them using the <codeph>REINDEX</codeph> command.</p>
@@ -126,9 +136,9 @@ DISTRIBUTED BY (<i>&lt;original distribution columns&gt;</i>);</codeblock><p>Thi
        to 0 (zero) for the index, To update those statistics, run <codeph>ANALYZE</codeph> on the
        table after reindexing. </p>
    </section>
-    <section>
+    <section id="bloat_catalog">
      <title>Removing Bloat from System Catalogs</title>
-      <p>Greenplum Database system catalogs are also heap tables and can become bloated over time.
+      <p>Greenplum Database system catalog tables are heap tables and can become bloated over time.
        As database objects are created, altered, or dropped, expired rows are left in the system
        catalogs. Using <codeph>gpload</codeph> to load data contributes to the bloat since
          <codeph>gpload</codeph> creates and drops external tables. (Rather than use
@@ -136,11 +146,12 @@ DISTRIBUTED BY (<i>&lt;original distribution columns&gt;</i>);</codeblock><p>Thi
      <p>Bloat in the system catalogs increases the time require to scan the tables, for example,
        when creating explain plans. System catalogs are scanned frequently and if they become
        bloated, overall system performance is degraded. </p>
-      <p>It is recommended to run <codeph>VACUUM</codeph> on the system catalog nightly and at least
-        weekly. At the same time, running <codeph>REINDEX SYSTEM</codeph> removes bloat from the
-        indexes. Alternatively, you can reindex system tables using the <codeph>reindexdb</codeph>
-        utility with the <codeph>-s</codeph> (<codeph>--system</codeph>) option. After removing
-        catalog bloat, run <codeph>ANALYZE</codeph> to update catalog table statistics. </p>
+      <p>It is recommended to run <codeph>VACUUM</codeph> on system catalog tables nightly and at
+        least weekly. At the same time, running <codeph>REINDEX SYSTEM</codeph> removes bloat from
+        the indexes. Alternatively, you can reindex system tables using the
+          <codeph>reindexdb</codeph> utility with the <codeph>-s</codeph>
+        (<codeph>--system</codeph>) option. After removing catalog bloat, run
+          <codeph>ANALYZE</codeph> to update catalog table statistics. </p>
      <p>These are Greenplum Database system catalog maintenance steps.<ol id="ol_un5_p1l_f2b">
          <li>Perform a <codeph>REINDEX</codeph> on the system catalog tables to rebuild the system
            catalog indexes. This removes bloat in the indexes and improves <codeph>VACUUM</codeph>
@@ -153,10 +164,10 @@ DISTRIBUTED BY (<i>&lt;original distribution columns&gt;</i>);</codeblock><p>Thi
          <li>Perform an <codeph>ANALYZE</codeph> on the system catalog tables to update the table
            statistics. </li>
        </ol></p>
-      <p>If you are performing catalog maintenance during a maintenance period and you need to stop
-        a process due to time constraints, run the Greenplum Database function
-            <codeph>pg_cancel_backend(&lt;<varname>PID</varname>>)</codeph> to safely stop a
-        Greenplum Database process.</p>
+      <p>If you are performing system catalog maintenance during a maintenance period and you need
+        to stop a process due to time constraints, run the Greenplum Database function
+          <codeph>pg_cancel_backend(&lt;PID>)</codeph> to safely stop a Greenplum Database
+        process.</p>
      <p>The following script runs <codeph>REINDEX</codeph>, <codeph>VACUUM</codeph>, and
          <codeph>ANALYZE</codeph> on the system
        catalogs.<pre>#!/bin/bash
@@ -167,13 +178,11 @@ where a.relnamespace=b.oid and b.nspname='pg_catalog' and a.relkind='r'"
 reindexdb -s -d $DBNAME
 psql -tc "SELECT 'VACUUM' || $SYSTABLES" $DBNAME | psql -a $DBNAME
 analyzedb -s pg_catalog -d $DBNAME</pre></p>
-      <p>If the system catalogs become significantly bloated, you must perform an intensive system
-        catalog maintenance procedure. The <codeph>CREATE TABLE AS SELECT</codeph> and
-        redistribution key methods for removing bloat cannot be used with system catalogs. You must
-        instead run <codeph>VACUUM FULL</codeph> during a scheduled downtime period. During this
-        period, stop all catalog activity on the system; <codeph>VACUUM FULL</codeph> takes
-        exclusive locks against the system catalog. Running <codeph>VACUUM</codeph> regularly can
-        prevent the need for this more costly procedure.</p>
+      <p>If the system catalogs become significantly bloated, you must run <codeph>VACUUM
+          FULL</codeph> during a scheduled downtime period. During this period, stop all catalog
+        activity on the system; <codeph>VACUUM FULL</codeph> takes <codeph>ACCESS EXCLUSIVE</codeph>
+        locks against the system catalog. Running <codeph>VACUUM</codeph> regularly on system
+        catalog tables can prevent the need for this more costly procedure.</p>
      <p>These are steps for intensive system catalog maintenance.<ol id="ol_trp_xqs_f2b">
          <li>Stop all catalog activity on the Greenplum Database system.</li>
          <li>Perform a <codeph>REINDEX</codeph> on the system catalog tables to rebuild the system
@@ -195,24 +204,5 @@ analyzedb -s pg_catalog -d $DBNAME</pre></p>
              of bloat</codeph> in the <codeph>gp_toolkit.gp_bloat_diag</codeph> view.</li>
        </ul></note>
    </section>
-    <section>
-      <title>Removing Bloat from Append-Optimized Tables</title>
-      <p>Append-optimized tables are handled much differently than heap tables. Although
-        append-optimized tables allow updates, inserts, and deletes, they are not optimized for
-        these operations and it is recommended to not use them with append-optimized tables. If you
-        heed this advice and use append-optimized for <i>load-once/read-many</i> workloads,
-          <codeph>VACUUM</codeph> on an append-optimized table runs almost instantaneously. </p>
-      <p>If you do run <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> commands on an
-        append-optimized table, expired rows are tracked in an auxiliary bitmap instead of the free
-        space map. <codeph>VACUUM</codeph> is the only way to recover the space. Running
-          <codeph>VACUUM</codeph> on an append-optimized table with expired rows compacts a table by
-        rewriting the entire table without the expired rows. However, no action is performed if the
-        percentage of expired rows in the table exceeds the value of the
-          <codeph>gp_appendonly_compaction_threshold</codeph> configuration parameter, which is 10
-        (10%) by default. The threshold is checked on each segment, so it is possible that a
-          <codeph>VACUUM</codeph> statement will compact an append-only table on some segments and
-        not others. Compacting append-only tables can be disabled by setting the
-          <codeph>gp_appendonly_compaction</codeph> parameter to <codeph>no</codeph>.</p>
-    </section>
  </body>
 </topic>