提交 ea2610e2 编写于 作者: M Mel Kiyama 提交者: David Yozie

docs - update bloat best practices (#10067)

* docs - update bloat best practices

information from dev.
--Remove copying or redistributing table data as alternatives to VACUUM FULL
--Mention that VACUUM (without FULL) maintenance is for both heap and AO tables.

Also
Reorganized information.
Clarified ACCESS EXCLUSIVE lock is reason users cannot access table during VACUUM FULL

* docs - updates based on review comments.

* docs - removed warning about stopping VACUUM FULL.
上级 b486393e
...@@ -411,11 +411,14 @@ WHERE logseverity in ('FATAL', 'ERROR') ...@@ -411,11 +411,14 @@ WHERE logseverity in ('FATAL', 'ERROR')
that cannot be recovered by a regular <codeph>VACUUM</codeph> that cannot be recovered by a regular <codeph>VACUUM</codeph>
command. <p>Recommended frequency: weekly or command. <p>Recommended frequency: weekly or
monthly</p><p>Severity: WARNING</p></entry> monthly</p><p>Severity: WARNING</p></entry>
<entry>Check the <codeph>gp_bloat_diag</codeph> view in each database: <entry>Check the <codeph>gp_bloat_diag</codeph> view in each
database:
<codeblock>SELECT * FROM gp_toolkit.gp_bloat_diag;</codeblock></entry> <codeblock>SELECT * FROM gp_toolkit.gp_bloat_diag;</codeblock></entry>
<entry>Execute a <codeph>VACUUM FULL</codeph> statement at a time <entry><codeph>VACUUM FULL</codeph> acquires an <codeph>ACCESS
when users are not accessing the table to remove bloat and EXCLUSIVE</codeph> lock on tables. Run <codeph>VACUUM
compact the data.</entry> FULL</codeph> during a time when users and applications do
not require access to the tables, such as during a time of low
activity, or during a maintenance window.</entry>
</row> </row>
</tbody> </tbody>
</tgroup> </tgroup>
......
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd"> <!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="topic_gft_h11_bp"> <topic id="topic_gft_h11_bp">
<title>Managing Bloat in the Database</title> <title>Managing Bloat in a Database</title>
<body> <body>
<p>Greenplum Database heap tables use the PostgreSQL Multiversion Concurrency Control (MVCC) <p>Database bloat occurs in heap tables, append-optimized tables, indexes, and system catalogs
storage implementation. A deleted or updated row is logically deleted from the database, but a and affects database performance and disk usage. You can detect database bloat and remove it
non-visible image of the row remains in the table. These deleted rows, also called expired from the database.</p>
rows, are tracked in a free space map. Running <codeph>VACUUM</codeph> marks the expired rows <ul id="ul_dwh_5s3_plb">
as free space that is available for reuse by subsequent inserts. </p> <li><xref href="#topic_gft_h11_bp/about_bloat" format="dita"/></li>
<p>If the free space map is not large enough to accommodate all of the expired rows, the <li><xref href="#topic_gft_h11_bp/detect_bloat" format="dita"/></li>
<codeph>VACUUM</codeph> command is unable to reclaim space for expired rows that overflowed <li><xref href="#topic_gft_h11_bp/remove_bloat" format="dita"/></li>
the free space map. The disk space may only be recovered by running <codeph>VACUUM <li><xref href="#topic_gft_h11_bp/bloat_ao_tables" format="dita"/></li>
FULL</codeph>, which locks the table, copies rows one-by-one to the beginning of the file, <li><xref href="#topic_gft_h11_bp/index_bloat" format="dita"/></li>
and truncates the file. This is an expensive operation that can take an exceptional amount of <li><xref href="#topic_gft_h11_bp/bloat_catalog" format="dita"/></li>
time to complete with a large table. It should be used only on smaller tables. If you attempt </ul>
to kill a <codeph>VACUUM FULL</codeph> operation, the system can be disrupted. </p> <section id="about_bloat">
<note type="important"> <title>About Bloat</title>
<p>It is very important to run <codeph>VACUUM</codeph> after large <codeph>UPDATE</codeph> and <p>Database bloat is disk space that was used by a table or index and is available for reuse
<codeph>DELETE</codeph> operations to avoid the necessity of ever running <codeph>VACUUM by the database but has not been reclaimed. Bloat is created when updating tables or
FULL</codeph>.</p> indexes.</p>
</note> <p>Because Greenplum Database heap tables use the PostgreSQL Multiversion Concurrency Control
<p>If the free space map overflows and it is necessary to recover the space it is recommended to (MVCC) storage implementation, a deleted or updated row is logically deleted from the
use the <codeph>CREATE TABLE...AS SELECT</codeph> command to copy the table to a new table, database, but a non-visible image of the row remains in the table. These deleted rows, also
which will create a new compact table. Then drop the original table and rename the copied called expired rows, are tracked in a free space map. Running <codeph>VACUUM</codeph> marks
table. </p> the expired rows as free space that is available for reuse by subsequent inserts. </p>
<p>It is normal for tables that have frequent updates to have a small or moderate amount of <p>It is normal for tables that have frequent updates to have a small or moderate amount of
expired rows and free space that will be reused as new data is added. But when the table is expired rows and free space that will be reused as new data is added. But when the table is
allowed to grow so large that active data occupies just a small fraction of the space, the allowed to grow so large that active data occupies just a small fraction of the space, the
table has become significantly "bloated." Bloated tables require more disk storage and table has become significantly bloated. Bloated tables require more disk storage and
additional I/O that can slow down query execution.</p> additional I/O that can slow down query execution.</p>
<p>Bloat affects heap tables, system catalogs, and indexes. </p> <note type="important">
<p>Running the <codeph>VACUUM</codeph> statement on tables regularly prevents them from growing <p>It is very important to run <codeph>VACUUM</codeph> on individual tables after large
too large. If the table does become significantly bloated, the <codeph>VACUUM FULL</codeph> <codeph>UPDATE</codeph> and <codeph>DELETE</codeph> operations to avoid the necessity of
statement (or an alternative procedure) must be used to compact the file. If a large table ever running <codeph>VACUUM FULL</codeph>.</p>
becomes significantly bloated, it is better to use one of the alternative methods described in </note>
<xref href="#topic_gft_h11_bp/remove_bloat" format="dita" type="section"/> to remove the <p>Running the <codeph>VACUUM</codeph> command regularly on tables prevents them from growing
bloat.</p> too large. If the table does become significantly bloated, the <codeph>VACUUM FULL</codeph>
<note type="caution"><b>Never</b> run <codeph>VACUUM FULL &lt;database_name></codeph> and do not command must be used to compact the table data. </p>
run <codeph>VACUUM FULL</codeph> on large tables in a Greenplum Database.</note> <p>If the free space map is not large enough to accommodate all of the expired rows, the
<section> <codeph>VACUUM</codeph> command is unable to reclaim space for expired rows that
overflowed the free space map. The disk space may only be recovered by running
<codeph>VACUUM FULL</codeph>, which locks the table, creates a new table, copies the table
data to the new table, and then drops old table. This is an expensive operation that can
take an exceptional amount of time to complete with a large table. </p>
<note type="warning"><codeph>VACUUM FULL</codeph> acquires an <codeph>ACCESS
EXCLUSIVE</codeph> lock on tables. You should not run <codeph>VACUUM FULL
&lt;database_name></codeph>. If you run <codeph>VACUUM FULL</codeph> on tables, run it
during a time when users and applications do not require access to the tables, such as
during a time of low activity, or during a maintenance window.</note>
</section>
<section id="detect_bloat">
<title>Detecting Bloat</title> <title>Detecting Bloat</title>
<p>The statistics collected by the <codeph>ANALYZE</codeph> statement can be used to calculate <p>The statistics collected by the <codeph>ANALYZE</codeph> statement can be used to calculate
the expected number of disk pages required to store a table. The difference between the the expected number of disk pages required to store a table. The difference between the
expected number of pages and the actual number of pages is a measure of bloat. The expected number of pages and the actual number of pages is a measure of bloat. The
<codeph>gp_toolkit</codeph> schema provides a <codeph>gp_bloat_diag</codeph> view that <codeph>gp_toolkit</codeph> schema provides the <codeph><xref
identifies table bloat by comparing the ratio of expected to actual pages. To use it, make href="../ref_guide/gp_toolkit.xml#topic3" type="topic" format="dita"
sure statistics are up to date for all of the tables in the database, then run the following class="- topic/xref "/></codeph> view that identifies table bloat by comparing the ratio
of expected to actual pages. To use it, make sure statistics are up to date for all of the
tables in the database, then run the following
SQL:<codeblock>gpadmin=# SELECT * FROM gp_toolkit.gp_bloat_diag; SQL:<codeblock>gpadmin=# SELECT * FROM gp_toolkit.gp_bloat_diag;
bdirelid | bdinspname | bdirelname | bdirelpages | bdiexppages | bdidiag bdirelid | bdinspname | bdirelname | bdirelpages | bdiexppages | bdidiag
----------+------------+------------+-------------+-------------+--------------------------------------- ----------+------------+------------+-------------+-------------+---------------------------------------
...@@ -89,34 +102,31 @@ ...@@ -89,34 +102,31 @@
reclaim space used by rows that overflowed the free space map and reduce the size of the reclaim space used by rows that overflowed the free space map and reduce the size of the
table file. However, a <codeph>VACUUM FULL</codeph> statement is an expensive operation that table file. However, a <codeph>VACUUM FULL</codeph> statement is an expensive operation that
requires an <codeph>ACCESS EXCLUSIVE</codeph> lock and may take an exceptionally long and requires an <codeph>ACCESS EXCLUSIVE</codeph> lock and may take an exceptionally long and
unpredictable amount of time to finish. Rather than run <codeph>VACUUM FULL</codeph> on a unpredictable amount of time to finish for large tables. You should run <codeph>VACUUM
large table, an alternative method is required to remove bloat from a large file. Note that FULL</codeph> on tables during a time when users and applications do not require access to
every method for removing bloat from large tables is resource intensive and should be done the tables being vacuumed, such as during a time of low activity, or during a maintenance
only under extreme circumstances. </p> window.</p>
<p>The first method to remove bloat from a large table is to create a copy of the table </section>
excluding the expired rows, drop the original table, and rename the copy. This method uses <section id="bloat_ao_tables">
the <codeph>CREATE TABLE &lt;table_name> AS SELECT</codeph> statement to create the new <title>Removing Bloat from Append-Optimized Tables</title>
table, for <p>Append-optimized tables are handled much differently than heap tables. Although
example:<codeblock>gpadmin=# CREATE TABLE mytable_tmp AS SELECT * FROM mytable; append-optimized tables allow update, insert, and delete operations, these operations are
gpadmin=# DROP TABLE mytable; not optimized and are not recommended with append-optimized tables. If you heed this advice
gpadmin=# ALTER TABLE mytabe_tmp RENAME TO mytable;</codeblock></p> and use append-optimized for <i>load-once/read-many</i> workloads, <codeph>VACUUM</codeph>
<p>A second way to remove bloat from a table is to redistribute the table, which rebuilds the on an append-optimized table runs almost instantaneously. </p>
table without the expired rows. Follow these steps:<ol id="ol_bqc_xhq_bp"> <p>If you do run <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> commands on an
<li>Make a note of the table's distribution columns.</li> append-optimized table, expired rows are tracked in an auxiliary bitmap instead of the free
<li>Change the table's distribution policy to space map. <codeph>VACUUM</codeph> is the only way to recover the space. Running
random:<codeblock>ALTER TABLE mytable SET WITH (REORGANIZE=false) <codeph>VACUUM</codeph> on an append-optimized table with expired rows compacts a table by
DISTRIBUTED randomly;</codeblock><p>This rewriting the entire table without the expired rows. However, no action is performed if the
changes the distribution policy for the table, but does not move any data. The command percentage of expired rows in the table exceeds the value of the
should complete instantly. </p></li> <codeph>gp_appendonly_compaction_threshold</codeph> configuration parameter, which is 10
<li>Change the distribution policy back to its initial (10%) by default. The threshold is checked on each segment, so it is possible that a
setting:<codeblock>ALTER TABLE mytable SET WITH (REORGANIZE=true) <codeph>VACUUM</codeph> statement will compact an append-only table on some segments and
DISTRIBUTED BY (<i>&lt;original distribution columns&gt;</i>);</codeblock><p>This not others. Compacting append-only tables can be disabled by setting the
step redistributes the data. Since the table was previously distributed with the same <codeph>gp_appendonly_compaction</codeph> parameter to <codeph>no</codeph>.</p>
distribution key, the rows are simply rewritten on the same segment, excluding expired
rows. </p></li>
</ol></p>
</section> </section>
<section> <section id="index_bloat">
<title>Removing Bloat from Indexes</title> <title>Removing Bloat from Indexes</title>
<p>The <codeph>VACUUM</codeph> command only recovers space from tables. To recover the space <p>The <codeph>VACUUM</codeph> command only recovers space from tables. To recover the space
from indexes, recreate them using the <codeph>REINDEX</codeph> command.</p> from indexes, recreate them using the <codeph>REINDEX</codeph> command.</p>
...@@ -126,9 +136,9 @@ DISTRIBUTED BY (<i>&lt;original distribution columns&gt;</i>);</codeblock><p>Thi ...@@ -126,9 +136,9 @@ DISTRIBUTED BY (<i>&lt;original distribution columns&gt;</i>);</codeblock><p>Thi
to 0 (zero) for the index, To update those statistics, run <codeph>ANALYZE</codeph> on the to 0 (zero) for the index, To update those statistics, run <codeph>ANALYZE</codeph> on the
table after reindexing. </p> table after reindexing. </p>
</section> </section>
<section> <section id="bloat_catalog">
<title>Removing Bloat from System Catalogs</title> <title>Removing Bloat from System Catalogs</title>
<p>Greenplum Database system catalogs are also heap tables and can become bloated over time. <p>Greenplum Database system catalog tables are heap tables and can become bloated over time.
As database objects are created, altered, or dropped, expired rows are left in the system As database objects are created, altered, or dropped, expired rows are left in the system
catalogs. Using <codeph>gpload</codeph> to load data contributes to the bloat since catalogs. Using <codeph>gpload</codeph> to load data contributes to the bloat since
<codeph>gpload</codeph> creates and drops external tables. (Rather than use <codeph>gpload</codeph> creates and drops external tables. (Rather than use
...@@ -136,11 +146,12 @@ DISTRIBUTED BY (<i>&lt;original distribution columns&gt;</i>);</codeblock><p>Thi ...@@ -136,11 +146,12 @@ DISTRIBUTED BY (<i>&lt;original distribution columns&gt;</i>);</codeblock><p>Thi
<p>Bloat in the system catalogs increases the time require to scan the tables, for example, <p>Bloat in the system catalogs increases the time require to scan the tables, for example,
when creating explain plans. System catalogs are scanned frequently and if they become when creating explain plans. System catalogs are scanned frequently and if they become
bloated, overall system performance is degraded. </p> bloated, overall system performance is degraded. </p>
<p>It is recommended to run <codeph>VACUUM</codeph> on the system catalog nightly and at least <p>It is recommended to run <codeph>VACUUM</codeph> on system catalog tables nightly and at
weekly. At the same time, running <codeph>REINDEX SYSTEM</codeph> removes bloat from the least weekly. At the same time, running <codeph>REINDEX SYSTEM</codeph> removes bloat from
indexes. Alternatively, you can reindex system tables using the <codeph>reindexdb</codeph> the indexes. Alternatively, you can reindex system tables using the
utility with the <codeph>-s</codeph> (<codeph>--system</codeph>) option. After removing <codeph>reindexdb</codeph> utility with the <codeph>-s</codeph>
catalog bloat, run <codeph>ANALYZE</codeph> to update catalog table statistics. </p> (<codeph>--system</codeph>) option. After removing catalog bloat, run
<codeph>ANALYZE</codeph> to update catalog table statistics. </p>
<p>These are Greenplum Database system catalog maintenance steps.<ol id="ol_un5_p1l_f2b"> <p>These are Greenplum Database system catalog maintenance steps.<ol id="ol_un5_p1l_f2b">
<li>Perform a <codeph>REINDEX</codeph> on the system catalog tables to rebuild the system <li>Perform a <codeph>REINDEX</codeph> on the system catalog tables to rebuild the system
catalog indexes. This removes bloat in the indexes and improves <codeph>VACUUM</codeph> catalog indexes. This removes bloat in the indexes and improves <codeph>VACUUM</codeph>
...@@ -153,10 +164,10 @@ DISTRIBUTED BY (<i>&lt;original distribution columns&gt;</i>);</codeblock><p>Thi ...@@ -153,10 +164,10 @@ DISTRIBUTED BY (<i>&lt;original distribution columns&gt;</i>);</codeblock><p>Thi
<li>Perform an <codeph>ANALYZE</codeph> on the system catalog tables to update the table <li>Perform an <codeph>ANALYZE</codeph> on the system catalog tables to update the table
statistics. </li> statistics. </li>
</ol></p> </ol></p>
<p>If you are performing catalog maintenance during a maintenance period and you need to stop <p>If you are performing system catalog maintenance during a maintenance period and you need
a process due to time constraints, run the Greenplum Database function to stop a process due to time constraints, run the Greenplum Database function
<codeph>pg_cancel_backend(&lt;<varname>PID</varname>>)</codeph> to safely stop a <codeph>pg_cancel_backend(&lt;PID>)</codeph> to safely stop a Greenplum Database
Greenplum Database process.</p> process.</p>
<p>The following script runs <codeph>REINDEX</codeph>, <codeph>VACUUM</codeph>, and <p>The following script runs <codeph>REINDEX</codeph>, <codeph>VACUUM</codeph>, and
<codeph>ANALYZE</codeph> on the system <codeph>ANALYZE</codeph> on the system
catalogs.<pre>#!/bin/bash catalogs.<pre>#!/bin/bash
...@@ -167,13 +178,11 @@ where a.relnamespace=b.oid and b.nspname='pg_catalog' and a.relkind='r'" ...@@ -167,13 +178,11 @@ where a.relnamespace=b.oid and b.nspname='pg_catalog' and a.relkind='r'"
reindexdb -s -d $DBNAME reindexdb -s -d $DBNAME
psql -tc "SELECT 'VACUUM' || $SYSTABLES" $DBNAME | psql -a $DBNAME psql -tc "SELECT 'VACUUM' || $SYSTABLES" $DBNAME | psql -a $DBNAME
analyzedb -s pg_catalog -d $DBNAME</pre></p> analyzedb -s pg_catalog -d $DBNAME</pre></p>
<p>If the system catalogs become significantly bloated, you must perform an intensive system <p>If the system catalogs become significantly bloated, you must run <codeph>VACUUM
catalog maintenance procedure. The <codeph>CREATE TABLE AS SELECT</codeph> and FULL</codeph> during a scheduled downtime period. During this period, stop all catalog
redistribution key methods for removing bloat cannot be used with system catalogs. You must activity on the system; <codeph>VACUUM FULL</codeph> takes <codeph>ACCESS EXCLUSIVE</codeph>
instead run <codeph>VACUUM FULL</codeph> during a scheduled downtime period. During this locks against the system catalog. Running <codeph>VACUUM</codeph> regularly on system
period, stop all catalog activity on the system; <codeph>VACUUM FULL</codeph> takes catalog tables can prevent the need for this more costly procedure.</p>
exclusive locks against the system catalog. Running <codeph>VACUUM</codeph> regularly can
prevent the need for this more costly procedure.</p>
<p>These are steps for intensive system catalog maintenance.<ol id="ol_trp_xqs_f2b"> <p>These are steps for intensive system catalog maintenance.<ol id="ol_trp_xqs_f2b">
<li>Stop all catalog activity on the Greenplum Database system.</li> <li>Stop all catalog activity on the Greenplum Database system.</li>
<li>Perform a <codeph>REINDEX</codeph> on the system catalog tables to rebuild the system <li>Perform a <codeph>REINDEX</codeph> on the system catalog tables to rebuild the system
...@@ -195,24 +204,5 @@ analyzedb -s pg_catalog -d $DBNAME</pre></p> ...@@ -195,24 +204,5 @@ analyzedb -s pg_catalog -d $DBNAME</pre></p>
of bloat</codeph> in the <codeph>gp_toolkit.gp_bloat_diag</codeph> view.</li> of bloat</codeph> in the <codeph>gp_toolkit.gp_bloat_diag</codeph> view.</li>
</ul></note> </ul></note>
</section> </section>
<section>
<title>Removing Bloat from Append-Optimized Tables</title>
<p>Append-optimized tables are handled much differently than heap tables. Although
append-optimized tables allow updates, inserts, and deletes, they are not optimized for
these operations and it is recommended to not use them with append-optimized tables. If you
heed this advice and use append-optimized for <i>load-once/read-many</i> workloads,
<codeph>VACUUM</codeph> on an append-optimized table runs almost instantaneously. </p>
<p>If you do run <codeph>UPDATE</codeph> or <codeph>DELETE</codeph> commands on an
append-optimized table, expired rows are tracked in an auxiliary bitmap instead of the free
space map. <codeph>VACUUM</codeph> is the only way to recover the space. Running
<codeph>VACUUM</codeph> on an append-optimized table with expired rows compacts a table by
rewriting the entire table without the expired rows. However, no action is performed if the
percentage of expired rows in the table exceeds the value of the
<codeph>gp_appendonly_compaction_threshold</codeph> configuration parameter, which is 10
(10%) by default. The threshold is checked on each segment, so it is possible that a
<codeph>VACUUM</codeph> statement will compact an append-only table on some segments and
not others. Compacting append-only tables can be disabled by setting the
<codeph>gp_appendonly_compaction</codeph> parameter to <codeph>no</codeph>.</p>
</section>
</body> </body>
</topic> </topic>
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册