ddl-index.xml 15.7 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="topic_gk2_nwy_sp">
  <title id="im143866">Using Indexes in Greenplum Database</title>
  <body>
    <p>In most traditional databases, indexes can greatly improve data access times. However, in a
      distributed database such as Greenplum, indexes should be used more sparingly. Greenplum
      Database performs very fast sequential scans; indexes use a random seek pattern to locate
      records on disk. Greenplum data is distributed across the segments, so each segment scans a
      smaller portion of the overall data to get the result. With table partitioning, the total data
      to scan may be even smaller. Because business intelligence (BI) query workloads generally
      return very large data sets, using indexes is not efficient.</p>
13
    <p>First try your query workload without adding indexes. Indexes are more
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
      likely to improve performance for OLTP workloads, where the query is returning a single record
      or a small subset of data. Indexes can also improve performance on compressed append-optimized
      tables for queries that return a targeted set of rows, as the optimizer can use an index
      access method rather than a full table scan when appropriate. For compressed data, an index
      access method means only the necessary rows are uncompressed. </p>
    <p>Greenplum Database automatically creates <codeph>PRIMARY KEY</codeph> constraints for tables
      with primary keys. To create an index on a partitioned table, create an index on the
      partitioned table that you created. The index is propagated to all the child tables created by
      Greenplum Database. Creating an index on a table that is created by Greenplum Database for use
      by a partitioned table is not supported. </p>
    <p>Note that a <codeph>UNIQUE CONSTRAINT</codeph> (such as a <codeph>PRIMARY KEY
        CONSTRAINT</codeph>) implicitly creates a <codeph>UNIQUE INDEX</codeph> that must include
      all the columns of the distribution key and any partitioning key. The <codeph>UNIQUE
        CONSTRAINT</codeph> is enforced across the entire table, including all table partitions (if
      any).</p>
    <p>Indexes add some database overhead — they use storage space and must be maintained when the
      table is updated. Ensure that the query workload uses the indexes that you create, and check
      that the indexes you add improve query performance (as compared to a sequential scan of the
      table). To determine whether indexes are being used, examine the query
        <codeph>EXPLAIN</codeph> plans. See <xref href="../query/topics/query-profiling.xml#topic39"
      />.</p>
    <p>Consider the following points when you create indexes.</p>
    <ul id="ul_cgn_nwy_sp">
      <li id="im151751"><b>Your Query Workload.</b> Indexes improve performance for workloads where
        queries return a single record or a very small data set, such as OLTP workloads.</li>
      <li id="im174599">Compressed Tables. Indexes can improve performance on compressed
        append-optimized tables for queries that return a targeted set of rows. For compressed data,
        an index access method means only the necessary rows are uncompressed.</li>
      <li id="im151752"><b>Avoid indexes on frequently updated columns.</b> Creating an index on a
        column that is frequently updated increases the number of writes required when the column is
        updated.</li>
      <li id="im151753"><b>Create selective B-tree indexes.</b> Index selectivity is a ratio of the
        number of distinct values a column has divided by the number of rows in a table. For
        example, if a table has 1000 rows and a column has 800 distinct values, the selectivity of
        the index is 0.8, which is considered good. Unique indexes always have a selectivity ratio
        of 1.0, which is the best possible. Greenplum Database allows unique indexes only on
        distribution key columns.</li>
      <li id="im151760"><b>Use Bitmap indexes for low selectivity columns. </b>The Greenplum
        Database Bitmap index type is not available in regular PostgreSQL. See <xref href="#topic93"
          type="topic" format="dita"/>.</li>
      <li id="im151764"><b>Index columns used in joins.</b> An index on a column used for frequent
        joins (such as a foreign key column) can improve join performance by enabling more join
        methods for the query optimizer to use.</li>
      <li id="im151765"><b>Index columns frequently used in predicates.</b> Columns that are
        frequently referenced in <codeph>WHERE</codeph> clauses are good candidates for
        indexes.</li>
      <li id="im151766"><b>Avoid overlapping indexes.</b> Indexes that have the same leading column
        are redundant.</li>
      <li id="im151767"><b>Drop indexes for bulk loads.</b> For mass loads of data into a table,
        consider dropping the indexes and re-creating them after the load completes. This is often
        faster than updating the indexes. </li>
      <li id="im151768"><b>Consider a clustered index.</b> Clustering an index means that the
        records are physically ordered on disk according to the index. If the records you need are
        distributed randomly on disk, the database has to seek across the disk to fetch the records
        requested. If the records are stored close together, the fetching operation is more
        efficient. For example, a clustered index on a date column where the data is ordered
        sequentially by date. A query against a specific date range results in an ordered fetch from
        the disk, which leverages fast sequential access.</li>
    </ul>
    <section id="im151772">
      <title>To cluster an index in Greenplum Database</title>
      <p>Using the <codeph>CLUSTER</codeph> command to physically reorder a table based on an index
        can take a long time with very large tables. To achieve the same results much faster, you
        can manually reorder the data on disk by creating an intermediate table and loading the data
        in the desired order. For example:</p>
      <p>
        <codeblock>CREATE TABLE new_table (LIKE old_table) 
       AS SELECT * FROM old_table ORDER BY myixcolumn;
DROP old_table;
ALTER TABLE new_table RENAME TO old_table;
CREATE INDEX myixcolumn_ix ON old_table;
VACUUM ANALYZE old_table;
</codeblock>
      </p>
    </section>
  </body>
  <topic id="topic92" xml:lang="en">
    <title>Index Types</title>
    <body>
H
Heikki Linnakangas 已提交
93
      <p>Greenplum Database supports the Postgres index types B-tree, GiST, and GIN. Hash indexes
94 95 96
        are not supported. Each index type uses a different algorithm that is best suited to
        different types of queries. B-tree indexes fit the most common situations and are the
        default index type. See <xref
97
          href="https://www.postgresql.org/docs/8.3/static/indexes-types.html" scope="external"
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250
          format="html"><ph>Index Types</ph></xref> in the PostgreSQL documentation for a
        description of these types.</p>
      <note type="note">Greenplum Database allows unique indexes only if the columns of the index
        key are the same as (or a superset of) the Greenplum distribution key. Unique indexes are
        not supported on append-optimized tables. On partitioned tables, a unique index cannot be
        enforced across all child table partitions of a partitioned table. A unique index is
        supported only within a partition.</note>
    </body>
    <topic id="topic93" xml:lang="en">
      <title id="im143283">About Bitmap Indexes</title>
      <body>
        <p>Greenplum Database provides the Bitmap index type. Bitmap indexes are best suited to data
          warehousing applications and decision support systems with large amounts of data, many ad
          hoc queries, and few data modification (DML) transactions.</p>
        <p>An index provides pointers to the rows in a table that contain a given key value. A
          regular index stores a list of tuple IDs for each key corresponding to the rows with that
          key value. Bitmap indexes store a bitmap for each key value. Regular indexes can be
          several times larger than the data in the table, but bitmap indexes provide the same
          functionality as a regular index and use a fraction of the size of the indexed data.</p>
        <p>Each bit in the bitmap corresponds to a possible tuple ID. If the bit is set, the row
          with the corresponding tuple ID contains the key value. A mapping function converts the
          bit position to a tuple ID. Bitmaps are compressed for storage. If the number of distinct
          key values is small, bitmap indexes are much smaller, compress better, and save
          considerable space compared with a regular index. The size of a bitmap index is
          proportional to the number of rows in the table times the number of distinct values in the
          indexed column.</p>
        <p>Bitmap indexes are most effective for queries that contain multiple conditions in the
            <codeph>WHERE</codeph> clause. Rows that satisfy some, but not all, conditions are
          filtered out before the table is accessed. This improves response time, often
          dramatically.</p>
      </body>
      <topic id="topic94" xml:lang="en">
        <title>When to Use Bitmap Indexes</title>
        <body>
          <p>Bitmap indexes are best suited to data warehousing applications where users query the
            data rather than update it. Bitmap indexes perform best for columns that have between
            100 and 100,000 distinct values and when the indexed column is often queried in
            conjunction with other indexed columns. Columns with fewer than 100 distinct values,
            such as a gender column with two distinct values (male and female), usually do not
            benefit much from any type of index. On a column with more than 100,000 distinct values,
            the performance and space efficiency of a bitmap index decline. </p>
          <p>Bitmap indexes can improve query performance for ad hoc queries. <codeph>AND</codeph>
            and <codeph>OR</codeph> conditions in the <codeph>WHERE</codeph> clause of a query can
            be resolved quickly by performing the corresponding Boolean operations directly on the
            bitmaps before converting the resulting bitmap to tuple ids. If the resulting number of
            rows is small, the query can be answered quickly without resorting to a full table
            scan.</p>
        </body>
      </topic>
      <topic id="topic95" xml:lang="en">
        <title>When Not to Use Bitmap Indexes</title>
        <body>
          <p>Do not use bitmap indexes for unique columns or columns with high cardinality data,
            such as customer names or phone numbers. The performance gains and disk space advantages
            of bitmap indexes start to diminish on columns with 100,000 or more unique values,
            regardless of the number of rows in the table.</p>
          <p>Bitmap indexes are not suitable for OLTP applications with large numbers of concurrent
            transactions modifying the data.</p>
          <p>Use bitmap indexes sparingly. Test and compare query performance with and without an
            index. Add an index only if query performance improves with indexed columns.</p>
        </body>
      </topic>
    </topic>
  </topic>
  <topic id="topic96" xml:lang="en">
    <title>Creating an Index</title>
    <body>
      <p>The <codeph>CREATE INDEX</codeph> command defines an index on a table. A B-tree index is
        the default index type. For example, to create a B-tree index on the column <i>gender</i> in
        the table <i>employee</i>:</p>
      <p>
        <codeblock>CREATE INDEX gender_idx ON employee (gender);
</codeblock>
      </p>
      <p>To create a bitmap index on the column <i>title</i> in the table <i>films</i>:</p>
      <p>
        <codeblock>CREATE INDEX title_bmp_idx ON films USING bitmap (title);
</codeblock>
      </p>
    </body>
  </topic>
  <topic id="topic97" xml:lang="en">
    <title>Examining Index Usage</title>
    <body>
      <p>Greenplum Database indexes do not require maintenance and tuning. You can check which
        indexes are used by the real-life query workload. Use the <codeph>EXPLAIN</codeph> command
        to examine index usage for a query.</p>
      <p>The query plan shows the steps or <i>plan nodes</i> that the database will take to answer a
        query and time estimates for each plan node. To examine the use of indexes, look for the
        following query plan node types in your <codeph>EXPLAIN</codeph> output:</p>
      <ul id="ul_f3n_nwy_sp">
        <li id="im143464"><b>Index Scan</b> - A scan of an index.</li>
        <li id="im143465"><b>Bitmap Heap Scan</b> - Retrieves all </li>
        <li id="im207555">from the bitmap generated by BitmapAnd, BitmapOr, or BitmapIndexScan and
          accesses the heap to retrieve the relevant rows.</li>
        <li id="im143466"><b>Bitmap Index Scan</b> - Compute a bitmap by OR-ing all bitmaps that
          satisfy the query predicates from the underlying index.</li>
        <li id="im143467"><b>BitmapAnd</b> or <b>BitmapOr</b> - Takes the bitmaps generated from
          multiple BitmapIndexScan nodes, ANDs or ORs them together, and generates a new bitmap as
          its output.</li>
      </ul>
      <p>You have to experiment to determine the indexes to create. Consider the following
        points.</p>
      <ul id="ul_njn_nwy_sp">
        <li id="im143930">Run <codeph>ANALYZE</codeph> after you create or update an index.
            <codeph>ANALYZE</codeph> collects table statistics. The query optimizer uses table
          statistics to estimate the number of rows returned by a query and to assign realistic
          costs to each possible query plan.</li>
        <li id="im143932">Use real data for experimentation. Using test data for setting up indexes
          tells you what indexes you need for the test data, but that is all. </li>
        <li id="im143934">Do not use very small test data sets as the results can be unrealistic or
          skewed. </li>
        <li id="im205776">Be careful when developing test data. Values that are similar, completely
          random, or inserted in sorted order will skew the statistics away from the distribution
          that real data would have. </li>
        <li id="im144062">You can force the use of indexes for testing purposes by using run-time
          parameters to turn off specific plan types. For example, turn off sequential scans
            (<codeph>enable_seqscan</codeph>) and nested-loop joins
            (<codeph>enable_nestloop</codeph>), the most basic plans, to force the system to use a
          different plan. Time your query with and without indexes and use the <codeph>EXPLAIN
            ANALYZE</codeph> command to compare the results.</li>
      </ul>
    </body>
  </topic>
  <topic id="topic98" xml:lang="en">
    <title>Managing Indexes</title>
    <body>
      <p>Use the <codeph>REINDEX</codeph> command to rebuild a poorly-performing index.
          <codeph>REINDEX</codeph> rebuilds an index using the data stored in the index's table,
        replacing the old copy of the index. </p>
      <section id="im143476">
        <title>To rebuild all indexes on a table</title>
        <codeblock>REINDEX my_table;
</codeblock>
        <title>To rebuild a particular index</title>
        <codeblock>REINDEX my_index;
</codeblock>
      </section>
    </body>
  </topic>
  <topic id="topic99" xml:lang="en">
    <title>Dropping an Index</title>
    <body>
      <p>The <codeph>DROP INDEX</codeph> command removes an index. For example:</p>
      <p>
        <codeblock>DROP INDEX title_idx;
</codeblock>
      </p>
      <p>When loading data, it can be faster to drop all indexes, load, then recreate the indexes.
      </p>
    </body>
  </topic>
</topic>