perf_issues.xml 13.0 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic
  PUBLIC "-//OASIS//DTD DITA Composite//EN" "ditabase.dtd">
<topic id="topic1" xml:lang="en">
  <title id="jb177957">Common Causes of Performance Issues</title>
  <shortdesc>This section explains the troubleshooting processes for common performance issues and
    potential solutions to these issues. </shortdesc>
  <body/>
  <topic id="topic2" xml:lang="en">
    <title id="jb150555">Identifying Hardware and Segment Failures</title>
    <body>
      <p>The performance of Greenplum Database depends on the hardware and IT infrastructure on
        which it runs. Greenplum Database is comprised of several servers (hosts) acting together as
        one cohesive system (array); as a first step in diagnosing performance problems, ensure that
        all Greenplum Database segments are online. Greenplum Database's performance will be as fast
        as the slowest host in the array. Problems with CPU utilization, memory management, I/O
        processing, or network load affect performance. Common hardware-related issues are:</p>
      <ul>
        <li id="jb155623"><b>Disk Failure</b> – Although a single disk failure should not
          dramatically affect database performance if you are using RAID, disk resynchronization
          does consume resources on the host with failed disks. The <codeph>gpcheckperf</codeph>
          utility can help identify segment hosts that have disk I/O issues.</li>
        <li id="jb155625"><b>Host Failure</b> – When a host is offline, the segments on that host
          are nonoperational. This means other hosts in the array must perform twice their usual
          workload because they are running the primary segments and multiple mirrors. If mirrors
          are not enabled, service is interrupted. Service is temporarily interrupted to recover
          failed segments. The <codeph>gpstate</codeph> utility helps identify failed segments.</li>
        <li id="jb155741"><b>Network Failure</b> – Failure of a network interface card, a switch, or
          DNS server can bring down segments. If host names or IP addresses cannot be resolved
          within your Greenplum array, these manifest themselves as interconnect errors in Greenplum
          Database. The <codeph>gpcheckperf</codeph> utility helps identify segment hosts that have
          network issues.</li>
        <li id="jb155762"><b>Disk Capacity</b> – Disk capacity on your segment hosts should never
          exceed 70 percent full. Greenplum Database needs some free space for runtime processing.
            <ph>To reclaim disk space that deleted rows occupy, run <codeph>VACUUM</codeph> after
            loads or updates. </ph>The <i>gp_toolkit</i> administrative schema has many views for
          checking the size of distributed database objects. <p>See the <i>Greenplum Database
              Reference Guide</i> for information about checking database object sizes and disk
            space.</p></li>
      </ul>
    </body>
  </topic>
  <topic id="topic3" xml:lang="en">
    <title id="jb155843">Managing Workload</title>
    <body>
      <p>A database system has a limited CPU capacity, memory, and disk I/O resources. When multiple
        workloads compete for access to these resources, database performance suffers. Workload
        management maximizes system throughput while meeting varied business requirements. With
        role-based resource queues, Greenplum Database workload management limits active queries and
        conserves system resources.</p>
      <p>A resource queue limits the size and/or total number of queries that users or roles can
        execute in the particular queue. By assigning database roles to the appropriate
        resource queue, administrators can control concurrent user queries and prevent system
        overload. See <xref href="./workload_mgmt.xml#topic1" type="topic" format="dita"/> for more
        information about setting up resource queues.</p>
      <p>Greenplum Database administrators should run maintenance workloads such as data loads and
          <codeph>VACUUM ANALYZE</codeph> operations after business hours. Do not compete with
        database users for system resources; perform administrative tasks at low-usage times.</p>
    </body>
  </topic>
  <topic id="topic4" xml:lang="en">
    <title id="jb155291">Avoiding Contention</title>
    <body>
      <p>Contention arises when multiple users or workloads try to use the system in a conflicting
        way; for example, contention occurs when two transactions try to update a table
        simultaneously. A transaction that seeks a table-level or row-level lock will wait
        indefinitely for conflicting locks to be released. Applications should not hold transactions
        open for long periods of time, for example, while waiting for user input.</p>
    </body>
  </topic>
  <topic id="topic5" xml:lang="en">
    <title id="jb155292">Maintaining Database Statistics</title>
    <body>
      <p>Greenplum Database uses a cost-based query optimizer that relies on database statistics.
        Accurate statistics allow the query optimizer to better estimate the number of rows
        retrieved by a query to choose the most efficient query plan. Without database statistics,
        the query optimizer cannot estimate how many records will be returned. The optimizer does
        not assume it has sufficient memory to perform certain operations such as aggregations, so
        it takes the most conservative action and does these operations by reading and writing from
        disk. This is significantly slower than doing them in memory. <cmdname>ANALYZE</cmdname>
        collects statistics about the database that the query optimizer needs.<note>When executing
          an SQL command with GPORCA, Greenplum Database issues a warning if
          the command performance could be improved by collecting statistics on a column or set of
          columns referenced by the command. The warning is issued on the command line and
          information is added to the Greenplum Database log file. For information about collecting
          statistics on table columns, see the <cmdname>ANALYZE</cmdname> command in the
            <cite>Greenplum Database Reference Guide</cite></note></p>
    </body>
    <topic id="topic6" xml:lang="en">
      <title>Identifying Statistics Problems in Query Plans</title>
      <body>
        <p>Before you interpret a query plan for a query using <ph>EXPLAIN</ph> or <codeph>EXPLAIN
            ANALYZE</codeph>, familiarize yourself with the data to help identify possible
          statistics problems. Check the plan for the following indicators of inaccurate
          statistics:</p>
        <ul>
          <li id="jb176751"><b>Are the optimizer's estimates close to reality?</b> Run
              <codeph>EXPLAIN ANALYZE</codeph> and see if the number of rows the optimizer estimated
            is close to the number of rows the query operation returned .</li>
          <li id="jb158716"><b>Are selective predicates applied early in the plan?</b> The most
            selective filters should be applied early in the plan so fewer rows move up the plan
            tree.</li>
          <li id="jb158755"><b>Is the optimizer choosing the best join order?</b> When you have a
            query that joins multiple tables, make sure the optimizer chooses the most selective
            join order. Joins that eliminate the largest number of rows should be done earlier in
            the plan so fewer rows move up the plan tree.</li>
        </ul>
        <p>See <xref href="query/topics/query-profiling.xml#topic39"/> for more information about
          reading query plans.</p>
      </body>
    </topic>
    <topic id="topic7" xml:lang="en">
      <title>Tuning Statistics Collection</title>
      <body>
        <p>The following configuration parameters control the amount of data sampled for statistics
          collection:</p>
        <ul>
          <li id="jb158767">
            <codeph>default_statistics_target</codeph>
          </li>
          <li id="jb158800">
            <codeph>gp_analyze_relative_error</codeph>
          </li>
        </ul>
        <p>These parameters control statistics sampling at the system level. It is better to sample
          only increased statistics for columns used most frequently in query predicates. You can
          adjust statistics for a particular column using the command:</p>
        <p>
          <codeph>ALTER TABLE...SET STATISTICS</codeph>
        </p>
        <p>For example:</p>
        <p>
          <codeblock>ALTER TABLE sales ALTER COLUMN region SET STATISTICS 50;
</codeblock>
        </p>
        <p>This is equivalent to increasing <codeph>default_statistics_target</codeph> for a
          particular column. Subsequent <codeph>ANALYZE</codeph> operations will then gather more
          statistics data for that column and produce better query plans as a result.</p>
      </body>
    </topic>
  </topic>
  <topic id="topic8" xml:lang="en">
    <title id="jb155293">Optimizing Data Distribution</title>
    <body>
      <p>When you create a table in Greenplum Database, you must declare a distribution key that
        allows for even data distribution across all segments in the system. Because the segments
        work on a query in parallel, Greenplum Database will always be as fast as the slowest
        segment. If the data is unbalanced, the segments that have more data will return their
        results slower and therefore slow down the entire system.</p>
    </body>
  </topic>
  <topic id="topic9" xml:lang="en">
    <title id="jb155264">Optimizing Your Database Design</title>
    <body>
      <p>Many performance issues can be improved by database design. Examine your database design
        and consider the following: </p>
      <ul>
        <li id="jb155523">Does the schema reflect the way the data is accessed? </li>
        <li id="jb155527">Can larger tables be broken down into partitions? </li>
        <li id="jb155528">Are you using the smallest data type possible to store column values? </li>
        <li id="jb155534">Are columns used to join tables of the same datatype? </li>
        <li id="jb155543">Are your indexes being used?</li>
      </ul>
    </body>
    <topic id="topic10" xml:lang="en">
      <title>Greenplum Database Maximum Limits</title>
      <body>
        <p>To help optimize database design, review the maximum limits that Greenplum Database
          supports:</p>
        <table id="jb163544">
          <title>Maximum Limits of Greenplum Database</title>
          <tgroup cols="2">
            <colspec colnum="1" colname="col1" colwidth="180pt"/>
            <colspec colnum="2" colname="col2" colwidth="180pt"/>
            <thead>
              <row>
                <entry colname="col1">Dimension</entry>
                <entry colname="col2">Limit</entry>
              </row>
            </thead>
            <tbody>
              <row>
                <entry colname="col1">Database Size</entry>
                <entry colname="col2">Unlimited</entry>
              </row>
              <row>
                <entry colname="col1">Table Size</entry>
                <entry colname="col2">Unlimited, 128 TB per partition per segment</entry>
              </row>
              <row>
                <entry colname="col1">Row Size</entry>
                <entry colname="col2">1.6 TB (1600 columns * 1 GB)</entry>
              </row>
              <row>
                <entry colname="col1">Field Size</entry>
                <entry colname="col2">1 GB</entry>
              </row>
              <row>
                <entry colname="col1">Rows per Table</entry>
                <entry colname="col2">281474976710656 (2^48)</entry>
              </row>
              <row>
                <entry colname="col1">Columns per Table/View</entry>
                <entry colname="col2">1600</entry>
              </row>
              <row>
                <entry colname="col1">Indexes per Table</entry>
                <entry colname="col2">Unlimited</entry>
              </row>
              <row>
                <entry colname="col1">Columns per Index</entry>
                <entry colname="col2">32</entry>
              </row>
              <row>
                <entry colname="col1">Table-level Constraints per Table</entry>
                <entry colname="col2">Unlimited</entry>
              </row>
              <row>
                <entry colname="col1">Table Name Length</entry>
                <entry colname="col2">63 Bytes (Limited by <i>name</i> data type)</entry>
              </row>
            </tbody>
          </tgroup>
        </table>
        <p>Dimensions listed as unlimited are not intrinsically limited by Greenplum Database.
          However, they are limited in practice to available disk space and memory/swap space.
          Performance may suffer when these values are unusually large.</p>
        <note type="note">
          <p>There is a maximum limit on the number of objects (tables<ph>, indexes,</ph> and views,
            but not rows) that may exist at one time. This limit is 4294967296 (2^32).</p>
        </note>
      </body>
    </topic>
  </topic>
</topic>