# 14.2.规划师使用的统计数据
# 14.2.1.单列统计
正如我们在上一节中所看到的,查询计划器需要估计查询检索到的行数,以便正确选择查询计划。本节简要介绍了系统用于这些估算的统计数据。
统计数据的一个组成部分是每个表和索引中的条目总数,以及每个表和索引占用的磁盘块数。这些信息保存在表格中pg_类
,列中重元组
和重新翻页
.我们可以使用类似于此的查询来查看它:
SELECT relname, relkind, reltuples, relpages
FROM pg_class
WHERE relname LIKE 'tenk1%';
relname | relkind | reltuples | relpages
### 14.2.2. Extended Statistics
[]()[]()[]()[]()
It is common to see slow queries running bad execution plans because multiple columns used in the query clauses are correlated. The planner normally assumes that multiple conditions are independent of each other, an assumption that does not hold when column values are correlated. Regular statistics, because of their per-individual-column nature, cannot capture any knowledge about cross-column correlation. However, PostgreSQL has the ability to compute *multivariate statistics*, which can capture such information.
Because the number of possible column combinations is very large, it's impractical to compute multivariate statistics automatically. Instead, *extended statistics objects*, more often called just *statistics objects*, can be created to instruct the server to obtain statistics across interesting sets of columns.
Statistics objects are created using the [`CREATE STATISTICS`](sql-createstatistics.html) command. Creation of such an object merely creates a catalog entry expressing interest in the statistics. Actual data collection is performed by `ANALYZE` (either a manual command, or background auto-analyze). The collected values can be examined in the [`pg_statistic_ext_data`](catalog-pg-statistic-ext-data.html) catalog.
`ANALYZE` computes extended statistics based on the same sample of table rows that it takes for computing regular single-column statistics. Since the sample size is increased by increasing the statistics target for the table or any of its columns (as described in the previous section), a larger statistics target will normally result in more accurate extended statistics, as well as more time spent calculating them.
The following subsections describe the kinds of extended statistics that are currently supported.
#### 14.2.2.1. Functional Dependencies
The simplest kind of extended statistics tracks *functional dependencies*, a concept used in definitions of database normal forms. We say that column `b` is functionally dependent on column `a` if knowledge of the value of `a` is sufficient to determine the value of `b`, that is there are no two rows having the same value of `a` but different values of `b`. In a fully normalized database, functional dependencies should exist only on primary keys and superkeys. However, in practice many data sets are not fully normalized for various reasons; intentional denormalization for performance reasons is a common example. Even in a fully normalized database, there may be partial correlation between some columns, which can be expressed as partial functional dependency.
The existence of functional dependencies directly affects the accuracy of estimates in certain queries. If a query contains conditions on both the independent and the dependent column(s), the conditions on the dependent columns do not further reduce the result size; but without knowledge of the functional dependency, the query planner will assume that the conditions are independent, resulting in underestimating the result size.
To inform the planner about functional dependencies, `ANALYZE` can collect measurements of cross-column dependency. Assessing the degree of dependency between all sets of columns would be prohibitively expensive, so data collection is limited to those groups of columns appearing together in a statistics object defined with the `dependencies` option. It is advisable to create `dependencies` statistics only for column groups that are strongly correlated, to avoid unnecessary overhead in both `ANALYZE` and later query planning.
Here is an example of collecting functional-dependency statistics:
创建城市统计STT(依赖项),从zipcodes压缩;
分析zipcodes;
从(oid=stxoid)上的pg_statistic_ext join pg_statistic_ext_data中选择stxname、stxkeys、stxdependencies,其中stxname='stts';stxname | stxkeys | stxddependencies
# 14.2.2.1.1.函数依赖的局限性
函数依赖关系目前仅适用于考虑将列与常量值进行比较的简单相等条件,以及在里面
具有常量值的子句。它们不用于改进对比较两列或将一列与表达式进行比较的等式条件的估计,也不用于范围子句,喜欢
或任何其他类型的情况。
当使用功能相关性进行估算时,规划人员假设相关列上的条件是兼容的,因此是冗余的。如果它们不兼容,正确的估计值为零行,但不考虑这种可能性。例如,给定如下查询:
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '94105';
规划者将忽略城市
不改变选择性,这是正确的。然而,它也会做出同样的假设
SELECT * FROM zipcodes WHERE city = 'San Francisco' AND zip = '90210';
即使满足此查询的行数实际上为零。然而,功能依赖性统计数据并没有提供足够的信息来得出这样的结论。
在许多实际情况下,这种假设通常是满足的;例如,应用程序中可能有一个GUI,只允许选择在查询中使用的兼容城市和邮政编码值。但如果不是这样的话,功能依赖可能不是一个可行的选择。
# 14.2.2.2.多元N-不同计数
单列统计信息存储每列中不同值的数量。组合多个列时不同值数量的估计(例如a组,b组
)当计划者只有单列统计数据时,经常出现错误,导致其选择错误的计划。
为了改善这种估计,分析
可以为列组收集n个不同的统计信息。和以前一样,对每一个可能的列分组都这样做是不切实际的,因此只为在用定义的统计对象中同时出现的那些列组收集数据禁止
选项将从列出的列集合中收集两个或更多列的每个可能组合的数据。
继续上一个示例,邮政编码表中的n个不同计数可能如下所示:
CREATE STATISTICS stts2 (ndistinct) ON city, state, zip FROM zipcodes;
ANALYZE zipcodes;
SELECT stxkeys AS k, stxdndistinct AS nd
FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid)
WHERE stxname = 'stts2';
-[ RECORD 1 ]--------------------------------------------------------
k | 1 2 5
nd | {"1, 2": 33178, "1, 5": 33178, "2, 5": 27435, "1, 2, 5": 33178}
(1 row)
这表明有三个列组合具有33178个不同的值:邮政编码和州;邮政编码和城市;以及邮政编码、城市和州(考虑到邮政编码在本表中是唯一的,预计它们都是平等的)。另一方面,城市和州的组合只有27435个不同的值。
建议创建禁止
statistics对象仅针对实际用于分组的列的组合,并且错误估计组的数量会导致糟糕的计划。否则分析
循环只是浪费。
# 14.2.2.3.多元MCV列表
为每列存储的另一种统计信息是最常见的值列表。这允许对单个列进行非常精确的估计,但可能会导致对多个列上有条件的查询进行重大错误估计。
为了改善这种估计,分析
可以收集列组合上的MCV列表。与函数依赖和n-不同系数类似,对每个可能的列分组都这样做是不切实际的。在这种情况下更是如此,因为MCV列表(不同于函数依赖项和n-不同系数)确实存储公共列值。因此,数据只针对在用定义的统计对象中同时出现的那些列组进行收集mcv
选项
继续上一个示例,邮政编码表的MCV列表可能如下所示(与更简单的统计类型不同,MCV内容检查需要一个函数):
CREATE STATISTICS stts3 (mcv) ON city, state FROM zipcodes;
ANALYZE zipcodes;
SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid = stxoid),
pg_mcv_list_items(stxdmcv) m WHERE stxname = 'stts3';
index | values | nulls | frequency | base_frequency