src/test/regress/expected/qp_subquery.out · f4d4835844a24f3b1056dbc97dabd0b81312786f · Greenplum / Gpdb

Make 'rows' estimate more accurate for plans that fetch only a few rows. · f4d48358

由 Heikki Linnakangas 提交于 10月 26, 2020

In commit c5f6dbbe, we changed the row and cost estimates on plan nodes
to represent per-segment costs. That made some estimates worse, because
the effects of the estimate "clamping" compounds. Per my comment on the
PR back then:

> One interesting effect of this change, that explains many of the
> plan changes: If you have a table with very few rows, or e.g. a qual
> like id = 123 that matches exactly one row, the Seq/Index Scan on it
> will be marked with rows=1. It now means that we estimate that every
> segment returns one row, although in reality, only one of them will
> return a row, and the rest will return nothing. That's because the
> row count estimates are "clamped" in the planner to at least
> 1. That's not a big deal on its own, but if you then have e.g. a
> Gather Motion on top of the Scan, the planner will estimate that the
> Gather Motion returns as many rows as there are segments. If you
> have e.g. 100 segments, that's relatively a big discrepancy, with
> 100 rows vs 1. I don't think that's a big problem in practice, I
> don't think most plans are very sensitive to that kind of a
> misestimate. What do you think?
>
> If we wanted to fix that, perhaps we should stop "clamping" the
> estimates to 1. I don't think there's any fundamental reason we need
> to do it. Perhaps clamp down to 1 / numsegments instead.

But I came up with a less intrusive idea, implemented in this commit:
Most Motion nodes have a "parent" RelOptInfo, and the RelOptInfo
contains an estimate of the total number of rows, before dividing it
with the number of segments or clamping. So if the row estimate we get
from the subpath seems clamped to 1.0, we look at the row estimate on
the underlying RelOptInfo instead, and use that if it's smaller. That
makes the row count estimates better for plans that fetch a single row
or a few rows, same as they were before commit c5f6dbbe. Not all
RelOptInfos have a row count estimate, and the subpaths estimate is
more accurate if the number of rows produced by the path differs from
the number of rows in the underlying relation, e.g.  because of a
ProjectSet node, so we still prefer the subpath's estimate if it
doesn't seem clamped.
Reviewed-by: NZhenghua Lyu <zlv@pivotal.io>

f4d48358

qp_subquery.out 55.7 KB

Greenplum / Gpdb

Replace qp_subquery.out