src/backend/gporca/data/dxl/minidump/BTreeIndex-Against-InListLarge.mdp · 9363718d163fa73080510520c47447f2f292da2f · Greenplum / Gpdb

Experimental cost model update (port from 6X) (#11115) · 9363718d

由 Hans Zeller 提交于 11月 04, 2020

This is a cherry-pick of the change from PR https://github.com/greenplum-db/gporca/pull/607

Avoid costing change for IN predicates on btree indexes

Commit e5f1716 changed the way we handle IN predicates on indexes, it
now uses a more efficient array comparison instead of treating it like
an OR predicate. A side effect is that the cost function,
CCostModelGPDB::CostBitmapTableScan, now goes through a different code
path, using the "small NDV" or "large NDV" costing method. This produces
very high cost estimates when the NDV increases beyond 2, so we
basically never choose an index for these cases, although a btree
index used in a bitmap scan isn't very sensitive to the NDV.

To avoid this, we go back to the old formula we used before commit e5f1716.
The fix is restricted to IN predicates on btree indexes, used in a bitmap
scan.

Add an MDP for a larger IN list, using a btree index on an AO table

Misc. changes to the calibration test program

- Added tests for btree indexes (btree_scan_tests).
- Changed data distribution so that all column values range from 1...n.
- Parameter values for test queries are now proportional to selectivity,
a parameter value of 0 produces a selectivity of 0%.
- Changed the logic to fake statistics somewhat, hopefully this will
lead to more precise estimates. Incorporated the changes to the
data distribution with no more 0 values. Added fake stats for
unique columns.
- Headers of tests now use semicolons to separate parts, to give
a nicer output when pasting into Google Docs.
- Some formatting changes.
- Log fallbacks.
- When using existing tables, the program now determines the table
structure (heap or append-only) and the row count.
- Split off two very slow tests into separate test units. These are
not included when running "all" tests, they have to be run
explicitly.
- Add btree join tests, rename "bitmap_join_tests" to "index_join_tests"
and run both bitmap and btree joins
- Update min and max parameter values to cover a range that includes
or at least is closer to the cross-over between index and table scan
- Remove the "high NDV" tests, since the ranges in the general test
now include both low and high NDV cases (<= and > 200)
- Print out selectivity of each query, if available
- Suppress standard deviation output when we execute queries only once
- Set search path when connecting
- Decrease the parameter range when running bitmap scan tests on
heap tables
- Run btree scan tests only on AO tables, they are not designed
for testing index scans

Updates to the experimental cost model, new calibration

1. Simplify some of the formulas, the calibration process seemed to justify
that. We might have to revisit if problems come up. Changes:
- Rewrite some of the formulas so the costs per row and costs per byte
are more easy to see
- Make the cost for the width directly proportional
- Unify the formula for scans and joins, use the same per-byte costs
and make NDV-dependent costs proportional to num_rebinds * dNDV,
except for the logic in item 3.

That makes the cost for the new experimental cost model a simple linear formula:

num_rebinds * ( rows * c1 + rows * width * c2 + ndv * c3 + bitmap_union_cost + c4 ) + c5

We have 5 constants, c1 ... c5:

c1: cost per row (rows on one segment)
c2: cost per byte
c3: cost per distinct value (total NDV on all segments)
c4: cost per rebind
c5: initialization cost
bitmap_union_cost: see item 3 below

2. Recalibrate some of the cost parameters, using the updated calibration
program src/backend/gporca/scripts/cal_bitmap_test.py

3. Add a cost penalty for bitmap index scans on heap tables. The added
cost takes the form bitmap_union_cost = <base table rows> * (NDV-1) * c6.

The reason for this is, as others have pointed out, that heap tables
lead to much larger bit vectors, since their CTIDs are more spaced out
than those of AO tables. The main factor seems to be the cost of unioning
these bit vectors, and that cost is proportional to the number of bitmaps
minus one and the size of the bitmaps, which is approximated here by the
number of rows in the table.

Note that because we use (NDV-1) in the formula, this penalty does not
apply to usual index joins, which have an NDV of 1 per rebind. This is
consistent with what we see in measurements and it also seems reasonable,
since we don't have to union bitmaps in this case.

4. Fix to select CostModelGPDB for the 'experimental' model, as we do in 5X.

5. Calibrate the constants involved (c1 ... c6), using the calibration program
and running experiments with heap and append-only tables on a laptop and
also on a Linux cluster with 24 segments. Also run some other workloads
for validation.

6. Give a small initial advantage to bitmap scans, so they will be chosen over
table scans for small tables. Otherwise, small queries will
have more or less random plans, all of which cost around 431, the value
of the initial cost. Added a 10% advantage of the bitmap scan.

* Port calibration program to Python 3

- Used 2to3 program to do the basics.
- Version parameter in argparse no longer supported
- Needs additional option in connection string to keep the search path
- The dbconn.execSQL call can no longer be used to get a cursor,
this was probably a non-observable defect in the Python 2 version
- Needed to use // (floor division) in some cases
Co-authored-by: NDavid Kimura <dkimura@vmware.com>

9363718d

BTreeIndex-Against-InListLarge.mdp 39.4 KB

Greenplum / Gpdb

Replace BTreeIndex-Against-InListLarge.mdp