• H
    Experimental cost model update (port from 6X) (#11115) · 9363718d
    Hans Zeller 提交于
    This is a cherry-pick of the change from PR https://github.com/greenplum-db/gporca/pull/607
    
    Avoid costing change for IN predicates on btree indexes
    
    Commit e5f1716 changed the way we handle IN predicates on indexes, it
    now uses a more efficient array comparison instead of treating it like
    an OR predicate. A side effect is that the cost function,
    CCostModelGPDB::CostBitmapTableScan, now goes through a different code
    path, using the "small NDV" or "large NDV" costing method. This produces
    very high cost estimates when the NDV increases beyond 2, so we
    basically never choose an index for these cases, although a btree
    index used in a bitmap scan isn't very sensitive to the NDV.
    
    To avoid this, we go back to the old formula we used before commit e5f1716.
    The fix is restricted to IN predicates on btree indexes, used in a bitmap
    scan.
    
    Add an MDP for a larger IN list, using a btree index on an AO table
    
    Misc. changes to the calibration test program
    
    - Added tests for btree indexes (btree_scan_tests).
    - Changed data distribution so that all column values range from 1...n.
    - Parameter values for test queries are now proportional to selectivity,
      a parameter value of 0 produces a selectivity of 0%.
    - Changed the logic to fake statistics somewhat, hopefully this will
      lead to more precise estimates. Incorporated the changes to the
      data distribution with no more 0 values. Added fake stats for
      unique columns.
    - Headers of tests now use semicolons to separate parts, to give
      a nicer output when pasting into Google Docs.
    - Some formatting changes.
    - Log fallbacks.
    - When using existing tables, the program now determines the table
      structure (heap or append-only) and the row count.
    - Split off two very slow tests into separate test units. These are
      not included when running "all" tests, they have to be run
      explicitly.
    - Add btree join tests, rename "bitmap_join_tests" to "index_join_tests"
      and run both bitmap and btree joins
    - Update min and max parameter values to cover a range that includes
      or at least is closer to the cross-over between index and table scan
    - Remove the "high NDV" tests, since the ranges in the general test
      now include both low and high NDV cases (<= and > 200)
    - Print out selectivity of each query, if available
    - Suppress standard deviation output when we execute queries only once
    - Set search path when connecting
    - Decrease the parameter range when running bitmap scan tests on
      heap tables
    - Run btree scan tests only on AO tables, they are not designed
      for testing index scans
    
    Updates to the experimental cost model, new calibration
    
    1. Simplify some of the formulas, the calibration process seemed to justify
       that. We might have to revisit if problems come up. Changes:
       - Rewrite some of the formulas so the costs per row and costs per byte
         are more easy to see
       - Make the cost for the width directly proportional
       - Unify the formula for scans and joins, use the same per-byte costs
         and make NDV-dependent costs proportional to num_rebinds * dNDV,
         except for the logic in item 3.
    
       That makes the cost for the new experimental cost model a simple linear formula:
    
       num_rebinds * ( rows * c1 + rows * width * c2 + ndv * c3 + bitmap_union_cost + c4 ) + c5
    
       We have 5 constants, c1 ... c5:
    
       c1: cost per row (rows on one segment)
       c2: cost per byte
       c3: cost per distinct value (total NDV on all segments)
       c4: cost per rebind
       c5: initialization cost
       bitmap_union_cost: see item 3 below
    
    2. Recalibrate some of the cost parameters, using the updated calibration
       program src/backend/gporca/scripts/cal_bitmap_test.py
    
    3. Add a cost penalty for bitmap index scans on heap tables. The added
       cost takes the form bitmap_union_cost = <base table rows> * (NDV-1) * c6.
    
       The reason for this is, as others have pointed out, that heap tables
       lead to much larger bit vectors, since their CTIDs are more spaced out
       than those of AO tables. The main factor seems to be the cost of unioning
       these bit vectors, and that cost is proportional to the number of bitmaps
       minus one and the size of the bitmaps, which is approximated here by the
       number of rows in the table.
    
       Note that because we use (NDV-1) in the formula, this penalty does not
       apply to usual index joins, which have an NDV of 1 per rebind. This is
       consistent with what we see in measurements and it also seems reasonable,
       since we don't have to union bitmaps in this case.
    
    4. Fix to select CostModelGPDB for the 'experimental' model, as we do in 5X.
    
    5. Calibrate the constants involved (c1 ... c6), using the calibration program
       and running experiments with heap and append-only tables on a laptop and
       also on a Linux cluster with 24 segments. Also run some other workloads
       for validation.
    
    6. Give a small initial advantage to bitmap scans, so they will be chosen over
       table scans for small tables. Otherwise, small queries will
       have more or less random plans, all of which cost around 431, the value
       of the initial cost. Added a 10% advantage of the bitmap scan.
    
    * Port calibration program to Python 3
    
    - Used 2to3 program to do the basics.
    - Version parameter in argparse no longer supported
    - Needs additional option in connection string to keep the search path
    - The dbconn.execSQL call can no longer be used to get a cursor,
      this was probably a non-observable defect in the Python 2 version
    - Needed to use // (floor division) in some cases
    Co-authored-by: NDavid Kimura <dkimura@vmware.com>
    9363718d
BTreeIndex-Against-InListLarge.mdp 39.4 KB