1. 05 11月, 2020 1 次提交
    • H
      Experimental cost model update (port from 6X) (#11115) · 9363718d
      Hans Zeller 提交于
      This is a cherry-pick of the change from PR https://github.com/greenplum-db/gporca/pull/607
      
      Avoid costing change for IN predicates on btree indexes
      
      Commit e5f1716 changed the way we handle IN predicates on indexes, it
      now uses a more efficient array comparison instead of treating it like
      an OR predicate. A side effect is that the cost function,
      CCostModelGPDB::CostBitmapTableScan, now goes through a different code
      path, using the "small NDV" or "large NDV" costing method. This produces
      very high cost estimates when the NDV increases beyond 2, so we
      basically never choose an index for these cases, although a btree
      index used in a bitmap scan isn't very sensitive to the NDV.
      
      To avoid this, we go back to the old formula we used before commit e5f1716.
      The fix is restricted to IN predicates on btree indexes, used in a bitmap
      scan.
      
      Add an MDP for a larger IN list, using a btree index on an AO table
      
      Misc. changes to the calibration test program
      
      - Added tests for btree indexes (btree_scan_tests).
      - Changed data distribution so that all column values range from 1...n.
      - Parameter values for test queries are now proportional to selectivity,
        a parameter value of 0 produces a selectivity of 0%.
      - Changed the logic to fake statistics somewhat, hopefully this will
        lead to more precise estimates. Incorporated the changes to the
        data distribution with no more 0 values. Added fake stats for
        unique columns.
      - Headers of tests now use semicolons to separate parts, to give
        a nicer output when pasting into Google Docs.
      - Some formatting changes.
      - Log fallbacks.
      - When using existing tables, the program now determines the table
        structure (heap or append-only) and the row count.
      - Split off two very slow tests into separate test units. These are
        not included when running "all" tests, they have to be run
        explicitly.
      - Add btree join tests, rename "bitmap_join_tests" to "index_join_tests"
        and run both bitmap and btree joins
      - Update min and max parameter values to cover a range that includes
        or at least is closer to the cross-over between index and table scan
      - Remove the "high NDV" tests, since the ranges in the general test
        now include both low and high NDV cases (<= and > 200)
      - Print out selectivity of each query, if available
      - Suppress standard deviation output when we execute queries only once
      - Set search path when connecting
      - Decrease the parameter range when running bitmap scan tests on
        heap tables
      - Run btree scan tests only on AO tables, they are not designed
        for testing index scans
      
      Updates to the experimental cost model, new calibration
      
      1. Simplify some of the formulas, the calibration process seemed to justify
         that. We might have to revisit if problems come up. Changes:
         - Rewrite some of the formulas so the costs per row and costs per byte
           are more easy to see
         - Make the cost for the width directly proportional
         - Unify the formula for scans and joins, use the same per-byte costs
           and make NDV-dependent costs proportional to num_rebinds * dNDV,
           except for the logic in item 3.
      
         That makes the cost for the new experimental cost model a simple linear formula:
      
         num_rebinds * ( rows * c1 + rows * width * c2 + ndv * c3 + bitmap_union_cost + c4 ) + c5
      
         We have 5 constants, c1 ... c5:
      
         c1: cost per row (rows on one segment)
         c2: cost per byte
         c3: cost per distinct value (total NDV on all segments)
         c4: cost per rebind
         c5: initialization cost
         bitmap_union_cost: see item 3 below
      
      2. Recalibrate some of the cost parameters, using the updated calibration
         program src/backend/gporca/scripts/cal_bitmap_test.py
      
      3. Add a cost penalty for bitmap index scans on heap tables. The added
         cost takes the form bitmap_union_cost = <base table rows> * (NDV-1) * c6.
      
         The reason for this is, as others have pointed out, that heap tables
         lead to much larger bit vectors, since their CTIDs are more spaced out
         than those of AO tables. The main factor seems to be the cost of unioning
         these bit vectors, and that cost is proportional to the number of bitmaps
         minus one and the size of the bitmaps, which is approximated here by the
         number of rows in the table.
      
         Note that because we use (NDV-1) in the formula, this penalty does not
         apply to usual index joins, which have an NDV of 1 per rebind. This is
         consistent with what we see in measurements and it also seems reasonable,
         since we don't have to union bitmaps in this case.
      
      4. Fix to select CostModelGPDB for the 'experimental' model, as we do in 5X.
      
      5. Calibrate the constants involved (c1 ... c6), using the calibration program
         and running experiments with heap and append-only tables on a laptop and
         also on a Linux cluster with 24 segments. Also run some other workloads
         for validation.
      
      6. Give a small initial advantage to bitmap scans, so they will be chosen over
         table scans for small tables. Otherwise, small queries will
         have more or less random plans, all of which cost around 431, the value
         of the initial cost. Added a 10% advantage of the bitmap scan.
      
      * Port calibration program to Python 3
      
      - Used 2to3 program to do the basics.
      - Version parameter in argparse no longer supported
      - Needs additional option in connection string to keep the search path
      - The dbconn.execSQL call can no longer be used to get a cursor,
        this was probably a non-observable defect in the Python 2 version
      - Needed to use // (floor division) in some cases
      Co-authored-by: NDavid Kimura <dkimura@vmware.com>
      9363718d
  2. 17 6月, 2020 1 次提交
    • T
      Refactor dbconn · 330db230
      Tyler Ramer 提交于
      One reason pygresql was previously modified was that it did not handle
      closing a connection very gracefully. In the process of updating
      pygresql, we've wrapped the connection it provides with a
      ClosingConnection function, which should handle gracefully closing the
      connection when the "with dbconn.connect as conn" syntax is used.
      
      This did, however, illustrate issues where a cursor might have been
      created as the result of a dbconn.execSQL() call, which seems to hold
      the connection open if not specifically closed.
      
      It is therefore necessary to remove the ability to get a cursor from
      dbconn.execSQL(). To highlight this difference, and to ensure that
      future calls to this library is easy to use, I've cleaned up and
      clarified the dbconn execution code, to include the following features.
      
      - dbconn.execSQL() closes the cursor as part of the function. It returns
        no rows
      - functions dbconn.query() is added, which behaves like dbconn.execSQL()
        except that it now returns a cursor
      - function dbconn.execQueryforSingleton() is renamed
        dconn.querySingleton()
      - function dbconn.execQueryforSingletonRow() is renamed
        dconn.queryRow()
      Authored-by: NTyler Ramer <tramer@pivotal.io>
      330db230