1. 20 6月, 2018 4 次提交
    • J
      Update README after concourse domain change · 165997a1
      Jim Doty 提交于
      165997a1
    • D
      Add tests for subqueries nested inside a scalar expression · dd77c59c
      Dhanashree Kashid 提交于
      Add tests to ensure sane behavior when a subquery appears nested inside
      a scalar expression. The intent is to check for correct results.
      
      Bump ORCA version to 2.63.0
      Signed-off-by: NShreedhar Hardikar <shardikar@pivotal.io>
      dd77c59c
    • J
      Add pg_log directory to basebackup static exclusion list · 292ef134
      Jimmy Yih 提交于
      The pg_log directory has always been excluded using the pg_basebackup
      exclude option (-E ./pg_log). With this change, we add it to the
      static list inside of basebackup. Because of this change, we are able
      to remove all instances of mkdir pg_log in our management
      utilities. Previously, the utilities would always have to create the
      pg_log directory after running pg_basebackup because the postmaster
      does a validation check on the pg_log path existing.
      
      This also helps us align better with upstream Postgres since the
      pg_basebackup exclude option is Greenplum-specific and really not
      needed at all. Our dynamic exclusion list hasn't changed for a very
      long time (so it's pretty much static anyways) and is not maintained
      in the utilities very well. We may actually remove the pg_basebackup
      exclude option in the near future.
      292ef134
    • M
      b143cd71
  2. 19 6月, 2018 9 次提交
    • L
      docs - docs and updates for pgbouncer 1.8.1 (#5151) · a99194e0
      Lisa Owen 提交于
      * docs - docs and updates for pgbouncer 1.8.1
      
      * some edits requested by david
      
      * add pgbouncer config page to see also, include directive
      
      * add auth_hba_type config param
      
      * ldap - add info to migrating section, remove ldap passwds
      
      * remove ldap note
      a99194e0
    • O
      Update utilities to capture hyperloglog counter · aa5fe3d5
      Omer Arap 提交于
      This commit updates the GPSD utility to capture the value of the column
      `stainherit` and also the HLL counters stored in column `stavalues4`
      generated for sample/full table scan based HLL analyze in `pg_statistic`
      table.
      
      This commit also updates minirepro utility to capture hyperloglog
      counter
      Signed-off-by: NEkta Khanna <ekhanna@pivotal.io>
      aa5fe3d5
    • O
      Utilize hyperloglog and merge utilities to derive root table statistics · 9c1b1ae3
      Omer Arap 提交于
      This commit introduces an end-to-end scalable solution to generate
      statistics of the root partitions. This is done by merging the
      statistics of leaf partition tables to generate the statistics of the
      root partition. Therefore, ability to merge leaf table statistics for
      the root table makes analyze very incremental and stable.
      
      **CHANGES IN LEAF TABLE STATS COLLECTION:**
      
      Incremental analyze will create sample for each partition as the
      previous version. While analyzing the sample and generating statistics
      for the partition, it will also create a `hyperloglog_counter` data
      structure and add values from the sample to the `hyperloglog_counter`
      such as number of multiples and sample size. Once the entire sample is
      processed, analyze will save the `hyperloglog_counter` as a byte array
      in `pg_statistic` catalog table. We reserve a slot for the
      `hyperlog_counter` in the table and signify this as a specific type of
      statistic kind which is `STATISTIC_KIND_HLL`. We only keep the
      `hyperloglog_counter` in the `pg_catalog` for the leaf partitions. If
      the user chooses to run FULL scan for HLL, we signify the kind as
      `STATISTIC_KIND_FULLHLL`.
      
      **MERGING LEAF STATISTICS**
      
      Once all the leaf partitions are analyzed, we analyze the root
      partition. Initially, we check if all the partitions have been analyzed
      properly and have all the statistics available to us in the
      `pg_statistic` catalog table. If there is a partition with no tuples,
      even though it has no entry in `pg_catalog`, we consider it as analyzed.
      If for some reason a single partition is not analyzed, we fall back to
      the original analyze algorithm that requires to acquire sample for the
      root partition and calculate statistic based on the sample.
      
      Merging null fraction and average width from leaf partition statistics
      is trivial and does not involve significant challenge. We do calculate
      them first. Then, the remaining statistics information are:
      
      - Number of distinct values (NDV)
      
      - Most common values (MCV), and their frequencies termed as most common
      frequency (MCF)
      
      - Histograms that represent the distribution of the data values in the
      table
      
      **Merging NDV:**
      
      Hyperloglog provides a functionality to merge multiple
      `hyperloglog_counter`s into one and calculate the number of distinct
      values using the aggregated `hyperlog_counter`. This aggregated
      `hyperlog_counter` is sufficient only if the user chooses to run full
      scan for hyperloglog. In the sample based approach, without the
      hyperloglog algorithm, derivation of number of distinct values is not
      possible. Hyperloglog enables us to merge the `hyperloglog_counter`s
      from each partition and calculate the NDV on the merged
      `hyperloglog_counter` with an acceptable error rate. However, it does
      not give us the ultimate NDV of the root partition, it provides us the
      NDV of the union of the samples from each partition.
      
      The rest of the NDV interpolation depends on four metrics in postgres
      and based on the formula used in postgres: NDV in the sample, number of
      multiple values in the sample, sample size and total rows in the table.
      Using these values the algorithm calculates the approximate NDV for the
      table. While merging the statistics from the leaf partitions, with the
      help of hyperloglog we can accurately generate NDV for the sample,
      sample size and total rows, however, number of multiples in the
      accumulated sample is unknown since we do not have an access to the
      accumulated sample at this point.
      
      _Number of Multiples_
      
      Our approach to estimate the number of multiples in the aggregated
      sample (which itself is unavailable) for the root requires the
      availability of NDVs, number of multiples and size of each leaf sample.
      The NDVs in each sample is trivial to calculate using the partition's
      `hyperloglog_counter`. The number of multiples and sample size for each
      partition is saved in the `hyperloglog_counter` of the partition to be
      used in the merge during the leaf statistics gathering.
      
      Estimating the number of multiples in the aggregate sample for the root
      partition is a two step process. First, we accurately estimate the
      number of values that reside in more than one partition's sample. Then,
      we estimate the number of multiples that uniquely exists in a single
      partition. Finally, we add these values to estimate the overall number
      of multiples in the aggregate sample of the root partition.
      
      To count the number of values that uniquely exists in one single
      partition, we utilize hyperloglog functionality. We can easily estimate
      how many values appear only on a specific partition _i_. We call the NDV
      of overall aggregate of the entire partition as `NDV_all` and NDV of
      aggregate of all partitions but _i_ as `NDV_minus_i`. The difference of
      `NDV_all` and  `NDV_minus_i` would result in the values that appear in
      only one partition. The rest of the values will contribute to the
      overall number of multiples in the root’s aggregated sample, and we call
      them as `nMultiple_inter` as the number of values that appear in more
      than one partition.
      
      However, that is not enough since even a single value only resides in
      one partition, the partition might have multiple of them. We need a way
      to express the possibility of existence of these values. Remember that
      we also account the number of multiples that uniquely in partition
      sample. We already know the number of multiples inside a partition
      sample, however we need to normalize this value with the proportion of
      the number of values unique to the partition sample to the number of
      distinct values of the partition sample. The normalized value would be
      partition sample i’s contribution to the overall calculation of the
      nMultiple.
      
      Finally, `nMultiple_root` would be the sum of the `nMultiple_inter` and
      `normalized_m_i` for each partition sample.
      
      **Merging MCVs:**
      
      We utilize the merge functionality we imported from the 4.3 version of
      the greenplum DB. The algorithm is trivial. We convert each MCV’s
      frequency into count and add them up if they appear in more than one
      partition. After every possible candidate’s count has been calculated,
      we sort the candidate values and pick the top ones which is defined by
      the `default_statistics_target`. 4.3 previously blindly picks the top
      values with the highest count. We however incorporated the same logic
      used in the current greenplum and postgres and test if a values is a
      real MCV by running some tests. Therefore, even after the merge, the
      logic totally aligns with the postgres.
      
      **Merging Histograms:**
      
      One of the main novel contribution of this commit comes in how we merge
      the histograms from the leaf partitions. In 4.3 we use priority queue to
      merge the histogram from the leaf partition. However, that approach is
      very naive and loses very important statistical information. In
      postgres, histogram is calculated over the values that did not qualify
      as an MCV. The merge logic for the histograms in 4.3, did not take this
      into consideration and significant statistical information is lost while
      we merge the MCV values.
      
      We introduce a novel approach to feed the MCV’s from the leaf partitions
      that did not qualify as a root MCV to the histogram merge logic. To
      fully utilize the previously implemented priority queue logic, we
      treated non-qualified MCV’s as the histograms of a so called `dummy`
      partitions. To be more previcate, if an MCV m1 is a non-qualified MCV we
      create a histogram [m1, m1] where it only has one bucket and the bucket
      size is the count of this non-qualified MCV. When we merge the
      histograms of the leaf partitions and these dummy partitions the merged
      histogram would not lose any statistical information.
      Signed-off-by: NJesse Zhang <sbjesse@gmail.com>
      Signed-off-by: NEkta Khanna <ekhanna@pivotal.io>
      9c1b1ae3
    • O
      Import and modify analyze utility functions from 4.3 · 7ea27fc5
      Omer Arap 提交于
      In the previous generation of analyze, gpdb provided features to merge
      statistics such as MCVs (Most common values) and histograms for the root
      or midlevel partitions from the leaf partition's statistics.
      
      This commit imports the utility functions for merging MCVs and
      histograms and modifies based on the needs of current version.
      Signed-off-by: NBhunvesh Chaudhary <bchaudhary@pivotal.io>
      7ea27fc5
    • O
      95279b10
    • A
      Port hyperloglog extension into gpdb · a9301fdc
      Abhijit Subramanya 提交于
      - Port the hyperloglog extension into the contrib directory and make
      corresponding makefile changes to get it to compile.
      - Also modify initdb to install the HLL extension as part of gpinitsystem.
      Signed-off-by: NOmer Arap <oarap@pivotal.io>
      Signed-off-by: NEkta Khanna <ekhanna@pivotal.io>
      a9301fdc
    • A
      Fix COPY TO ON SEGMENT processed counting · cb63e543
      Adam Lee 提交于
      The processed variable should not be reset while looping all partitions.
      cb63e543
    • A
      Fix COPY TO IGNORE EXTERNAL PARTITIONS · f118f4bd
      Adam Lee 提交于
      BeginCopy() returns a brand new CopyState but ignored the value of
      skip_ext_partition, set after it.
      
      It's a simple boolean of struct CopyStmt, no need to wrap in options.
      f118f4bd
    • A
      Update .gitignore files · e0e8f475
      Adam Lee 提交于
      To have a clean `git status` output.
      e0e8f475
  3. 18 6月, 2018 1 次提交
    • M
      docs - gpbackup/gprestore new functionality. (#5157) · 6df82183
      Mel Kiyama 提交于
      * docs - gpbackup/gprestore new functionality.
      
      --gpbackup new option --jobs to backup tables in parallel.
      --gprestore  --include-table* options support restoring views and sequences.
      
      * docs - gpbackup/gprestore. fixed typos. Updated backup/restore of sequences and views
      
      * docs - gpbackup/gprestore - clarified information on dependent objects.
      
      * docs - gpbackup/gprestore - updated information on locking/quiescent state.
      
      * docs - gpbackup/gprestore - clarify connection in --jobs option.
      6df82183
  4. 16 6月, 2018 1 次提交
    • A
      Fix incorrect modification of storageAttributes.compress. · 7c82d50f
      Ashwin Agrawal 提交于
      For CO table, storageAttributes.compress only conveys if should apply block
      compression or not. RLE is performed as stream compression within the block and
      hence storageAttributes.compress true or false doesn't relate to rle at all. So,
      with rle_type compression storageAttributes.compress is true for compression
      levels > 1 where along with stream compression, block compression is
      performed. For compress level = 1 storageAttributes.compress is always false as
      no block compression is applied. Now since rle doesn't relate to
      storageAttributes.compress there is no reason to touch the same based on
      rle_type compression.
      
      Also, the problem manifests more due the fact in datumstream layer
      AppendOnlyStorageAttributes in DatumStreamWrite (`acc->ao_attr.compress`) is
      used to decide block type whereas in cdb storage layer functions
      AppendOnlyStorageAttributes from AppendOnlyStorageWrite
      (`idesc->ds[i]->ao_write->storageAttributes.compress`) is used. Due to this
      difference changing just one that too unnecessarily is bound to cause issue
      during insert.
      
      So, removing the unnecessary and incorrect update to
      AppendOnlyStorageAttributes.
      
      Test case showcases the failing scenario without the patch.
      7c82d50f
  5. 15 6月, 2018 2 次提交
  6. 14 6月, 2018 4 次提交
  7. 13 6月, 2018 1 次提交
  8. 12 6月, 2018 8 次提交
  9. 11 6月, 2018 5 次提交
  10. 09 6月, 2018 3 次提交
  11. 08 6月, 2018 2 次提交
    • T
      Update time zone data files to tzdata release 2018e. · f9c94e87
      Tom Lane 提交于
      This commit pulls in the latest tzdata from Postgres 11. We
      intentionally left out comment changes to
      `src/backend/utils/adt/datetime.c` because it's not applicable (yet).
      
      > DST law changes in North Korea.  Redefinition of "daylight savings" in
      > Ireland, as well as for some past years in Namibia and Czechoslovakia.
      > Additional historical corrections for Czechoslovakia.
      >
      > With this change, the IANA database models Irish timekeeping as following
      > "standard time" in summer, and "daylight savings" in winter, so that the
      > daylight savings offset is one hour behind standard time not one hour
      > ahead.  This does not change their UTC offset (+1:00 in summer, 0:00 in
      > winter) nor their timezone abbreviations (IST in summer, GMT in winter),
      > though now "IST" is more correctly read as "Irish Standard Time" not "Irish
      > Summer Time".  However, the "is_dst" column in the pg_timezone_names view
      > will now be true in winter and false in summer for the Europe/Dublin zone.
      >
      > Similar changes were made for Namibia between 1994 and 2017, and for
      > Czechoslovakia between 1946 and 1947.
      >
      > So far as I can find, no Postgres internal logic cares about which way
      > tm_isdst is reported; in particular, since commit b2cbced9 we do not
      > rely on it to decide how to interpret ambiguous timestamps during DST
      > transitions.  So I don't think this change will affect any Postgres
      > behavior other than the timezone-view outputs.
      >
      > Discussion: https://postgr.es/m/30996.1525445902@sss.pgh.pa.us
      (cherry picked from commit 234bb985)
      Co-authored-by: NJesse Zhang <sbjesse@gmail.com>
      Co-authored-by: NTaylor Vesely <tvesely@pivotal.io>
      f9c94e87
    • T
      Sync our copy of the timezone library with IANA release tzcode2018e. · 20269256
      Tom Lane 提交于
      The non-cosmetic changes involve teaching the "zic" tzdata compiler about
      negative DST.  While I'm not currently intending that we start using
      negative-DST data right away, it seems possible that somebody would try
      to use our copy of zic with bleeding-edge IANA data.  So we'd better be
      out in front of this change code-wise, even though it doesn't matter for
      the data file we're shipping.
      
      Discussion: https://postgr.es/m/30996.1525445902@sss.pgh.pa.us
      (cherry picked from commit b45f6613)
      20269256