提交 9e09ccdf 编写于 作者: L Lisa Owen 提交者: David Yozie

docs - enhance pxf filter pushdown info (#5290)

* docs - enhance pxf filter pushdown info

* edits requested by david

* misc edits
上级 88c0e5de
......@@ -26,7 +26,7 @@
<li><codeph>s3://</codeph> accesses files in an Amazon S3 bucket. See <xref
href="g-s3-protocol.xml#amazon-emr"/>.</li>
<li>The <codeph>pxf://</codeph> protocol accesses external HDFS files and Hive tables using the Greenplum Platform Extension Framework (PXF). See <xref href="g-pxf-protocol.xml"></xref>.</li>
<li>The <codeph>pxf://</codeph> protocol accesses external HDFS files and HBase and Hive tables using the Greenplum Platform Extension Framework (PXF). See <xref href="g-pxf-protocol.xml"></xref>.</li>
</ul></p>
<p>External tables access external files from within the database as if they are regular
database tables. External tables defined with the
......@@ -63,10 +63,11 @@
data warehousing</li>
<li id="du210102">Reading external table data in parallel from multiple Greenplum
database segment instances, to optimize large load operations</li>
<li id="du210103">Filter pushdown. If a query contains <codeph>WHERE</codeph> clause,
it may be passed to the external data source. See <xref
<li id="du210103">Filter pushdown. If a query contains a <codeph>WHERE</codeph> clause,
it may be passed to the external data source. Refer to the <xref
href="../../ref_guide/config_params/guc-list.xml#gp_external_enable_filter_pushdown"
/>. Note that this feature is currently supported only with the
/> server configuration parameter discussion for more information. Note that
this feature is currently supported only with the
<codeph>pxf</codeph> protocol (see <xref href="g-pxf-protocol.xml"/>).</li>
</ul><p>Readable external tables allow only <codeph>SELECT</codeph> operations.</p>
</li>
......
......@@ -5,7 +5,7 @@
<shortdesc>Data managed by your organization may already reside in external sources. The Greenplum Platform Extension Framework (PXF) provides access to this external data via built-in connectors that map an external data source to a Greenplum Database table definition.</shortdesc>
<body>
<p>PXF is installed with HDFS, Hive, and HBase connectors. These connectors enable you to read external HDFS file system and Hive and HBase table data stored in text, Avro, JSON, RCFile, Parquet, SequenceFile, and ORC formats.</p>
<note>PXF does not currently support filter predicate pushdown for the HDFS connector.</note>
<note>PXF supports filter pushdown in the Hive connector only.</note>
<p>The Greenplum Platform Extension Framework includes a protocol C library and a Java service. After you configure and initialize PXF, you start a single PXF JVM process on each Greenplum Database segment host. This long-running process concurrently serves multiple query requests.</p>
<p>For detailed information about the architecture of and using PXF, refer to the <xref href="../../pxf/overview_pxf.html" type="topic" format="html">Greenplum Platform Extension Framework (PXF)</xref> documentation.</p>
</body>
......
......@@ -660,14 +660,108 @@ In the following example, you will create and populate a Hive table stored in OR
`intarray` and `propmap` are again serialized as text strings.
## <a id="partitionfiltering"></a>Partition Filter Pushdown
## <a id="hive_partitioning"></a>Accessing Heterogeneous Partitioned Data
The PXF Hive connector supports the Hive partitioning feature and directory structure. This enables partition exclusion on selected HDFS files comprising a Hive table. To use the partition filtering feature to reduce network traffic and I/O, run a query on a PXF external table using a `WHERE` clause that refers to a specific partition column in a partitioned Hive table.
You can use the PXF `Hive` profile with any Hive file storage types. With the `Hive` profile, you can access heterogeneous format data in a single Hive table where the partitions may be stored in different file formats.
To take advantage of PXF partition filtering pushdown, the Hive and PXF partition field names must be the same. Otherwise, PXF ignores partition filtering and the filtering is performed on the Greenplum Database side, impacting performance.
**Note:** The Hive connector filters only on partition columns, not on other table attributes.
PXF filter pushdown is disabled by default. You configure PXF filter pushdown as described in [Configuring Filter Pushdown](using_pxf.html#filter-pushdown).
### <a id="hive_homog_part"></a>Example: Using the Hive Profile to Access Partitioned Homogenous Data
In this example, you use the `Hive` profile to query a Hive table named `sales_part` that you partition on the `delivery_state` and `delivery_city` fields. You then create a Greenplum Database external table to query `sales_part`. The procedure includes specific examples that illustrate filter pushdown.
1. Create a Hive table named `sales_part` with two partition columns, `delivery_state` and `delivery_city:`
``` sql
hive> CREATE TABLE sales_part (name string, type string, supplier_key int, price double)
PARTITIONED BY (delivery_state string, delivery_city string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
```
2. Load data into this Hive table and add some partitions:
``` sql
hive> INSERT INTO TABLE sales_part
PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Fresno')
VALUES ('block', 'widget', 33, 15.17);
hive> INSERT INTO TABLE sales_part
PARTITION(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento')
VALUES ('cube', 'widget', 11, 1.17);
hive> INSERT INTO TABLE sales_part
PARTITION(delivery_state = 'NEVADA', delivery_city = 'Reno')
VALUES ('dowel', 'widget', 51, 31.82);
hive> INSERT INTO TABLE sales_part
PARTITION(delivery_state = 'NEVADA', delivery_city = 'Las Vegas')
VALUES ('px49', 'pipe', 52, 99.82);
```
3. Query the `sales_part` table:
``` sql
hive> SELECT * FROM sales_part;
```
A `SELECT *` statement on a Hive partitioned table shows the partition fields at the end of the record.
### <a id="example_hive_part_multi" class="no-quick-link"></a>Example: Using the Hive Profile with Heterogenous Data
3. Examine the Hive/HDFS directory structure for the `sales_part` table:
In this example, you create a partitioned Hive external table. The table is composed of the HDFS data files associated with the `sales_info` (text format) and `sales_info_rcfile` (RC format) Hive tables you created in previous exercises. You will partition the data by year, assigning the data from `sales_info` to the year 2013, and the data from `sales_info_rcfile` to the year 2016. (Ignore at the moment the fact that the tables contain the same data.) You will then use the PXF `Hive` profile to query this partitioned Hive external table.
``` shell
$ sudo -u hdfs hdfs dfs -ls -R /apps/hive/warehouse/sales_part
/apps/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=Fresno/
/apps/hive/warehouse/sales_part/delivery_state=CALIFORNIA/delivery_city=Sacramento/
/apps/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=Reno/
/apps/hive/warehouse/sales_part/delivery_state=NEVADA/delivery_city=Las Vegas/
```
4. Create a PXF external table to read the partitioned `sales_part` Hive table. To take advantage of partition filter push-down, define fields corresponding to the Hive partition fields at the end of the `CREATE EXTERNAL TABLE` attribute list.
``` shell
$ psql -d postgres
```
``` sql
postgres=# CREATE EXTERNAL TABLE pxf_sales_part(
item_name TEXT, item_type TEXT,
supplier_key INTEGER, item_price DOUBLE PRECISION,
delivery_state TEXT, delivery_city TEXT)
LOCATION ('pxf://sales_part?Profile=Hive')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
5. Query the table:
``` sql
postgres=# SELECT * FROM pxf_sales_part;
```
6. Perform another query (no pushdown) on `pxf_sales_part` to return records where the `delivery_city` is `Sacramento` and  `item_name` is `cube`:
``` sql
postgres=# SELECT * FROM pxf_sales_part WHERE delivery_city = 'Sacramento' AND item_name = 'cube';
```
The query filters the `delivery_city` partition `Sacramento`. The filter on  `item_name` is not pushed down, since it is not a partition column. It is performed on the Greenplum Database side after all the data in the `Sacramento` partition is transferred for processing.
7. Query (with pushdown) for all records where `delivery_state` is `CALIFORNIA`:
``` sql
postgres=# SET gp_external_enable_filter_pushdown=on;
postgres=# SELECT * FROM pxf_sales_part WHERE delivery_state = 'CALIFORNIA';
```
This query reads all of the data in the `CALIFORNIA` `delivery_state` partition, regardless of the city.
### <a id="hive_heter_part"></a>Example: Using the Hive Profile to Access Partitioned Heterogenous Data
You can use the PXF `Hive` profile with any Hive file storage types. With the `Hive` profile, you can access heterogeneous format data in a single Hive table where the partitions may be stored in different file formats.
In this example, you create a partitioned Hive external table. The table is composed of the HDFS data files associated with the `sales_info` (text format) and `sales_info_rcfile` (RC format) Hive tables that you created in previous exercises. You will partition the data by year, assigning the data from `sales_info` to the year 2013, and the data from `sales_info_rcfile` to the year 2016. (Ignore at the moment the fact that the tables contain the same data.) You will then use the PXF `Hive` profile to query this partitioned Hive external table.
1. Create a Hive external table named `hive_multiformpart` that is partitioned by a string field named `year`:
......@@ -767,3 +861,38 @@ In this example, you create a partitioned Hive external table. The table is comp
-----
433
```
## <a id="default_part"></a>Using PXF with Hive Default Partitions
This topic describes a difference in query results between Hive and PXF queries when Hive tables use a default partition. When dynamic partitioning is enabled in Hive, a partitioned table may store data in a default partition. Hive creates a default partition when the value of a partitioning column does not match the defined type of the column (for example, when a NULL value is used for any partitioning column). In Hive, any query that includes a filter on a partition column *excludes* any data that is stored in the table's default partition.
Similar to Hive, PXF represents a table's partitioning columns as columns that are appended to the end of the table. However, PXF translates any column value in a default partition to a NULL value. This means that a Greenplum Database query that includes an `IS NULL` filter on a partitioning column can return different results than the same Hive query.
Consider a Hive partitioned table that is created with the statement:
``` sql
hive> CREATE TABLE sales (order_id bigint, order_amount float) PARTITIONED BY (xdate date);
```
The table is loaded with five rows that contain the following data:
``` pre
1.0 1900-01-01
2.2 1994-04-14
3.3 2011-03-31
4.5 NULL
5.0 2013-12-06
```
Inserting row 4 creates a Hive default partition, because the partition column `xdate` contains a null value.
In Hive, any query that filters on the partition column omits data in the default partition. For example, the following query returns no rows:
``` sql
hive> SELECT * FROM sales WHERE xdate IS null;
```
However, if you map this Hive table to a PXF external table in Greenplum Database, all default partition values are translated into actual NULL values. In Greenplum Database, executing the same query against the PXF external table returns row 4 as the result, because the filter matches the NULL value.
Keep this behavior in mind when you execute `IS NULL` queries on Hive partitioned tables.
......@@ -92,32 +92,41 @@ GRANT INSERT ON PROTOCOL pxf TO bill;
PXF supports filter pushdown. When filter pushdown is enabled, the constraints from the `WHERE` clause of a `SELECT` query can be extracted and passed to the external data source for filtering. This process can improve query performance, and can also reduce the amount of data that is transferred to Greenplum Database.
You enable or disable filter pushdown for all external table protocols, including `pxf`, by setting the `gp_external_enable_filter_pushdown` configuration parameter. The default value is `off`; set it to `on` to enable filter pushdown.
You enable or disable filter pushdown for all external table protocols, including `pxf`, by setting the `gp_external_enable_filter_pushdown` server configuration parameter. The default value of this configuration parameter is `off`; set it to `on` to enable filter pushdown. For example:
**Note:** Some data sources do not support filter pushdown. Also, filter pushdown may not be supported with certain data types or operators. If a query accesses a data source that does not support filter push-down for the query constraints, the query is instead executed without filter pushdown (the data is first transferred to Greenplum Database and then filtered).
``` sql
SHOW gp_external_enable_filter_pushdown;
SET gp_external_enable_filter_pushdown TO 'on';
```
**Note:** Some external data sources do not support filter pushdown. Also, filter pushdown may not be supported with certain data types or operators. If a query accesses a data source that does not support filter push-down for the query constraints, the query is instead executed without filter pushdown (the data is filtered after it is transferred to Greenplum Database).
PXF accesses data sources using different connectors, and filter pushdown support is determined by the specific connector implementation. The following PXF connectors support filter pushdown:
PXF accesses data sources using different connectors, and filter pushdown support is determined by the specific connector implementation. The following connectors support filter pushdown:
* HBase
* Hive
- Hive Connector
PXF filter pushdown can be used with these data types:
* `INT`, array of `INT`
* `FLOAT`
* `NUMERIC`
* `BOOL`
* `CHAR`, `TEXT`, array of `TEXT`
* `DATE`, `TIMESTAMP`.
- `INT`, array of `INT`
- `FLOAT`
- `NUMERIC`
- `BOOL`
- `CHAR`, `TEXT`, array of `TEXT`
- `DATE`, `TIMESTAMP`.
PXF filter pushdown can be used with these operators:
* `<`, `<=`, `>=`, `>`
* `<>`, `=`
* `IN`
* `LIKE` (only for `TEXT` fields).
- `<`, `<=`, `>=`, `>`
- `<>`, `=`
- `IN`
- `LIKE` (only for `TEXT` fields).
To summarize, all of the following criteria must be met for filter pushdown to occur:
* The Greenplum Database protocol that is used to access external data must use filter pushdown. The `pxf` external table protocol can use pushdown.
* The external data source that is being accessed must support pushdown. For example, both HBase and Hive support pushdown.
* For queries that use the `pxf` protocol, the underlying PXF connector must support filter pushdown. For example, the HBase and Hive connectors support pushdown.
* You enable external table filter pushdown by setting the `gp_external_enable_filter_pushdown` server configuration parameter to `'on'`.
* The Greenplum Database protocol that you use to access external data source must support filter pushdown. The `pxf` external table protocol supports pushdown.
* The external data source that you are accessing must support pushdown. For example, HBase and Hive support pushdown.
* For queries on external tables that you create with the `pxf` protocol, the underlying PXF connector must also support filter pushdown. For example, only the PXF Hive connector supports pushdown.
## <a id="built-inprofiles"></a> PXF Profiles
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册