提交 46bb4284 编写于 作者: L Lisa Owen 提交者: David Yozie

docs - pxf projection and pushdown info to tbl format; add parquet pushdo… (#9618)

* docs - projection and pushdown info to tbl format; add parquet pushdown info

* combine arithmetic ops in single column

* multi-line header, center, combine is/not null

* add parquet data type write mapping
上级 7883c3ac
......@@ -8,10 +8,16 @@ PXF supports column projection, and it is always enabled. With column projection
Column projection is automatically enabled for the `pxf` external table protocol. PXF accesses external data sources using different connectors, and column projection support is also determined by the specific connector implementation. The following PXF connector and profile combinations support column projection on read operations:
- PXF Hive Connector, `HiveORC` profile
- PXF JDBC Connector, `Jdbc` profile
- PXF Hadoop and Object Store Connectors, `hdfs:parquet`, `adl:parquet`, `gs:parquet`,`s3:parquet`, and `wasbs:parquet` profiles
- PXF S3 Connector using Amazon S3 Select service, `s3:parquet` and `s3:text` profiles
| Data Source | Connector | Profile(s) |
|-------------|---------------|---------|
| External SQL database | JDBC Connector | Jdbc |
| Hive | Hive Connector | HiveORC, HiveVectorizedORC |
| Hadoop | HDFS Connector | hdfs:parquet |
| Amazon S3 | S3-Compatible Object Store Connectors | s3:parquet |
| Amazon S3 using S3 Select | S3-Compatible Object Store Connectors | s3:parquet, s3:text |
| Google Cloud Storage | GCS Object Store Connector | gs:parquet |
| Azure Blob Storage | Azure Object Store Connector | wasbs:parquet |
| Azure Data Lake | Azure Object Store Connector | adl:parquet |
**Note:** PXF may disable column projection in cases where it cannot successfully serialize a query filter; for example, when the `WHERE` clause resolves to a `boolean` type.
......
......@@ -32,14 +32,7 @@ SET gp_external_enable_filter_pushdown TO 'on';
**Note:** Some external data sources do not support filter pushdown. Also, filter pushdown may not be supported with certain data types or operators. If a query accesses a data source that does not support filter push-down for the query constraints, the query is instead executed without filter pushdown (the data is filtered after it is transferred to Greenplum Database).
PXF accesses data sources using different connectors, and filter pushdown support is determined by the specific connector implementation. The following PXF connectors support filter pushdown:
- Hive Connector, all profiles
- HBase Connector
- JDBC Connector
- S3 Connector using the Amazon S3 Select service to access CSV and Parquet data
PXF filter pushdown can be used with these data types (connector-specific):
PXF filter pushdown can be used with these data types (connector- and profile-specific):
- `INT2`, `INT4`, `INT8`
- `CHAR`, `TEXT`
......@@ -48,14 +41,32 @@ PXF filter pushdown can be used with these data types (connector-specific):
- `BOOL`
- `DATE`, `TIMESTAMP` (available only with the JDBC connector and the S3 connector when using S3 Select)
You can use PXF filter pushdown with these operators:
You can use PXF filter pushdown with these arithmetic and logical operators (connector- and profile-specific):
- `<`, `<=`, `>=`, `>`
- `<>`, `=`
- `AND`, `OR`, `NOT`
- `IN` operator on arrays of `INT` and `TEXT` (JDBC connector only)
- `LIKE` (`TEXT` fields, JDBC connector only)
PXF accesses data sources using profiles exposed by different connectors, and filter pushdown support is determined by the specific connector implementation. The following PXF profiles support some aspect of filter pushdown:
|Profile | <,&nbsp;&nbsp; >,</br><=,&nbsp;&nbsp; >=,</br>=,&nbsp;&nbsp;<> | LIKE | IS [NOT] NULL | IN | AND | OR | NOT |
|-------|:------------------------:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| Jdbc | Y | Y | Y | Y | Y | Y | Y | Y | N | Y | Y | Y |
| *:parquet | Y<sup>1</sup> | N | Y<sup>1</sup> | N | Y<sup>1</sup> | Y<sup>1</sup> | Y<sup>1</sup> |
| s3:parquet and s3:text with S3-Select | Y | N | Y | Y | Y | Y | Y |
| HBase | Y | N | Y | N | Y | Y | N |
| Hive | Y<sup>2</sup> | N | N | N | Y<sup>2</sup> | Y<sup>2</sup> | N |
| HiveText | Y<sup>2</sup> | N | N | N | Y<sup>2</sup> | Y<sup>2</sup> | N |
| HiveRC | Y<sup>2</sup> | N | N | N | Y<sup>2</sup> | Y<sup>2</sup> | N |
| HiveORC | Y, Y<sup>2</sup> | N | Y | Y | Y, Y<sup>2</sup> | Y, Y<sup>2</sup> | Y |
| HiveVectorizedORC | Y, Y<sup>2</sup> | N | Y | Y | Y, Y<sup>2</sup> | Y, Y<sup>2</sup> | Y |
</br><sup>1</sup>&nbsp;PXF applies the predicate, rather than the remote system, reducing CPU usage and the memory footprint.
</br><sup>2</sup>&nbsp;PXF supports partition pruning based on partition keys.
PXF does not support filter pushdown for any profile not mentioned in the table above, including: *:avro, *:AvroSequenceFile, *:SequenceFile, *:json, *:text, and *:text:multi.
To summarize, all of the following criteria must be met for filter pushdown to occur:
* You enable external table filter pushdown by setting the `gp_external_enable_filter_pushdown` server configuration parameter to `'on'`.
......
......@@ -33,26 +33,61 @@ Ensure that you have met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_
## <a id="datatype_map"></a>Data Type Mapping
To read and write Parquet primitive data types in Greenplum Database, map Parquet data values to Greenplum Database columns of the same type. The following table summarizes the external mapping rules:
To read and write Parquet primitive data types in Greenplum Database, map Parquet data values to Greenplum Database columns of the same type.
Parquet supports a small set of primitive data types, and uses metadata annotations to extend the data types that it supports. These annotations specify how to interpret the primitive type. For example, Parquet stores both `INTEGER` and `DATE` types as the `INT32` primitive type. An annotation identifies the original type as a `DATE`.
### <a id="datatype_map_read "></a>Read Mapping
<a id="p2g_type_mapping_table"></a>
| Parquet Data Type | PXF/Greenplum Data Type |
|-------------------|-------------------------|
| boolean | Boolean |
| byte_array | Bytea, Text |
| double | Float8 |
| fixed\_len\_byte\_array | Numeric |
| float | Real |
| int\_8, int\_16 | Smallint, Integer |
| int64 | Bigint |
| int96 | Timestamp, Timestamptz |
<div class="note">When writing to Parquet:
<ul>
<li>PXF localizes a <code>timestamp</code> to the current system timezone and converts it to universal time (UTC) before finally converting to <code>int96</code>.</li>
<li>PXF converts a <code>timestamptz</code> to a UTC <code>timestamp</code> and then converts to <code>int96</code>. PXF loses the time zone information during this conversion.</li>
</ul></div>
PXF uses the following data type mapping when reading Parquet data:
| Parquet Data Type | Original Type | PXF/Greenplum Data Type |
|-------------------|---------------|--------------------------|
| binary (byte_array) | Date | Date |
| binary (byte_array) | Timestamp_millis | Timestamp |
| binary (byte_array) | all others | Text |
| binary (byte_array) | -- | Bytea |
| boolean | -- | Boolean |
| double | -- | Float8 |
| fixed\_len\_byte\_array | -- | Numeric |
| float | -- | Real |
| int32 | Date | Date |
| int32 | Decimal | Numeric |
| int32 | int_8 | Smallint |
| int32 | int_16 | Smallint |
| int32 | -- | Integer |
| int64 | Decimal | Numeric |
| int64 | -- | Bigint |
| int96 | -- | Timestamp |
**Note**: PXF supports filter predicate pushdown on all parquet data types listed above, *except* the `fixed_len_byte_array` and `int96` types.
### <a id="datatype_map_Write "></a>Write Mapping
PXF uses the following data type mapping when writing Parquet data:
| PXF/Greenplum Data Type | Original Type | Parquet Data Type |
|-------------------|---------------|--------------------------|
| Boolean | -- | boolean |
| Bytea | -- | binary |
| Bigint | -- | int64 |
| SmallInt | int_16 | int32 |
| Integer | -- | int32 |
| Real | -- | float |
| Float8 | -- | double |
| Numeric/Decimal | Decimal | fixed\_len\_byte\_array |
| Timestamp<sup>1</sup> | -- | int96 |
| Timestamptz<sup>2</sup> | -- | int96 |
| Date | utf8 | binary |
| Time | utf8 | binary |
| Varchar | utf8 | binary |
| Text | utf8 | binary |
| OTHERS | -- | UNSUPPORTED |
</br><sup>1</sup>&nbsp;PXF localizes a <code>Timestamp</code> to the current system timezone and converts it to universal time (UTC) before finally converting to <code>int96</code>.
</br><sup>2</sup>&nbsp;PXF converts a <code>Timestamptz</code> to a UTC <code>timestamp</code> and then converts to <code>int96</code>. PXF loses the time zone information during this conversion.
## <a id="profile_cet"></a>Creating the External Table
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册