intro_pxf.html.md.erb 7.7 KB
Newer Older
1
---
2
title: Introduction to PXF
3 4
---

5
The Greenplum Platform Extension Framework (PXF) provides *connectors* that enable you to access data stored in sources external to your Greenplum Database deployment. These connectors map an external data source to a Greenplum Database *external table* definition. When you create the Greenplum Database external table, you identify the external data store and the format of the data via a *server* name and a *profile* name that you provide in the command.
6

7
You can query the external table via Greenplum Database, leaving the referenced data in place. Or, you can use the external table to load the data into Greenplum Database for higher performance.
8

9 10
## <a id="suppplat"></a> Supported Platforms

11 12 13 14 15 16 17 18 19 20 21 22 23
### <a id="os"></a> Operating Systems

PXF supports the Red Hat Enterprise Linux 64-bit 7.x, CentOS 64-bit 7.x, and Ubuntu 18.04 LTS operating system platforms.

<div class="note">Starting in 6.x, Greenplum does not bundle <code>cURL</code> and instead loads the system-provided library. PXF requires <code>cURL</code> version 7.29.0 or newer. The officially-supported <code>cURL</code> for the CentOS 6.x and Red Hat Enterprise Linux 6.x operating systems is version 7.19.*. Greenplum Database 6 does not support running PXF on CentOS 6.x or RHEL 6.x due to this limitation.</div>

### <a id="java"></a> Java

PXF supports Java 8 and Java 11.


### <a id="hadoop"></a> Hadoop

24 25 26 27
PXF bundles all of the Hadoop JAR files on which it depends, and supports the following Hadoop component versions:

| PXF Version | Hadoop Version | Hive Server Version | HBase Server Version |
|-------------|----------------|---------------------|-------------|
28
| 5.10, 5.11, 5.12 | 2.x, 3.1+ | 1.x, 2.x, 3.1+ | 1.3.2 |
29 30
| 5.9 | 2.x, 3.1+ | 1.x, 2.x, 3.1+ | 1.3.2 |
| 5.8 | 2.x | 1.x | 1.3.2 |
31

32
## <a id="arch"></a> Architectural Overview
33

34
Your Greenplum Database deployment consists of a master node and multiple segment hosts. A single PXF agent process on each Greenplum Database segment host allocates a worker thread for each segment instance on a segment host that participates in a query against an external table. The PXF agents on multiple segment hosts communicate with the external data store in parallel.
35 36


37 38
## <a id="more"></a> About Connectors, Servers, and Profiles

39
*Connector* is a generic term that encapsulates the implementation details required to read from or write to an external data store. PXF provides built-in connectors to Hadoop (HDFS, Hive, HBase), object stores (Azure, Google Cloud Storage, Minio, S3), and SQL databases (via JDBC).
40

41
A PXF *Server* is a named configuration for a connector. A server definition provides the information required for PXF to access an external data source. This configuration information is data-store-specific, and may include server location, access credentials, and other relevant properties.
42

43
The Greenplum Database administrator will configure at least one server definition for each external data store that they will allow Greenplum Database users to access, and will publish the available server names as appropriate.
44

45
You specify a `SERVER=<server_name>` setting when you create the external table to identify the server configuration from which to obtain the configuration and credentials to access the external data store.
46

47
The default PXF server is named `default` (reserved), and when configured provides the location and access information for the external data source in the absence of a `SERVER=<server_name>` setting.
48

49
Finally, a PXF *profile* is a named mapping identifying a specific data format or protocol supported by a specific external data store. PXF supports text, Avro, JSON, RCFile, Parquet, SequenceFile, and ORC data formats, and the JDBC protocol, and provides several built-in profiles as discussed in the following section.
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75

## <a id="create_external_table"></a>Creating an External Table

PXF implements a Greenplum Database protocol named `pxf` that you can use to create an external table that references data in an external data store. The syntax for a [`CREATE EXTERNAL TABLE`](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command that specifies the `pxf` protocol follows:

``` sql
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
        ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION('pxf://<path-to-data>?PROFILE=<profile_name>[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
```

The `LOCATION` clause in a `CREATE EXTERNAL TABLE` statement specifying the `pxf` protocol is a URI. This URI identifies the path to, or other information describing, the location of the external data. For example, if the external data store is HDFS, the \<path-to-data\> identifies the absolute path to a specific HDFS file. If the external data store is Hive, \<path-to-data\> identifies a schema-qualified Hive table name.

You use the query portion of the URI, introduced by the question mark (?), to identify the PXF server and profile names.

PXF may require additional information to read or write certain data formats. You provide profile-specific information using the optional \<custom-option\>=\<value\> component of the `LOCATION` string and formatting information via the \<formatting-properties\> component of the string. The custom options and formatting properties supported by a specific profile vary; they are identified in usage documentation for the profile.

<caption><span class="tablecap">Table 1. CREATE EXTERNAL TABLE Parameter Values and Descriptions</span></caption>

<a id="creatinganexternaltable__table_pfy_htz_4p"></a>

| Keyword               | Value and Description                                                                                                                                                                                                                                                          |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| \<path&#8209;to&#8209;data\>        | A directory, file name, wildcard pattern, table name, etc. The syntax of \<path-to-data\> is dependent upon the external data source.                                                                                                                                                                                                                    |
| PROFILE=\<profile_name\>  | The profile that PXF uses to access the data. PXF supports profiles that access text, Avro, JSON, RCFile, Parquet, SequenceFile, and ORC data in [Hadoop services](access_hdfs.html), [object stores](access_objstore.html), and [other SQL databases](jdbc_pxf.html).  |
76
| SERVER=\<server_name\>   | The named server configuration that PXF uses to access the data. PXF uses the `default` server if not specified. |
77 78 79 80 81
| \<custom&#8209;option\>=\<value\> | Additional options and their values supported by the profile or the server. |
| FORMAT&nbsp;\<value\>| PXF profiles support the `TEXT`, `CSV`, and `CUSTOM` formats.  |
| \<formatting&#8209;properties\> | Formatting properties supported by the profile; for example, the `FORMATTER` or `delimiter`.                                                                   |

**Note:** When you create a PXF external table, you cannot use the `HEADER` option in your formatter specification.
82

83 84 85 86 87 88 89
## <a id="other"></a> Other PXF Features

Certain PXF connectors and profiles support filter pushdown and column projection. Refer to the following topics for detailed information about this support:

- [About PXF Filter Pushdown](filter_push.html)
- [About Column Projection in PXF](col_project.html)