未验证 提交 a796bc69 编写于 作者: L Lisa Owen 提交者: GitHub

docs - remove now unused pxf doc files, update exclusions (#10372)

上级 95ef455d
......@@ -84,4 +84,4 @@ template_variables:
support_link: <a href="https://github.com/greenplum-db/gpdb/wiki">Wiki</a>
support_url: https://greenplum.org
broken_link_exclusions: iefix|arrowhead|overview_pxf.html|pxf.html|pxf-cluster.html|pxf_kerbhdfs.html|cfginitstart_pxf.html|init_pxf.html|access_hdfs.html|intro_pxf.html|using_pxf.html|client_instcfg.html|install_java.html
broken_link_exclusions: iefix|arrowhead|overview_pxf.html|pxf.html|pxf-cluster.html|pxf_kerbhdfs.html|cfginitstart_pxf.html|init_pxf.html|access_hdfs.html|intro_pxf.html|using_pxf.html|client_instcfg.html|install_java.html|pxfuserimpers.html|jdbc_cfg.html
---
title: About the PXF Installation and Configuration Directories
---
PXF is installed on your master and segment nodes when you install Greenplum Database.
## <a id="installed"></a>PXF Installation Directories
The following PXF files and directories are installed in your Greenplum Database cluster when you install Greenplum. These files/directories are relative to the PXF installation directory `$GPHOME/pxf`:
| Directory | Description |
|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| apache&#8209;tomcat/ | The PXF Tomcat directory. |
| bin/ | The PXF script and executable directory. |
| conf/ | The PXF internal configuration directory. This directory contains the `pxf-env-default.sh` and `pxf-profiles-default.xml` configuration files. After initializing PXF, this directory will also include the `pxf-private.classpath` file. |
| lib/ | The PXF library directory. |
| templates/ | Configuration templates for PXF. |
## <a id="runtime"></a>PXF Runtime Directories
During initialization and startup, PXF creates the following internal directories in `$GPHOME/pxf`:
| Directory | Description |
|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| pxf&#8209;service/ | After initializing PXF, the PXF service instance directory. |
| run/ | After starting PXF, the PXF run directory. Includes a PXF catalina process id file. |
## <a id="usercfg"></a>PXF User Configuration Directories
Also during initialization, PXF populates a user configuration directory that you choose (`$PXF_CONF`) with the following subdirectories and template files:
| Directory | Description |
|-----------|---------------|
| conf/ | The location of user-customizable PXF configuration files: `pxf-env.sh`, `pxf-log4j.properties`, and `pxf-profiles.xml`. |
| keytabs/ | The default location for the PXF service Kerberos principal keytab file. |
| lib/ | The default PXF user runtime library directory. |
| logs/ | The PXF runtime log file directory. Includes `pxf-service.log` and the Tomcat-related log `catalina.out`. The `logs/` directory and log files are readable only by the `gpadmin` user. |
| servers/ | The server configuration directory; each subdirectory identifies the name of a server. The default server is named `default`. The Greenplum Database administrator may configure other servers. |
| templates/ | The configuration directory for connector server template files. |
Refer to [Initializing PXF](init_pxf.html) and [Starting PXF](cfginitstart_pxf.html#start_pxf) for detailed information about the PXF initialization and startup commands and procedures.
---
title: Accessing Hadoop with PXF
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
PXF is compatible with Cloudera, Hortonworks Data Platform, MapR, and generic Apache Hadoop distributions. PXF is installed with HDFS, Hive, and HBase connectors. You use these connectors to access varied formats of data from these Hadoop distributions.
<div class="note">In previous versions of Greenplum Database, you may have used the <code>gphdfs</code> external table protocol to access data stored in Hadoop. Greenplum Database version 6.0.0 removes the <code>gphdfs</code> protocol. Use PXF and the <code>pxf</code> external table protocol to access Hadoop in Greenplum Database version 6.x.</div>
## <a id="hdfs_arch"></a>Architecture
HDFS is the primary distributed storage mechanism used by Apache Hadoop. When a user or application performs a query on a PXF external table that references an HDFS file, the Greenplum Database master node dispatches the query to all segment hosts. Each segment instance contacts the PXF agent running on its host. When it receives the request from a segment instance, the PXF agent:
1. Allocates a worker thread to serve the request from a segment.
2. Invokes the HDFS Java API to request metadata information for the HDFS file from the HDFS NameNode.
3. Provides the metadata information returned by the HDFS NameNode to the segment instance.
<span class="figtitleprefix">Figure: </span>PXF-to-Hadoop Architecture
<img src="graphics/pxfarch.png" class="image" />
A segment instance uses its Greenplum Database `gp_segment_id` and the file block information described in the metadata to assign itself a specific portion of the query data. The segment instance then sends a request to the PXF agent to read the assigned data. This data may reside on one or more HDFS DataNodes.
The PXF agent invokes the HDFS Java API to read the data and delivers it to the segment instance. The segment instance delivers its portion of the data to the Greenplum Database master node. This communication occurs across segment hosts and segment instances in parallel.
## <a id="hadoop_prereq"></a>Prerequisites
Before working with Hadoop data using PXF, ensure that:
- You have configured and initialized PXF, and PXF is running on each Greenplum Database segment host. See [Configuring PXF](instcfg_pxf.html) for additional information.
- You have configured the PXF Hadoop Connectors that you plan to use. Refer to [Configuring PXF Hadoop Connectors](client_instcfg.html) for instructions. If you plan to access JSON-formatted data stored in a Cloudera Hadoop cluster, PXF requires a Cloudera version 5.8 or later Hadoop distribution.
- If user impersonation is enabled (the default), ensure that you have granted read (and write as appropriate) permission to the HDFS files and directories that will be accessed as external tables in Greenplum Database to each Greenplum Database user/role name that will access the HDFS files and directories. If user impersonation is not enabled, you must grant this permission to the `gpadmin` user.
- Time is synchronized between the Greenplum Database segment hosts and the external Hadoop systems.
## <a id="hdfs_cmdline"></a>HDFS Shell Command Primer
Examples in the PXF Hadoop topics access files on HDFS. You can choose to access files that already exist in your HDFS cluster. Or, you can follow the steps in the examples to create new files.
A Hadoop installation includes command-line tools that interact directly with your HDFS file system. These tools support typical file system operations that include copying and listing files, changing file permissions, and so forth. You run these tools on a system with a Hadoop client installation. By default, Greenplum Database hosts do not
include a Hadoop client installation.
The HDFS file system command syntax is `hdfs dfs <options> [<file>]`. Invoked with no options, `hdfs dfs` lists the file system options supported by the tool.
The user invoking the `hdfs dfs` command must have read privileges on the HDFS data store to list and view directory and file contents, and write permission to create directories and files.
The `hdfs dfs` options used in the PXF Hadoop topics are:
| Option | Description |
|-------|-------------------------------------|
| `-cat` | Display file contents. |
| `-mkdir` | Create a directory in HDFS. |
| `-put` | Copy a file from the local file system to HDFS. |
Examples:
Create a directory in HDFS:
``` shell
$ hdfs dfs -mkdir -p /data/exampledir
```
Copy a text file from your local file system to HDFS:
``` shell
$ hdfs dfs -put /tmp/example.txt /data/exampledir/
```
Display the contents of a text file located in HDFS:
``` shell
$ hdfs dfs -cat /data/exampledir/example.txt
```
## <a id="hadoop_connectors"></a>Connectors, Data Formats, and Profiles
The PXF Hadoop connectors provide built-in profiles to support the following data formats:
- Text
- Avro
- JSON
- ORC
- Parquet
- RCFile
- SequenceFile
- AvroSequenceFile
The PXF Hadoop connectors expose the following profiles to read, and in many cases write, these supported data formats:
| Data Source | Data Format | Profile Name(s) | Deprecated Profile Name |
|-----|------|---------|------------|
| HDFS | delimited single line [text](hdfs_text.html#profile_text) | hdfs:text | HdfsTextSimple |
| HDFS | delimited [text with quoted linefeeds](hdfs_text.html#profile_textmulti) | hdfs:text:multi | HdfsTextMulti |
| HDFS | [Avro](hdfs_avro.html) | hdfs:avro | Avro |
| HDFS | [JSON](hdfs_json.html) | hdfs:json | Json |
| HDFS | [Parquet](hdfs_parquet.html) | hdfs:parquet | Parquet |
| HDFS | AvroSequenceFile | hdfs:AvroSequenceFile | n/a |
| HDFS | [SequenceFile](hdfs_seqfile.html) | hdfs:SequenceFile | SequenceWritable |
| [Hive](hive_pxf.html) | stored as TextFile | Hive, [HiveText](hive_pxf.html#hive_text) | n/a |
| [Hive](hive_pxf.html) | stored as SequenceFile | Hive | n/a |
| [Hive](hive_pxf.html) | stored as RCFile | Hive, [HiveRC](hive_pxf.html#hive_hiverc) | n/a |
| [Hive](hive_pxf.html) | stored as ORC | Hive, [HiveORC](hive_pxf.html#hive_orc), HiveVectorizedORC | n/a |
| [Hive](hive_pxf.html) | stored as Parquet | Hive | n/a |
| [HBase](hbase_pxf.html) | Any | HBase | n/a |
You provide the profile name when you specify the `pxf` protocol on a `CREATE EXTERNAL TABLE` command to create a Greenplum Database external table that references a Hadoop file, directory, or table. For example, the following command creates an external table that uses the default server and specifies the profile named `hdfs:text`:
``` sql
CREATE EXTERNAL TABLE pxf_hdfs_text(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text')
FORMAT 'TEXT' (delimiter=E',');
```
---
title: Accessing Azure, Google Cloud Storage, Minio, and S3 Object Stores with PXF
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
PXF is installed with connectors to Azure Blob Storage, Azure Data Lake, Google Cloud Storage, Minio, and S3 object stores.
## <a id="objstore_prereq"></a>Prerequisites
Before working with object store data using PXF, ensure that:
- You have configured and initialized PXF, and PXF is running on each Greenplum Database segment host. See [Configuring PXF](instcfg_pxf.html) for additional information.
- You have configured the PXF Object Store Connectors that you plan to use. Refer to [Configuring Connectors to Azure and Google Cloud Storage Object Stores](objstore_cfg.html) and [Configuring Connectors to Minio and S3 Object Stores](s3_objstore_cfg.html) for instructions.
- Time is synchronized between the Greenplum Database segment hosts and the external object store systems.
## <a id="objstore_connectors"></a>Connectors, Data Formats, and Profiles
The PXF object store connectors provide built-in profiles to support the following data formats:
- Text
- Avro
- JSON
- Parquet
- AvroSequenceFile
- SequenceFile
The PXF connectors to Azure expose the following profiles to read, and in many cases write, these supported data formats:
| Data Format | Azure Blob Storage | Azure Data Lake |
|-----|------|---------|
| delimited single line [plain text](objstore_text.html) | wasbs:text | adl:text |
| delimited [text with quoted linefeeds](objstore_text.html) | wasbs:text:multi | adl:text:multi |
| [Avro](objstore_avro.html) | wasbs:avro | adl:avro |
| [JSON](objstore_json.html) | wasbs:json | adl:json |
| [Parquet](objstore_parquet.html) | wasbs:parquet | adl:parquet |
| AvroSequenceFile | wasbs:AvroSequenceFile | adl:AvroSequenceFile |
| [SequenceFile](objstore_seqfile.html) | wasbs:SequenceFile | adl:SequenceFile |
Similarly, the PXF connectors to Google Cloud Storage, Minio, and S3 expose these profiles:
| Data Format | Google Cloud Storage | S3 or Minio |
|-----|------|---------|
| delimited single line [plain text](objstore_text.html) | gs:text | s3:text |
| delimited [text with quoted linefeeds](objstore_text.html) | gs:text:multi | s3:text:multi |
| [Avro](objstore_avro.html) | gs:avro | s3:avro |
| [JSON](objstore_json.html) | gs:json | s3:json |
| [Parquet](objstore_parquet.html) | gs:parquet | s3:parquet |
| AvroSequenceFile | gs:AvroSequenceFile | s3:AvroSequenceFile |
| [SequenceFile](objstore_seqfile.html) | gs:SequenceFile | s3:SequenceFile |
You provide the profile name when you specify the `pxf` protocol on a `CREATE EXTERNAL TABLE` command to create a Greenplum Database external table that references a file or directory in the specific object store.
## <a id="sample_ddl"></a>Sample CREATE EXTERNAL TABLE Commands
<div class="note">When you create an external table that references a file or directory in an object store, you must specify a <code>SERVER</code> in the <code>LOCATION</code> URI.</div>
The following command creates an external table that references a text file on S3. It specifies the profile named `s3:text` and the server configuration named `s3srvcfg`:
<pre>
CREATE EXTERNAL TABLE pxf_s3_text(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://S3_BUCKET/pxf_examples/pxf_s3_simple.txt?<b>PROFILE=s3:text&SERVER=s3srvcfg</b>')
FORMAT 'TEXT' (delimiter=E',');
</pre>
The following command creates an external table that references a text file on Azure Blob Storage. It specifies the profile named `wasbs:text` and the server configuration named `wasbssrvcfg`. You would provide the Azure Blob Storage container identifier and your Azure Blob Storage account name.
<pre>
CREATE EXTERNAL TABLE pxf_wasbs_text(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://<b>AZURE_CONTAINER@YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME</b>.blob.core.windows.net/path/to/blob/file?<b>PROFILE=wasbs:text&SERVER=wasbssrvcfg</b>')
FORMAT 'TEXT';
</pre>
The following command creates an external table that references a text file on Azure Data Lake. It specifies the profile named `adl:text` and the server configuration named `adlsrvcfg`. You would provide your Azure Data Lake account name.
<pre>
CREATE EXTERNAL TABLE pxf_adl_text(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://<b>YOUR_ADL_ACCOUNT_NAME</b>.azuredatalakestore.net/path/to/file?<b>PROFILE=adl:text&SERVER=adlsrvcfg</b>')
FORMAT 'TEXT';
</pre>
The following command creates an external table that references a JSON file on Google Cloud Storage. It specifies the profile named `gs:json` and the server configuration named `gcssrvcfg`:
<pre>
CREATE EXTERNAL TABLE pxf_gsc_json(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://dir/subdir/file.json?<b>PROFILE=gs:json&SERVER=gcssrvcfg</b>')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
</pre>
---
title: About Accessing the S3 Object Store
---
PXF is installed with a connector to the S3 object store. PXF supports the following additional runtime features with this connector:
- Overriding the S3 credentials specified in the server configuration by providing them in the `CREATE EXTERNAL TABLE` command DDL.
- Using the Amazon S3 Select service to read certain CSV and Parquet data from S3.
## <a id="s3_override"></a>Overriding the S3 Server Configuration with DDL
If you are accessing an S3-compatible object store, you can override the credentials in an S3 server configuration by directly specifying the S3 access ID and secret key via these custom options in the `CREATE EXTERNAL TABLE` `LOCATION` clause:
| Custom Option | Value Description |
|-------|-------------------------------------|
| accesskey | The AWS account access key ID. |
| secretkey | The secret key associated with the AWS access key ID. |
For example:
<pre>CREATE EXTERNAL TABLE pxf_ext_tbl(name text, orders int)
LOCATION ('pxf://S3_BUCKET/dir/file.txt?PROFILE=s3:text&SERVER=s3srvcfg<b>&accesskey=YOURKEY&secretkey=YOURSECRET</b>')
FORMAT 'TEXT' (delimiter=E',');</pre>
<div class="note warning">Credentials that you provide in this manner are visible as part of the external table definition. Do not use this method of passing credentials in a production environment.</div>
PXF does not support overriding Azure, Google Cloud Storage, and Minio server credentials in this manner at this time.
Refer to [Configuration Property Precedence](cfg_server.html#override) for detailed information about the precedence rules that PXF uses to obtain configuration property settings for a Greenplum Database user.
## <a id="s3_select"></a>Using the Amazon S3 Select Service
Refer to [Reading CSV and Parquet Data from S3 Using S3 Select](read_s3_s3select.html) for specific information on how PXF can use the Amazon S3 Select service to read CSV and Parquet files stored on S3.
---
title: Configuring PXF Servers
---
This topic provides an overview of PXF server configuration. To configure a server, refer to the topic specific to the connector that you want to configure.
You read from or write data to an external data store via a PXF connector. To access an external data store, you must provide the server location. You may also be required to provide client access credentials and other external data store-specific properties. PXF simplifies configuring access to external data stores by:
- Supporting file-based connector and user configuration
- Providing connector-specific template configuration files
A PXF *Server* definition is simply a named configuration that provides access to a specific external data store. A PXF server name is the name of a directory residing in `$PXF_CONF/servers/`. The information that you provide in a server configuration is connector-specific. For example, a PXF JDBC Connector server definition may include settings for the JDBC driver class name, URL, username, and password. You can also configure connection-specific and session-specific properties in a JDBC server definition.
PXF provides a server template file for each connector; this template identifies the typical set of properties that you must configure to use the connector.
You will configure a server definition for each external data store that Greenplum Database users need to access. For example, if you require access to two Hadoop clusters, you will create a PXF Hadoop server configuration for each cluster. If you require access to an Oracle and a MySQL database, you will create one or more PXF JDBC server configurations for each database.
A server configuration may include default settings for user access credentials and other properties for the external data store. You can allow Greenplum Database users to access the external data store using the default settings, or you can configure access and other properties on a per-user basis. This allows you to configure different Greenplum Database users with different external data store access credentials in a single PXF server definition.
## <a id="templates"></a>About Server Template Files
The configuration information for a PXF server resides in one or more `<connector>-site.xml` files in `$PXF_CONF/servers/<server_name>/`.
PXF provides a template configuration file for each connector. These server template configuration files are located in the `$PXF_CONF/templates/` directory after you initialize PXF:
```
gpadmin@gpmaster$ ls $PXF_CONF/templates
adl-site.xml hbase-site.xml jdbc-site.xml pxf-site.xml yarn-site.xml
core-site.xml hdfs-site.xml mapred-site.xml s3-site.xml
gs-site.xml hive-site.xml minio-site.xml wasbs-site.xml
```
For example, the contents of the `s3-site.xml` template file follow:
``` pre
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>YOUR_AWS_ACCESS_KEY_ID</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>YOUR_AWS_SECRET_ACCESS_KEY</value>
</property>
<property>
<name>fs.s3a.fast.upload</name>
<value>true</value>
</property>
</configuration>
```
<div class="note">You specify credentials to PXF in clear text in configuration files.</div>
**Note**: The template files for the Hadoop connectors are not intended to be modified and used for configuration, as they only provide an example of the information needed. Instead of modifying the Hadoop templates, you will copy several Hadoop `*-site.xml` files from the Hadoop cluster to your PXF Hadoop server configuration.
## <a id="default"></a>About the Default Server
PXF defines a special server named `default`. When you initialize PXF, it automatically creates a `$PXF_CONF/servers/default/` directory. This directory, initially empty, identifies the default PXF server configuration. You can configure and assign the default PXF server to any external data source. For example, you can assign the PXF default server to a Hadoop cluster, or to a MySQL database that your users frequently access.
PXF automatically uses the `default` server configuration if you omit the `SERVER=<server_name>` setting in the `CREATE EXTERNAL TABLE` command `LOCATION` clause.
## <a id="cfgproc"></a>Configuring a Server
When you configure a PXF connector to an external data store, you add a named PXF server configuration for the connector. Among the tasks that you perform, you may:
1. Determine if you are configuring the `default` PXF server, or choose a new name for the server configuration.
2. Create the directory `$PXF_CONF/servers/<server_name>`.
3. Copy template or other configuration files to the new server directory.
4. Fill in appropriate default values for the properties in the template file.
5. Add any additional configuration properties and values required for your environment.
6. Configure one or more users for the server configuration as described in [About Configuring a PXF User](#usercfg).
7. Synchronize the server and user configuration to the Greenplum Database cluster.
**Note**: You must re-sync the PXF configuration to the Greenplum Database cluster after you add or update PXF server configuration.
After you configure a PXF server, you publish the server name to Greenplum Database users who need access to the data store. A user only needs to provide the server name when they create an external table that accesses the external data store. PXF obtains the external data source location and access credentials from server and user configuration files residing in the server configuration directory identified by the server name.
To configure a PXF server, refer to the connector configuration topic:
- To configure a PXF server for Hadoop, refer to [Configuring PXF Hadoop Connectors ](client_instcfg.html).
- To configure a PXF server for an object store, refer to [Configuring Connectors to Minio and S3 Object Stores](s3_objstore_cfg.html) and [Configuring Connectors to Azure and Google Cloud Storage Object Stores](objstore_cfg.html).
- To configure a PXF JDBC server, refer to [Configuring the JDBC Connector ](jdbc_cfg.html).
## <a id="pxf-site"></a>About Kerberos and User Impersonation Configuration (pxf-site.xml)
PXF includes a template file named `pxf-site.xml`. You use the `pxf-site.xml` template file to specify Kerberos and/or user impersonation settings for a server configuration.
<div class="note">The settings in this file apply only to Hadoop and JDBC server configurations; they do not apply to object store server configurations.</div>
You configure properties in the `pxf-site.xml` file for a PXF server when one or more of the following conditions hold:
- The remote Hadoop system utilizes Kerberos authentication.
- You want to enable/disable user impersonation on the remote Hadoop or external database system.
`pxf-site.xml` includes the following properties:
| Property | Description | Default Value |
|----------------|--------------------------------------------|---------------|
| pxf.service.kerberos.principal | The Kerberos principal name. | gpadmin/\_HOST@EXAMPLE.COM |
| pxf.service.kerberos.keytab | The file system path to the Kerberos keytab file. | $PXF_CONF/keytabs/pxf.service.keytab |
| pxf.service.user.name | The log in user for the remote system. | The operating system user that starts the pxf process, typically `gpadmin`. |
| pxf.service.user.impersonation | Enables/disables user impersonation on the remote system. | The value of the (deprecated) `PXF_USER_IMPERSONATION` property when that property is set. If the `PXF_USER_IMPERSONATION` property does not exist and the `pxf.service.user.impersonation` property is missing from `pxf-site.xml`, the default is `false`, user impersonation is disabled on the remote system. |
Refer to [Configuring PXF Hadoop Connectors ](client_instcfg.html) and [Configuring the JDBC Connector ](jdbc_cfg.html) for information about relevant `pxf-site.xml` property settings for Hadoop and JDBC server configurations, respectively.
## <a id="usercfg"></a>Configuring a PXF User
You can configure access to an external data store on a per-server, per-Greenplum-user basis.
<div class="note info">PXF per-server, per-user configuration provides the most benefit for JDBC servers.</div>
You configure external data store user access credentials and properties for a specific Greenplum Database user by providing a `<greenplum_user_name>-user.xml` user configuration file in the PXF server configuration directory, `$PXF_CONF/servers/<server_name>/`. For example, you specify the properties for the Greenplum Database user named `bill` in the file `$PXF_CONF/servers/<server_name>/bill-user.xml`. You can configure zero, one, or more users in a PXF server configuration.
The properties that you specify in a user configuration file are connector-specific. You can specify any configuration property supported by the PXF connector server in a `<greenplum_user_name>-user.xml` configuration file.
For example, suppose you have configured access to a PostgreSQL database in the PXF JDBC server configuration named `pgsrv1`. To allow the Greenplum Database user named `bill` to access this database as the PostgreSQL user named `pguser1`, password `changeme`, you create the user configuration file `$PXF_CONF/servers/pgsrv1/bill-user.xml` with the following properties:
``` xml
<configuration>
<property>
<name>jdbc.user</name>
<value>pguser1</value>
</property>
<property>
<name>jdbc.password</name>
<value>changeme</value>
</property>
</configuration>
```
If you want to configure a specific search path and a larger read fetch size for `bill`, you would also add the following properties to the `bill-user.xml` user configuration file:
``` xml
<property>
<name>jdbc.session.property.search_path</name>
<value>bill_schema</value>
</property>
<property>
<name>jdbc.statement.fetchSize</name>
<value>2000</value>
</property>
```
### <a id="cfgproc_user"></a>Procedure
For each PXF user that you want to configure, you will:
1. Identify the name of the Greenplum Database user.
2. Identify the PXF server definition for which you want to configure user access.
3. Identify the name and value of each property that you want to configure for the user.
4. Create/edit the file `$PXF_CONF/servers/<server_name>/<greenplum_user_name>-user.xml`, and add the outer configuration block:
``` xml
<configuration>
</configuration>
```
5. Add each property/value pair that you identified in Step 3 within the configuration block in the `<greenplum_user_name>-user.xml` file.
6. If you are adding the PXF user configuration to previously configured PXF server definition, synchronize the user configuration to the Greenplum Database cluster.
## <a id="override"></a>About Configuration Property Precedence
A PXF server configuration may include default settings for user access credentials and other properties for accessing an external data store. Some PXF connectors, such as the S3 and JDBC connectors, allow you to directly specify certain server properties via custom options in the `CREATE EXTERNAL TABLE` command `LOCATION` clause. A `<greenplum_user_name>-user.xml` file specifies property settings for an external data store that are specific to a Greenplum Database user.
For a given Greenplum Database user, PXF uses the following precedence rules (highest to lowest) to obtain configuration property settings for the user:
1. A property that you configure in `<server_name>/<greenplum_user_name>-user.xml` overrides any setting of the property elsewhere.
2. A property that is specified via custom options in the `CREATE EXTERNAL TABLE` command `LOCATION` clause overrides any setting of the property in a PXF server configuration.
3. Properties that you configure in the `<server_name>` PXF server definition identify the default property values.
These precedence rules allow you create a single external table that can be accessed by multiple Greenplum Database users, each with their own unique external data store user credentials.
## <a id="using"></a>Using a Server Configuration
To access an external data store, the Greenplum Database user specifies the server name in the `CREATE EXTERNAL TABLE` command `LOCATION` clause `SERVER=<server_name>` option. The `<server_name>` that the user provides identifies the server configuration directory from which PXF obtains the configuration and credentials to access the external data store.
For example, the following command accesses an S3 object store using the server configuration defined in the `$PXF_CONF/servers/s3srvcfg/s3-site.xml` file:
<pre>
CREATE EXTERNAL TABLE pxf_ext_tbl(name text, orders int)
LOCATION ('pxf://BUCKET/dir/file.txt?PROFILE=s3:text&<b>SERVER=s3srvcfg</b>')
FORMAT 'TEXT' (delimiter=E',');
</pre>
PXF automatically uses the `default` server configuration when no `SERVER=<server_name>` setting is provided.
For example, if the `default` server configuration identifies a Hadoop cluster, the following example command references the HDFS file located at `/path/to/file.txt`:
<pre>
CREATE EXTERNAL TABLE pxf_ext_hdfs(location text, miles int)
LOCATION ('pxf://path/to/file.txt?PROFILE=hdfs:text')
FORMAT 'TEXT' (delimiter=E',');
</pre>
<div class="note info">A Greenplum Database user who queries or writes to an external table accesses the external data store with the credentials configured for the <code>&lt;server_name></code> user. If no user-specific credentials are configured for <code>&lt;server_name></code>, the Greenplum user accesses the external data store with the default credentials configured for <code>&lt;server_name></code>.</div>
---
title: Configuring the PXF Agent Host and Port (Optional)
---
By default, a PXF agent started on a segment host listens on port number `5888` on `localhost`. You can configure PXF to start on a different port number, or use a different hostname or IP address. To change the default configuration, you will set one or both of the environment variables identified below:
| Environment Variable | Description |
--------------------------+--------------------
| PXF_HOST | The name of the host or IP address. The default host name is `localhost`. |
| PXF_PORT | The port number on which the PXF agent listens for requests on the host. The default port number is `5888`. |
Set the environment variables in the `gpadmin` user's `.bashrc` shell login file on each segment host.
<div class="note">You must restart both Greenplum Database and PXF when you configure the agent host and/or port in this manner. Consider performing this configuration during a scheduled down time.</div>
## <a id="proc"></a>Procedure
Perform the following procedure to configure the PXF agent host and/or port number on one or more Greenplum Database segment hosts:
1. Log in to your Greenplum Database master node:
``` shell
$ ssh gpadmin@<gpmaster>
```
2. For each Greenplum Database segment host:
1. Identify the host name or IP address of the PXF agent.
2. Identify the port number on which you want the PXF agent to run.
3. Log in to the Greenplum Database segment host:
``` shell
$ ssh gpadmin@<seghost>
```
4. Open the `~/.bashrc` file in the editor of your choice.
5. Set the `PXF_HOST` and/or `PXF_PORT` environment variables. For example, to set the PXF agent port number to 5998, add the following to the `.bashrc` file:
``` shell
export PXF_PORT=5998
```
4. Save the file and exit the editor.
3. Restart Greenplum Database as described in [Restarting Greenplum Database](../admin_guide/managing/startstop.html#task_gpdb_restart).
4. Restart PXF on each Greenplum Database segment host as described in [Restarting PXF](cfginitstart_pxf.html#restart_pxf).
---
title: Starting, Stopping, and Restarting PXF
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
PXF provides two management commands:
- `pxf cluster` - manage all PXF service instances in the Greenplum Database cluster
- `pxf` - manage the PXF service instance on a specific Greenplum Database host
The [`pxf cluster`](ref/pxf-cluster.html) command supports `init`, `start`, `restart`, `status`, `stop`, and `sync` subcommands. When you run a `pxf cluster` subcommand on the Greenplum Database master host, you perform the operation on all segment hosts in the Greenplum Database cluster. PXF also runs the `init` and `sync` commands on the standby master host.
The [`pxf`](ref/pxf.html) command supports `init`, `start`, `stop`, `restart`, and `status` operations. These operations run locally. That is, if you want to start or stop the PXF agent on a specific Greenplum Database segment host, you log in to the host and run the command.
## <a id="start_pxf"></a>Starting PXF
After initializing PXF, you must start PXF on each segment host in your Greenplum Database cluster. The PXF service, once started, runs as the `gpadmin` user on default port 5888. Only the `gpadmin` user can start and stop the PXF service.
If you want to change the default PXF configuration, you must update the configuration before you start PXF.
`$PXF_CONF/conf` includes these user-customizable configuration files:
- `pxf-env.sh` - runtime configuration parameters
- `pxf-log4j.properties` - logging configuration parameters
- `pxf-profiles.xml` - custom profile definitions
The `pxf-env.sh` file exposes the following PXF runtime configuration parameters:
| Parameter | Description | Default Value |
|-----------|---------------| ------------|
| JAVA_HOME | The Java JRE home directory. | /usr/java/default |
| PXF_LOGDIR | The PXF log directory. | $PXF_CONF/logs |
| PXF_JVM_OPTS | Default options for the PXF Java virtual machine. | -Xmx2g -Xms1g |
| PXF_MAX_THREADS | Default for the maximum number of PXF threads. | 200 |
| PXF_FRAGMENTER_CACHE | Enable/disable fragment caching. | Enabled |
| PXF_OOM_KILL | Enable/disable PXF auto-kill on OutOfMemoryError. | Enabled |
| PXF_OOM_DUMP_PATH | Absolute pathname to dump file generated on OOM. | No dump file |
| PXF_KEYTAB | The absolute path to the PXF service Kerberos principal keytab file. Deprecated; specify the keytab in a server-specific `pxf-site.xml` file. | $PXF_CONF/keytabs/pxf.service.keytab |
| PXF_PRINCIPAL | The PXF service Kerberos principal. Deprecated; specify the principal in a server-specific `pxf-site.xml` file. | gpadmin/\_HOST@EXAMPLE.COM |
| PXF_USER_IMPERSONATION | Enable/disable end user identity impersonation. Deprecated; enable/disable impersonation in a server-specific `pxf-site.xml` file. | true |
You must synchronize any changes that you make to `pxf-env.sh`, `pxf-log4j.properties`, or `pxf-profiles.xml` to the Greenplum Database cluster, and (re)start PXF on each segment host.
### <a id="start_pxf_prereq" class="no-quick-link"></a>Prerequisites
Before you start PXF in your Greenplum Database cluster, ensure that:
- Your Greenplum Database cluster is up and running.
- You have previously initialized PXF.
### <a id="start_pxf_proc" class="no-quick-link"></a>Procedure
Perform the following procedure to start PXF on each segment host in your Greenplum Database cluster.
1. Log in to the Greenplum Database master node:
``` shell
$ ssh gpadmin@<gpmaster>
```
3. Run the `pxf cluster start` command to start PXF on each segment host. For example:
```shell
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster start
```
## <a id="stop_pxf"></a>Stopping PXF
If you must stop PXF, for example if you are upgrading PXF, you must stop PXF on each segment host in your Greenplum Database cluster. Only the `gpadmin` user can stop the PXF service.
### <a id="stop_pxf_prereq" class="no-quick-link"></a>Prerequisites
Before you stop PXF in your Greenplum Database cluster, ensure that your Greenplum Database cluster is up and running.
### <a id="stop_pxf_proc" class="no-quick-link"></a>Procedure
Perform the following procedure to stop PXF on each segment host in your Greenplum Database cluster.
1. Log in to the Greenplum Database master node:
``` shell
$ ssh gpadmin@<gpmaster>
```
3. Run the `pxf cluster stop` command to stop PXF on each segment host. For example:
```shell
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster stop
```
## <a id="restart_pxf"></a>Restarting PXF
If you must restart PXF, for example if you updated PXF user configuration files in `$PXF_CONF/conf`, you run `pxf cluster restart` to stop, and then start, PXF on all segment hosts in your Greenplum Database cluster.
Only the `gpadmin` user can restart the PXF service.
### <a id="restart_pxf_prereq" class="no-quick-link"></a>Prerequisites
Before you restart PXF in your Greenplum Database cluster, ensure that your Greenplum Database cluster is up and running.
### <a id="restart_pxf_proc" class="no-quick-link"></a>Procedure
Perform the following procedure to restart PXF in your Greenplum Database cluster.
1. Log in to the Greenplum Database master node:
``` shell
$ ssh gpadmin@<gpmaster>
```
2. Restart PXF:
```shell
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster restart
```
---
title: Configuring PXF Hadoop Connectors (Optional)
---
PXF is compatible with Cloudera, Hortonworks Data Platform, MapR, and generic Apache Hadoop distributions. This topic describes how configure the PXF Hadoop, Hive, and HBase connectors.
*If you do not want to use the Hadoop-related PXF connectors, then you do not need to perform this procedure.*
## <a id="prereq"></a>Prerequisites
Configuring PXF Hadoop connectors involves copying configuration files from your Hadoop cluster to the Greenplum Database master host. If you are using the MapR Hadoop distribution, you must also copy certain JAR files to the master host. Before you configure the PXF Hadoop connectors, ensure that you can copy files from hosts in your Hadoop cluster to the Greenplum Database master.
## <a id="client-pxf-config-steps"></a>Procedure
Perform the following procedure to configure the desired PXF Hadoop-related connectors on the Greenplum Database master host. After you configure the connectors, you will use the `pxf cluster sync` command to copy the PXF configuration to the Greenplum Database cluster.
In this procedure, you use the `default`, or create a new, PXF server configuration. You copy Hadoop configuration files to the server configuration directory on the Greenplum Database master host. You identify Kerberos and user impersonation settings required for access, if applicable. You may also copy libraries to `$PXF_CONF/lib` for MapR support. You then synchronize the PXF configuration on the master host to the standby master and segment hosts. (PXF creates the`$PXF_CONF/*` directories when you run `pxf cluster init`.)
1. Log in to your Greenplum Database master node:
``` shell
$ ssh gpadmin@<gpmaster>
```
2. Identify the name of your PXF Hadoop server configuration.
3. If you are not using the `default` PXF server, create the `$PXF_CONF/servers/<server_name>` directory. For example, use the following command to create a Hadoop server configuration named `hdp3`:
``` shell
gpadmin@gpmaster$ mkdir $PXF_CONF/servers/hdp3
````
4. Change to the server directory. For example:
```shell
gpadmin@gpmaster$ cd $PXF_CONF/servers/default
```
Or,
```shell
gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3
```
2. PXF requires information from `core-site.xml` and other Hadoop configuration files. Copy the `core-site.xml`, `hdfs-site.xml`, `mapred-site.xml`, and `yarn-site.xml` Hadoop configuration files from your Hadoop cluster NameNode host to the current host using your tool of choice. Your file paths may differ based on the Hadoop distribution in use. For example, these commands use `scp` to copy the files:
``` shell
gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml .
gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml .
```
3. If you plan to use the PXF Hive connector to access Hive table data, similarly copy the Hive configuration to the Greenplum Database master host. For example:
``` shell
gpadmin@gpmaster$ scp hiveuser@hivehost:/etc/hive/conf/hive-site.xml .
```
4. If you plan to use the PXF HBase connector to access HBase table data, similarly copy the HBase configuration to the Greenplum Database master host. For example:
``` shell
gpadmin@gpmaster$ scp hbaseuser@hbasehost:/etc/hbase/conf/hbase-site.xml .
```
2. If you are using PXF with the MapR Hadoop distribution, you must copy certain JAR files from your MapR cluster to the Greenplum Database master host. (Your file paths may differ based on the version of MapR in use.) For example, these commands use `scp` to copy the files:
``` shell
gpadmin@gpmaster$ cd $PXF_CONF/lib
gpadmin@gpmaster$ scp mapruser@maprhost:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/maprfs-5.2.2-mapr.jar .
gpadmin@gpmaster$ scp mapruser@maprhost:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/hadoop-auth-2.7.0-mapr-1707.jar .
gpadmin@gpmaster$ scp mapruser@maprhost:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/hadoop-common-2.7.0-mapr-1707.jar .
```
5. Synchronize the PXF configuration to the Greenplum Database cluster. For example:
``` shell
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
```
4. PXF accesses Hadoop services on behalf of Greenplum Database end users. By default, PXF tries to access HDFS, Hive, and HBase using the identity of the Greenplum Database user account that logs into Greenplum Database. In order to support this functionality, you must configure proxy settings for Hadoop, as well as for Hive and HBase if you intend to use those PXF connectors. Follow procedures in [Configuring User Impersonation and Proxying](pxfuserimpers.html) to configure user impersonation and proxying for Hadoop services, or to turn off PXF user impersonation.
5. Grant read permission to the HDFS files and directories that will be accessed as external tables in Greenplum Database. If user impersonation is enabled (the default), you must grant this permission to each Greenplum Database user/role name that will use external tables that reference the HDFS files. If user impersonation is not enabled, you must grant this permission to the `gpadmin` user.
6. If your Hadoop cluster is secured with Kerberos, you must configure PXF and generate Kerberos principals and keytabs for each segment host as described in [Configuring PXF for Secure HDFS](pxf_kerbhdfs.html).
## <a id="client-cfg-update"></a>About Updating the Hadoop Configuration
If you update your Hadoop, Hive, or HBase configuration while the PXF service is running, you must copy the updated configuration to the `$PXF_CONF/servers/<server_name>` directory and re-sync the PXF configuration to your Greenplum Database cluster. For example:
``` shell
gpadmin@gpmaster$ cd $PXF_CONF/servers/<server_name>
gpadmin@gpmaster$ scp hiveuser@hivehost:/etc/hive/conf/hive-site.xml .
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
```
---
title: About Column Projection in PXF
---
PXF supports column projection, and it is always enabled. With column projection, only the columns required by a `SELECT` query on an external table are returned from the external data source. This process can improve query performance, and can also reduce the amount of data that is transferred to Greenplum Database.
**Note:** Some external data sources do not support column projection. If a query accesses a data source that does not support column projection, the query is instead executed without it, and the data is filtered after it is transferred to Greenplum Database.
Column projection is automatically enabled for the `pxf` external table protocol. PXF accesses external data sources using different connectors, and column projection support is also determined by the specific connector implementation. The following PXF connector and profile combinations support column projection on read operations:
| Data Source | Connector | Profile(s) |
|-------------|---------------|---------|
| External SQL database | JDBC Connector | Jdbc |
| Hive | Hive Connector | Hive, HiveRC, HiveORC, HiveVectorizedORC |
| Hadoop | HDFS Connector | hdfs:parquet |
| Amazon S3 | S3-Compatible Object Store Connectors | s3:parquet |
| Amazon S3 using S3 Select | S3-Compatible Object Store Connectors | s3:parquet, s3:text |
| Google Cloud Storage | GCS Object Store Connector | gs:parquet |
| Azure Blob Storage | Azure Object Store Connector | wasbs:parquet |
| Azure Data Lake | Azure Object Store Connector | adl:parquet |
**Note:** PXF may disable column projection in cases where it cannot successfully serialize a query filter; for example, when the `WHERE` clause resolves to a `boolean` type.
To summarize, all of the following criteria must be met for column projection to occur:
* The external data source that you are accessing must support column projection. For example, Hive supports column projection for ORC-format data, and certain SQL databases support column projection.
* The underlying PXF connector and profile implementation must also support column projection. For example, the PXF Hive and JDBC connector profiles identified above support column projection, as do the PXF connectors that support reading Parquet data.
* PXF must be able to serialize the query filter.
---
title: About PXF Filter Pushdown
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
PXF supports filter pushdown. When filter pushdown is enabled, the constraints from the `WHERE` clause of a `SELECT` query can be extracted and passed to the external data source for filtering. This process can improve query performance, and can also reduce the amount of data that is transferred to Greenplum Database.
You enable or disable filter pushdown for all external table protocols, including `pxf`, by setting the `gp_external_enable_filter_pushdown` server configuration parameter. The default value of this configuration parameter is `on`; set it to `off` to disable filter pushdown. For example:
``` sql
SHOW gp_external_enable_filter_pushdown;
SET gp_external_enable_filter_pushdown TO 'on';
```
**Note:** Some external data sources do not support filter pushdown. Also, filter pushdown may not be supported with certain data types or operators. If a query accesses a data source that does not support filter push-down for the query constraints, the query is instead executed without filter pushdown (the data is filtered after it is transferred to Greenplum Database).
PXF filter pushdown can be used with these data types (connector- and profile-specific):
- `INT2`, `INT4`, `INT8`
- `CHAR`, `TEXT`
- `FLOAT`
- `NUMERIC` (not available with the S3 connector when using S3 Select)
- `BOOL`
- `DATE`, `TIMESTAMP` (available only with the JDBC connector and the S3 connector when using S3 Select)
You can use PXF filter pushdown with these arithmetic and logical operators (connector- and profile-specific):
- `<`, `<=`, `>=`, `>`
- `<>`, `=`
- `AND`, `OR`, `NOT`
- `LIKE` (`TEXT` fields, JDBC connector only)
PXF accesses data sources using profiles exposed by different connectors, and filter pushdown support is determined by the specific connector implementation. The following PXF profiles support some aspect of filter pushdown:
|Profile | <,&nbsp;&nbsp; >,</br><=,&nbsp;&nbsp; >=,</br>=,&nbsp;&nbsp;<> | LIKE | IS [NOT] NULL | IN | AND | OR | NOT |
|-------|:------------------------:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| Jdbc | Y | Y | Y | Y | Y | Y | Y | Y | N | Y | Y | Y |
| *:parquet | Y<sup>1</sup> | N | Y<sup>1</sup> | N | Y<sup>1</sup> | Y<sup>1</sup> | Y<sup>1</sup> |
| s3:parquet and s3:text with S3-Select | Y | N | Y | Y | Y | Y | Y |
| HBase | Y | N | Y | N | Y | Y | N |
| Hive | Y<sup>2</sup> | N | N | N | Y<sup>2</sup> | Y<sup>2</sup> | N |
| HiveText | Y<sup>2</sup> | N | N | N | Y<sup>2</sup> | Y<sup>2</sup> | N |
| HiveRC | Y<sup>2</sup> | N | N | N | Y<sup>2</sup> | Y<sup>2</sup> | N |
| HiveORC | Y, Y<sup>2</sup> | N | Y | Y | Y, Y<sup>2</sup> | Y, Y<sup>2</sup> | Y |
| HiveVectorizedORC | Y, Y<sup>2</sup> | N | Y | Y | Y, Y<sup>2</sup> | Y, Y<sup>2</sup> | Y |
</br><sup>1</sup>&nbsp;PXF applies the predicate, rather than the remote system, reducing CPU usage and the memory footprint.
</br><sup>2</sup>&nbsp;PXF supports partition pruning based on partition keys.
PXF does not support filter pushdown for any profile not mentioned in the table above, including: *:avro, *:AvroSequenceFile, *:SequenceFile, *:json, *:text, and *:text:multi.
To summarize, all of the following criteria must be met for filter pushdown to occur:
* You enable external table filter pushdown by setting the `gp_external_enable_filter_pushdown` server configuration parameter to `'on'`.
* The Greenplum Database protocol that you use to access external data source must support filter pushdown. The `pxf` external table protocol supports pushdown.
* The external data source that you are accessing must support pushdown. For example, HBase and Hive support pushdown.
* For queries on external tables that you create with the `pxf` protocol, the underlying PXF connector must also support filter pushdown. For example, the PXF Hive, HBase, and JDBC connectors support pushdown.
- Refer to Hive [Partition Filter Pushdown](hive_pxf.html#partitionfiltering) for more information about Hive support for this feature.
---
title: Reading HBase Table Data
---
Apache HBase is a distributed, versioned, non-relational database on Hadoop.
The PXF HBase connector reads data stored in an HBase table. The HBase connector supports filter pushdown.
This section describes how to use the PXF HBase connector.
## <a id="prereq"></a>Prerequisites
Before working with HBase table data, ensure that you have:
- Copied `$GPHOME/pxf/lib/pxf-hbase-*.jar` to each node in your HBase cluster, and that the location of this PXF JAR file is in the `$HBASE_CLASSPATH`. This configuration is required for the PXF HBase connector to support filter pushdown.
- Met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_prereq).
## <a id="hbase_primer"></a>HBase Primer
This topic assumes that you have a basic understanding of the following HBase concepts:
- An HBase column includes two components: a column family and a column qualifier. These components are delimited by a colon `:` character, \<column-family\>:\<column-qualifier\>.
- An HBase row consists of a row key and one or more column values. A row key is a unique identifier for the table row.
- An HBase table is a multi-dimensional map comprised of one or more columns and rows of data. You specify the complete set of column families when you create an HBase table.
- An HBase cell is comprised of a row (column family, column qualifier, column value) and a timestamp. The column value and timestamp in a given cell represent a version of the value.
For detailed information about HBase, refer to the [Apache HBase Reference Guide](http://hbase.apache.org/book.html).
## <a id="hbase_shell"></a>HBase Shell
The HBase shell is a subsystem similar to that of `psql`. To start the HBase shell:
``` shell
$ hbase shell
<hbase output>
hbase(main):001:0>
```
The default HBase namespace is named `default`.
### <a id="hbaseshell_example" class="no-quick-link"></a>Example: Creating an HBase Table
Create a sample HBase table.
1. Create an HBase table named `order_info` in the `default` namespace. `order_info` has two column families: `product` and `shipping_info`:
``` pre
hbase(main):> create 'order_info', 'product', 'shipping_info'
```
2. The `order_info` `product` column family has qualifiers named `name` and `location`. The `shipping_info` column family has qualifiers named `state` and `zipcode`. Add some data to the `order_info` table:
``` pre
put 'order_info', '1', 'product:name', 'tennis racquet'
put 'order_info', '1', 'product:location', 'out of stock'
put 'order_info', '1', 'shipping_info:state', 'CA'
put 'order_info', '1', 'shipping_info:zipcode', '12345'
put 'order_info', '2', 'product:name', 'soccer ball'
put 'order_info', '2', 'product:location', 'on floor'
put 'order_info', '2', 'shipping_info:state', 'CO'
put 'order_info', '2', 'shipping_info:zipcode', '56789'
put 'order_info', '3', 'product:name', 'snorkel set'
put 'order_info', '3', 'product:location', 'warehouse'
put 'order_info', '3', 'shipping_info:state', 'OH'
put 'order_info', '3', 'shipping_info:zipcode', '34567'
```
You will access the `orders_info` HBase table directly via PXF in examples later in this topic.
3. Display the contents of the `order_info` table:
``` pre
hbase(main):> scan 'order_info'
ROW COLUMN+CELL
1 column=product:location, timestamp=1499074825516, value=out of stock
1 column=product:name, timestamp=1499074825491, value=tennis racquet
1 column=shipping_info:state, timestamp=1499074825531, value=CA
1 column=shipping_info:zipcode, timestamp=1499074825548, value=12345
2 column=product:location, timestamp=1499074825573, value=on floor
...
3 row(s) in 0.0400 seconds
```
## <a id="syntax3"></a>Querying External HBase Data
The PXF HBase connector supports a single profile named `HBase`.
Use the following syntax to create a Greenplum Database external table that references an HBase table:
``` sql
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<hbase-table-name>?PROFILE=HBase[&SERVER=<server_name>]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
HBase connector-specific keywords and values used in the [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) call are described below.
| Keyword | Value |
|-------|-------------------------------------|
| \<hbase&#8209;table&#8209;name\> | The name of the HBase table. |
| PROFILE | The `PROFILE` keyword must specify `HBase`. |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. Optional; PXF uses the `default` server if not specified. |
| FORMAT | The `FORMAT` clause must specify `'CUSTOM' (FORMATTER='pxfwritable_import')`. |
## <a id="datatypemapping"></a>Data Type Mapping
HBase is byte-based; it stores all data types as an array of bytes. To represent HBase data in Greenplum Database, select a data type for your Greenplum Database column that matches the underlying content of the HBase column qualifier values.
**Note**: PXF does not support complex HBase objects.
## <a id="columnmapping"></a>Column Mapping
You can create a Greenplum Database external table that references all, or a subset of, the column qualifiers defined in an HBase table. PXF supports direct or indirect mapping between a Greenplum Database table column and an HBase table column qualifier.
### <a id="directmapping" class="no-quick-link"></a>Direct Mapping
When you use direct mapping to map Greenplum Database external table column names to HBase qualifiers, you specify column-family-qualified HBase qualifier names as quoted values. The PXF HBase connector passes these column names as-is to HBase as it reads the table data.
For example, to create a Greenplum Database external table accessing the following data:
- qualifier `name` in the column family named `product`
- qualifier `zipcode` in the column family named `shipping_info` 
from the `order_info` HBase table that you created in [Example: Creating an HBase Table](#hbaseshell_example), use this `CREATE EXTERNAL TABLE` syntax:
``` sql
CREATE EXTERNAL TABLE orderinfo_hbase ("product:name" varchar, "shipping_info:zipcode" int)
LOCATION ('pxf://order_info?PROFILE=HBase')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
### <a id="indirectmappingvialookuptable" class="no-quick-link"></a>Indirect Mapping via Lookup Table
When you use indirect mapping to map Greenplum Database external table column names to HBase qualifiers, you specify the mapping in a lookup table that you create in HBase. The lookup table maps a \<column-family\>:\<column-qualifier\> to a column name alias that you specify when you create the Greenplum Database external table.
You must name the HBase PXF lookup table `pxflookup`. And you must define this table with a single column family named `mapping`. For example:
``` pre
hbase(main):> create 'pxflookup', 'mapping'
```
While the direct mapping method is fast and intuitive, using indirect mapping allows you to create a shorter, character-based alias for the HBase \<column-family\>:\<column-qualifier\> name. This better reconciles HBase column qualifier names with Greenplum Database due to the following:
- HBase qualifier names can be very long. Greenplum Database has a 63 character limit on the size of the column name.
- HBase qualifier names can include binary or non-printable characters. Greenplum Database column names are character-based.
When populating the `pxflookup` HBase table, add rows to the table such that the:
- row key specifies the HBase table name
- `mapping` column family qualifier identifies the Greenplum Database column name, and the value identifies the HBase `<column-family>:<column-qualifier>` for which you are creating the alias.
For example, to use indirect mapping with the `order_info` table, add these entries to the `pxflookup` table:
``` pre
hbase(main):> put 'pxflookup', 'order_info', 'mapping:pname', 'product:name'
hbase(main):> put 'pxflookup', 'order_info', 'mapping:zip', 'shipping_info:zipcode'
```
Then create a Greenplum Database external table using the following `CREATE EXTERNAL TABLE` syntax:
``` sql
CREATE EXTERNAL TABLE orderinfo_map (pname varchar, zip int)
LOCATION ('pxf://order_info?PROFILE=HBase')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
## <a id="rowkey"></a>Row Key
The HBase table row key is a unique identifier for the table row. PXF handles the row key in a special way.
To use the row key in the Greenplum Database external table query, define the external table using the PXF reserved column named `recordkey.` The `recordkey` column name instructs PXF to return the HBase table record key for each row.
Define the `recordkey` using the Greenplum Database data type `bytea`.
For example:
``` sql
CREATE EXTERNAL TABLE <table_name> (recordkey bytea, ... )
LOCATION ('pxf://<hbase_table_name>?PROFILE=HBase')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
After you have created the external table, you can use the `recordkey` in a `WHERE` clause to filter the HBase table on a range of row key values.
**Note**: To enable filter pushdown on the `recordkey`, define the field as `text`.
---
title: Reading and Writing HDFS Avro Data
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
Use the PXF HDFS Connector to read and write Avro-format data. This section describes how to use PXF to read and write Avro data in HDFS, including how to create, query, and insert into an external table that references an Avro file in the HDFS data store.
**Note**: PXF does not support reading or writing compressed Avro files.
## <a id="prereq"></a>Prerequisites
Ensure that you have met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_prereq) before you attempt to read data from HDFS.
## <a id="avro_work"></a>Working with Avro Data
Apache Avro is a data serialization framework where the data is serialized in a compact binary format. Avro specifies that data types be defined in JSON. Avro format data has an independent schema, also defined in JSON. An Avro schema, together with its data, is fully self-describing.
### <a id="profile_hdfsavrodatamap"></a>Data Type Mapping
Avro supports both primitive and complex data types.
To represent Avro primitive data types in Greenplum Database, map data values to Greenplum Database columns of the same type.
Avro supports complex data types including arrays, maps, records, enumerations, and fixed types. Map top-level fields of these complex data types to the Greenplum Database `TEXT` type. While Greenplum Database does not natively support these types, you can create Greenplum Database functions or application code to extract or further process subcomponents of these complex data types.
The following table summarizes external mapping rules for Avro data.
<a id="topic_oy3_qwm_ss__table_j4s_h1n_ss"></a>
| Avro Data Type | PXF/Greenplum Data Type |
|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| boolean | boolean |
| bytes | bytea |
| double | double |
| float | real |
| int | int or smallint |
| long | bigint |
| string | text |
| Complex type: Array, Map, Record, or Enum | text, with delimiters inserted between collection items, mapped key-value pairs, and record data. |
| Complex type: Fixed | bytea (supported for read operations only). |
| Union | Follows the above conventions for primitive or complex data types, depending on the union; must contain 2 elements, one of which must be null. |
### <a id="topic_tr3_dpg_ts__section_m2p_ztg_ts"></a>Avro Schemas and Data
Avro schemas are defined using JSON, and composed of the same primitive and complex types identified in the data type mapping section above. Avro schema files typically have a `.avsc` suffix.
Fields in an Avro schema file are defined via an array of objects, each of which is specified by a name and a type.
An Avro data file contains the schema and a compact binary representation of the data. Avro data files typically have the `.avro` suffix.
You can specify an Avro schema on both read and write operations to HDFS. You can provide either a binary `*.avro` file or a JSON-format `*.avsc` file for the schema file:
| External Table Type | Schema Specified? | Description
|-------|--------------------------|-----------|
| readable | yes | PXF uses the specified schema; this overrides the schema embedded in the Avro data file. |
| readable | no | PXF uses the schema embedded in the Avro data file. |
| writable | yes | PXF uses the specified schema. |
| writable | no | PXF creates the Avro schema based on the external table definition. |
When you provide the Avro schema file to PXF, the file must reside in the same location on each Greenplum Database segment host **or** the file may reside on the Hadoop file system. PXF first searches for an absolute file path on the Greenplum segment hosts. If PXF does not find the schema file there, it searches for the file relative to the PXF classpath. If PXF cannot find the schema file locally, it searches for the file on HDFS.
The `$PXF_CONF/conf` directory is in the PXF classpath. PXF can locate an Avro schema file that you add to this directory on every Greenplum Database segment host.
See [Writing Avro Data](#topic_avro_writedata) for additional schema considerations when writing Avro data to HDFS.
## <a id="profile_cet"></a>Creating the External Table
Use the `hdfs:avro` profile to read or write Avro-format data in HDFS. The following syntax creates a Greenplum Database readable external table that references such a file:
``` sql
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:avro[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
```
The specific keywords and values used in the [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;hdfs&#8209;file\> | The absolute path to the directory or file in the HDFS data store. |
| PROFILE | The `PROFILE` keyword must specify `hdfs:avro`. |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. Optional; PXF uses the `default` server if not specified. |
| \<custom&#8209;option\> | \<custom-option\>s are discussed below.|
| FORMAT 'CUSTOM' | Use `FORMAT` '`CUSTOM`' with `(FORMATTER='pxfwritable_export')` (write) or `(FORMATTER='pxfwritable_import')` (read). |
| DISTRIBUTED BY | If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same distribution policy or \<column_name\> on both tables. Doing so will avoid extra motion of data between segments on the load operation. |
<a id="customopts"></a>
For complex types, the PXF `hdfs:avro` profile inserts default delimiters between collection items and values before display. You can use non-default delimiter characters by identifying values for specific `hdfs:avro` custom options in the `CREATE EXTERNAL TABLE` command.
The `hdfs:avro` profile supports the following \<custom-option\>s:
| Option Keyword | Description
|---------------|--------------------|
| COLLECTION_DELIM | The delimiter character(s) placed between entries in a top-level array, map, or record field when PXF maps an Avro complex data type to a text column. The default is the comma `,` character. (Read)|
| MAPKEY_DELIM | The delimiter character(s) placed between the key and value of a map entry when PXF maps an Avro complex data type to a text column. The default is the colon `:` character. (Read)|
| RECORDKEY_DELIM | The delimiter character(s) placed between the field name and value of a record entry when PXF maps an Avro complex data type to a text column. The default is the colon `:` character. (Read)|
| SCHEMA | The absolute path to the Avro schema file on the segment host or on HDFS, or the relative path to the schema file on the segment host. (Read and Write)|
| IGNORE_MISSING_PATH | A Boolean value that specifies the action to take when \<path-to-hdfs-file\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. |
## <a id="avro_example"></a>Example: Reading Avro Data
The examples in this section will operate on Avro data with the following field name and data type record schema:
- id - long
- username - string
- followers - array of string
- fmap - map of long
- relationship - enumerated type
- address - record comprised of street number (int), street name (string), and city (string)
You create an Avro schema and data file, and then create a readable external table to read the data.
### <a id="topic_tr3_dpg_ts__section_m2p_ztg_ts_99"></a>Create Schema
Perform the following operations to create an Avro schema to represent the example schema described above.
1. Create a file named `avro_schema.avsc`:
``` shell
$ vi /tmp/avro_schema.avsc
```
2. Copy and paste the following text into `avro_schema.avsc`:
``` json
{
"type" : "record",
"name" : "example_schema",
"namespace" : "com.example",
"fields" : [ {
"name" : "id",
"type" : "long",
"doc" : "Id of the user account"
}, {
"name" : "username",
"type" : "string",
"doc" : "Name of the user account"
}, {
"name" : "followers",
"type" : {"type": "array", "items": "string"},
"doc" : "Users followers"
}, {
"name": "fmap",
"type": {"type": "map", "values": "long"}
}, {
"name": "relationship",
"type": {
"type": "enum",
"name": "relationshipEnum",
"symbols": ["MARRIED","LOVE","FRIEND","COLLEAGUE","STRANGER","ENEMY"]
}
}, {
"name": "address",
"type": {
"type": "record",
"name": "addressRecord",
"fields": [
{"name":"number", "type":"int"},
{"name":"street", "type":"string"},
{"name":"city", "type":"string"}]
}
} ],
"doc:" : "A basic schema for storing messages"
}
```
### <a id="topic_tr3_dpgspk_15g_tsdata"></a>Create Avro Data File (JSON)
Perform the following steps to create a sample Avro data file conforming to the above schema.
1. Create a text file named `pxf_avro.txt`:
``` shell
$ vi /tmp/pxf_avro.txt
```
2. Enter the following data into `pxf_avro.txt`:
``` pre
{"id":1, "username":"john","followers":["kate", "santosh"], "relationship": "FRIEND", "fmap": {"kate":10,"santosh":4}, "address":{"number":1, "street":"renaissance drive", "city":"san jose"}}
{"id":2, "username":"jim","followers":["john", "pam"], "relationship": "COLLEAGUE", "fmap": {"john":3,"pam":3}, "address":{"number":9, "street":"deer creek", "city":"palo alto"}}
```
The sample data uses a comma `,` to separate top level records and a colon `:` to separate map/key values and record field name/values.
3. Convert the text file to Avro format. There are various ways to perform the conversion, both programmatically and via the command line. In this example, we use the [Java Avro tools](http://avro.apache.org/releases.html); the jar `avro-tools-1.9.1.jar` file resides in the current directory:
``` shell
$ java -jar ./avro-tools-1.9.1.jar fromjson --schema-file /tmp/avro_schema.avsc /tmp/pxf_avro.txt > /tmp/pxf_avro.avro
```
The generated Avro binary data file is written to `/tmp/pxf_avro.avro`.
4. Copy the generated Avro file to HDFS:
``` shell
$ hdfs dfs -put /tmp/pxf_avro.avro /data/pxf_examples/
```
### <a id="topic_avro_querydata"></a>Reading Avro Data
Perform the following operations to create and query an external table that references the `pxf_avro.avro` file that you added to HDFS in the previous section. When creating the table:
- Use the PXF default server.
- Map the top-level primitive fields, `id` (type long) and `username` (type string), to their equivalent Greenplum Database types (bigint and text).
- Map the remaining complex fields to type text.
- Explicitly set the record, map, and collection delimiters using the `hdfs:avro` profile custom options.
1. Use the `hdfs:avro` profile to create a queryable external table from the `pxf_avro.avro` file:
``` sql
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_avro(id bigint, username text, followers text, fmap text, relationship text, address text)
LOCATION ('pxf://data/pxf_examples/pxf_avro.avro?PROFILE=hdfs:avro&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
2. Perform a simple query of the `pxf_hdfs_avro` table:
``` sql
postgres=# SELECT * FROM pxf_hdfs_avro;
```
``` pre
id | username | followers | fmap | relationship | address
----+----------+----------------+--------------------+--------------+---------------------------------------------------
1 | john | [kate,santosh] | {kate:10,santosh:4} | FRIEND | {number:1,street:renaissance drive,city:san jose}
2 | jim | [john,pam] | {pam:3,john:3} | COLLEAGUE | {number:9,street:deer creek,city:palo alto}
(2 rows)
```
The simple query of the external table shows the components of the complex type data separated with the delimiters specified in the `CREATE EXTERNAL TABLE` call.
3. Process the delimited components in the text columns as necessary for your application. For example, the following command uses the Greenplum Database internal `string_to_array` function to convert entries in the `followers` field to a text array column in a new view.
``` sql
postgres=# CREATE VIEW followers_view AS
SELECT username, address, string_to_array(substring(followers FROM 2 FOR (char_length(followers) - 2)), ',')::text[]
AS followers
FROM pxf_hdfs_avro;
```
4. Query the view to filter rows based on whether a particular follower appears in the view:
``` sql
postgres=# SELECT username, address FROM followers_view WHERE followers @> '{john}';
```
``` pre
username | address
----------+---------------------------------------------
jim | {number:9,street:deer creek,city:palo alto}
```
## <a id="topic_avro_writedata"></a>Writing Avro Data
When you create a writable external table to write data to an Avro file, each table row is an Avro record and each table column is an Avro field.
If you do not specify a `SCHEMA` file, PXF generates a schema for the Avro file based on the Greenplum Database external table definition. PXF assigns the name of the external table column to the Avro field name. Because Avro has a `null` type and Greenplum external tables do not support the `NOT NULL` column qualifier, PXF wraps each data type in an Avro `union` of the mapped type and `null`. For example, for a writable external table column that you define with the Greenplum Database `text` data type, PXF generates the following schema element:
``` pre
["string", "null"]
```
PXF returns an error if you provide a schema that does not include a `union` of the field data type with `null`, and PXF encounters a NULL data field.
PXF supports writing only Avro primitive data types. It does not support writing complex types to Avro:
- When you specify a `SCHEMA` file in the `LOCATION`, the schema must include only primitive data types.
- When PXF generates the schema, it writes any complex type that you specify in the writable external table column definition to the Avro file as a single Avro `string` type. For example, if you write an array of integers, PXF converts the array to a `string`, and you must read this data with a Greenplum `text`-type column.
### <a id="topic_avrowrite_example"></a>Example: Writing Avro Data
In this example, you create an external table that writes to an Avro file on HDFS, letting PXF generate the Avro schema. After you insert some data into the file, you create a readable external table to query the Avro data.
The Avro file that you create and read in this example includes the following fields:
- id: `int`
- username: `text`
- followers: `text[]`
Example procedure:
1. Create the writable external table:
``` sql
postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_avrowrite(id int, username text, followers text[])
LOCATION ('pxf://data/pxf_examples/pxfwrite.avro?PROFILE=hdfs:avro')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
2. Insert some data into the `pxf_avrowrite` table:
``` sql
postgres=# INSERT INTO pxf_avrowrite VALUES (33, 'oliver', ARRAY['alex','frank']);
postgres=# INSERT INTO pxf_avrowrite VALUES (77, 'lisa', ARRAY['tom','mary']);
```
PXF uses the external table definition to generate the Avro schema.
3. Create an external table to read the Avro data that you just inserted into the table:
``` sql
postgres=# CREATE EXTERNAL TABLE read_pxfwrite(id int, username text, followers text)
LOCATION ('pxf://data/pxf_examples/pxfwrite.avro?PROFILE=hdfs:avro')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
Notice that the `followers` column is of type `text`.
4. Read the Avro data by querying the `read_pxfwrite` table:
``` sql
postgres=# SELECT * FROM read_pxfwrite;
```
``` pre
id | username | followers
----+----------+--------------
77 | lisa | {tom,mary}
33 | oliver | {alex,frank}
(2 rows)
```
`followers` is a single string comprised of the `text` array elements that you inserted into the table.
---
title: Reading a Multi-Line Text File into a Single Table Row
---
You can use the PXF HDFS connector to read one or more multi-line text files in HDFS each as a single table row. This may be useful when you want to read multiple files into the same Greenplum Database external table, for example when individual JSON files each contain a separate record.
PXF supports reading only text and JSON files in this manner.
**Note**: Refer to the [Reading JSON Data from HDFS](hdfs_json.html) topic if you want to use PXF to read JSON files that include more than one record.
## <a id="prereq"></a>Prerequisites
Ensure that you have met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_prereq) before you attempt to read files from HDFS.
## <a id="fileasrow"></a>Reading Multi-Line Text and JSON Files
You can read single- and multi-line files into a single table row, including files with embedded linefeeds. If you are reading multiple JSON files, each file must be a complete record, and each file must contain the same record type.
PXF reads the complete file data into a single row and column. When you create the external table to read multiple files, you must ensure that all of the files that you want to read are of the same (text or JSON) type. You must also specify a single `text` or `json` column, depending upon the file type.
The following syntax creates a Greenplum Database readable external table that references one or more text or JSON files on HDFS:
``` sql
CREATE EXTERNAL TABLE <table_name>
( <column_name> text|json | LIKE <other_table> )
LOCATION ('pxf://<path-to-files>?PROFILE=hdfs:text:multi[&SERVER=<server_name>][&IGNORE_MISSING_PATH=<boolean>]&FILE_AS_ROW=true')
FORMAT 'CSV');
```
The keywords and values used in this [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;files\> | The absolute path to the directory or files in the HDFS data store. |
| PROFILE | The `PROFILE` keyword must specify `hdfs:text:multi`. |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. Optional; PXF uses the `default` server if not specified. |
| FILE\_AS\_ROW=true | The required option that instructs PXF to read each file into a single table row. |
| IGNORE_MISSING_PATH=\<boolean\> | Specify the action to take when \<path-to-files\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. |
| FORMAT | The `FORMAT` must specify `'CSV'`. |
**Note**: The `hdfs:text:multi` profile does not support additional format options when you specify the `FILE_AS_ROW=true` option.
For example, if `/data/pxf_examples/jdir` identifies an HDFS directory that contains a number of JSON files, the following statement creates a Greenplum Database external table that references all of the files in that directory:
``` sql
CREATE EXTERNAL TABLE pxf_readjfiles(j1 json)
LOCATION ('pxf://data/pxf_examples/jdir?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
FORMAT 'CSV';
```
When you query the `pxf_readjfiles` table with a `SELECT` statement, PXF returns the contents of each JSON file in `jdir/` as a separate row in the external table.
When you read JSON files, you can use the JSON functions provided in Greenplum Database to access individual data fields in the JSON record. For example, if the `pxf_readjfiles` external table above reads a JSON file that contains this JSON record:
``` json
{
"root":[
{
"record_obj":{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user":{
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":null
}
}
]
}
```
You can use the `json_array_elements()` function to extract specific JSON fields from the table row. For example, the following command displays the `user->id` field:
``` sql
SELECT json_array_elements(j1->'root')->'record_obj'->'user'->'id'
AS userid FROM pxf_readjfiles;
userid
----------
31424214
(1 rows)
```
Refer to [Working with JSON Data](../admin_guide/query/topics/json-data.html) for specific information on manipulating JSON data with Greenplum Database.
### <a id="example_fileasrow"></a>Example: Reading an HDFS Text File into a Single Table Row
Perform the following procedure to create 3 sample text files in an HDFS directory, and use the PXF `hdfs:text:multi` profile and the default PXF server to read all of these text files in a single external table query.
1. Create an HDFS directory for the text files. For example:
``` shell
$ hdfs dfs -mkdir -p /data/pxf_examples/tdir
```
2. Create a text data file named `file1.txt`:
``` shell
$ echo 'text file with only one line' > /tmp/file1.txt
```
3. Create a second text data file named `file2.txt`:
``` shell
$ echo 'Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67' > /tmp/file2.txt
```
This file has multiple lines.
4. Create a third text file named `/tmp/file3.txt`:
``` shell
$ echo '"4627 Star Rd.
San Francisco, CA 94107":Sept:2017
"113 Moon St.
San Diego, CA 92093":Jan:2018
"51 Belt Ct.
Denver, CO 90123":Dec:2016
"93114 Radial Rd.
Chicago, IL 60605":Jul:2017
"7301 Brookview Ave.
Columbus, OH 43213":Dec:2018' > /tmp/file3.txt
```
This file includes embedded line feeds.
5. Save the file and exit the editor.
6. Copy the text files to HDFS:
``` shell
$ hdfs dfs -put /tmp/file1.txt /data/pxf_examples/tdir
$ hdfs dfs -put /tmp/file2.txt /data/pxf_examples/tdir
$ hdfs dfs -put /tmp/file3.txt /data/pxf_examples/tdir
```
7. Log in to a Greenplum Database system and start the `psql` subsystem.
8. Use the `hdfs:text:multi` profile to create an external table that references the `tdir` HDFS directory. For example:
``` sql
CREATE EXTERNAL TABLE pxf_readfileasrow(c1 text)
LOCATION ('pxf://data/pxf_examples/tdir?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
FORMAT 'CSV';
```
9. Turn on expanded display and query the `pxf_readfileasrow` table:
``` sql
postgres=# \x on
postgres=# SELECT * FROM pxf_readfileasrow;
```
``` pre
-[ RECORD 1 ]---------------------------
c1 | Prague,Jan,101,4875.33
| Rome,Mar,87,1557.39
| Bangalore,May,317,8936.99
| Beijing,Jul,411,11600.67
-[ RECORD 2 ]---------------------------
c1 | text file with only one line
-[ RECORD 3 ]---------------------------
c1 | "4627 Star Rd.
| San Francisco, CA 94107":Sept:2017
| "113 Moon St.
| San Diego, CA 92093":Jan:2018
| "51 Belt Ct.
| Denver, CO 90123":Dec:2016
| "93114 Radial Rd.
| Chicago, IL 60605":Jul:2017
| "7301 Brookview Ave.
| Columbus, OH 43213":Dec:2018
```
---
title: Reading JSON Data from HDFS
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
Use the PXF HDFS Connector to read JSON-format data. This section describes how to use PXF to access JSON data in HDFS, including how to create and query an external table that references a JSON file in the HDFS data store.
## <a id="prereq"></a>Prerequisites
Ensure that you have met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_prereq) before you attempt to read data from HDFS.
## <a id="hdfsjson_work"></a>Working with JSON Data
JSON is a text-based data-interchange format. JSON data is typically stored in a file with a `.json` suffix.
A `.json` file will contain a collection of objects. A JSON object is a collection of unordered name/value pairs. A value can be a string, a number, true, false, null, or an object or an array. You can define nested JSON objects and arrays.
Sample JSON data file content:
``` json
{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user": {
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":{
"type":"Point",
"values":[
13,
99
]
}
}
```
In the sample above, `user` is an object composed of fields named `id` and `location`. To specify the nested fields in the `user` object as Greenplum Database external table columns, use `.` projection:
``` pre
user.id
user.location
```
`coordinates` is an object composed of a text field named `type` and an array of integers named `values`. Use `[]` to identify specific elements of the `values` array as Greenplum Database external table columns:
``` pre
coordinates.values[0]
coordinates.values[1]
```
Refer to [Introducing JSON](http://www.json.org/) for detailed information on JSON syntax.
### <a id="datatypemap_json"></a>JSON to Greenplum Database Data Type Mapping</a>
To represent JSON data in Greenplum Database, map data values that use a primitive data type to Greenplum Database columns of the same type. JSON supports complex data types including projections and arrays. Use N-level projection to map members of nested objects and arrays to primitive data types.
The following table summarizes external mapping rules for JSON data.
<caption><span class="tablecap">Table 1. JSON Mapping</span></caption>
<a id="topic_table_jsondatamap"></a>
| JSON Data Type | PXF/Greenplum Data Type |
|-------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Primitive type (integer, float, string, boolean, null) | Use the corresponding Greenplum Database built-in data type; see [Greenplum Database Data Types](../ref_guide/data_types.html). |
| Array | Use `[]` brackets to identify a specific array index to a member of primitive type. |
| Object | Use dot `.` notation to specify each level of projection (nesting) to a member of a primitive type. |
### <a id="topic_jsonreadmodes"></a>JSON Data Read Modes
PXF supports two data read modes. The default mode expects one full JSON record per line. PXF also supports a read mode operating on JSON records that span multiple lines.
In upcoming examples, you will use both read modes to operate on a sample data set. The schema of the sample data set defines objects with the following member names and value data types:
- "created_at" - text
- "id_str" - text
- "user" - object
- "id" - integer
- "location" - text
- "coordinates" - object (optional)
- "type" - text
- "values" - array
- [0] - integer
- [1] - integer
The single-JSON-record-per-line data set follows:
``` pre
{"created_at":"FriJun0722:45:03+00002013","id_str":"343136551322136576","user":{
"id":395504494,"location":"NearCornwall"},"coordinates":{"type":"Point","values"
: [ 6, 50 ]}},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547115253761","user":{
"id":26643566,"location":"Austin,Texas"}, "coordinates": null},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547136233472","user":{
"id":287819058,"location":""}, "coordinates": null}
```
This is the data set for the multi-line JSON record data set:
``` json
{
"root":[
{
"record_obj":{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user":{
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":null
},
"record_obj":{
"created_at":"MonSep3004:04:54+00002013",
"id_str":"384529260872228864",
"user":{
"id":67600981,
"location":"KryberWorld"
},
"coordinates":{
"type":"Point",
"values":[
8,
52
]
}
}
}
]
}
```
You will create JSON files for the sample data sets and add them to HDFS in the next section.
## <a id="jsontohdfs"></a>Loading the Sample JSON Data to HDFS
The PXF HDFS connector reads native JSON stored in HDFS. Before you can use Greenplum Database to query JSON format data, the data must reside in your HDFS data store.
Copy and paste the single line JSON record sample data set above to a file named `singleline.json`. Similarly, copy and paste the multi-line JSON record data set to a file named `multiline.json`.
**Note**: Ensure that there are **no** blank lines in your JSON files.
Copy the JSON data files that you just created to your HDFS data store. Create the `/data/pxf_examples` directory if you did not do so in a previous exercise. For example:
``` shell
$ hdfs dfs -mkdir /data/pxf_examples
$ hdfs dfs -put singleline.json /data/pxf_examples
$ hdfs dfs -put multiline.json /data/pxf_examples
```
Once the data is loaded to HDFS, you can use Greenplum Database and PXF to query and analyze the JSON data.
## <a id="json_cet"></a>Creating the External Table
Use the `hdfs:json` profile to read JSON-format files from HDFS. The following syntax creates a Greenplum Database readable external table that references such a file:
``` sql
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:json[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
The specific keywords and values used in the [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;hdfs&#8209;file\> | The absolute path to the directory or file in the HDFS data store. |
| PROFILE | The `PROFILE` keyword must specify `hdfs:json`. |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. Optional; PXF uses the `default` server if not specified. |
| \<custom&#8209;option\> | \<custom-option\>s are discussed below.|
| FORMAT 'CUSTOM' | Use `FORMAT` `'CUSTOM'` with the `hdfs:json` profile. The `CUSTOM` `FORMAT` requires that you specify `(FORMATTER='pxfwritable_import')`. |
<a id="customopts"></a>
PXF supports single- and multi- line JSON records. When you want to read multi-line JSON records, you must provide an `IDENTIFIER` \<custom-option\> and value. Use this \<custom-option\> to identify the member name of the first field in the JSON record object.
The `hdfs:json` profile supports the following \<custom-option\>s:
| Option Keyword | &nbsp;&nbsp;Syntax,&nbsp;&nbsp;Example(s)&nbsp;&nbsp; | Description |
|-------|--------------|-----------------------|
| IDENTIFIER | `&IDENTIFIER=<value>`<br>`&IDENTIFIER=created_at`| You must include the `IDENTIFIER` keyword and \<value\> in the `LOCATION` string only when you are accessing JSON data comprised of multi-line records. Use the \<value\> to identify the member name of the first field in the JSON record object. |
| IGNORE_MISSING_PATH | `&IGNORE_MISSING_PATH=<boolean>` | Specify the action to take when \<path-to-hdfs-file\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. |
## <a id="jsonexample1"></a>Example: Reading a JSON File with Single Line Records
Use the following [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) SQL command to create a readable external table that references the single-line-per-record JSON data file and uses the PXF default server.
``` sql
CREATE EXTERNAL TABLE singleline_json_tbl(
created_at TEXT,
id_str TEXT,
"user.id" INTEGER,
"user.location" TEXT,
"coordinates.values[0]" INTEGER,
"coordinates.values[1]" INTEGER
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
Notice the use of `.` projection to access the nested fields in the `user` and `coordinates` objects. Also notice the use of `[]` to access specific elements of the `coordinates.values[]` array.
To query the JSON data in the external table:
``` sql
SELECT * FROM singleline_json_tbl;
```
## <a id="jsonexample2" class="no-quick-link"></a>Example: Reading a JSON file with Multi-Line Records
The SQL command to create a readable external table from the multi-line-per-record JSON file is very similar to that of the single line data set above. You must additionally specify the `LOCATION` clause `IDENTIFIER` keyword and an associated value when you want to read multi-line JSON records. For example:
``` sql
CREATE EXTERNAL TABLE multiline_json_tbl(
created_at TEXT,
id_str TEXT,
"user.id" INTEGER,
"user.location" TEXT,
"coordinates.values[0]" INTEGER,
"coordinates.values[1]" INTEGER
)
LOCATION('pxf://data/pxf_examples/multiline.json?PROFILE=hdfs:json&IDENTIFIER=created_at')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
`created_at` identifies the member name of the first field in the JSON record `record_obj` in the sample data schema.
To query the JSON data in this external table:
``` sql
SELECT * FROM multiline_json_tbl;
```
---
title: Reading and Writing HDFS Parquet Data
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
Use the PXF HDFS connector to read and write Parquet-format data. This section describes how to read and write HDFS files that are stored in Parquet format, including how to create, query, and insert into external tables that reference files in the HDFS data store.
PXF currently supports reading and writing primitive Parquet data types only.
## <a id="prereq"></a>Prerequisites
Ensure that you have met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_prereq) before you attempt to read data from or write data to HDFS.
## <a id="datatype_map"></a>Data Type Mapping
To read and write Parquet primitive data types in Greenplum Database, map Parquet data values to Greenplum Database columns of the same type.
Parquet supports a small set of primitive data types, and uses metadata annotations to extend the data types that it supports. These annotations specify how to interpret the primitive type. For example, Parquet stores both `INTEGER` and `DATE` types as the `INT32` primitive type. An annotation identifies the original type as a `DATE`.
### <a id="datatype_map_read "></a>Read Mapping
<a id="p2g_type_mapping_table"></a>
PXF uses the following data type mapping when reading Parquet data:
| Parquet Data Type | Original Type | PXF/Greenplum Data Type |
|-------------------|---------------|--------------------------|
| binary (byte_array) | Date | Date |
| binary (byte_array) | Timestamp_millis | Timestamp |
| binary (byte_array) | all others | Text |
| binary (byte_array) | -- | Bytea |
| boolean | -- | Boolean |
| double | -- | Float8 |
| fixed\_len\_byte\_array | -- | Numeric |
| float | -- | Real |
| int32 | Date | Date |
| int32 | Decimal | Numeric |
| int32 | int_8 | Smallint |
| int32 | int_16 | Smallint |
| int32 | -- | Integer |
| int64 | Decimal | Numeric |
| int64 | -- | Bigint |
| int96 | -- | Timestamp |
**Note**: PXF supports filter predicate pushdown on all parquet data types listed above, *except* the `fixed_len_byte_array` and `int96` types.
### <a id="datatype_map_Write "></a>Write Mapping
PXF uses the following data type mapping when writing Parquet data:
| PXF/Greenplum Data Type | Original Type | Parquet Data Type |
|-------------------|---------------|--------------------------|
| Boolean | -- | boolean |
| Bytea | -- | binary |
| Bigint | -- | int64 |
| SmallInt | int_16 | int32 |
| Integer | -- | int32 |
| Real | -- | float |
| Float8 | -- | double |
| Numeric/Decimal | Decimal | fixed\_len\_byte\_array |
| Timestamp<sup>1</sup> | -- | int96 |
| Timestamptz<sup>2</sup> | -- | int96 |
| Date | utf8 | binary |
| Time | utf8 | binary |
| Varchar | utf8 | binary |
| Text | utf8 | binary |
| OTHERS | -- | UNSUPPORTED |
</br><sup>1</sup>&nbsp;PXF localizes a <code>Timestamp</code> to the current system timezone and converts it to universal time (UTC) before finally converting to <code>int96</code>.
</br><sup>2</sup>&nbsp;PXF converts a <code>Timestamptz</code> to a UTC <code>timestamp</code> and then converts to <code>int96</code>. PXF loses the time zone information during this conversion.
## <a id="profile_cet"></a>Creating the External Table
The PXF HDFS connector `hdfs:parquet` profile supports reading and writing HDFS data in Parquet-format. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.
Use the following syntax to create a Greenplum Database external table that references an HDFS directory:
``` sql
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-dir>
?PROFILE=hdfs:parquet[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
```
The specific keywords and values used in the [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;hdfs&#8209;file\> | The absolute path to the directory in the HDFS data store. |
| PROFILE | The `PROFILE` keyword must specify `hdfs:parquet`. |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. Optional; PXF uses the `default` server if not specified. |
| \<custom&#8209;option\> | \<custom-option\>s are described below.|
| FORMAT 'CUSTOM' | Use `FORMAT` '`CUSTOM`' with `(FORMATTER='pxfwritable_export')` (write) or `(FORMATTER='pxfwritable_import')` (read). |
| DISTRIBUTED BY | If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same distribution policy or \<column_name\> on both tables. Doing so will avoid extra motion of data between segments on the load operation. |
<a id="customopts"></a>
The PXF `hdfs:parquet` profile supports the following read option. You specify this option in the `CREATE EXTERNAL TABLE` `LOCATION` clause:
| Read Option | Value Description |
|-------|-------------------------------------|
| IGNORE_MISSING_PATH | A Boolean value that specifies the action to take when \<path-to-hdfs-file\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. |
The PXF `hdfs:parquet` profile supports encoding- and compression-related write options. You specify these write options in the `CREATE WRITABLE EXTERNAL TABLE` `LOCATION` clause. The `hdfs:parquet` profile supports the following custom write options:
| Write Option | Value Description |
|-------|-------------------------------------|
| COMPRESSION_CODEC | The compression codec alias. Supported compression codecs for writing Parquet data include: `snappy`, `gzip`, `lzo`, and `uncompressed` . If this option is not provided, PXF compresses the data using `snappy` compression. |
| ROWGROUP_SIZE | A Parquet file consists of one or more row groups, a logical partitioning of the data into rows. `ROWGROUP_SIZE` identifies the size (in bytes) of the row group. The default row group size is `8 * 1024 * 1024` bytes. |
| PAGE_SIZE | A row group consists of column chunks that are divided up into pages. `PAGE_SIZE` is the size (in bytes) of such a page. The default page size is `1024 * 1024` bytes. |
| DICTIONARY\_PAGE\_SIZE | Dictionary encoding is enabled by default when PXF writes Parquet files. There is a single dictionary page per column, per row group. `DICTIONARY_PAGE_SIZE` is similar to `PAGE_SIZE`, but for the dictionary. The default dictionary page size is `512 * 1024` bytes. |
| PARQUET_VERSION | The Parquet version; values `v1` and `v2` are supported. The default Parquet version is `v1`. |
| SCHEMA | The location of the Parquet schema file on the file system of the specified `SERVER`. |
**Note**: You must explicitly specify `uncompressed` if you do not want PXF to compress the data.
Parquet files that you write to HDFS with PXF have the following naming format: `<file>.<compress_extension>.parquet`, for example `1547061635-0000004417_0.gz.parquet`.
## <a id="parquet_write"></a> Example
This example utilizes the data schema introduced in [Example: Reading Text Data on HDFS](hdfs_text.html#profile_text_query).
| Column Name | Data Type |
|-------|-------------------------------------|
| location | text |
| month | text |
| number\_of\_orders | int |
| total\_sales | float8 |
In this example, you create a Parquet-format writable external table that uses the default PXF server to reference Parquet-format data in HDFS, insert some data into the table, and then create a readable external table to read the data.
1. Use the `hdfs:parquet` profile to create a writable external table. For example:
``` sql
postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_tbl_parquet (location text, month text, number_of_orders int, total_sales double precision)
LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
```
2. Write a few records to the `pxf_parquet` HDFS directory by inserting directly into the `pxf_tbl_parquet` table. For example:
``` sql
postgres=# INSERT INTO pxf_tbl_parquet VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 );
postgres=# INSERT INTO pxf_tbl_parquet VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
```
3. Recall that Greenplum Database does not support directly querying a writable external table. To read the data in `pxf_parquet`, create a readable external Greenplum Database referencing this HDFS directory:
``` sql
postgres=# CREATE EXTERNAL TABLE read_pxf_parquet(location text, month text, number_of_orders int, total_sales double precision)
LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
4. Query the readable external table `read_pxf_parquet`:
``` sql
postgres=# SELECT * FROM read_pxf_parquet ORDER BY total_sales;
```
``` pre
location | month | number_of_orders | total_sales
-----------+-------+------------------+-------------
Frankfurt | Mar | 777 | 3956.98
Cleveland | Oct | 3812 | 96645.4
(2 rows)
```
---
title: Reading and Writing HDFS SequenceFile Data
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
The PXF HDFS connector supports SequenceFile format binary data. This section describes how to use PXF to read and write HDFS SequenceFile data, including how to create, insert, and query data in external tables that reference files in the HDFS data store.
## <a id="prereq"></a>Prerequisites
Ensure that you have met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_prereq) before you attempt to read data from or write data to HDFS.
## <a id="hdfswrite_writeextdata"></a>Creating the External Table
The PXF HDFS connector `hdfs:SequenceFile` profile supports reading and writing HDFS data in SequenceFile binary format. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.
**Note**: External tables that you create with a writable profile can only be used for INSERT operations. If you want to query the data that you inserted, you must create a separate readable external table that references the HDFS directory.
Use the following syntax to create a Greenplum Database external table that references an HDFS directory: 
``` sql
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-dir>
?PROFILE=hdfs:SequenceFile[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (<formatting-properties>)
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
```
The specific keywords and values used in the [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;hdfs&#8209;dir\> | The absolute path to the directory in the HDFS data store. |
| PROFILE | The `PROFILE` keyword must specify `hdfs:SequenceFile`. |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. Optional; PXF uses the `default` server if not specified. |
| \<custom&#8209;option\> | \<custom-option\>s are described below.|
| FORMAT | Use `FORMAT` '`CUSTOM`' with `(FORMATTER='pxfwritable_export')` (write) or `(FORMATTER='pxfwritable_import')` (read). |
| DISTRIBUTED BY | If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same distribution policy or \<column_name\> on both tables. Doing so will avoid extra motion of data between segments on the load operation. |
<a id="customopts"></a>
SequenceFile format data can optionally employ record or block compression. The PXF `hdfs:SequenceFile` profile supports the following compression codecs:
- org.apache.hadoop.io.compress.DefaultCodec
- org.apache.hadoop.io.compress.BZip2Codec
When you use the `hdfs:SequenceFile` profile to write SequenceFile format data, you must provide the name of the Java class to use for serializing/deserializing the binary data. This class must provide read and write methods for each data type referenced in the data schema.
You specify the compression codec and Java serialization class via custom options in the `CREATE EXTERNAL TABLE` `LOCATION` clause.
The `hdfs:SequenceFile` profile supports the following custom options:
| Option | Value Description |
|-------|-------------------------------------|
| COMPRESSION_CODEC | The compression codec Java class name. If this option is not provided, Greenplum Database performs no data compression. Supported compression codecs include:<br>`org.apache.hadoop.io.compress.DefaultCodec`<br>`org.apache.hadoop.io.compress.BZip2Codec`<br>`org.apache.hadoop.io.compress.GzipCodec` |
| COMPRESSION_TYPE | The compression type to employ; supported values are `RECORD` (the default) or `BLOCK`. |
| DATA-SCHEMA | The name of the writer serialization/deserialization class. The jar file in which this class resides must be in the PXF classpath. This option is required for the `hdfs:SequenceFile` profile and has no default value. |
| THREAD-SAFE | Boolean value determining if a table query can run in multi-threaded mode. The default value is `TRUE`. Set this option to `FALSE` to handle all requests in a single thread for operations that are not thread-safe (for example, compression). |
| IGNORE_MISSING_PATH | A Boolean value that specifies the action to take when \<path-to-hdfs-dir\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. |
## <a id="write_binary"></a>Reading and Writing Binary Data
Use the HDFS connector `hdfs:SequenceFile` profile when you want to read or write SequenceFile format data to HDFS. Files of this type consist of binary key/value pairs. SequenceFile format is a common data transfer format between MapReduce jobs.
### <a id="write_seqfile_example" class="no-quick-link"></a>Example: Writing Binary Data to HDFS
In this example, you create a Java class named `PxfExample_CustomWritable` that will serialize/deserialize the fields in the sample schema used in previous examples. You will then use this class to access a writable external table that you create with the `hdfs:SequenceFile` profile and that uses the default PXF server.
Perform the following procedure to create the Java class and writable table.
1. Prepare to create the sample Java class:
``` shell
$ mkdir -p pxfex/com/example/pxf/hdfs/writable/dataschema
$ cd pxfex/com/example/pxf/hdfs/writable/dataschema
$ vi PxfExample_CustomWritable.java
```
2. Copy and paste the following text into the `PxfExample_CustomWritable.java` file:
``` java
package com.example.pxf.hdfs.writable.dataschema;
import org.apache.hadoop.io.*;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.lang.reflect.Field;
/**
* PxfExample_CustomWritable class - used to serialize and deserialize data with
* text, int, and float data types
*/
public class PxfExample_CustomWritable implements Writable {
public String st1, st2;
public int int1;
public float ft;
public PxfExample_CustomWritable() {
st1 = new String("");
st2 = new String("");
int1 = 0;
ft = 0.f;
}
public PxfExample_CustomWritable(int i1, int i2, int i3) {
st1 = new String("short_string___" + i1);
st2 = new String("short_string___" + i1);
int1 = i2;
ft = i1 * 10.f * 2.3f;
}
String GetSt1() {
return st1;
}
String GetSt2() {
return st2;
}
int GetInt1() {
return int1;
}
float GetFt() {
return ft;
}
@Override
public void write(DataOutput out) throws IOException {
Text txt = new Text();
txt.set(st1);
txt.write(out);
txt.set(st2);
txt.write(out);
IntWritable intw = new IntWritable();
intw.set(int1);
intw.write(out);
FloatWritable fw = new FloatWritable();
fw.set(ft);
fw.write(out);
}
@Override
public void readFields(DataInput in) throws IOException {
Text txt = new Text();
txt.readFields(in);
st1 = txt.toString();
txt.readFields(in);
st2 = txt.toString();
IntWritable intw = new IntWritable();
intw.readFields(in);
int1 = intw.get();
FloatWritable fw = new FloatWritable();
fw.readFields(in);
ft = fw.get();
}
public void printFieldTypes() {
Class myClass = this.getClass();
Field[] fields = myClass.getDeclaredFields();
for (int i = 0; i < fields.length; i++) {
System.out.println(fields[i].getType().getName());
}
}
}
```
3. Compile and create a Java class JAR file for `PxfExample_CustomWritable`. Provide a classpath that includes the `hadoop-common.jar` file for your Hadoop distribution. For example, if you installed the Hortonworks Data Platform Hadoop client:
``` shell
$ javac -classpath /usr/hdp/current/hadoop-client/hadoop-common.jar PxfExample_CustomWritable.java
$ cd ../../../../../../
$ jar cf pxfex-customwritable.jar com
$ cp pxfex-customwritable.jar /tmp/
```
(Your Hadoop library classpath may differ.)
3. Copy the `pxfex-customwritable.jar` file to the Greenplum Database master node. For example:
``` shell
$ scp pxfex-customwritable.jar gpadmin@gpmaster:/home/gpadmin
```
4. Log in to your Greenplum Database master node:
``` shell
$ ssh gpadmin@<gpmaster>
````
5. Copy the `pxfex-customwritable.jar` JAR file to the user runtime library directory, and note the location. For example, if `PXF_CONF=/usr/local/greenplum-pxf`:
``` shell
gpadmin@gpmaster$ cp /home/gpadmin/pxfex-customwritable.jar /usr/local/greenplum-pxf/lib/pxfex-customwritable.jar
```
5. Synchronize the PXF configuration to the Greenplum Database cluster. For example:
``` shell
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
```
3. Restart PXF on each Greenplum Database segment host as described in [Restarting PXF](cfginitstart_pxf.html#restart_pxf).
5. Use the PXF `hdfs:SequenceFile` profile to create a Greenplum Database writable external table. Identify the serialization/deserialization Java class you created above in the `DATA-SCHEMA` \<custom-option\>. Use `BLOCK` mode compression with `BZip2` when you create the writable table.
``` sql
postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_tbl_seqfile (location text, month text, number_of_orders integer, total_sales real)
LOCATION ('pxf://data/pxf_examples/pxf_seqfile?PROFILE=hdfs:SequenceFile&DATA-SCHEMA=com.example.pxf.hdfs.writable.dataschema.PxfExample_CustomWritable&COMPRESSION_TYPE=BLOCK&COMPRESSION_CODEC=org.apache.hadoop.io.compress.BZip2Codec')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
```
Notice that the `'CUSTOM'` `FORMAT` \<formatting-properties\> specifies the built-in `pxfwritable_export` formatter.
6. Write a few records to the `pxf_seqfile` HDFS directory by inserting directly into the `pxf_tbl_seqfile` table. For example:
``` sql
postgres=# INSERT INTO pxf_tbl_seqfile VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 );
postgres=# INSERT INTO pxf_tbl_seqfile VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
```
6. Recall that Greenplum Database does not support directly querying a writable external table. To read the data in `pxf_seqfile`, create a readable external Greenplum Database referencing this HDFS directory:
``` sql
postgres=# CREATE EXTERNAL TABLE read_pxf_tbl_seqfile (location text, month text, number_of_orders integer, total_sales real)
LOCATION ('pxf://data/pxf_examples/pxf_seqfile?PROFILE=hdfs:SequenceFile&DATA-SCHEMA=com.example.pxf.hdfs.writable.dataschema.PxfExample_CustomWritable')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
You must specify the `DATA-SCHEMA` \<custom-option\> when you read HDFS data via the `hdfs:SequenceFile` profile. You need not provide compression-related options.
6. Query the readable external table `read_pxf_tbl_seqfile`:
``` sql
gpadmin=# SELECT * FROM read_pxf_tbl_seqfile ORDER BY total_sales;
```
``` pre
location | month | number_of_orders | total_sales
-----------+-------+------------------+-------------
Frankfurt | Mar | 777 | 3956.98
Cleveland | Oct | 3812 | 96645.4
(2 rows)
```
## <a id="read_recordkey"></a>Reading the Record Key
When a Greenplum Database external table references SequenceFile or another data format that stores rows in a key-value format, you can access the key values in Greenplum queries by using the `recordkey` keyword as a field name.
The field type of `recordkey` must correspond to the key type, much as the other fields must match the HDFS data. 
You can define `recordkey` to be any of the following Hadoop types:
- BooleanWritable
- ByteWritable
- DoubleWritable
- FloatWritable
- IntWritable
- LongWritable
- Text
If no record key is defined for a row, Greenplum Database returns the id of the segment that processed the row.
### <a id="read_recordkey_example"></a>Example: Using Record Keys
Create an external readable table to access the record keys from the writable table `pxf_tbl_seqfile` that you created in [Example: Writing Binary Data to HDFS](#write_seqfile_example). Define the `recordkey` in this example to be of type `int8`.
``` sql
postgres=# CREATE EXTERNAL TABLE read_pxf_tbl_seqfile_recordkey(recordkey int8, location text, month text, number_of_orders integer, total_sales real)
LOCATION ('pxf://data/pxf_examples/pxf_seqfile?PROFILE=hdfs:SequenceFile&DATA-SCHEMA=com.example.pxf.hdfs.writable.dataschema.PxfExample_CustomWritable')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
gpadmin=# SELECT * FROM read_pxf_tbl_seqfile_recordkey;
```
``` pre
recordkey | location | month | number_of_orders | total_sales
-----------+-------------+-------+------------------+-------------
2 | Frankfurt | Mar | 777 | 3956.98
1 | Cleveland | Oct | 3812 | 96645.4
(2 rows)
```
You did not define a record key when you inserted the rows into the writable table, so the `recordkey` identifies the segment on which the row data was processed.
---
title: Reading and Writing HDFS Text Data
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
The PXF HDFS Connector supports plain delimited and comma-separated value form text data. This section describes how to use PXF to access HDFS text data, including how to create, query, and insert data into an external table that references files in the HDFS data store.
## <a id="prereq"></a>Prerequisites
Ensure that you have met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_prereq) before you attempt to read data from or write data to HDFS.
## <a id="profile_text"></a>Reading Text Data
Use the `hdfs:text` profile when you read plain text delimited or .csv data where each row is a single record. The following syntax creates a Greenplum Database readable external table that references such a text file on HDFS: 
``` sql
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:text[&SERVER=<server_name>][&IGNORE_MISSING_PATH=<boolean>]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>');
```
The specific keywords and values used in the [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;hdfs&#8209;file\> | The absolute path to the directory or file in the HDFS data store. |
| PROFILE | The `PROFILE` keyword must specify `hdfs:text`. |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. Optional; PXF uses the `default` server if not specified. |
| IGNORE_MISSING_PATH=\<boolean\> | Specify the action to take when \<path-to-hdfs-file\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. |
| FORMAT | Use `FORMAT` `'TEXT'` when \<path-to-hdfs-file\> references plain text delimited data.<br> Use `FORMAT` `'CSV'` when \<path-to-hdfs-file\> references comma-separated value data. |
| delimiter | The delimiter character in the data. For `FORMAT` `'CSV'`, the default \<delim_value\> is a comma `,`. Preface the \<delim_value\> with an `E` when the value is an escape sequence. Examples: `(delimiter=E'\t')`, `(delimiter ':')`. |
**Note**: PXF does not support CSV files with a header row, nor does it support the `(HEADER)` formatter option in the `CREATE EXTERNAL TABLE` command.
### <a id="profile_text_query"></a>Example: Reading Text Data on HDFS
Perform the following procedure to create a sample text file, copy the file to HDFS, and use the `hdfs:text` profile and the default PXF server to create two PXF external tables to query the data:
1. Create an HDFS directory for PXF example data files. For example:
``` shell
$ hdfs dfs -mkdir -p /data/pxf_examples
```
2. Create a delimited plain text data file named `pxf_hdfs_simple.txt`:
``` shell
$ echo 'Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67' > /tmp/pxf_hdfs_simple.txt
```
Note the use of the comma `,` to separate the four data fields.
4. Add the data file to HDFS:
``` shell
$ hdfs dfs -put /tmp/pxf_hdfs_simple.txt /data/pxf_examples/
```
5. Display the contents of the `pxf_hdfs_simple.txt` file stored in HDFS:
``` shell
$ hdfs dfs -cat /data/pxf_examples/pxf_hdfs_simple.txt
```
4. Start the `psql` subsystem:
``` shell
$ psql -d postgres
```
1. Use the PXF `hdfs:text` profile to create a Greenplum Database external table that references the `pxf_hdfs_simple.txt` file that you just created and added to HDFS:
``` sql
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text')
FORMAT 'TEXT' (delimiter=E',');
```
2. Query the external table:
``` sql
postgres=# SELECT * FROM pxf_hdfs_textsimple;
```
``` pre
location | month | num_orders | total_sales
---------------+-------+------------+-------------
Prague | Jan | 101 | 4875.33
Rome | Mar | 87 | 1557.39
Bangalore | May | 317 | 8936.99
Beijing | Jul | 411 | 11600.67
(4 rows)
```
2. Create a second external table that references `pxf_hdfs_simple.txt`, this time specifying the `CSV` `FORMAT`:
``` sql
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_csv(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text')
FORMAT 'CSV';
postgres=# SELECT * FROM pxf_hdfs_textsimple_csv;
```
When you specify `FORMAT 'CSV'` for comma-separated value data, no `delimiter` formatter option is required because comma is the default delimiter value.
## <a id="profile_textmulti"></a>Reading Text Data with Quoted Linefeeds
Use the `hdfs:text:multi` profile to read plain text data with delimited single- or multi- line records that include embedded (quoted) linefeed characters. The following syntax creates a Greenplum Database readable external table that references such a text file on HDFS:
``` sql
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:text:multi[&SERVER=<server_name>][&IGNORE_MISSING_PATH=<boolean>]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>');
```
The specific keywords and values used in the [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;hdfs&#8209;file\> | The absolute path to the directory or file in the HDFS data store. |
| PROFILE | The `PROFILE` keyword must specify `hdfs:text:multi`. |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. Optional; PXF uses the `default` server if not specified. |
| IGNORE_MISSING_PATH=\<boolean\> | Specify the action to take when \<path-to-hdfs-file\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. |
| FORMAT | Use `FORMAT` `'TEXT'` when \<path-to-hdfs-file\> references plain text delimited data.<br> Use `FORMAT` `'CSV'` when \<path-to-hdfs-file\> references comma-separated value data. |
| delimiter | The delimiter character in the data. For `FORMAT` `'CSV'`, the default \<delim_value\> is a comma `,`. Preface the \<delim_value\> with an `E` when the value is an escape sequence. Examples: `(delimiter=E'\t')`, `(delimiter ':')`. |
### <a id="profile_textmulti_query"></a>Example: Reading Multi-Line Text Data on HDFS
Perform the following steps to create a sample text file, copy the file to HDFS, and use the PXF `hdfs:text:multi` profile and the default PXF server to create a Greenplum Database readable external table to query the data:
1. Create a second delimited plain text file:
``` shell
$ vi /tmp/pxf_hdfs_multi.txt
```
2. Copy/paste the following data into `pxf_hdfs_multi.txt`:
``` pre
"4627 Star Rd.
San Francisco, CA 94107":Sept:2017
"113 Moon St.
San Diego, CA 92093":Jan:2018
"51 Belt Ct.
Denver, CO 90123":Dec:2016
"93114 Radial Rd.
Chicago, IL 60605":Jul:2017
"7301 Brookview Ave.
Columbus, OH 43213":Dec:2018
```
Notice the use of the colon `:` to separate the three fields. Also notice the quotes around the first (address) field. This field includes an embedded line feed separating the street address from the city and state.
3. Copy the text file to HDFS:
``` shell
$ hdfs dfs -put /tmp/pxf_hdfs_multi.txt /data/pxf_examples/
```
4. Use the `hdfs:text:multi` profile to create an external table that references the `pxf_hdfs_multi.txt` HDFS file, making sure to identify the `:` (colon) as the field separator:
``` sql
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textmulti(address text, month text, year int)
LOCATION ('pxf://data/pxf_examples/pxf_hdfs_multi.txt?PROFILE=hdfs:text:multi')
FORMAT 'CSV' (delimiter ':');
```
Notice the alternate syntax for specifying the `delimiter`.
2. Query the `pxf_hdfs_textmulti` table:
``` sql
postgres=# SELECT * FROM pxf_hdfs_textmulti;
```
``` pre
address | month | year
--------------------------+-------+------
4627 Star Rd. | Sept | 2017
San Francisco, CA 94107
113 Moon St. | Jan | 2018
San Diego, CA 92093
51 Belt Ct. | Dec | 2016
Denver, CO 90123
93114 Radial Rd. | Jul | 2017
Chicago, IL 60605
7301 Brookview Ave. | Dec | 2018
Columbus, OH 43213
(5 rows)
```
## <a id="hdfswrite_text"></a>Writing Text Data to HDFS
The PXF HDFS connector "hdfs:text" profile supports writing single line plain text data to HDFS. When you create a writable external table with the PXF HDFS connector, you specify the name of a directory on HDFS. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.
**Note**: External tables that you create with a writable profile can only be used for `INSERT` operations. If you want to query the data that you inserted, you must create a separate readable external table that references the HDFS directory.
Use the following syntax to create a Greenplum Database writable external table that references an HDFS directory: 
``` sql
CREATE WRITABLE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-dir>
?PROFILE=hdfs:text[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV]' (delimiter[=|<space>][E]'<delim_value>');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
```
The specific keywords and values used in the [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;hdfs&#8209;dir\> | The absolute path to the directory in the HDFS data store. |
| PROFILE | The `PROFILE` keyword must specify `hdfs:text` |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. Optional; PXF uses the `default` server if not specified. |
| \<custom&#8209;option\> | \<custom-option\>s are described below.|
| FORMAT | Use `FORMAT` `'TEXT'` to write plain, delimited text to \<path-to-hdfs-dir\>.<br> Use `FORMAT` `'CSV'` to write comma-separated value text to \<path-to-hdfs-dir\>. |
| delimiter | The delimiter character in the data. For `FORMAT` `'CSV'`, the default \<delim_value\> is a comma `,`. Preface the \<delim_value\> with an `E` when the value is an escape sequence. Examples: `(delimiter=E'\t')`, `(delimiter ':')`. |
| DISTRIBUTED BY | If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same distribution policy or \<column_name\> on both tables. Doing so will avoid extra motion of data between segments on the load operation. |
Writable external tables that you create using the `hdfs:text` profile can optionally use record or block compression. The PXF `hdfs:text` profile supports the following compression codecs:
- org.apache.hadoop.io.compress.DefaultCodec
- org.apache.hadoop.io.compress.GzipCodec
- org.apache.hadoop.io.compress.BZip2Codec
You specify the compression codec via custom options in the `CREATE EXTERNAL TABLE` `LOCATION` clause. The `hdfs:text` profile support the following custom write options:
| Option | Value Description |
|-------|-------------------------------------|
| COMPRESSION_CODEC | The compression codec Java class name. If this option is not provided, Greenplum Database performs no data compression. Supported compression codecs include:<br>`org.apache.hadoop.io.compress.DefaultCodec`<br>`org.apache.hadoop.io.compress.BZip2Codec`<br>`org.apache.hadoop.io.compress.GzipCodec` |
| COMPRESSION_TYPE | The compression type to employ; supported values are `RECORD` (the default) or `BLOCK`. |
| THREAD-SAFE | Boolean value determining if a table query can run in multi-threaded mode. The default value is `TRUE`. Set this option to `FALSE` to handle all requests in a single thread for operations that are not thread-safe (for example, compression). |
### <a id="write_hdfstextsimple_example"></a>Example: Writing Text Data to HDFS
This example utilizes the data schema introduced in [Example: Reading Text Data on HDFS](#profile_text_query).
| Column Name | Data Type |
|-------|-------------------------------------|
| location | text |
| month | text |
| number\_of\_orders | int |
| total\_sales | float8 |
This example also optionally uses the Greenplum Database external table named `pxf_hdfs_textsimple` that you created in that exercise.
#### <a id="write_hdfstextsimple_proc" class="no-quick-link"></a>Procedure
Perform the following procedure to create Greenplum Database writable external tables utilizing the same data schema as described above, one of which will employ compression. You will use the PXF `hdfs:text` profile and the default PXF server to write data to the underlying HDFS directory. You will also create a separate, readable external table to read the data that you wrote to the HDFS directory.
1. Create a Greenplum Database writable external table utilizing the data schema described above. Write to the HDFS directory `/data/pxf_examples/pxfwritable_hdfs_textsimple1`. Create the table specifying a comma `,` as the delimiter:
``` sql
postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writabletbl_1(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxfwritable_hdfs_textsimple1?PROFILE=hdfs:text')
FORMAT 'TEXT' (delimiter=',');
```
You specify the `FORMAT` subclause `delimiter` value as the single ascii comma character `,`.
2. Write a few individual records to the `pxfwritable_hdfs_textsimple1` HDFS directory by invoking the SQL `INSERT` command on `pxf_hdfs_writabletbl_1`:
``` sql
postgres=# INSERT INTO pxf_hdfs_writabletbl_1 VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 );
postgres=# INSERT INTO pxf_hdfs_writabletbl_1 VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
```
3. (Optional) Insert the data from the `pxf_hdfs_textsimple` table that you created in [Example: Reading Text Data on HDFS] (#profile_text_query) into `pxf_hdfs_writabletbl_1`:
``` sql
postgres=# INSERT INTO pxf_hdfs_writabletbl_1 SELECT * FROM pxf_hdfs_textsimple;
```
4. In another terminal window, display the data that you just added to HDFS:
``` shell
$ hdfs dfs -cat /data/pxf_examples/pxfwritable_hdfs_textsimple1/*
Frankfurt,Mar,777,3956.98
Cleveland,Oct,3812,96645.37
Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67
```
Because you specified comma `,` as the delimiter when you created the writable external table, this character is the field separator used in each record of the HDFS data.
5. Greenplum Database does not support directly querying a writable external table. To query the data that you just added to HDFS, you must create a readable external Greenplum Database table that references the HDFS directory:
``` sql
postgres=# CREATE EXTERNAL TABLE pxf_hdfs_textsimple_r1(location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxfwritable_hdfs_textsimple1?PROFILE=hdfs:text')
FORMAT 'CSV';
```
You specify the `'CSV'` `FORMAT` when you create the readable external table because you created the writable table with a comma `,` as the delimiter character, the default delimiter for `'CSV'` `FORMAT`.
6. Query the readable external table:
``` sql
postgres=# SELECT * FROM pxf_hdfs_textsimple_r1 ORDER BY total_sales;
```
``` pre
location | month | num_orders | total_sales
-----------+-------+------------+-------------
Rome | Mar | 87 | 1557.39
Frankfurt | Mar | 777 | 3956.98
Prague | Jan | 101 | 4875.33
Bangalore | May | 317 | 8936.99
Beijing | Jul | 411 | 11600.67
Cleveland | Oct | 3812 | 96645.37
(6 rows)
```
The `pxf_hdfs_textsimple_r1` table includes the records you individually inserted, as well as the full contents of the `pxf_hdfs_textsimple` table if you performed the optional step.
7. Create a second Greenplum Database writable external table, this time using Gzip compression and employing a colon `:` as the delimiter:
``` sql
postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_hdfs_writabletbl_2 (location text, month text, num_orders int, total_sales float8)
LOCATION ('pxf://data/pxf_examples/pxfwritable_hdfs_textsimple2?PROFILE=hdfs:text&COMPRESSION_CODEC=org.apache.hadoop.io.compress.GzipCodec')
FORMAT 'TEXT' (delimiter=':');
```
8. Write a few records to the `pxfwritable_hdfs_textsimple2` HDFS directory by inserting directly into the `pxf_hdfs_writabletbl_2` table:
``` sql
gpadmin=# INSERT INTO pxf_hdfs_writabletbl_2 VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 );
gpadmin=# INSERT INTO pxf_hdfs_writabletbl_2 VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
```
9. In another terminal window, display the contents of the data that you added to HDFS; use the `-text` option to `hdfs dfs` to view the compressed data as text:
``` shell
$ hdfs dfs -text /data/pxf_examples/pxfwritable_hdfs_textsimple2/*
Frankfurt:Mar:777:3956.98
Cleveland:Oct:3812:96645.3
```
Notice that the colon `:` is the field separator in this HDFS data.
To query data from the newly-created HDFS directory named `pxfwritable_hdfs_textsimple2`, you can create a readable external Greenplum Database table as described above that references this HDFS directory and specifies `FORMAT 'CSV' (delimiter=':')`.
---
title: Configuring the JDBC Connector for Hive Access (Optional)
---
You can use the PXF JDBC Connector to retrieve data from Hive. You can also use a JDBC named query to submit a custom SQL query to Hive and retrieve the results using the JDBC Connector.
This topic describes how to configure the PXF JDBC Connector to access Hive. When you configure Hive access with JDBC, you must take into account the Hive user impersonation setting, as well as whether or not the Hadoop cluster is secured with Kerberos.
*If you do not plan to use the PXF JDBC Connector to access Hive, then you do not need to perform this procedure.*
## <a id="hive_cfg_server"></a>JDBC Server Configuration
The PXF JDBC Connector is installed with the JAR files required to access Hive via JDBC, `hive-jdbc-<version>.jar` and `hive-service-<version>.jar`, and automatically registers these JARs.
When you configure a PXF JDBC server for Hive access, you must specify the JDBC driver class name, database URL, and client credentials just as you would when configuring a client connection to an SQL database.
To access Hive via JDBC, you must specify the following properties and values in the `jdbc-site.xml` server configuration file:
| Property | Value |
|----------------|--------|
| jdbc.driver | org.apache.hive.jdbc.HiveDriver |
| jdbc.url | jdbc:hive2://\<hiveserver2_host>:\<hiveserver2_port>/\<database> |
The value of the HiveServer2 authentication (`hive.server2.authentication`) and impersonation (`hive.server2.enable.doAs`) properties, and whether or not the Hive service is utilizing Kerberos authentication, will inform the setting of other JDBC server configuration properties. These properties are defined in the `hive-site.xml` configuration file in the Hadoop cluster. You will need to obtain the values of these properties.
The following table enumerates the Hive2 authentication and impersonation combinations supported by the PXF JDBC Connector. It identifies the possible Hive user identities and the JDBC server configuration required for each.
Table heading key:
- *authentication* -> Hive `hive.server2.authentication` Setting
- *enable.doAs* -> Hive `hive.server2.enable.doAs` Setting
- *User Identity* -> Identity that HiveServer2 will use to access data
- *Configuration Required* -> PXF JDBC Connector or Hive configuration required for *User Identity*
| authentication | enable.doAs | User Identity | Configuration Required |
|------------------|---------------|----------------|-------------------|
| `NOSASL` | n/a | No authentication | Must set `jdbc.connection.property.auth` = `noSasl`. |
| `NONE`, or not specified | `TRUE` | User name that you provide | Set `jdbc.user`. |
| `NONE`, or not specified | `TRUE` | Greenplum user name | Set `pxf.service.user.impersonation` to `true` in `jdbc-site.xml`. |
| `NONE`, or not specified | `FALSE` | Name of the user who started Hive, typically `hive` | None |
| `KERBEROS` | `TRUE` | Identity provided in the PXF Kerberos principal, typically `gpadmin` | Must set `hadoop.security.authentication` to `kerberos` in `jdbc-site.xml`. |
| `KERBEROS` | `TRUE` | User name that you provide | Set `hive.server2.proxy.user` in `jdbc.url` and set `hadoop.security.authentication` to `kerberos` in `jdbc-site.xml`. |
| `KERBEROS` | `TRUE` | Greenplum user name | Set `pxf.service.user.impersonation` to `true` and `hadoop.security.authentication` to `kerberos` in `jdbc-site.xml`. |
| `KERBEROS` | `FALSE` | Identity provided in the `jdbc.url` `principal` parameter, typically `hive` | Must set `hadoop.security.authentication` to `kerberos` in `jdbc-site.xml`. |
**Note**: There are additional configuration steps required when Hive utilizes Kerberos authentication.
## <a id="hive_cfg_server_proc"></a>Example Configuration Procedure
Perform the following procedure to configure a PXF JDBC server for Hive:
1. Log in to your Greenplum Database master node:
``` shell
$ ssh gpadmin@<gpmaster>
```
2. Choose a name for the JDBC server.
3. Create the `$PXF_CONF/servers/<server_name>` directory. For example, use the following command to create a JDBC server configuration named `hivejdbc1`:
``` shell
gpadmin@gpmaster$ mkdir $PXF_CONF/servers/hivejdbc1
````
3. Navigate to the server configuration directory. For example:
```shell
gpadmin@gpmaster$ cd $PXF_CONF/servers/hivejdbc1
```
4. Copy the PXF JDBC server template file to the server configuration directory. For example:
``` shell
gpadmin@gpmaster$ cp $PXF_CONF/templates/jdbc-site.xml .
```
4. When you access Hive secured with Kerberos, you also need to specify configuration properties in the `pxf-site.xml` file. *If this file does not yet exist in your server configuration*, copy the `pxf-site.xml` template file to the server config directory. For example:
``` shell
gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml .
```
5. Open the `jdbc-site.xml` file in the editor of your choice and set the `jdbc.driver` and `jdbc.url` properties. Be sure to specify your Hive host, port, and database name:
``` xml
<property>
<name>jdbc.driver</name>
<value>org.apache.hive.jdbc.HiveDriver</value>
</property>
<property>
<name>jdbc.url</name>
<value>jdbc:hive2://<hiveserver2_host>:<hiveserver2_port>/<database></value>
</property>
```
6. Obtain the `hive-site.xml` file from your Hadoop cluster and examine the file.
7. If the `hive.server2.authentication` property in `hive-site.xml` is set to `NOSASL`, HiveServer2 performs no authentication. Add the following connection-level property to `jdbc-site.xml`:
``` xml
<property>
<name>jdbc.connection.property.auth</name>
<value>noSasl</value>
</property>
```
Alternatively, you may choose to add `;auth=noSasl` to the `jdbc.url`.
8. If the `hive.server2.authentication` property in `hive-site.xml` is set to `NONE`, or the property is not specified, you must set the `jdbc.user` property. The value to which you set the `jdbc.user` property is dependent upon the `hive.server2.enable.doAs` impersonation setting in `hive-site.xml`:
1. If `hive.server2.enable.doAs` is set to `TRUE` (the default), Hive runs Hadoop operations on behalf of the user connecting to Hive. *Choose/perform one of the following options*:
**Set** `jdbc.user` to specify the user that has read permission on all Hive data accessed by Greenplum Database. For example, to connect to Hive and run all requests as user `gpadmin`:
``` xml
<property>
<name>jdbc.user</name>
<value>gpadmin</value>
</property>
```
**Or**, turn on JDBC server-level user impersonation so that PXF automatically uses the Greenplum Database user name to connect to Hive; uncomment the `pxf.service.user.impersonation` property in `jdbc-site.xml` and set the value to `true:
``` xml
<property>
<name>pxf.service.user.impersonation</name>
<value>true</value>
</property>
```
If you enable JDBC impersonation in this manner, you must not specify a `jdbc.user` nor include the setting in the `jdbc.url`.
2. If required, create a PXF user configuration file as described in [Configuring a PXF User](cfg_server.html#usercfg) to manage the password setting.
3. If `hive.server2.enable.doAs` is set to `FALSE`, Hive runs Hadoop operations as the user who started the HiveServer2 process, usually the user `hive`. PXF ignores the `jdbc.user` setting in this circumstance.
9. If the `hive.server2.authentication` property in `hive-site.xml` is set to `KERBEROS`:
1. Identify the name of the server configuration.
2. Ensure that you have configured Kerberos authentication for PXF as described in [Configuring PXF for Secure HDFS](pxf_kerbhdfs.html), and that you have specified the Kerberos principal and keytab in the `pxf-site.xml` properties as described in the procedure.
3. Comment out the `pxf.service.user.impersonation` property in the `pxf-site.xml` file. If you require user impersonation, you will uncomment and set the property in an upcoming step.)
3. Uncomment the `hadoop.security.authentication` setting in `$PXF_CONF/servers/<name>/jdbc-site.xml`:
``` xml
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
```
4. Add the `saslQop` property to `jdbc.url`, and set it to match the `hive.server2.thrift.sasl.qop` property setting in `hive-site.xml`. For example, if the `hive-site.xml` file includes the following property setting:
``` xml
<property>
<name>hive.server2.thrift.sasl.qop</name>
<value>auth-conf</value>
</property>
```
You would add `;saslQop=auth-conf` to the `jdbc.url`.
5. Add the HiverServer2 `principal` name to the `jdbc.url`. For example:
<pre>
jdbc:hive2://hs2server:10000/default;<b>principal=hive/hs2server@REALM</b>;saslQop=auth-conf
</pre>
6. If `hive.server2.enable.doAs` is set to `TRUE` (the default), Hive runs Hadoop operations on behalf of the user connecting to Hive. *Choose/perform one of the following options*:
**Do not** specify any additional properties. In this case, PXF initiates all Hadoop access with the identity provided in the PXF Kerberos principal (usually `gpadmin`).
**Or**, set the `hive.server2.proxy.user` property in the `jdbc.url` to specify the user that has read permission on all Hive data. For example, to connect to Hive and run all requests as the user named `integration` use the following `jdbc.url`:
<pre>
jdbc:hive2://hs2server:10000/default;principal=hive/hs2server@REALM;saslQop=auth-conf;<b>hive.server2.proxy.user=integration</b>
</pre>
**Or**, enable PXF JDBC impersonation in the `pxf-site.xml` file so that PXF automatically uses the Greenplum Database user name to connect to Hive. Add or uncomment the `pxf.service.user.impersonation` property and set the value to `true`. For example:
``` xml
<property>
<name>pxf.service.user.impersonation</name>
<value>true</value>
</property>
```
If you enable JDBC impersonation, you must not explicitly specify a `hive.server2.proxy.user` in the `jdbc.url`.
6. If required, create a PXF user configuration file to manage the password setting.
7. If `hive.server2.enable.doAs` is set to `FALSE`, Hive runs Hadoop operations with the identity provided by the PXF Kerberos principal (usually `gpadmin`).
10. Save your changes and exit the editor.
11. Use the `pxf cluster sync` command to copy the new server configuration to the Greenplum Database cluster. For example:
``` shell
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
```
此差异已折叠。
---
title: Initializing PXF
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
The PXF server is a Java application. You must explicitly initialize the PXF Java service instance. This one-time initialization creates the PXF service web application and generates PXF configuration files and templates.
PXF provides two management commands that you can use for initialization:
- [`pxf cluster init`](ref/pxf-cluster.html) - initialize all PXF service instances in the Greenplum Database cluster
- [`pxf init`](ref/pxf.html) - initialize the PXF service instance on the current Greenplum Database host
PXF also provides similar `reset` commands that you can use to reset your PXF configuration.
## <a id="init_pxf"></a> Configuration Properties
PXF supports both internal and user-customizable configuration properties.
PXF internal configuration files are located in your Greenplum Database installation in the `$GPHOME/pxf/conf` directory.
You identify the PXF user configuration directory at PXF initialization time via an environment variable named `$PXF_CONF`. If you do not set `$PXF_CONF` prior to initializing PXF, PXF may prompt you to accept or decline the default user configuration directory, `$HOME/pxf`, during the initialization process.
**Note**: Choose a `$PXF_CONF` directory location that you can back up, and ensure that it resides outside of your Greenplum Database installation directory.
Refer to [PXF User Configuration Directories](about_pxf_dir.html#usercfg) for a list of `$PXF_CONF` subdirectories and their contents.
## <a id="init_descript"></a> Initialization Overview
The PXF server runs on Java 8 or 11. You identify the PXF `$JAVA_HOME` and `$PXF_CONF` settings at PXF initialization time.
Initializing PXF creates the PXF Java web application, and generates PXF internal configuration files, setting default properties specific to your configuration.
Initializing PXF also creates the `$PXF_CONF` user configuration directory if it does not already exist, and then populates `conf` and `templates` subdirectories with the following:
- `conf/` - user-customizable files for PXF runtime and logging configuration settings
- `templates/` - template configuration files
PXF remembers the `JAVA_HOME` setting that you specified during initialization by updating the property of the same name in the `$PXF_CONF/conf/pxf-env.sh` user configuration file. PXF sources this environment file on startup, allowing it to run with a Java installation that is different than the system default Java.
If the `$PXF_CONF` directory that you specify during initialization already exists, PXF updates only the `templates` subdirectory and the `$PXF_CONF/conf/pxf-env.sh` environment configuration file.
## <a id="init-pxf-prereq"></a>Prerequisites
Before initializing PXF in your Greenplum Database cluster, ensure that:
- Your Greenplum Database cluster is up and running.
- You have identified the PXF user configuration directory filesystem location, `$PXF_CONF`, and that the `gpadmin` user has the necessary permissions to create, or write to, this directory.
- You can identify the Java 8 or 11 `$JAVA_HOME` setting for PXF.
## <a id="init-pxf-steps"></a>Procedure
Perform the following procedure to initialize PXF on each segment host in your Greenplum Database cluster.
1. Log in to the Greenplum Database master node:
``` shell
$ ssh gpadmin@<gpmaster>
```
2. Export the PXF `JAVA_HOME` setting in your shell. For example:
``` shell
gpadmin@gpmaster$ export JAVA_HOME=/usr/lib/jvm/jre
```
4. Run the `pxf cluster init` command to initialize the PXF service on the master, standby master, and on each segment host. For example, the following command specifies `/usr/local/greenplum-pxf` as the PXF user configuration directory for initialization:
``` shell
gpadmin@gpmaster$ PXF_CONF=/usr/local/greenplum-pxf $GPHOME/pxf/bin/pxf cluster init
```
**Note**: The PXF service runs only on the segment hosts. However,`pxf cluster init` also sets up the PXF user configuration directories on the Greenplum Database master and standby master hosts.
## <a id="pxf-reset"></a>Resetting PXF
Should you need to, you can reset PXF to its uninitialized state. You might choose to reset PXF if you specified an incorrect `PXF_CONF` directory, or if you want to start the initialization procedure from scratch.
When you reset PXF, PXF prompts you to confirm the operation. If you confirm, PXF removes the following runtime files and directories (where `PXF_HOME=$GPHOME/pxf`):
- `$PXF_HOME/conf/pxf-private.classpath`
- `$PXF_HOME/pxf-service`
- `$PXF_HOME/run`
PXF does not remove the `$PXF_CONF` directory during a reset operation.
You must stop the PXF service instance on a segment host before you can reset PXF on the host.
### <a id="reset-pxf-steps"></a>Procedure
Perform the following procedure to reset PXF on each segment host in your Greenplum Database cluster.
1. Log in to the Greenplum Database master node:
``` shell
$ ssh gpadmin@<gpmaster>
```
2. Stop the PXF service instances on each segment host. For example:
``` shell
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster stop
```
3. Reset the PXF service instances on all Greenplum hosts. For example:
``` shell
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster reset
```
**Note**: After you reset PXF, you must initialize and start PXF to use the service again.
---
title: Installing Java for PXF
---
PXF is a Java service. It requires a Java 8 or Java 11 installation on each Greenplum Database host.
## <a id="prereq"></a>Prerequisites
Ensure that you have access to, or superuser permissions to install, Java 8 or Java 11 on each Greenplum Database host.
## <a id="proc"></a>Procedure
Perform the following procedure to install Java on the master, standby master, and on each segment host in your Greenplum Database cluster. You will use the `gpssh` utility where possible to run a command on multiple hosts.
1. Log in to your Greenplum Database master node:
``` shell
$ ssh gpadmin@<gpmaster>
```
2. Determine the version(s) of Java installed on the system:
``` pre
gpadmin@gpmaster$ rpm -qa | grep java
```
3. If the system does not include a Java version 8 or 11 installation, install one of these Java versions on the master, standby master, and on each Greenplum Database segment host.
1. Create a text file that lists your Greenplum Database standby master host and segment hosts, one host name per line. For example, a file named `gphostfile` may include:
``` pre
gpmaster
mstandby
seghost1
seghost2
seghost3
```
2. Install the Java package on each host. For example, to install Java version 8:
``` shell
gpadmin@gpmaster$ gpssh -e -v -f gphostfile sudo yum -y install java-1.8.0-openjdk-1.8.0*
```
4. Identify the Java 8 or 11 `$JAVA_HOME` setting for PXF. For example:
If you installed Java 8:
``` shell
JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.x86_64/jre
```
If you installed Java 11:
``` shell
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.4.11-0.el7_6.x86_64
```
If the superuser configures the newly-installed Java alternative as the system default:
``` shell
JAVA_HOME=/usr/lib/jvm/jre
```
5. Note the `$JAVA_HOME` setting; you provide this value when you initialize PXF.
---
title: Configuring PXF
---
Your Greenplum Database deployment consists of a master node and multiple segment hosts. When you initialize and configure the Greenplum Platform Extension Framework (PXF), you start a single PXF JVM process on each Greenplum Database segment host.
PXF provides connectors to Hadoop, Hive, HBase, object stores, and external SQL data stores. You must configure PXF to support the connectors that you plan to use.
To configure PXF, you must:
1. Install Java packages on each Greenplum Database segment host as described in [Installing Java for PXF](install_java.html).
2. [Initialize the PXF Service](init_pxf.html).
3. If you plan to use the Hadoop, Hive, or HBase PXF connectors, you must perform the configuration procedure described in [Configuring PXF Hadoop Connectors](client_instcfg.html).
3. If you plan to use the PXF connectors to access the Azure, Google Cloud Storage, Minio, or S3 object store(s), you must perform the configuration procedure described in [Configuring Connectors to Azure, Google Cloud Storage, Minio, and S3 Object Stores](objstore_cfg.html).
3. If you plan to use the PXF JDBC Connector to access an external SQL database, perform the configuration procedure described in [Configuring the JDBC Connector](jdbc_cfg.html).
4. [Start PXF](cfginitstart_pxf.html).
---
title: Introduction to PXF
---
The Greenplum Platform Extension Framework (PXF) provides *connectors* that enable you to access data stored in sources external to your Greenplum Database deployment. These connectors map an external data source to a Greenplum Database *external table* definition. When you create the Greenplum Database external table, you identify the external data store and the format of the data via a *server* name and a *profile* name that you provide in the command.
You can query the external table via Greenplum Database, leaving the referenced data in place. Or, you can use the external table to load the data into Greenplum Database for higher performance.
## <a id="suppplat"></a> Supported Platforms
### <a id="os"></a> Operating Systems
PXF supports the Red Hat Enterprise Linux 64-bit 7.x, CentOS 64-bit 7.x, and Ubuntu 18.04 LTS operating system platforms.
<div class="note">Starting in 6.x, Greenplum does not bundle <code>cURL</code> and instead loads the system-provided library. PXF requires <code>cURL</code> version 7.29.0 or newer. The officially-supported <code>cURL</code> for the CentOS 6.x and Red Hat Enterprise Linux 6.x operating systems is version 7.19.*. Greenplum Database 6 does not support running PXF on CentOS 6.x or RHEL 6.x due to this limitation.</div>
### <a id="java"></a> Java
PXF supports Java 8 and Java 11.
### <a id="hadoop"></a> Hadoop
PXF bundles all of the Hadoop JAR files on which it depends, and supports the following Hadoop component versions:
| PXF Version | Hadoop Version | Hive Server Version | HBase Server Version |
|-------------|----------------|---------------------|-------------|
| 5.10, 5.11, 5.12 | 2.x, 3.1+ | 1.x, 2.x, 3.1+ | 1.3.2 |
| 5.9 | 2.x, 3.1+ | 1.x, 2.x, 3.1+ | 1.3.2 |
| 5.8 | 2.x | 1.x | 1.3.2 |
## <a id="arch"></a> Architectural Overview
Your Greenplum Database deployment consists of a master node and multiple segment hosts. A single PXF agent process on each Greenplum Database segment host allocates a worker thread for each segment instance on a segment host that participates in a query against an external table. The PXF agents on multiple segment hosts communicate with the external data store in parallel.
## <a id="more"></a> About Connectors, Servers, and Profiles
*Connector* is a generic term that encapsulates the implementation details required to read from or write to an external data store. PXF provides built-in connectors to Hadoop (HDFS, Hive, HBase), object stores (Azure, Google Cloud Storage, Minio, S3), and SQL databases (via JDBC).
A PXF *Server* is a named configuration for a connector. A server definition provides the information required for PXF to access an external data source. This configuration information is data-store-specific, and may include server location, access credentials, and other relevant properties.
The Greenplum Database administrator will configure at least one server definition for each external data store that they will allow Greenplum Database users to access, and will publish the available server names as appropriate.
You specify a `SERVER=<server_name>` setting when you create the external table to identify the server configuration from which to obtain the configuration and credentials to access the external data store.
The default PXF server is named `default` (reserved), and when configured provides the location and access information for the external data source in the absence of a `SERVER=<server_name>` setting.
Finally, a PXF *profile* is a named mapping identifying a specific data format or protocol supported by a specific external data store. PXF supports text, Avro, JSON, RCFile, Parquet, SequenceFile, and ORC data formats, and the JDBC protocol, and provides several built-in profiles as discussed in the following section.
## <a id="create_external_table"></a>Creating an External Table
PXF implements a Greenplum Database protocol named `pxf` that you can use to create an external table that references data in an external data store. The syntax for a [`CREATE EXTERNAL TABLE`](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command that specifies the `pxf` protocol follows:
``` sql
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION('pxf://<path-to-data>?PROFILE=<profile_name>[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
```
The `LOCATION` clause in a `CREATE EXTERNAL TABLE` statement specifying the `pxf` protocol is a URI. This URI identifies the path to, or other information describing, the location of the external data. For example, if the external data store is HDFS, the \<path-to-data\> identifies the absolute path to a specific HDFS file. If the external data store is Hive, \<path-to-data\> identifies a schema-qualified Hive table name.
You use the query portion of the URI, introduced by the question mark (?), to identify the PXF server and profile names.
PXF may require additional information to read or write certain data formats. You provide profile-specific information using the optional \<custom-option\>=\<value\> component of the `LOCATION` string and formatting information via the \<formatting-properties\> component of the string. The custom options and formatting properties supported by a specific profile vary; they are identified in usage documentation for the profile.
<caption><span class="tablecap">Table 1. CREATE EXTERNAL TABLE Parameter Values and Descriptions</span></caption>
<a id="creatinganexternaltable__table_pfy_htz_4p"></a>
| Keyword | Value and Description |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| \<path&#8209;to&#8209;data\> | A directory, file name, wildcard pattern, table name, etc. The syntax of \<path-to-data\> is dependent upon the external data source. |
| PROFILE=\<profile_name\> | The profile that PXF uses to access the data. PXF supports profiles that access text, Avro, JSON, RCFile, Parquet, SequenceFile, and ORC data in [Hadoop services](access_hdfs.html), [object stores](access_objstore.html), and [other SQL databases](jdbc_pxf.html). |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. PXF uses the `default` server if not specified. |
| \<custom&#8209;option\>=\<value\> | Additional options and their values supported by the profile or the server. |
| FORMAT&nbsp;\<value\>| PXF profiles support the `TEXT`, `CSV`, and `CUSTOM` formats. |
| \<formatting&#8209;properties\> | Formatting properties supported by the profile; for example, the `FORMATTER` or `delimiter`.   |
**Note:** When you create a PXF external table, you cannot use the `HEADER` option in your formatter specification.
## <a id="other"></a> Other PXF Features
Certain PXF connectors and profiles support filter pushdown and column projection. Refer to the following topics for detailed information about this support:
- [About PXF Filter Pushdown](filter_push.html)
- [About Column Projection in PXF](col_project.html)
此差异已折叠。
此差异已折叠。
---
title: Monitoring PXF
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
The `pxf cluster status` command displays the status of the PXF service instance on all segment hosts in your Greenplum Database cluster. `pxf status` displays the status of the PXF service instance on the local (segment) host.
Only the `gpadmin` user can request the status of the PXF service.
Perform the following procedure to request the PXF status of your Greenplum Database cluster.
1. Log in to the Greenplum Database master node:
``` shell
$ ssh gpadmin@<gpmaster>
```
2. Run the `pxf cluster status` command:
```shell
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster status
```
---
title: Reading and Writing Avro Data in an Object Store
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
The PXF object store connectors support reading Avro-format data. This section describes how to use PXF to read and write Avro data in an object store, including how to create, query, and insert into an external table that references an Avro file in the store.
**Note**: Accessing Avro-format data from an object store is very similar to accessing Avro-format data in HDFS. This topic identifies object store-specific information required to read Avro data, and links to the [PXF HDFS Avro documentation](hdfs_avro.html) where appropriate for common information.
## <a id="prereq"></a>Prerequisites
Ensure that you have met the PXF Object Store [Prerequisites](access_objstore.html#objstore_prereq) before you attempt to read data from an object store.
## <a id="avro_work"></a>Working with Avro Data
Refer to [Working with Avro Data](hdfs_avro.html#avro_work) in the PXF HDFS Avro documentation for a description of the Apache Avro data serialization framework.
When you read or write Avro data in an object store:
- If the Avro schema file resides in the object store:
- You must include the bucket in the schema file path. This bucket need not specify the same bucket as the Avro data file.
- The secrets that you specify in the `SERVER` configuration must provide access to both the data file and schema file buckets.
- The schema file path must not include spaces.
## <a id="avro_cet"></a>Creating the External Table
Use the `<objstore>:avro` profiles to read and write Avro-format files in an object store. PXF supports the following `<objstore>` profile prefixes:
| Object Store | Profile Prefix |
|-------|-------------------------------------|
| Azure Blob Storage | wasbs |
| Azure Data Lake | adl |
| Google Cloud Storage | gs |
| Minio | s3 |
| S3 | s3 |
The following syntax creates a Greenplum Database external table that references an Avro-format file:
``` sql
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-file>?PROFILE=<objstore>:avro&SERVER=<server_name>[&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export');
```
The specific keywords and values used in the [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;file\> | The absolute path to the directory or file in the object store. |
| PROFILE=\<objstore\>:avro | The `PROFILE` keyword must identify the specific object store. For example, `s3:avro`. |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. |
| \<custom&#8209;option\>=\<value\> | Avro-specific custom options are described in the [PXF HDFS Avro documentation](hdfs_avro.html#customopts). |
| FORMAT 'CUSTOM' | Use `FORMAT` '`CUSTOM`' with `(FORMATTER='pxfwritable_export')` (write) or `(FORMATTER='pxfwritable_import')` (read).|
If you are accessing an S3 object store, you can provide S3 credentials via custom options in the `CREATE EXTERNAL TABLE` command as described in [Overriding the S3 Server Configuration with DDL](access_s3.html#s3_override).
## <a id="example"></a>Example
Refer to [Example: Reading Avro Data](hdfs_avro.html#avro_example) in the PXF HDFS Avro documentation for an Avro example. Modifications that you must make to run the example with an object store include:
- Copying the file to the object store instead of HDFS. For example, to copy the file to S3:
``` shell
$ aws s3 cp /tmp/pxf_avro.avro s3://BUCKET/pxf_examples/
```
- Using the `CREATE EXTERNAL TABLE` syntax and `LOCATION` keywords and settings described above. For example, if your server name is `s3srvcfg`:
``` sql
CREATE EXTERNAL TABLE pxf_s3_avro(id bigint, username text, followers text, fmap text, relationship text, address text)
LOCATION ('pxf://BUCKET/pxf_examples/pxf_avro.avro?PROFILE=s3:avro&SERVER=s3srvcfg&COLLECTION_DELIM=,&MAPKEY_DELIM=:&RECORDKEY_DELIM=:')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
You make similar modifications to follow the steps in [Example: Writing Avro Data](hdfs_avro.html#topic_avro_writedata).
---
title: Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
---
You can use PXF to access Azure Data Lake, Azure Blob Storage, and Google Cloud Storage object stores. This topic describes how to configure the PXF connectors to these external data sources.
*If you do not plan to use these PXF object store connectors, then you do not need to perform this procedure.*
## <a id="about_cfg"></a>About Object Store Configuration
To access data in an object store, you must provide a server location and client credentials. When you configure a PXF object store connector, you add at least one named PXF server configuration for the connector as described in [Configuring PXF Servers](cfg_server.html).
PXF provides a template configuration file for each object store connector. These template files are located in the `$PXF_CONF/templates/` directory.
### <a id="abs_cfg"></a>Azure Blob Storage Server Configuration
The template configuration file for Azure Blob Storage is `$PXF_CONF/templates/wasbs-site.xml`. When you configure an Azure Blob Storage server, you must provide the following server configuration properties and replace the template value with your account name:
| Property | Description | Value |
|----------------|--------------------------------------------|-------|
| fs.adl.oauth2.access.token.provider.type | The token type. | Must specify `ClientCredential`. |
| fs.azure.account.key.\<YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME\>.blob.core.windows.net | The Azure account key. | Replace <YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME\> with your account key. |
| fs.AbstractFileSystem.wasbs.impl | The file system class name. | Must specify `org.apache.hadoop.fs.azure.Wasbs`. |
### <a id="adl_cfg"></a>Azure Data Lake Server Configuration
The template configuration file for Azure Data Lake is `$PXF_CONF/templates/adl-site.xml`. When you configure an Azure Data Lake server, you must provide the following server configuration properties and replace the template values with your credentials:
| Property | Description | Value |
|----------------|--------------------------------------------|-------|
| fs.adl.oauth2.access.token.provider.type | The type of token. | Must specify `ClientCredential`. |
| fs.adl.oauth2.refresh.url | The Azure endpoint to which to connect. | Your refresh URL. |
| fs.adl.oauth2.client.id | The Azure account client ID. | Your client ID (UUID). |
| fs.adl.oauth2.credential | The password for the Azure account client ID. | Your password. |
### <a id="gcs_cfg"></a>Google Cloud Storage Server Configuration
The template configuration file for Google Cloud Storage is `$PXF_CONF/templates/gs-site.xml`. When you configure a Google Cloud Storage server, you must provide the following server configuration properties and replace the template values with your credentials:
| Property | Description | Value |
|----------------|--------------------------------------------|-------|
| google.cloud.auth.service.account.enable | Enable service account authorization. | Must specify `true`. |
| google.cloud.auth.service.account.json.keyfile | The Google Storage key file. | Path to your key file. |
| fs.AbstractFileSystem.gs.impl | The file system class name. | Must specify `com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS`. |
## <a id="cfg_proc"></a>Example Server Configuration Procedure
Ensure that you have initialized PXF before you configure an object store connector server.
In this procedure, you name and add a PXF server configuration in the `$PXF_CONF/servers` directory on the Greenplum Database master host for the Google Cloud Storate (GCS) connector. You then use the `pxf cluster sync` command to sync the server configuration(s) to the Greenplum Database cluster.
1. Log in to your Greenplum Database master node:
``` shell
$ ssh gpadmin@<gpmaster>
```
2. Choose a name for the server. You will provide the name to end users that need to reference files in the object store.
3. Create the `$PXF_CONF/servers/<server_name>` directory. For example, use the following command to create a server configuration for a Google Cloud Storage server named `gs_public`:
``` shell
gpadmin@gpmaster$ mkdir $PXF_CONF/servers/gs_public
````
3. Copy the PXF template file for GCS to the server configuration directory. For example:
``` shell
gpadmin@gpmaster$ cp $PXF_CONF/templates/gs-site.xml $PXF_CONF/servers/gs_public/
```
4. Open the template server configuration file in the editor of your choice, and provide appropriate property values for your environment. For example, if your Google Cloud Storage key file is located in `/home/gpadmin/keys/gcs-account.key.json`:
``` pre
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>google.cloud.auth.service.account.enable</name>
<value>true</value>
</property>
<property>
<name>google.cloud.auth.service.account.json.keyfile</name>
<value>/home/gpadmin/keys/gcs-account.key.json</value>
</property>
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
</property>
</configuration>
```
5. Save your changes and exit the editor.
4. Use the `pxf cluster sync` command to copy the new server configurations to the Greenplum Database cluster. For example:
``` shell
gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync
```
---
title: Reading a Multi-Line Text File into a Single Table Row
---
The PXF object store connectors support reading a multi-line text file as a single table row. This section describes how to use PXF to read multi-line text and JSON data files in an object store, including how to create an external table that references multiple files in the store.
PXF supports reading only text and JSON files in this manner.
**Note**: Accessing multi-line files from an object store is very similar to accessing multi-line files in HDFS. This topic identifies the object store-specific information required to read these files. Refer to the [PXF HDFS documentation](hdfs_fileasrow.html) for more information.
## <a id="prereq"></a>Prerequisites
Ensure that you have met the PXF Object Store [Prerequisites](access_objstore.html#objstore_prereq) before you attempt to read data from multiple files residing in an object store.
## <a id="objmulti_cet"></a>Creating the External Table
Use the `<objstore>:hdfs:multi` profile to read multiple files in an object store each into a single table row. PXF supports the following `<objstore>` profile prefixes:
| Object Store | Profile Prefix |
|-------|-------------------------------------|
| Azure Blob Storage | wasbs |
| Azure Data Lake | adl |
| Google Cloud Storage | gs |
| Minio | s3 |
| S3 | s3 |
The following syntax creates a Greenplum Database readable external table that references one or more text files in an object store:
``` sql
CREATE EXTERNAL TABLE <table_name>
( <column_name> text|json | LIKE <other_table> )
LOCATION ('pxf://<path-to-files>?PROFILE=<objstore>:text:multi&SERVER=<server_name>[&IGNORE_MISSING_PATH=<boolean>]&FILE_AS_ROW=true')
FORMAT 'CSV');
```
The specific keywords and values used in the [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;files\> | The absolute path to the directory or files in the object store. |
| PROFILE=\<objstore\>:text:multi | The `PROFILE` keyword must identify the specific object store. For example, `s3:text:multi`. |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. |
| IGNORE_MISSING_PATH=\<boolean\> | Specify the action to take when \<path-to-files\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. |
| FILE\_AS\_ROW=true | The required option that instructs PXF to read each file into a single table row. |
| FORMAT | The `FORMAT` must specify `'CSV'`. |
If you are accessing an S3 object store, you can provide S3 credentials via custom options in the `CREATE EXTERNAL TABLE` command as described in [Overriding the S3 Server Configuration with DDL](access_s3.html#s3_override).
## <a id="example"></a>Example
Refer to [Example: Reading an HDFS Text File into a Single Table Row](hdfs_fileasrow.html#example_fileasrow) in the PXF HDFS documentation for an example. Modifications that you must make to run the example with an object store include:
- Copying the file to the object store instead of HDFS. For example, to copy the file to S3:
``` shell
$ aws s3 cp /tmp/file1.txt s3://BUCKET/pxf_examples/tdir
$ aws s3 cp /tmp/file2.txt s3://BUCKET/pxf_examples/tdir
$ aws s3 cp /tmp/file3.txt s3://BUCKET/pxf_examples/tdir
```
- Using the `CREATE EXTERNAL TABLE` syntax and `LOCATION` keywords and settings described above. For example, if your server name is `s3srvcfg`:
``` sql
CREATE EXTERNAL TABLE pxf_readfileasrow_s3( c1 text )
LOCATION('pxf://BUCKET/pxf_examples/tdir?PROFILE=s3:text:multi&SERVER=s3srvcfg&FILE_AS_ROW=true')
FORMAT 'CSV'
```
---
title: Reading JSON Data from an Object Store
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
The PXF object store connectors support reading JSON-format data. This section describes how to use PXF to access JSON data in an object store, including how to create and query an external table that references a JSON file in the store.
**Note**: Accessing JSON-format data from an object store is very similar to accessing JSON-format data in HDFS. This topic identifies object store-specific information required to read JSON data, and links to the [PXF HDFS JSON documentation](hdfs_json.html) where appropriate for common information.
## <a id="prereq"></a>Prerequisites
Ensure that you have met the PXF Object Store [Prerequisites](access_objstore.html#objstore_prereq) before you attempt to read data from an object store.
## <a id="avro_work"></a>Working with Avro Data
Refer to [Working with JSON Data](hdfs_json.html#hdfsjson_work) in the PXF HDFS JSON documentation for a description of the JSON text-based data-interchange format.
## <a id="json_cet"></a>Creating the External Table
Use the `<objstore>:json` profile to read JSON-format files from an object store. PXF supports the following `<objstore>` profile prefixes:
| Object Store | Profile Prefix |
|-------|-------------------------------------|
| Azure Blob Storage | wasbs |
| Azure Data Lake | adl |
| Google Cloud Storage | gs |
| Minio | s3 |
| S3 | s3 |
The following syntax creates a Greenplum Database readable external table that references a JSON-format file:
``` sql
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-file>?PROFILE=<objstore>:json&SERVER=<server_name>[&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
The specific keywords and values used in the [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;file\> | The absolute path to the directory or file in the object store. |
| PROFILE=\<objstore\>:json | The `PROFILE` keyword must identify the specific object store. For example, `s3:json`. |
| SERVER=\<server_name\> | The named server configuration that PXF uses to access the data. |
| \<custom&#8209;option\>=\<value\> | JSON supports the custom option named `IDENTIFIER` as described in the [PXF HDFS JSON documentation](hdfs_json.html#customopts). |
| FORMAT 'CUSTOM' | Use `FORMAT` `'CUSTOM'` with the `<objstore>:json` profile. The `CUSTOM` `FORMAT` requires that you specify `(FORMATTER='pxfwritable_import')`. |
If you are accessing an S3 object store, you can provide S3 credentials via custom options in the `CREATE EXTERNAL TABLE` command as described in [Overriding the S3 Server Configuration with DDL](access_s3.html#s3_override).
## <a id="example"></a>Example
Refer to [Loading the Sample JSON Data to HDFS](hdfs_json.html#jsontohdfs) and [Example: Reading a JSON File with Single Line Records](hdfs_json.html#jsonexample1) in the PXF HDFS JSON documentation for a JSON example. Modifications that you must make to run the example with an object store include:
- Copying the file to the object store instead of HDFS. For example, to copy the file to S3:
``` shell
$ aws s3 cp /tmp/singleline.json s3://BUCKET/pxf_examples/
$ aws s3 cp /tmp/multiline.json s3://BUCKET/pxf_examples/
```
- Using the `CREATE EXTERNAL TABLE` syntax and `LOCATION` keywords and settings described above. For example, if your server name is `s3srvcfg`:
``` sql
CREATE EXTERNAL TABLE singleline_json_s3(
created_at TEXT,
id_str TEXT,
"user.id" INTEGER,
"user.location" TEXT,
"coordinates.values[0]" INTEGER,
"coordinates.values[1]" INTEGER
)
LOCATION('pxf://BUCKET/pxf_examples/singleline.json?PROFILE=s3:json&SERVER=s3srvcfg')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
---
title: PXF Utility Reference
---
The Greenplum Platform Extension Framework (PXF) includes the following utility reference pages:
- [pxf cluster](pxf-cluster.html)
- [pxf](pxf.html)
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册