提交 79af424a 编写于 作者: L Lisa Owen 提交者: David Yozie

docs - add pxf content into gpdb doc set (#3486)

* docs - integrate pxf markdown docs into gpdb dita

* incorporate initial review edits from david

* updates for david review (PART 2)

* incorporate comments from shivram

* hdfs/hive installcfg - copy cfg files from hdfs cluster

* update hadoop cluster dir for copy example, cfg chg note

* client inst/cfg - reiterate must be tarball in step
上级 ab71b4b1
<div id="sub-nav" class="js-sidenav nav-container" role="navigation">
<a class="sidenav-title" data-behavior="SubMenuMobile"> Doc Index</a>
<div class="nav-content">
<ul>
<li class="has_submenu">
<a href="/docs/pxf/intro_pxf.html" format="markdown">PXF Extension Framework</a>
<ul>
<li>
<a href="/docs/pxf/intro_pxf.html" format="markdown">Introducing the PXF Extension Framework</a>
</li>
<li class="has_submenu">
<a href="/docs/pxf/instcfg_pxf.html" format="markdown">Installing and Configuring PXF</a>
<ul>
<li>
<a href="/docs/pxf/hdfsclient_instcfg.html" format="markdown">Installing and Configuring the Hadoop Client for PXF</a>
</li>
<li>
<a href="/docs/pxf/hiveclient_instcfg.html" format="markdown">Installing and Configuring the Hive Client for PXF</a>
</li>
<li>
<a href="/docs/pxf/cfginitstart_pxf.html" format="markdown">Configuring, Initializing, and Starting PXF</a>
</li>
</ul>
<li>
<a href="/docs/pxf/using_pxf.html" format="markdown">Using PXF</a>
</li>
<li>
<a href="/docs/pxf/hdfs_read_pxf.html" format="markdown">Reading HDFS File Data with PXF</a>
</li>
<li>
<a href="/docs/pxf/hive_pxf.html" format="markdown">Accessing Hive Table Data with PXF</a>
</li>
<li>
<a href="/docs/pxf/troubleshooting_pxf.html" format="markdown">Troubleshooting PXF</a>
</li>
</ul>
</li>
</ul>
</div>
</div>
<!--end of sub-nav-->
......@@ -8,6 +8,7 @@
<topicref href="g-gpfdist-protocol.xml"/>
<topicref href="g-gpfdists-protocol.xml"/>
<topicref href="g-gphdfs-protocol.xml"/>
<topicref href="g-pxf-protocol.xml"/>
<topicref href="g-s3-protocol.xml"/>
<topicref href="g-accessing-ext-files-custom-protocol.xml"/>
<topicref href="g-handling-errors-ext-table-data.xml"/>
......@@ -26,16 +27,17 @@
type="topic"/>
</topicref>
</topicref>
<topicref href="pxf-overview.xml" navtitle="Accessing HDFS and Hive Data with PXF"/>
<topicref href="g-using-hadoop-distributed-file-system--hdfs--tables.xml" type="topic">
<topicref href="g-one-time-hdfs-protocol-installation.xml" type="topic"/>
<topicref href="g-grant-privileges-for-the-hdfs-protocol.xml" type="topic"/>
<topicref href="g-specify-hdfs-data-in-an-external-table-definition.xml" type="topic"/>
<topicref href="g-hdfs-avro-format.xml"/>
<topicref href="g-hdfs-parquet-format.xml"/>
<topicref href="g-hdfs-readable-and-writable-external-table-examples.xml" type="topic"/>
<topicref href="g-reading-and-writing-custom-formatted-hdfs-data.xml" type="topic"/>
<topicref href="g-hdfs-emr-config.xml"/>
</topicref>
<topicref href="g-one-time-hdfs-protocol-installation.xml" type="topic"/>
<topicref href="g-grant-privileges-for-the-hdfs-protocol.xml" type="topic"/>
<topicref href="g-specify-hdfs-data-in-an-external-table-definition.xml" type="topic"/>
<topicref href="g-hdfs-avro-format.xml"/>
<topicref href="g-hdfs-parquet-format.xml"/>
<topicref href="g-hdfs-readable-and-writable-external-table-examples.xml" type="topic"/>
<topicref href="g-reading-and-writing-custom-formatted-hdfs-data.xml" type="topic"/>
<topicref href="g-hdfs-emr-config.xml"/>
</topicref>
<topicref href="g-using-the-greenplum-parallel-file-server--gpfdist-.xml"/>
</topicref>
</map>
......@@ -3,8 +3,8 @@
PUBLIC "-//OASIS//DTD DITA Composite//EN" "ditabase.dtd">
<topic id="topic3">
<title>Defining External Tables</title>
<shortdesc>External tables enable accessing external files as if they are regular database
tables. They are often used to move data into and out of a Greenplum database.</shortdesc>
<shortdesc>External tables enable accessing external data as if it were a regular database
table. They are often used to move data into and out of a Greenplum database.</shortdesc>
<body>
<p>To create an external table definition, you specify the format of your input files and the
location of your external data sources. For information input file formats, see <xref
......@@ -25,6 +25,8 @@
href="g-hdfs-emr-config.xml#amazon-emr"/>.</p></li>
<li><codeph>s3://</codeph> accesses files in an Amazon S3 bucket. See <xref
href="g-s3-protocol.xml#amazon-emr"/>.</li>
<li>The <codeph>pxf://</codeph> protocol accesses external HDFS files and Hive tables using the PXF Extension Framework. See <xref href="g-pxf-protocol.xml"></xref>.</li>
</ul></p>
<p>External tables access external files from within the database as if they are regular
database tables. External tables defined with the
......
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="topic_z5g_l5h_kr1313">
<title>pxf:// Protocol</title>
<shortdesc>You can use the PXF Extension Framework <codeph>pxf://</codeph> protocol to access data on external HDFS and Hive systems.</shortdesc>
<body>
<p>The PXF Extension Framework <codeph>pxf</codeph> protocol is packaged as a Greenplum Database extension. The <codeph>pxf</codeph> protocol supports reading HDFS file and Hive table data. The protocol does not yet support writing to HDFS or Hive data stores.</p>
<p>When you use the <codeph>pxf</codeph> protocol to query HDFS and Hive systems, you specify the HDFS file or Hive table you want to access. PXF requests the data from HDFS and delivers the relevant portions in parallel to each Greenplum Database segment instance serving the query.</p>
<p>You must explicitly initialize and start the PXF Extension Framework before you can use the <codeph>pxf</codeph> protocol to read external data. You must also grant permissions to the <codeph>pxf</codeph> protocol and enable PXF in each database in which you want to create external tables to access external data.</p>
<p>For detailed information about configuring and using the PXF Extension Framework and the <codeph>pxf</codeph> protocol, refer to <xref href="pxf-overview.xml" type="topic">Accessing HDFS and Hive Data with PXF</xref>.</p>
</body>
</topic>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="topic_u14_wtd_dbb">
<title>Working with Hadoop Distributed File System (HDFS) Data</title>
<shortdesc>Greenplum Database leverages the parallel architecture of a Hadoop Distributed File
System to read and write data files efficiently using the <codeph>gphdfs</codeph> protocol or
PXF framework. </shortdesc>
<body>
<p>
<note>PXF is an experimental feature and is not recommended for production use.</note>
</p>
</body>
</topic>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="topic_u14_wtd_dbb">
<title>Accessing HDFS and Hive Data with PXF </title>
<shortdesc>Data managed by your organization may already reside in external sources. The Greenplum Database PXF Extension Framework (PXF) provides access to this external data via built-in connectors that map an external data source to a Greenplum Database table definition.</shortdesc>
<body>
<p>PXF is installed with HDFS and Hive connectors. These connectors enable you to read external HDFS file system and Hive table data stored in text, Avro, RCFile, Parquet, SequenceFile, and ORC formats.</p>
<p>The PXF Extension Framework includes a protocol C library and a Java service. After you configure and initialize PXF, you start a single PXF JVM process on each Greenplum Database segment host. This long-running process concurrently serves multiple query requests.</p>
<p>For detailed information about the architecture of and using the PXF Extension Framework, refer to the <xref href="../../pxf/overview_pxf.html" type="topic" format="html">Using PXF with External Data</xref> documentation.</p>
</body>
</topic>
......@@ -15,6 +15,7 @@
       | ('gpfdists://<varname>filehost</varname>[:<varname>port</varname>]/<varname>file_pattern</varname>[#transform=<varname>trans_name</varname>]'
           [, ...])
       | ('gphdfs://<varname>hdfs_host</varname>[:port]/<varname>path</varname>/<varname>file</varname>')
       | ('pxf://<varname>path-to-data</varname>?<varname>PROFILE</varname>[&amp;<varname>custom-option</varname>=<varname>value</varname>[...]]'))
       | ('s3://<varname>S3_endpoint</varname>[:<varname>port</varname>]/<varname>bucket_name</varname>/[<varname>S3_prefix</varname>]
[region=<varname>S3-region</varname>]
[config=<varname>config_file</varname>]')
......@@ -206,6 +207,17 @@ CREATE WRITABLE EXTERNAL WEB TABLE <varname>table_name</varname>
</plentry>
<plentry>
<pt>LOCATION <varname>('protocol://host[:port]/path/file' [, ...])</varname></pt>
<pd>If you use the <codeph>gphdfs</codeph> protocol to read or write a file to a Hadoop
file system (HDFS), refer to <xref
href="../../admin_guide/external/g-specify-hdfs-data-in-an-external-table-definition.xml"/> for additional information about
the <codeph>gphdfs</codeph> protocol <codeph>LOCATION</codeph> clause syntax.</pd>
<pd>If you use the <codeph>pxf</codeph> protocol to access an external data source,
refer to the <xref href="../../pxf/using_pxf.html#creatinganexternaltable" format="html">Creating an External Table Using PXF</xref>
documentation for detailed information about
the <codeph>pxf</codeph> protocol <codeph>LOCATION</codeph> clause syntax.</pd>
<pd>If you use the <codeph>s3</codeph> protocol to read or write to S3, refer to <xref
href="../../admin_guide/external/g-s3-protocol.xml#amazon-emr__section_stk_c2r_kx">About the S3 Protocol URL</xref> for additional information about
the <codeph>s3</codeph> protocol <codeph>LOCATION</codeph> clause syntax.</pd>
<pd>For readable external tables, specifies the URI of the external data source(s) to be
used to populate the external table or web table. Regular readable external tables allow
the <codeph>gpfdist</codeph> or <codeph>file</codeph> protocols. External web tables
......@@ -246,20 +258,13 @@ CREATE WRITABLE EXTERNAL WEB TABLE <varname>table_name</varname>
about specifying a transform, see <xref
href="../../utility_guide/admin_utilities/gpfdist.xml#topic1"
><codeph>gpfdist</codeph></xref> in the <cite>Greenplum Utility Guide</cite>. </pd>
<pd>If you use the <codeph>gphdfs</codeph> protocol to read or write a file to a Hadoop
file system (HDFS), refer to <xref
href="../../admin_guide/external/g-using-hadoop-distributed-file-system--hdfs--tables.xml"/> for detailed information about
the <codeph>gphdfs</codeph> protocol <codeph>LOCATION</codeph> clause syntax.</pd>
<pd>If you use the <codeph>s3</codeph> protocol to read or write to S3, refer to <xref
href="../../admin_guide/external/g-s3-protocol.xml#amazon-emr"/> for detailed information about
the <codeph>s3</codeph> protocol <codeph>LOCATION</codeph> clause syntax.</pd>
</plentry>
<plentry>
<pt>ON MASTER</pt>
<pd>Restricts all table-related operations to the Greenplum master segment. Permitted only
on readable and writable external tables created with the <codeph>s3</codeph> or custom
protocols. The <codeph>gpfdist</codeph>, <codeph>gpfdists</codeph>,
<codeph>gphdfs</codeph>, and <codeph>file</codeph> protocols do not support <codeph>ON
<codeph>gphdfs</codeph>, <codeph>pxf</codeph>, and <codeph>file</codeph> protocols do not support <codeph>ON
MASTER</codeph>. <note>Be aware of potential resource impacts when reading from or
writing to external tables you create with the <codeph>ON MASTER</codeph> clause. You
may encounter performance issues when you restrict table operations solely to the
......@@ -320,10 +325,10 @@ CREATE WRITABLE EXTERNAL WEB TABLE <varname>table_name</varname>
href="../../admin_guide/external/g-using-hadoop-distributed-file-system--hdfs--tables.xml"/>
for detailed information about the <codeph>gphdfs</codeph> protocol <codeph>FORMAT</codeph>
clause syntax.</pd>
<pd>The <codeph>s3</codeph> protocol supports only the
<codeph>TEXT</codeph> and <codeph>CSV</codeph> formats. Refer to <xref
href="../../admin_guide/external/g-s3-protocol.xml#amazon-emr"/> for detailed information about
the <codeph>s3</codeph> protocol <codeph>FORMAT</codeph> clause syntax.</pd>
<pd>If you use the <codeph>pxf</codeph> protocol to access an external data source,
refer to the PXF <xref href="../../pxf/using_pxf.html#creatinganexternaltable" format="html">Creating an External Table Using PXF</xref>
documentation for detailed information about
the <codeph>pxf</codeph> protocol <codeph>FORMAT</codeph> clause syntax.</pd>
</plentry>
<plentry>
<pt>FORMAT 'CUSTOM' (formatter=<varname>formatter_specification</varname>)</pt>
......@@ -331,7 +336,11 @@ CREATE WRITABLE EXTERNAL WEB TABLE <varname>table_name</varname>
specifies the function to use to format the data, followed by comma-separated parameters
to the formatter function. The length of the formatter specification, the string
including <codeph>Formatter=</codeph>, can be up to approximately 50K bytes.</pd>
<pd>For information about using a custom format, see "Loading and Unloading Data" in the
<pd>If you use the <codeph>pxf</codeph> protocol to access an external data source,
refer to the PXF <xref href="../../pxf/using_pxf.html#creatinganexternaltable" format="html">Creating an External Table Using PXF</xref>
documentation for detailed information about
the <codeph>pxf</codeph> protocol <codeph>FORMAT</codeph> clause syntax.</pd>
<pd>For general information about using a custom format, see "Loading and Unloading Data" in the
<cite>Greenplum Database Administrator Guide</cite>.</pd>
</plentry>
<plentry>
......@@ -389,6 +398,7 @@ CREATE WRITABLE EXTERNAL WEB TABLE <varname>table_name</varname>
<pd>For the <codeph>s3</codeph> protocol, the column names in the header row cannot
contain a newline character (<codeph>\n</codeph>) or a carriage return
(<codeph>\r</codeph>).</pd>
<pd>The <codeph>pxf</codeph> protocol does not support the <codeph>HEADER</codeph> formatting option.</pd>
</plentry>
<plentry>
<pt>QUOTE</pt>
......
---
title: Configuring, Initializing, and Starting PXF
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
The PXF Extension Framework is composed of a Greenplum Database protocol and a Java service that map an external data source to a table definition. This topic describes how to configure, initialize, and start the PXF Extension Framework.
## <a id="install_info"></a>Installing PXF
The PXF Extension Framework is installed on your master node when you install Greenplum Database. You install PXF on your Greenplum Database segment hosts when you invoke the `gpseginstall` command.
You must explicitly initialize and start PXF before you can use the framework. You must also explicitly enable PXF in each database in which you plan to use it.
### <a id="pxf-install-dirs"></a>PXF Install Files/Directories
The following PXF files and directories are installed in your Greenplum Database cluster. These files/directories are relative to `$GPHOME`:
| Directory | Description |
|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `pxf` | The PXF installation directory. |
| `pxf/apache-tomcat` | The PXF tomcat directory. |
| `pxf/bin` | The PXF script and executable directory. |
| `pxf/conf` | The PXF configuration directory. This directory contains the `pxf-private.classpath` and `pxf-profiles.xml` configuration files. |
| `pxf/conf-templates` | Configuration templates for PXF. |
| `pxf/lib` | The PXF library directory. |
| `pxf/logs`, | The PXF log file directory. Includes `pxf-service.log` and Tomcat-related logs including `catalina.out`. The log directory and log files are readable only by the `gpadmin` user.
| `pxf/pxf-service` | After initializing PXF, the PXF service instance directory. |
| `pxf/run` | After starting PXF, the PXF run directory. Includes a PXF catalina process id file. |
| `pxf/tomcat-templates` | Tomcat templates for PXF. |
## <a id="init_pxf"></a>Initializing PXF
You must explicitly initialize the PXF service instance. This one-time initialization creates the PXF service web application. It also updates PXF configuration files to include information specific to your Hadoop cluster configuration.
### <a id="init-pxf-prereq"></a>Prerequisites
Before initializing PXF in your Greenplum Database cluster, ensure that you have:
- Installed and configured the required Hadoop client on each Greenplum Database segment host. Refer to [Installing and Configuring the Hadoop Client for PXF](hdfsclient_instcfg.html) for instructions.
- Granted the `gpadmin` operating system user read permission on the relevant portions of your HDFS file system.
- Installed and configured the required Hive client on each Greenplum Database segment host if you plan to use the PXF Hive connector. Refer to [Installing and Configuring the Hive Client for PXF](hiveclient_instcfg.html) for instructions.
- Located and noted the full file system path to the base install directory of the Hadoop and Hive clients, `$PXF_HADOOP_HOME` and `$PXF_HIVE_HOME`.
### <a id="init-pxf-steps"></a>Procedure
Perform the following procedure to initialize PXF on each segment host in your Greenplum Database cluster. You will use the `gpssh` utility to run a command on multiple hosts.
1. Log in to the Greenplum Database master node and set up your environment:
``` shell
$ ssh gpadmin@<gpmaster>
gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
```
2. Create a text file that lists your Greenplum Database segment hosts, one host name per line. Ensure that there are no blank lines or extra spaces in the file. For example, a file named `seghostfile` may include:
``` pre
seghost1
seghost2
seghost3
```
3. Install the `unzip` package on each segment host:
``` shell
gpadmin@gpmaster$ gpssh -e -v -f seghostfile "sudo yum -y install unzip"
```
4. Run the `pxf init` command to initialize the PXF service on each segment host. Provide your Hadoop base install directory in the `--hadoop-home` option value. If you plan to use the PXF Hive connector, also provide a `--hive-home` option and value. For example:
``` shell
gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf init --hadoop-home \$PXF_HADOOP_HOME --hive-home \$PXF_HIVE_HOME"
```
The `init` command creates and initializes the PXF web application. It also updates the `pxf-private.classpath` file to include entries for your Hadoop distribution JAR files.
## <a id="start_pxf"></a>Starting PXF
After initializing PXF, you must explicitly start PXF on each segment host in your Greenplum Database cluster. The PXF service, once started, runs as the `gpadmin` user on default port 51200. Only the `gpadmin` user can start and stop the PXF service.
Perform the following procedure to start PXF on each segment host in your Greenplum Database cluster. You will use the `gpssh` command and a `seghostfile` to run the command on multiple hosts.
1. Log in to the Greenplum Database master node and set up your environment:
``` shell
$ ssh gpadmin@<gpmaster>
gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
```
3. Run the `pxf start` command to start PXF on each segment host. For example:
```shell
$ gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf start"
```
## <a id="pxf_svc_mgmt"></a>PXF Service Management
The `pxf` command supports `stop`, `restart`, and `status` operations in addition to `init` and `start`. These operations run locally. That is, if you want to start or stop the PXF agent on a specific segment host, you can log in to the host and run the command. If you want to start or stop the PXF agent on multiple segment hosts, use the `gpssh` utility as shown above, or individually log in to each segment host and run the command.
**Note**: If you update your Hadoop or Hive configuration while the PXF service is running, you must copy any updated configuration files to each Greenplum Database segment host and restart PXF on each host.
\ No newline at end of file
此差异已折叠。
---
title: Writing HDFS File Data with PXF
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
The PXF HDFS connector supports writable external tables using the `HdfsTextSimple` and `SequenceWritable` profiles. You might create a writable table to export data from a Greenplum Database internal table to binary or text HDFS files.
Use the `HdfsTextSimple` profile when writing text data. Use the `SequenceWritable` profile when dealing with binary data.
This section describes how to use these PXF profiles to write data to HDFS.
Note: Tables that you create with writable profiles can only be used for `INSERT` operations. If you want to query inserted data, you must create a separate external readable table that references the new HDFS file, specifying the equivalent readable profile.
## <a id="hdfswrite_prereq"></a>Prerequisites
Before working with HDFS file data using PXF, ensure that:
- You have installed and configured a Hadoop client on each Greenplum Database segment host. Refer to [Installing and Configuring the Hadoop Client for PXF](hdfsclient_instcfg.html) for instructions.
- You have initialized and started PXF on your Greenplum Database segment hosts. See [Configuring, Initializing, and Starting PXF](cfginitstart_pxf.html) for PXF initialization, configuration, and startup information.
- You have granted the `gpadmin` user read and write permission to the appropriate directories in your HDFS file system.
## <a id="hdfswrite_writeextdata"></a>Writing to PXF External Tables
The PXF HDFS connector supports two writable profiles: `HdfsTextSimple` and `SequenceWritable`.
## <a id="hdfswrite_options"></a>Writing to PXF External Tables
## <a id="profile_options"></a>Custom Options
## <a id="profile_write_hdfstextsimple"></a>HdfsTextSimple Profile
### <a id="profile_write_hdfstextsimple_example"></a>Example: Writing Data Using the HdfsTextSimple Profile
## <a id="profile_write_sequencewritable"></a>SequenceWritable Profile
### <a id="profile_write_sequencewritable_example"></a>Example: Writing Data Using the SequenceWritable Profile
# <a id="read_recordkey"></a>Reading the Record Key
### <a id="read_recordkey_example"></a>Example: Using Record Keys
---
title: Installing and Configuring the Hadoop Client for PXF
---
You use the PXF HDFS connector to access HDFS file data. PXF requires a Hadoop client installation on each Greenplum Database segment host. The Hadoop client must be installed from a tarball.
This topic describes how to install and configure the Hadoop client for PXF access.
## <a id="hadoop-pxf-prereq"></a>Prerequisites
Compatible Hadoop clients for PXF include Cloudera, Hortonworks Data Platform, and generic Apache Hadoop.
Before setting up the Hadoop Client for PXF, ensure that you:
- Have `scp` access to the HDFS NameNode host on a running Hadoop cluster.
- Have access to or permission to install Java version 1.7 or 1.8 on each segment host.
## <a id="hadoop-pxf-config-steps"></a>Procedure
Perform the following procedure to install and configure the Hadoop client for PXF on each segment host in your Greenplum Database cluster. You will use the `gpssh` utility where possible to run a command on multiple hosts.
1. Log in to your Greenplum Database master node and set up the environment:
``` shell
$ ssh gpadmin@<gpmaster>
gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
```
2. Create a text file that lists your Greenplum Database segment hosts, one host name per line. Ensure that there are no blank lines or extra spaces in the file. For example, a file named `seghostfile` may include:
``` pre
seghost1
seghost2
seghost3
```
2. If not already present, install Java on each Greenplum Database segment host. For example:
``` shell
gpadmin@gpmaster$ gpssh -e -v -f seghostfile sudo yum -y install java-1.8.0-openjdk-1.8.0*
```
3. Identify the Java base install directory. Update the `gpadmin` user's `.bash_profile` file on each segment host to include this `$JAVA_HOME` setting. For example:
``` shell
gpadmin@gpmaster$ gpssh -e -v -f seghostfile "echo 'export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.144-0.b01.el7_4.x86_64/jre' >> /home/gpadmin/.bash_profile"
```
4. Download a compatible Hadoop client and install it on **each** Greenplum Database segment host. The Hadoop client must be a tarball distribution. You must install the same Hadoop client distribution in the same file system location on each host.
If you are running Cloudera Hadoop:
1. Download the Hadoop distribution:
``` shell
gpadmin@master$ wget http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.10.2.tar.gz -O /tmp/hadoop-2.6.0-cdh5.10.2.tar.gz
```
2. Copy the Cloudera Hadoop distribution to each Greenplum Database segment host. For example, to copy the distribution to the `/home/gpadmin` directory:
``` shell
gpadmin@master$ gpscp -v -f seghostfile /tmp/hadoop-2.6.0-cdh5.10.2.tar.gz =:/home/gpadmin
```
3. Unpack the Cloudera Hadoop distribution on each Greenplum Database segment host. For example:
``` shell
gpadmin@master$ gpssh -e -v -f seghostfile "tar zxf /home/gpadmin/hadoop-2.6.0-cdh5.10.2.tar.gz"
```
5. Ensure that the `gpadmin` user has read and execute permission on all Hadoop client libraries on each segment host. For example:
``` shell
gpadmin@master$ gpssh -e -v -f seghostfile "chmod -R 755 /home/gpadmin/hadoop-2.6.0-cdh5.10.2"
```
5. Locate the base install directory of the Hadoop client. Edit the `gpadmin` user's `.bash_profile` file on each segment host to include this `$PXF_HADOOP_HOME` setting. For example:
``` shell
gpadmin@gpmaster$ gpssh -e -v -f seghostfile "echo 'export PXF_HADOOP_HOME=/home/gpadmin/hadoop-2.6.0-cdh5.10.2' >> /home/gpadmin/.bash_profile"
```
6. The Hadoop `core-site.xml` configuration file `fs.defaultFS` property value identifies the HDFS NameNode URI. PXF requires this information to access the Hadoop cluster. A sample `fs.defaultFS` setting follows:
``` xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode.domain:8020</value>
</property>
```
Complete the PXF Hadoop client configuration by copying configuration files from your Hadoop cluster to each Greenplum Database segment host.
1. Copy the `core-site.xml` and `hdfs-site.xml` Hadoop configuration files from your Hadoop cluster NameNode host to the current host. For example:
``` shell
gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml .
gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml .
```
2. Next, copy these Hadoop configuration files to each Greenplum Database segment host. For example:
``` shell
gpadmin@gpmaster$ gpscp -v -f seghostfile core-site.xml =:\$PXF_HADOOP_HOME/etc/hadoop/core-site.xml
gpadmin@gpmaster$ gpscp -v -f seghostfile hdfs-site.xml =:\$PXF_HADOOP_HOME/etc/hadoop/hdfs-site.xml
```
7. The PXF HDFS connector supports the Avro file format. If you plan to access Avro format files, you must download a required JAR file and copy the JAR to each Greenplum Database segment host.
1. Download the Avro JAR file required by PXF:
``` shell
gpadmin@gpmaster$ wget "http://central.maven.org/maven2/org/apache/avro/avro-mapred/1.7.1/avro-mapred-1.7.1.jar"
```
2. Copy the Avro JAR file to each Greenplum Database segment host. You must copy the file to the `$PXF_HADOOP_HOME/share/hadoop/common/lib` directory. For example:
``` shell
gpadmin@gpmaster$ gpscp -v -f seghostfile avro-mapred-*.jar =:\$PXF_HADOOP_HOME/share/hadoop/common/lib
```
**Note**: If you update your Hadoop configuration while the PXF service is running, you must copy the updated `core-site.xml` and `hdfs-site.xml` files to each Greenplum Database segment host and restart PXF.
此差异已折叠。
---
title: Installing and Configuring the Hive Client for PXF
---
You use the PXF Hive connector to access Hive table data. The PXF Hive connector requires a Hive client installation on each Greenplum Database segment host. You must install the Hive client from a tarball.
This topic describes how to install and configure the Hive client for PXF access.
## <a id="hive-pxf-prereq"></a>Prerequisites
Compatible Hive clients for PXF are Cloudera and Hortonworks Data Platform Hive.
Before setting up the Hive Client for PXF, ensure that you:
- Have `scp` access to a running Hadoop cluster with the Hive Metastore service.
- Have installed and configured a Hadoop client on each Greenplum Database segment host. Refer to [Installing and Configuring the Hadoop Client for PXF](hdfsclient_instcfg.html) for instructions.
## <a id="hive-pxf-config-steps"></a>Procedure
Perform the following procedure to install and configure the Hive client for PXF on each segment host in your Greenplum Database cluster. You will use the `gpssh` utility where possible to run a command on multiple hosts.
1. Create a text file that lists your Greenplum Database segment hosts, one host name per line. Ensure that there are no blank lines or extra spaces in the file. For example, a file named `seghostfile` may include:
``` pre
seghost1
seghost2
seghost3
```
2. Download a compatible Hive client and install on **each** Greenplum Database segment host. The Hive client must be a tarball distribution. You must install the same Hive client distribution in the same file system location on each host.
If you are running Cloudera Hive:
1. Download the Hive distribution:
``` shell
gpadmin@master$ wget http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.10.2.tar.gz -O /tmp/hive-1.1.0-cdh5.10.2.tar.gz
```
2. Copy the Cloudera Hadoop distribution to each Greenplum Database segment host. For example, to copy the distribution to the `/home/gpadmin` directory:
``` shell
gpadmin@master$ gpscp -v -f seghostfile /tmp/hive-1.1.0-cdh5.10.2.tar.gz =:/home/gpadmin
```
3. Unpack the Cloudera Hadoop distribution on each Greenplum Database segment host. For example:
``` shell
gpadmin@master$ gpssh -e -v -f seghostfile "tar zxf /home/gpadmin/hive-1.1.0-cdh5.10.2.tar.gz"
```
4. Ensure that the `gpadmin` user has read and execute permission on all Hive client libraries on each segment host. For example:
``` shell
gpadmin@master$ gpssh -e -v -f seghostfile "chmod -R 755 /home/gpadmin/hive-1.1.0-cdh5.10.2"
```
3. Locate the base install directory of the Hive client. Edit the `gpadmin` user's `.bash_profile` file on each segment host to include this `$PXF_HIVE_HOME` setting. For example:
``` shell
gpadmin@gpmaster$ gpssh -e -v -f seghostfile "echo 'export PXF_HIVE_HOME=/home/gpadmin/hive-1.1.0-cdh5.10.2' >> /home/gpadmin/.bash_profile"
```
4. The Hive `hive-site.xml` configuration file `hive.metastore.uris` property value identifies the Hive Metastore URI. PXF requires this information to access the Hive service. A sample `hive.metastore.uris` setting follows:
``` xml
<property>
<name>hive.metastore.uris</name>
<value>thrift://metastorehost.domain:9083</value>
</property>
```
Complete the PXF Hive client configuration by copying Hive configuration from your Hadoop cluster to each Greenplum Database segment host.
1. Copy the `hive-site.xml` Hive configuration files from your Hadoop cluster NameNode host to the current host. For example:
``` shell
gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hive/conf/hive-site.xml .
```
2. Next, copy the `hive-site.xml` configuration file to each Greenplum Database segment host. For example:
``` shell
gpadmin@gpmaster$ gpscp -v -f seghostfile hive-site.xml =:\$PXF_HIVE_HOME/conf/hive-site.xml
```
**Note**: If you update your Hive configuration while the PXF service is running, you must copy the updated `hive-site.xml` file to each Greenplum Database segment host and restart PXF.
---
title: Installing and Configuring PXF
---
The PXF Extension Framework provides connectors to Hadoop and Hive data stores. To use these PXF connectors, you must install Hadoop and Hive clients on each Greenplum Database segment host as described in these one-time installation and configuration procedures:
- **[Installing and Configuring the Hadoop Client for PXF](hdfsclient_instcfg.html)**
- **[Installing and Configuring the Hive Client for PXF](hiveclient_instcfg.html)**
You must also configure and initialize PXF itself, in addition to starting the service:
- **[Configuring, Initializing, and Starting PXF](cfginitstart_pxf.html)**
---
title: PXF Architecture
---
The PXF Extension Framework is composed of a Greenplum Database protocol and associated C client library plus a Java service. These components work together to provide you access to data stored in sources external to your Greenplum Database deployment.
Your Greenplum Database deployment consists of a master node and multiple segment hosts. After you configure and initialize PXF, you start a single PXF JVM process on each Greenplum Database segment host. This PXF process (referred to as the PXF agent) spawns a thread for each segment instance on a segment host that participates in a query against a PXF external table. Multiple segment instances on each segment host communicate via a REST API with PXF in parallel, and the PXF agents on multiple hosts communicate with HDFS in parallel.
<span class="figtitleprefix">Figure: </span>PXF Extension Framework Architecture
<img src="graphics/pxfarch.png" class="image" />
When a user or application performs a query on a PXF external table that references an HDFS file, the Greenplum Database master node dispatches the query to all segment hosts. Each segment instance contacts the PXF agent running on its host. When it receives the request from a segment instance, the PXF agent:
1. Spawns a thread for the segment instance.
2. Invokes the HDFS Java API to request metadata information for the HDFS file from the HDFS NameNode.
3. Provides the metadata information returned by the HDFS NameNode to the segment instance.
A segment instance uses its Greenplum Database `gp_segment_id` and the file block information described in the metadata to assign itself a specific portion of the query data. The segment instance then sends a request to the PXF agent to read the assigned data. This data may reside on one or more HDFS DataNodes.
The PXF agent invokes the HDFS Java API to read the data and deliver it to the segment instance. The segment instance delivers its portion of the data to the Greenplum Database master node. This communication occurs across segment hosts and segment instances in parallel.
---
title: Using PXF with External Data
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
The PXF Extension Framework (PXF) provides parallel, high throughput data access and federated queries across heterogeneous data sources via built-in connectors that map a Greenplum Database external table definition to an external data source. This Greenplum Database extension is based on [PXF](https://cwiki.apache.org/confluence/display/HAWQ/PXF) from Apache HAWQ (incubating).
- **[PXF Architecture](intro_pxf.html)**
This topic describes the architecture of PXF and its integration with Greenplum Database.
- **[Installing and Configuring PXF](instcfg_pxf.html)**
This topic details the PXF installation, configuration, and startup procedures.
- **[Using PXF](using_pxf.html)**
This topic describes important PXF procedures and concepts, including enabling PXF for use in a database and PXF protocol and external table definitions.
- **[Reading HDFS File Data](hdfs_read_pxf.html)**
This topic describes how to use the PXF HDFS connector and related profiles to read Text and Avro format HDFS files.
- **[Accessing Hive Table Data](hive_pxf.html)**
This topic describes how to use the PXF Hive connector and related profiles to read Hive tables stored in Text, RCFile, Parquet, and ORC storage formats.
- **[Troubleshooting PXF](troubleshooting_pxf.html)**
This topic details the service- and database- level logging configuration procuredures for PXF. It also identifies some common PXF errors.
---
title: Troubleshooting PXF
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
## <a id="pxf-errors"></a>PXF Errors
The following table describes some errors you may encounter while using PXF:
| Error Message | Discussion |
|-------------------------------|---------------------------------|
| Protocol "pxf" does not exist | **Cause**: The `pxf` extension was not registered.<br>**Remedy**: Create (enable) the PXF extension for the database as described in the PXF [Enable Procedure](using_pxf.html#enable-pxf-ext).|
| Invalid URI pxf://\<path-to-data\>: missing options section | **Cause**: The `LOCATION` URI does not include the profile or other required options.<br>**Remedy**: Provide the profile and required options in the URI. |
| org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://\<namenode\>:8020/\<path-to-file\> | **Cause**: The HDFS file you specified in \<path-to-file\> does not exist. <br>**Remedy**: Provide the path to an existing HDFS file. |
| NoSuchObjectException(message:\<schema\>.\<hivetable\> table not found) | **Cause**: The Hive table you specified with \<schema\>.\<hivetable\> does not exist. <br>**Remedy**: Provide the name of an existing Hive table. |
| Failed to connect to \<segment-host\> port 51200: Connection refused (libchurl.c:944) (\<segment-id\> slice\<N\> \<segment-host\>:40000 pid=\<process-id\>)<br> ... |**Cause**: PXF is not running on \<segment-host\>.<br>**Remedy**: Restart PXF on \<segment-host\>. |
| *ERROR*: failed to acquire resources on one or more segments<br>*DETAIL*: could not connect to server: Connection refused<br>&nbsp;&nbsp;&nbsp;&nbsp;Is the server running on host "\<segment-host\>" and accepting<br>&nbsp;&nbsp;&nbsp;&nbsp;TCP/IP connections on port 40000?(seg\<N\> \<segment-host\>:40000) | **Cause**: The Greenplum Database segment host \<segment-host\> is down.
## <a id="pxf-logging"></a>PXF Logging
Enabling more verbose logging may aid PXF troubleshooting efforts. PXF provides two categories of message logging: service-level and client-level.
### <a id="pxfsvclogmsg"></a>Service-Level Logging
PXF utilizes `log4j` for service-level logging. PXF-service-related log messages are captured in a log file specified by PXF's `log4j` properties file, `$GPHOME/pxf/conf/pxf-log4j.properties`. The default PXF logging configuration will write `INFO` and more severe level logs to `$GPHOME/pxf/logs/pxf-service.log`. You can configure the logging level and log file location.
PXF provides more detailed logging when the `DEBUG` level is enabled. To configure PXF `DEBUG` logging, uncomment the following line in `pxf-log4j.properties`:
``` shell
#log4j.logger.org.apache.hawq.pxf=DEBUG
```
Copy the `pxf-log4j.properties` file to each segment host and restart the PXF service on *each* Greenplum Database segment host. For example:
``` shell
gpadmin@gpmaster$ gpscp -v -f seghostfile $GPHOME/pxf/conf/pxf-log4j.properties :=/usr/local/greenplum-db/pxf/conf/pxf-log4j.properties
gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf restart"
```
With `DEBUG` level logging now enabled, perform your PXF operations; for example, create and query an external table. (Make note of the time; this will direct you to the relevant log messages in `$GPHOME/pxf/logs/pxf-service.log`.)
``` shell
$ date
Wed Oct 4 09:30:06 MDT 2017
$ psql -d <dbname>
```
``` sql
dbname=> CREATE EXTERNAL TABLE hdfstest(id int, newid int)
LOCATION ('pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple')
FORMAT 'TEXT' (delimiter='E',');
dbname=> SELECT * FROM hdfstest;
<select output>
```
Examine/collect the log messages from `pxf-service.log`.
**Note**: `DEBUG` logging is quite verbose and has a performance impact. Remember to turn off PXF service `DEBUG` logging after you have collected the desired information.
### <a id="pxfdblogmsg"></a>Client-Level Logging
Database-level client logging may provide insight into internal PXF service operations.
Enable Greenplum Database and PXF debug message logging during operations on PXF external tables by setting the `client_min_messages` server configuration parameter to `DEBUG2` in your `psql` session.
``` shell
$ psql -d <dbname>
```
``` sql
dbname=# SET client_min_messages=DEBUG2
dbname=# SELECT * FROM hdfstest;
...
DEBUG2: churl http header: cell #19: X-GP-URL-HOST: seghost1 (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT: External table hdfstest, file pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple
DEBUG2: churl http header: cell #20: X-GP-URL-PORT: 51200 (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT: External table hdfstest, file pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple
DEBUG2: churl http header: cell #21: X-GP-DATA-DIR: data/dir/hdfsfile (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT: External table hdfstest, file pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple
DEBUG2: churl http header: cell #22: X-GP-PROFILE: HdfsTextSimple (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT: External table hdfstest, file pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple
DEBUG2: churl http header: cell #23: X-GP-URI: pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple (seg0 slice1 127.0.0.1:40000 pid=3981)
CONTEXT: External table hdfstest, file pxf://data/dir/hdfsfile?PROFILE=HdfsTextSimple
```
Examine/collect the log messages from `stdout`.
**Note**: `DEBUG2` database session logging has a performance impact. Remember to turn off `DEBUG2` logging after you have collected the desired information.
``` sql
dbname=# SET client_min_messages=NOTICE
```
---
title: Using PXF to Read and Write External Data
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
The PXF Extension Framework implements a protocol named `pxf` that you can use to create an external table that references data in an external data store. The PXF Extension Framework protocol and Java service are packaged as a Greenplum Database extension.
You must enable the PXF extension in each database in which you plan to use the framework to access external data. You must also explicitly `GRANT` permission to the `pxf` protocol to those users/roles who require access.
After the extension is registered and privileges are assigned, you can use the `CREATE EXTERNAL TABLE` command to create an external table using the `pxf` protocol. PXF provides built-in HDFS and Hive connectors. These connectors define profiles that support different file formats. You specify the profile name in the `CREATE EXTERNAL TABLE` command `LOCATION` URI.
## <a id="enable-pxf-ext"></a>Enabling/Disabling PXF
You must explicitly enable the PXF extension in each Greenplum Database in which you plan to use the extension.
**Note**: You must have Greenplum Database administrator privileges to create an extension.
### <a id="enable-pxf-steps"></a>Enable Procedure
Perform the following procedure for **_each_** database in which you want to use PXF:
1. Connect to the database as the `gpadmin` user:
``` shell
gpadmin@gpmaster$ psql -d <database-name> -U gpadmin
```
2. Create the PXF extension. You must have Greenplum Database administrator privileges to create an extension. For example:
``` sql
database-name=# CREATE EXTENSION pxf;
```
Creating the `pxf` extension registers the `pxf` protocol and the call handlers required for PXF to access external data.
### <a id="disable-pxf-steps"></a>Disable Procedure
When you no longer want to use PXF on a specific database, you must explicitly disable the PXF extension for that database:
1. Connect to the database as the `gpadmin` user:
``` shell
gpadmin@gpmaster$ psql -d <database-name> -U gpadmin
```
2. Drop the PXF extension:
``` sql
database-name=# DROP EXTENSION pxf;
```
The `DROP` command fails if there are any currently defined external tables using the `pxf` protocol. Add the `CASCADE` option if you choose to forcibly remove these external tables.
## <a id="access_pxf"></a>Granting Access to PXF
To access external data with PXF, you create an external table with the `CREATE EXTERNAL TABLE` command that specifies the `pxf` protocol. You must specifically grant `SELECT` permission to the `pxf` protocol to all non-`SUPERUSER` Greenplum Database roles that require such access.
To grant a specific role access to the `pxf` protocol, use the `GRANT` command. For example, to grant the role named `bill` read access to data referenced by an external table created with the `pxf` protocol:
``` sql
GRANT SELECT ON PROTOCOL pxf TO bill;
```
**Note**: The `pxf` protocol supports only read access at this time.
## <a id="built-inprofiles"></a> PXF Profiles
PXF is installed with HDFS and Hive connectors that provide a number of built-in profiles. These profiles simplify and unify access to external data sources of varied formats. You provide the profile name when you specify the `pxf` protocol on a `CREATE EXTERNAL TABLE` command to create a Greenplum Database external table referencing an external data store.
PXF provides the following built-in profiles:
| Data Source | Data Format | Profile Name(s) | Description |
|-------|---------|------------|----------------|
| HDFS | Text | HdfsTextSimple | Delimited single line records from plain text files on HDFS.|
| HDFS | Text | HdfsTextMulti | Delimited single or multi-line records with quoted linefeeds from plain text files on HDFS. |
| HDFS | Avro | Avro | Avro format binary files (\<filename\>.avro). |
| Hive | TextFile | Hive, HiveText | Data in comma-, tab-, or space-separated value format or JSON notation. |
| Hive | RCFile | Hive, HiveRC | Record columnar data consisting of binary key/value pairs. |
| Hive | SequenceFile | Hive | Data consisting of binary key/value pairs. |
| Hive | ORC | Hive, HiveORC, HiveVectorizedORC | Optimized row columnar data with stripe, footer, and postscript sections. |
| Hive | Parquet | Hive | Compressed columnar data representation. |
A PXF profile definition includes the name of the profile, a description, and the Java classes that implement parsing and reading external data for the profile. Built-in PXF profiles are defined in the `$GPHOME/pxf/conf/pxf-profiles-default.xml` configuration file. The built-in `HdfsTextSimple` profile definition is reproduced below:
``` xml
<profile>
<name>HdfsTextSimple</name>
<description>This profile is suitable for using when reading
delimited single line records from plain text files on HDFS
</description>
<plugins>
<fragmenter>org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
<accessor>org.apache.hawq.pxf.plugins.hdfs.LineBreakAccessor</accessor>
<resolver>org.apache.hawq.pxf.plugins.hdfs.StringPassResolver</resolver>
</plugins>
</profile>
```
**Note**: Profile `plugins` identify the Java classes that PXF uses to parse and access the external data. The typical PXF user need not concern themselves with the profile `plugins`.
## <a id="profile-dependencies"></a>PXF JAR Dependencies
You use the PXF Extension Framework to access data stored on external systems. Depending upon the external data source, this access may require that you install and/or configure additional components or services for the external data source. For example, to use PXF to access a file stored in HDFS, you must install a Hadoop client on each Greenplum Database segment host.
PXF depends on JAR files and other configuration information provided by these additional components. The `$GPHOME/pxf/conf/pxf-private.classpath` and `$GPHOME/pxf/conf/pxf-public.classpath` configuration files identify PXF JAR dependencies. PXF manages the `pxf-private.classpath` file, adding entries as necessary based on options that you provide to the `pxf init` command.
Should you need to add additional JAR dependencies for PXF, you must add them to the `pxf-public.classpath` file on each segment host, and then restart PXF on each host.
## <a id="creatinganexternaltable"></a>Creating an External Table using PXF
The syntax for a `CREATE EXTERNAL TABLE` command that specifies the `pxf` protocol is as follows:
``` sql
CREATE [READABLE] EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION('pxf://<path-to-data>?PROFILE[&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);
```
**Note**: PXF does not currently support writable external tables.
The `LOCATION` clause in a `CREATE EXTERNAL TABLE` statement specifying the `pxf` protocol is a URI that identifies the path to, or other information describing, the location of the external data. For example, if the external data source is HDFS, the \<path-to-data\> would identify the full file system path to a specific HDFS file. If the external data source is Hive, \<path-to-data\> would identify a schema-qualified Hive table name.
Use the query portion of the URI, introduced by the question mark (?), to identify the PXF profile name.
You will provide profile-specific information using the optional &\<custom-option\>=\<value\> component of the `LOCATION` string and formatting information via the \<formatting-properties\> component of the string. The custom options and formatting properties supported by a specific profile are identified later in usage documentation.
Greenplum Database passes the parameters in the `LOCATION` string as headers to the PXF Java service.
<caption><span class="tablecap">Table 1. Create External Table Parameter Values and Descriptions</span></caption>
<a id="creatinganexternaltable__table_pfy_htz_4p"></a>
| Keyword | Value and Description |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| \<path\-to\-data\> | A directory, file name, wildcard pattern, table name, etc. The syntax of \<path-to-data\> is dependent upon the profile currently in use. |
| PROFILE | The profile PXF uses to access the data. PXF supports HDFS and Hive connectors that expose profiles named `HdfsTextSimple`, `HdfsTextMulti`, `Avro`, `Hive`, `HiveText`, `HiveRC`, `HiveORC`, and `HiveVectorizedORC`. |
| \<custom-option\>=\<value\> | Additional options and values supported by the profile.  |
| FORMAT \<value\>| PXF profiles support the '`TEXT`', '`CSV`', and '`CUSTOM`' `FORMAT`s. |
| \<formatting-properties\> | Formatting properties supported by the profile; for example, the `formatter` or `delimiter`.   |
**Note:** When you create a PXF external table, you cannot use the `HEADER` option in your `FORMAT` specification.
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册