From e8fb4bc3ffa0a83b251c763468fa85e403ebb9a9 Mon Sep 17 00:00:00 2001 From: Lisa Owen Date: Fri, 19 Oct 2018 15:50:14 -0700 Subject: [PATCH] docs - pxf no longer requires hadoop client install (#6031) * docs - pxf no longer requires hadoop client install * incorporate review comments, other simplifications * unrelated - qualify filter pushdown data types and operators * copy hadoop config to /etc/*/conf, bundled JARs used * custom JARs * misc edits from lav * do not need to config hadoop if client installed * remove private classpath from upgrade file list * hbase prereq to copy pxf-hbase jar to hbase cluster nodes --- .../markdown/pxf/about_pxf_dir.html.md.erb | 20 +++ .../markdown/pxf/cfginitstart_pxf.html.md.erb | 51 ++---- .../markdown/pxf/client_instcfg.html.md.erb | 159 ++++-------------- gpdb-doc/markdown/pxf/hbase_pxf.html.md.erb | 6 +- .../markdown/pxf/hdfs_read_pxf.html.md.erb | 2 +- .../markdown/pxf/hdfs_write_pxf.html.md.erb | 2 +- gpdb-doc/markdown/pxf/hive_pxf.html.md.erb | 2 +- .../markdown/pxf/install_java.html.md.erb | 46 +++++ gpdb-doc/markdown/pxf/instcfg_pxf.html.md.erb | 20 +-- gpdb-doc/markdown/pxf/intro_pxf.html.md.erb | 2 +- gpdb-doc/markdown/pxf/jdbc_pxf.html.md.erb | 5 +- .../markdown/pxf/overview_pxf.html.md.erb | 4 +- .../markdown/pxf/pxfuserimpers.html.md.erb | 2 +- gpdb-doc/markdown/pxf/upgrade_pxf.html.md.erb | 13 +- gpdb-doc/markdown/pxf/using_pxf.html.md.erb | 10 +- 15 files changed, 144 insertions(+), 200 deletions(-) create mode 100644 gpdb-doc/markdown/pxf/about_pxf_dir.html.md.erb create mode 100644 gpdb-doc/markdown/pxf/install_java.html.md.erb diff --git a/gpdb-doc/markdown/pxf/about_pxf_dir.html.md.erb b/gpdb-doc/markdown/pxf/about_pxf_dir.html.md.erb new file mode 100644 index 0000000000..b205d7e863 --- /dev/null +++ b/gpdb-doc/markdown/pxf/about_pxf_dir.html.md.erb @@ -0,0 +1,20 @@ +--- +title: About the PXF Installation Directories +--- + +PXF is installed on your master node when you install Greenplum Database. You install PXF on your Greenplum Database segment hosts when you invoke the `gpseginstall` command. + +The following PXF files and directories are installed in your Greenplum Database cluster. These files/directories are relative to the PXF installation directory `$GPHOME/pxf`: + +| Directory | Description | +|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `apache-tomcat` | The PXF tomcat directory. | +| `bin` | The PXF script and executable directory. | +| `conf` | The PXF configuration directory. This directory contains the `pxf-env.sh`, `pxf-public.classpath`, `pxf-private.classpath` and `pxf-profiles.xml` configuration files. | +| `conf-templates` | Configuration templates for PXF. | +| `lib` | The PXF library directory. | +| `logs`, | The PXF log file directory. Includes `pxf-service.log` and Tomcat-related logs including `catalina.out`. The log directory and log files are readable only by the `gpadmin` user. +| `pxf-service` | After initializing PXF, the PXF service instance directory. | +| `run` | After starting PXF, the PXF run directory. Includes a PXF catalina process id file. | +| `tomcat-templates` | Tomcat templates for PXF. | + diff --git a/gpdb-doc/markdown/pxf/cfginitstart_pxf.html.md.erb b/gpdb-doc/markdown/pxf/cfginitstart_pxf.html.md.erb index eda98d104a..3d9c72ed03 100644 --- a/gpdb-doc/markdown/pxf/cfginitstart_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/cfginitstart_pxf.html.md.erb @@ -1,5 +1,5 @@ --- -title: Configuring, Initializing, and Managing PXF +title: Initializing and Managing PXF --- -The Greenplum Platform Extension Framework (PXF) is composed of a Greenplum Database protocol and a Java service that map an external data source to a table definition. This topic describes how to configure, initialize, and manage PXF. - -## Installing PXF - -PXF is installed on your master node when you install Greenplum Database. You install PXF on your Greenplum Database segment hosts when you invoke the `gpseginstall` command. - -You must explicitly initialize and start PXF before you can use the framework. You must also explicitly enable PXF in each database in which you plan to use it. - -### PXF Install Files/Directories - -The following PXF files and directories are installed in your Greenplum Database cluster. These files/directories are relative to `$GPHOME`: - -| Directory | Description | -|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `pxf` | The PXF installation directory. | -| `pxf/apache-tomcat` | The PXF tomcat directory. | -| `pxf/bin` | The PXF script and executable directory. | -| `pxf/conf` | The PXF configuration directory. This directory contains the `pxf-env.sh`, `pxf-public.classpath`, `pxf-private.classpath` and `pxf-profiles.xml` configuration files. | -| `pxf/conf-templates` | Configuration templates for PXF. | -| `pxf/lib` | The PXF library directory. | -| `pxf/logs`, | The PXF log file directory. Includes `pxf-service.log` and Tomcat-related logs including `catalina.out`. The log directory and log files are readable only by the `gpadmin` user. -| `pxf/pxf-service` | After initializing PXF, the PXF service instance directory. | -| `pxf/run` | After starting PXF, the PXF run directory. Includes a PXF catalina process id file. | -| `pxf/tomcat-templates` | Tomcat templates for PXF. | - +You must initialize and start PXF before you can use the framework. You must also explicitly enable PXF in each database in which you plan to use it. ## Initializing PXF -You must explicitly initialize the PXF service instance. This one-time initialization creates the PXF service web application. It also updates PXF configuration files to include information specific to your Hadoop cluster configuration. +You must explicitly initialize the PXF service instance. This one-time initialization creates the PXF service web application. It also updates PXF configuration files to include information specific to your configuration and the connectors that you will use. ### Prerequisites -Before initializing PXF in your Greenplum Database cluster, ensure that you have: +Before initializing PXF in your Greenplum Database cluster: -- Installed and configured the required Hadoop clients on each Greenplum Database segment host. Refer to [Installing and Configuring Hadoop Clients for PXF](client_instcfg.html) for instructions. -- Granted read permission to the HDFS files and directories that will be accessed as external tables in Greenplum Database. If user impersonation is enabled (the default), you must grant this permission to each Greenplum Database user/role name that will use external tables that reference the HDFS files. If user impersonation is not enabled, you must grant this permission to the `gpadmin` user. +- *If you plan to use the PXF Hadoop connectors*, configure them as described in [Configuring the PXF Hadoop Connectors ](client_instcfg.html). ### Procedure @@ -69,7 +44,7 @@ Perform the following procedure to initialize PXF on each segment host in your G gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh ``` -2. Create a text file that lists your Greenplum Database segment hosts, one host name per line. Ensure that there are no blank lines or extra spaces in the file. For example, a file named `seghostfile` may include: +2. Create a text file that lists your Greenplum Database segment hosts, one host name per line. For example, a file named `seghostfile` may include: ``` pre seghost1 @@ -77,7 +52,7 @@ Perform the following procedure to initialize PXF on each segment host in your G seghost3 ``` -3. If not already present, install the `unzip` package on each Greenplum Database segment host: +3. If not already present, install the `unzip` package on each Greenplum Database segment host. You must have operating system super user privileges to install packages: ``` shell gpadmin@gpmaster$ gpssh -e -v -f seghostfile "sudo yum -y install unzip" @@ -89,7 +64,7 @@ Perform the following procedure to initialize PXF on each segment host in your G gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf init" ``` - The `init` command creates and initializes the PXF web application. It also updates the `pxf-private.classpath` file to include entries for your Hadoop distribution JAR files. + The `init` command creates and initializes the PXF web application. It also creates the `pxf-private.classpath` file, which specifies the required PXF JAR files. ## Starting PXF @@ -105,10 +80,10 @@ Perform the following procedure to start PXF on each segment host in your Greenp gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh ``` -3. Run the `pxf start` command to start PXF on each segment host. For example, if `seghostfile` contains a list, one-host-per-line, of the segment hosts in your Greenplum Database cluster: +3. Run the `pxf start` command to start PXF on each segment host. For example: ```shell - $ gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf start" + gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf start" ``` ## Stopping PXF @@ -124,15 +99,15 @@ Perform the following procedure to stop PXF on each segment host in your Greenpl gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh ``` -3. Run the `pxf stop` command to stop PXF on each segment host. For example, if `seghostfile` contains a list, one-host-per-line, of the segment hosts in your Greenplum Database cluster: +3. Run the `pxf stop` command to stop PXF on each segment host. For example: ```shell - $ gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf stop" + gpadmin@gpmaster$ gpssh -e -v -f seghostfile "/usr/local/greenplum-db/pxf/bin/pxf stop" ``` ## PXF Service Management The `pxf` command supports `init`, `start`, `stop`, `restart`, and `status` operations. These operations run locally. That is, if you want to start or stop the PXF agent on a specific segment host, you can log in to the host and run the command. If you want to start or stop the PXF agent on multiple segment hosts, use the `gpssh` utility as shown above, or individually log in to each segment host and run the command. -**Note**: If you update your Hadoop or Hive configuration while the PXF service is running, you must copy any updated configuration files to each Greenplum Database segment host and restart PXF on each host. +**Note**: If you have configured PXF Hadoop connectors and you update your Hadoop (or Hive or HBase) configuration while the PXF service is running, you must copy any updated configuration files to each Greenplum Database segment host and restart PXF on each host. diff --git a/gpdb-doc/markdown/pxf/client_instcfg.html.md.erb b/gpdb-doc/markdown/pxf/client_instcfg.html.md.erb index 092ee39d99..9b71d65258 100644 --- a/gpdb-doc/markdown/pxf/client_instcfg.html.md.erb +++ b/gpdb-doc/markdown/pxf/client_instcfg.html.md.erb @@ -1,117 +1,47 @@ --- -title: Installing and Configuring Hadoop Clients for PXF +title: Configuring PXF Hadoop Connectors --- -You use PXF connectors to access external data sources. PXF requires that you install a Hadoop client on each Greenplum Database segment host. Hive and HBase client installation is required only if you plan to access those external data stores. +PXF is compatible with Cloudera, Hortonworks Data Platform, and generic Apache Hadoop distributions. This topic describes how configure the PXF Hadoop, Hive, and HBase connectors. -Compatible Hadoop, Hive, and HBase clients for PXF include Cloudera, Hortonworks Data Platform, and generic Apache distributions. - -This topic describes how to install and configure Hadoop, Hive, and HBase client RPM distributions for PXF. When you install these clients via RPMs, PXF auto-detects your Hadoop distribution and optional Hive and HBase installations and sets certain configuration and class paths accordingly. - -If your Hadoop, Hive, and HBase installation is a custom or tarball distribution, refer to [Using a Custom Client Installation](#client-install-custom) for instructions. +*If you do not plan to use the Hadoop-related PXF connectors, or you already have Hadoop client installations on your Greenplum Database segments hosts, you need not perform this procedure.* ## Prerequisites -Before setting up the Hadoop, Hive, and HBase clients for PXF, ensure that you have: +PXF bundles all of the JAR files on which it depends, including those for Hadoop services, and loads these JARs at runtime. Configuring PXF Hadoop connectors involves copying configuration files from your Hadoop cluster to each Greenplum Database segment host. Before you configure PXF Hadoop, Hive, and HBase connectors, ensure that you have: - `scp` access to hosts running the HDFS, Hive, and HBase services in your Hadoop cluster. -- Superuser permissions to add `yum` repository files and install RPM packages on each Greenplum Database segment host. -- Access to, or superuser permissions to install, Java version 1.7 or 1.8 on each Greenplum Database segment host. -**Note**: If you plan to access JSON format data stored in a Cloudera Hadoop cluster, PXF requires a Cloudera version 5.8 or later Hadoop distribution. +In this procedure, you copy Hadoop configuration files to `/etc//conf` directories on each Greenplum Database segment host. You must create these directories if they do not exist, and assign read and write access permissions to the `gpadmin` user. You need only create the directories for the Hadoop services that you plan to use. For example, to create the directories and assign permission: +``` shell +root@seghost$ mkdir -p /etc/hadoop/conf /etc/hive/conf /etc/hbase/conf +root@seghost$ chown -R gpadmin:gpadmin /etc/hadoop/conf /etc/hive/conf /etc/hbase/conf +``` ## Procedure -Perform the following procedure to install and configure the appropriate clients for PXF on each segment host in your Greenplum Database cluster. You will use the `gpssh` utility where possible to run a command on multiple hosts. - -1. Log in to your Greenplum Database master node and set up the environment: - - ``` shell - $ ssh gpadmin@ - gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh - ``` - -2. Create a text file that lists your Greenplum Database segment hosts, one host name per line. Ensure that there are no blank lines or extra spaces in the file. For example, a file named `seghostfile` may include: - - ``` pre - seghost1 - seghost2 - seghost3 - ``` - -3. If not already present, install Java on each Greenplum Database segment host. For example: - - ``` shell - gpadmin@gpmaster$ gpssh -e -v -f seghostfile sudo yum -y install java-1.8.0-openjdk-1.8.0* - ``` - -4. Identify the Java base install directory. Update the `gpadmin` user's `.bash_profile` file on each segment host to include this `$JAVA_HOME` setting if it is not already present. For example: - ``` shell - gpadmin@gpmaster$ gpssh -e -v -f seghostfile "echo 'export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.144-0.b01.el7_4.x86_64/jre' >> /home/gpadmin/.bash_profile" - ``` - -5. Set up a `yum` repository for your desired Hadoop distribution on each segment host. +Perform the following procedure to configure the desired PXF Hadoop-related connectors on each segment host in your Greenplum Database cluster. - 1. Download the `.repo` file for your Hadoop distribution. For example, to download the file for RHEL 7: - For Cloudera distributions: +You will use the `gpssh` and `gpscp` utilities where possible to run a command on multiple hosts. - ``` shell - gpadmin@gpmaster$ wget https://archive.cloudera.com/cdh5/redhat/7/x86_64/cdh/cloudera-cdh5.repo - ``` - - For Hortonworks Data Platform distributions: - - ``` shell - gpadmin@gpmaster$ wget http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.2.0/hdp.repo - ``` - - 2. Copy the `.repo` file to each Greenplum Database segment host. For example: - - ``` shell - gpadmin@gpmaster$ gpscp -v -f seghostfile .repo =:/etc/yum.repos.d - ``` - - With the `.repo` file is in place, you can use the `yum` utility to install client RPM packages. - -6. Install the Hadoop client on each Greenplum Database segment host. For example: - - ``` shell - gpadmin@gpmaster$ gpssh -e -v -f seghostfile sudo yum -y install hadoop-client - ``` - -7. If you plan to use the PXF Hive connector to access Hive table data, install the Hive client on each Greenplum Database segment host. For example: - - ``` shell - gpadmin@gpmaster$ gpssh -e -v -f seghostfile sudo yum -y install hive - ``` - -8. If you plan to use the PXF HBase connector to access HBase table data, install the HBase client on each Greenplum Database segment host. For example: +1. Log in to your Greenplum Database master node and set up the environment: ``` shell - gpadmin@gpmaster$ gpssh -e -v -f seghostfile sudo yum -y install hbase + $ ssh gpadmin@ + gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh ``` -You have installed the desired client packages on each segment host in your Greenplum Database cluster. Copy relevant HDFS, Hive, and HBase configuration from your Hadoop cluster to each Greenplum Database segment host. You will use the `gpscp` utility to copy files to multiple hosts. +2. PXF requires information from `core-site.xml` and other Hadoop configuration files. Copy relevant configuration from your Hadoop cluster to each Greenplum Database segment host. -1. The Hadoop `core-site.xml` configuration file `fs.defaultFS` property value identifies the HDFS NameNode URI. PXF requires this information to access your Hadoop cluster. A sample `fs.defaultFS` setting follows: - - ``` xml - - fs.defaultFS - hdfs://namenode.domain:8020 - - ``` - - PXF requires information from `core-site.xml` and other Hadoop configuration files. Copy these files from your Hadoop cluster to each Greenplum Database segment host. - - 1. Copy the `core-site.xml`, `hdfs-site.xml`, and `mapred-site.xml` Hadoop configuration files from your Hadoop cluster NameNode host to the current host. For example: + 1. Copy the `core-site.xml`, `hdfs-site.xml`, `mapred-site.xml`, and `yarn-site.xml` Hadoop configuration files from your Hadoop cluster NameNode host to the current host. For example: ``` shell gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/core-site.xml . gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/hdfs-site.xml . gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/mapred-site.xml . + gpadmin@gpmaster$ scp hdfsuser@namenode:/etc/hadoop/conf/yarn-site.xml . ``` 2. Next, copy these Hadoop configuration files to each Greenplum Database segment host. For example: @@ -120,18 +50,10 @@ You have installed the desired client packages on each segment host in your Gree gpadmin@gpmaster$ gpscp -v -f seghostfile core-site.xml =:/etc/hadoop/conf/core-site.xml gpadmin@gpmaster$ gpscp -v -f seghostfile hdfs-site.xml =:/etc/hadoop/conf/hdfs-site.xml gpadmin@gpmaster$ gpscp -v -f seghostfile mapred-site.xml =:/etc/hadoop/conf/mapred-site.xml + gpadmin@gpmaster$ gpscp -v -f seghostfile yarn-site.xml =:/etc/hadoop/conf/yarn-site.xml ``` -2. The Hive `hive-site.xml` configuration file `hive.metastore.uris` property value identifies the Hive Metastore URI. PXF requires this information to access the Hive service. A sample `hive.metastore.uris` setting follows: - - ``` xml - - hive.metastore.uris - thrift://metastorehost.domain:9083 - - ``` - - If you plan to use the PXF Hive connector to access Hive table data, copy Hive configuration to each Greenplum Database segment host. +2. If you plan to use the PXF Hive connector to access Hive table data, similarly copy Hive configuration to each Greenplum Database segment host. 1. Copy the `hive-site.xml` Hive configuration file from one of the hosts on which your Hive service is running to the current host. For example: @@ -143,53 +65,30 @@ You have installed the desired client packages on each segment host in your Gree ``` shell gpadmin@gpmaster$ gpscp -v -f seghostfile hive-site.xml =:/etc/hive/conf/hive-site.xml + ``` -3. The HBase `hbase-site.xml` configuration file `hbase.rootdir` property value identifies the location of the HBase data directory. PXF requires this information to access the HBase service. A sample `hbase.rootdir` setting follows: - - ``` xml - - hbase.rootdir - hdfs://hbasehost.domain:8020/apps/hbase/data - - ``` - - If you plan to use the PXF HBase connector to access HBase table data, copy HBase configuration to each Greenplum Database segment host. +3. If you plan to use the PXF HBase connector to access HBase table data, similarly copy HBase configuration to each Greenplum Database segment host. 1. Copy the `hbase-site.xml` HBase configuration file from one of the hosts on which your HBase service is running to the current host. For example: ``` shell - gpadmin@gpmaster$ scp hbaseuser@hbasehost:/etc/hive/conf/hbase-site.xml . + gpadmin@gpmaster$ scp hbaseuser@hbasehost:/etc/hbase/conf/hbase-site.xml . ``` 2. Next, copy the `hbase-site.xml` configuration file to each Greenplum Database segment host. For example: ``` shell - gpadmin@gpmaster$ gpscp -v -f seghostfile hive-site.xml =:/etc/hbase/conf/hbase-site.xml - - -## Updating Hadoop Configuration - -If you update your Hadoop, Hive, or HBase configuration while the PXF service is running, you must copy the updated `.xml` file(s) to each Greenplum Database segment host and restart PXF. - - -## Using a Custom Client Installation - -If you can not install your Hadoop, Hive, and HBase clients via RPMs from supported distributions, you have a custom installation. - -Use the `HADOOP_ROOT` and `HADOOP_DISTRO` environment variables to provide additional configuration information to PXF for custom client Hadoop distributions. As specified below, you must set the relevant environment variable on the command line or in the PXF `$GPHOME/pxf/conf/pxf-env.sh` configuration file prior to initializing PXF. - -If you must install your Hadoop and optional Hive and HBase client distributions from a *tarball*: + gpadmin@gpmaster$ gpscp -v -f seghostfile hbase-site.xml =:/etc/hbase/conf/hbase-site.xml + ``` -- You must install the Hadoop and optional Hive and HBase clients in peer directories that are all children of a Hadoop root directory. These client directories must be simply-named as `hadoop`, `hive`, and `hbase`. +4. PXF accesses Hadoop services on behalf of Greenplum Database end users. By default, PXF tries to access HDFS, Hive, and HBase using the identity of the Greenplum Database user account that logs into Greenplum Database. In order to support this functionality, you must configure proxy settings for Hadoop, as well as for Hive and HBase if you intend to use those PXF connectors. Follow procedures in [Configuring User Impersonation and Proxying](pxfuserimpers.html) to configure user impersonation and proxying for Hadoop services, or to turn off PXF user impersonation. -- You must identify the absolute path to the Hadoop root directory in the `HADOOP_ROOT` environment variable setting. +5. Grant read permission to the HDFS files and directories that will be accessed as external tables in Greenplum Database. If user impersonation is enabled (the default), you must grant this permission to each Greenplum Database user/role name that will use external tables that reference the HDFS files. If user impersonation is not enabled, you must grant this permission to the `gpadmin` user. -- The directory `$HADOOP_ROOT/hadoop/share/hadoop/common/lib` must exist. +6. If your Hadoop cluster is secured with Kerberos, you must configure PXF and generate Kerberos principals and keytabs for each segment host as described in [Configuring PXF for Secure HDFS](pxf_kerbhdfs.html). -If the requirements above are not applicable to your Hadoop distribution: -- You must set the `HADOOP_DISTRO` environment variable to the value `CUSTOM`. +## Updating Hadoop Configuration -- After you initialize PXF, you must manually edit the `$GPHOME/pxf/conf/pxf-private.classpath` file to identify absolute paths to the Hadoop, Hive, and HBase JAR and configuration files. You must edit this file *before* you start PXF. +If you update your Hadoop, Hive, or HBase configuration while the PXF service is running, you must copy the updated `.xml` file(s) to each Greenplum Database segment host and restart PXF. -**Note**: After you install a custom client distribution, you must copy Hadoop, Hive, and HBase configuration as described in the procedure above. diff --git a/gpdb-doc/markdown/pxf/hbase_pxf.html.md.erb b/gpdb-doc/markdown/pxf/hbase_pxf.html.md.erb index b652ac2961..edf72b2bd9 100644 --- a/gpdb-doc/markdown/pxf/hbase_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/hbase_pxf.html.md.erb @@ -12,9 +12,9 @@ This section describes how to use the PXF HBase connector. Before working with HBase table data, ensure that you have: -- Installed and configured an HBase client on each Greenplum Database segment host. Refer to [Installing and Configuring Clients for PXF](client_instcfg.html). +- You have copied `$GPHOME/pxf/lib/pxf-hbase-*.jar` to each node in your HBase cluster, and that the location of this PXF JAR file is in the `$HBASE_CLASSPATH`. This configuration is required for the PXF HBase connector to support filter pushdown. -- Initialized PXF on your Greenplum Database segment hosts, and started PXF on each host. See [Configuring, Initializing, and Starting PXF](cfginitstart_pxf.html) for PXF initialization, configuration, and startup information. +- Initialized PXF on your Greenplum Database segment hosts, and started PXF on each host. See [Initializing and Managing PXF](cfginitstart_pxf.html) for PXF initialization and startup information. ## HBase Primer @@ -185,3 +185,5 @@ CREATE EXTERNAL TABLE (recordkey bytea, ... ) ``` After you have created the external table, you can use the `recordkey` in a `WHERE` clause to filter the HBase table on a range of row key values. + +**Note**: To enable filter pushdown on the `recordkey`, define the field as `text`. diff --git a/gpdb-doc/markdown/pxf/hdfs_read_pxf.html.md.erb b/gpdb-doc/markdown/pxf/hdfs_read_pxf.html.md.erb index 7f5913f205..c4194fb84b 100644 --- a/gpdb-doc/markdown/pxf/hdfs_read_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/hdfs_read_pxf.html.md.erb @@ -30,7 +30,7 @@ This section describes how to use PXF to access HDFS data, including how to crea Before working with HDFS data using PXF, ensure that: - You have installed and configured a Hadoop client on each Greenplum Database segment host. Refer to [Installing and Configuring Hadoop Clients for PXF](client_instcfg.html) for instructions. If you plan to access JSON format data stored in a Cloudera Hadoop cluster, PXF requires a Cloudera version 5.8 or later Hadoop distribution. -- You have initialized PXF on your Greenplum Database segment hosts, and PXF is running on each host. See [Configuring, Initializing, and Managing PXF](cfginitstart_pxf.html) for PXF initialization, configuration, and startup information. +- You have initialized PXF on your Greenplum Database segment hosts, and PXF is running on each host. See [Initializing and Managing PXF](cfginitstart_pxf.html) for PXF initialization and startup information. - If user impersonation is enabled (the default), you have granted read permission to the HDFS files and directories that will be accessed as external tables in Greenplum Database to each Greenplum Database user/role name that will access the HDFS files and directories. If user impersonation is not enabled, you must grant this permission to the `gpadmin` user. diff --git a/gpdb-doc/markdown/pxf/hdfs_write_pxf.html.md.erb b/gpdb-doc/markdown/pxf/hdfs_write_pxf.html.md.erb index 5a28e16f6a..843ca4d26c 100644 --- a/gpdb-doc/markdown/pxf/hdfs_write_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/hdfs_write_pxf.html.md.erb @@ -32,7 +32,7 @@ This section describes how to use PXF to write HDFS data, including how to creat Before writing HDFS data using PXF, ensure that: - You have installed and configured a Hadoop client on each Greenplum Database segment host. Refer to [Installing and Configuring Hadoop Clients for PXF](client_instcfg.html) for instructions. -- You have initialized and started PXF on your Greenplum Database segment hosts. See [Configuring, Initializing, and Managing PXF](cfginitstart_pxf.html) for PXF initialization, configuration, and startup information. +- You have initialized and started PXF on your Greenplum Database segment hosts. See [Initializing and Managing PXF](cfginitstart_pxf.html) for PXF initialization and startup information. - You have granted both *read and write* permissions to the HDFS directories that will be accessed as external tables in Greenplum Database. If user impersonation is enabled (the default), you must grant these permissions to each Greenplum Database user/role name that will use external tables that reference the HDFS files. If user impersonation is not enabled, you must grant this permission to the `gpadmin` user. diff --git a/gpdb-doc/markdown/pxf/hive_pxf.html.md.erb b/gpdb-doc/markdown/pxf/hive_pxf.html.md.erb index 29d9483367..dcb64d7b82 100644 --- a/gpdb-doc/markdown/pxf/hive_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/hive_pxf.html.md.erb @@ -30,7 +30,7 @@ The PXF Hive connector reads data stored in a Hive table. This section describes Before working with Hive table data using PXF, ensure that: - You have installed and configured a Hive client on each Greenplum Database segment host. Refer to [Installing and Configuring Hadoop Clients for PXF](client_instcfg.html) for instructions. -- You have initialized PXF on your Greenplum Database segment hosts, and PXF is running on each host. See [Configuring, Initializing, and Managing PXF](cfginitstart_pxf.html) for PXF initialization, configuration, and startup information. +- You have initialized PXF on your Greenplum Database segment hosts, and PXF is running on each host. See [Initializing and Managing PXF](cfginitstart_pxf.html) for PXF initialization and startup information. ## Hive Data Formats diff --git a/gpdb-doc/markdown/pxf/install_java.html.md.erb b/gpdb-doc/markdown/pxf/install_java.html.md.erb new file mode 100644 index 0000000000..89704ebf52 --- /dev/null +++ b/gpdb-doc/markdown/pxf/install_java.html.md.erb @@ -0,0 +1,46 @@ +--- +title: Installing Java for PXF +--- + +PXF is a Java service. It requires a Java 1.7 or 1.8 installation on each Greenplum Database segment host. + +*If an appropriate version of PXF is already installed on each Greenplum Database segment host, you need not perform the procedure in this topic.* + + +## Prerequisites + +Ensure that you have access to, or superuser permissions to install, Java version 1.7 or 1.8 on each Greenplum Database segment host. + +## Procedure + +Perform the following procedure to install Java on each segment host in your Greenplum Database cluster. You will use the `gpssh` utility where possible to run a command on multiple hosts. + +1. Log in to your Greenplum Database master node and set up the environment: + + ``` shell + $ ssh gpadmin@ + gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh + ``` + +2. Create a text file that lists your Greenplum Database segment hosts, one host name per line. For example, a file named `seghostfile` may include: + + ``` pre + seghost1 + seghost2 + seghost3 + ``` + +3. Install Java on each Greenplum Database segment host and set up the Java environment on each host. + + 1. Install the Java package. For example, to install Java version 1.8: + + ``` shell + gpadmin@gpmaster$ gpssh -e -v -f seghostfile sudo yum -y install java-1.8.0-openjdk-1.8.0* + ``` + + 2. Identify the Java base install directory. Update the `gpadmin` user's `.bash_profile` file on each segment host to include this `$JAVA_HOME` setting if it is not already present. For example: + + ``` shell + gpadmin@gpmaster$ gpssh -e -v -f seghostfile "echo 'export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.x86_64/jre' >> /home/gpadmin/.bash_profile" + ``` + diff --git a/gpdb-doc/markdown/pxf/instcfg_pxf.html.md.erb b/gpdb-doc/markdown/pxf/instcfg_pxf.html.md.erb index c58debcda0..39349fcda5 100644 --- a/gpdb-doc/markdown/pxf/instcfg_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/instcfg_pxf.html.md.erb @@ -1,22 +1,16 @@ --- -title: Installing and Configuring PXF +title: Configuring PXF --- -The Greenplum Platform Extension Framework (PXF) provides connectors to Hadoop, Hive, HBase and external SQL data stores. To use these PXF connectors, you must install Hadoop, Hive, and HBase clients on each Greenplum Database segment host as described in this one-time installation and configuration procedure: +The Greenplum Platform Extension Framework (PXF) provides connectors to Hadoop, Hive, HBase and external SQL data stores. You must configure PXF to support the connectors that you plan to use. -- **[Installing and Configuring Hadoop Clients for PXF](client_instcfg.html)** +To configure PXF, you must: -PXF accesses Hadoop services on behalf of Greenplum Database end users. By default, PXF tries to access data source services (HDFS, Hive, HBase) using the identity of the Greenplum Database user account that logs into Greenplum Database. In order to support this functionality, you must configure proxy settings for Hadoop, as well as for Hive and HBase if you intend to use those PXF connectors. Follow procedures in: +1. Install Java packages on each Greenplum Database segment host as described in **[Installing Java for PXF](install_java.html)**. -- **[Configuring User Impersonation and Proxying](pxfuserimpers.html)** +2. If you plan to use the Hadoop, Hive, or HBase PXF connectors, you must perform the configuration procedure described in **[Configuring PXF Hadoop Connectors](client_instcfg.html)**. -to configure user impersonation and proxying for Hadoop services, or to turn off PXF user impersonation. +3. **[Initialize PXF](cfginitstart_pxf.html#init_pxf)**. -You must also configure and initialize PXF itself, and start the PXF service on each segment host: - -- **[Configuring, Initializing, and Managing PXF](cfginitstart_pxf.html)** - -If your Hadoop cluster is secured with Kerberos, you must configure PXF and generate Kerberos principals and keytabs for each segment host: - -- **[Configuring PXF for Secure HDFS](pxf_kerbhdfs.html)** +4. **[Start PXF](cfginitstart_pxf.html#start_pxf)**. diff --git a/gpdb-doc/markdown/pxf/intro_pxf.html.md.erb b/gpdb-doc/markdown/pxf/intro_pxf.html.md.erb index a423e1ce51..a55c305bed 100644 --- a/gpdb-doc/markdown/pxf/intro_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/intro_pxf.html.md.erb @@ -6,7 +6,7 @@ The Greenplum Platform Extension Framework (PXF) is composed of a Greenplum Data Your Greenplum Database deployment consists of a master node and multiple segment hosts. After you configure and initialize PXF, you start a single PXF JVM process on each Greenplum Database segment host. This PXF process (referred to as the PXF agent) spawns a thread for each segment instance on a segment host that participates in a query against a PXF external table. Multiple segment instances on each segment host communicate via a REST API with PXF in parallel, and the PXF agents on multiple hosts communicate with HDFS in parallel. -Figure: PXF Architecture +Figure: PXF-to-Hadoop Architecture diff --git a/gpdb-doc/markdown/pxf/jdbc_pxf.html.md.erb b/gpdb-doc/markdown/pxf/jdbc_pxf.html.md.erb index 59c003b4bd..1d90b65e91 100644 --- a/gpdb-doc/markdown/pxf/jdbc_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/jdbc_pxf.html.md.erb @@ -31,9 +31,12 @@ This section describes how to use the PXF JDBC connector to access data in an ex Before you access an external SQL database using the PXF JDBC connector, ensure that: -- You have initialized PXF on your Greenplum Database segment hosts, and PXF is running on each host. See [Configuring, Initializing, and Managing PXF](instcfg_pxf.html) for PXF initialization, configuration, and startup information. +- You have initialized PXF on your Greenplum Database segment hosts, and PXF is running on each host. See [Initializing and Managing PXF](cfginitstart_pxf.html) for PXF initialization and startup information. - Connectivity exists between all Greenplum Database segment hosts and the external SQL database. - You have configured your external SQL database for user access from all Greenplum Database segment hosts. + +The PXF JDBC connector is installed with the `postgresql-8.4-702.jdbc4.jar` JAR file. If you require a different JDBC JAR file(s), ensure that: + - You have installed the JDBC driver JAR files for the external SQL database in the same location on each segment host. Be sure to install JDBC driver JAR files that are compatible with your JRE version. - You have added the file system locations of the external SQL database JDBC JAR files to `$GPHOME/pxf/conf/pxf-public.classpath` and restarted PXF on each segment host. See [PXF JAR Dependencies](using_pxf.html#profile-dependencies) for additional information. diff --git a/gpdb-doc/markdown/pxf/overview_pxf.html.md.erb b/gpdb-doc/markdown/pxf/overview_pxf.html.md.erb index 87e046a3f6..36bc0adac2 100644 --- a/gpdb-doc/markdown/pxf/overview_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/overview_pxf.html.md.erb @@ -27,9 +27,9 @@ The Greenplum Platform Extension Framework (PXF) provides parallel, high through This topic describes the architecture of PXF and its integration with Greenplum Database. -- **[Installing and Configuring PXF](instcfg_pxf.html)** +- **[Configuring PXF](instcfg_pxf.html)** - This topic details the installation, configuration, and startup procedures for PXF and supporting clients. + This topic details the PXF configuration, initialization, and startup procedures. - **[Upgrading PXF](upgrade_pxf.html)** diff --git a/gpdb-doc/markdown/pxf/pxfuserimpers.html.md.erb b/gpdb-doc/markdown/pxf/pxfuserimpers.html.md.erb index 175ece2fb4..b42e75e832 100644 --- a/gpdb-doc/markdown/pxf/pxfuserimpers.html.md.erb +++ b/gpdb-doc/markdown/pxf/pxfuserimpers.html.md.erb @@ -10,7 +10,7 @@ As an alternative, you can disable PXF user impersonation. With user impersonati ## Configure PXF User Impersonation -Perform the following procedure to turn PXF user impersonation on or off in your Greenplum Database cluster. User impersonation is enabled by default. +Perform the following procedure to turn PXF user impersonation on or off in your Greenplum Database cluster. If you are configuring PXF for the first time, user impersonation is enabled by default. You need not perform this procedure. 1. Log in to your Greenplum Database master node as the administrative user and set up the environment: diff --git a/gpdb-doc/markdown/pxf/upgrade_pxf.html.md.erb b/gpdb-doc/markdown/pxf/upgrade_pxf.html.md.erb index 84f40f0619..75cf8c11ff 100644 --- a/gpdb-doc/markdown/pxf/upgrade_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/upgrade_pxf.html.md.erb @@ -6,12 +6,17 @@ If you are using PXF in your current Greenplum Database installation, you must u The PXF upgrade procedure describes how to upgrade PXF in your Greenplum Database installation. This procedure uses *PXF.from* to refer to your currently-installed PXF version and *PXF.to* to refer to the PXF version installed when you upgrade to the new version of Greenplum Database. +Most PXF installations do not require modifications to PXF configuration files and should experience a seamless upgrade. + +**Note**: Starting in Greenplum Database version 5.12.0, PXF no longer requires a Hadoop client installation. PXF now bundles all of the JAR files on which it depends, and loads these JARs at runtime. PXF still requires that the Hadoop configuration files reside in `/etc//conf`. + The PXF upgrade procedure has two parts. You perform one procedure before, and one procedure after, you upgrade to a new version of Greenplum Database: - [Step 1: PXF Pre-Upgrade Actions](#pxfpre) - Upgrade to a new Greenplum Database version - [Step 2: Upgrading PXF](#pxfup) + ## Step 1: PXF Pre-Upgrade Actions Perform this procedure before you upgrade to a new version of Greenplum Database: @@ -23,7 +28,7 @@ Perform this procedure before you upgrade to a new version of Greenplum Database gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh ``` -2. Create a text file that lists your Greenplum Database segment hosts, one host name per line. Ensure that there are no blank lines or extra spaces in the file. For example, a file named `seghostfile` may include: +2. Create a text file that lists your Greenplum Database segment hosts, one host name per line. For example, a file named `seghostfile` may include: ``` pre seghost1 @@ -44,7 +49,7 @@ Perform this procedure before you upgrade to a new version of Greenplum Database gpadmin@gpmaster$ scp gpadmin@seghost1:/usr/local/greenplum-db/pxf/conf/* /save/pxf-from-conf/ ``` -5. Note the locations of any JAR files you may have added to your *PXF.from* installation. Save a copy of these JAR files in case they are removed or altered when you upgrade to the new version of Greenplum Database. +5. Note the locations of any custom JAR files that you may have added to your *PXF.from* installation. Save a copy of these JAR files in case they are removed or altered when you upgrade to the new version of Greenplum Database. 6. Upgrade to the new version of Greenplum Database and then continue your PXF upgrade with [Step 2: Upgrading PXF](#pxfup). @@ -71,9 +76,9 @@ After you upgrade to the new version of Greenplum Database, perform the followin 3. Initialize PXF on each segment host as described in [Initializing PXF](cfginitstart_pxf.html#init_pxf). -4. If you updated any of the `pxf-profiles.xml`, `pxf-log4j.properties`, `pxf-private.classpath`, or `pxf-public.classpath` configuration files in your *PXF.from* installation, re-apply those changes and copy the updated file(s) to all segment hosts. Refer to Step 3 above for a similar `gpscp` command. +4. If you updated any of the `pxf-profiles.xml`, `pxf-log4j.properties`, or `pxf-public.classpath` configuration files in your *PXF.from* installation, re-apply those changes and copy the updated file(s) to all segment hosts. Refer to Step 3 above for a similar `gpscp` command. - **Note:** Starting in Greenplum Database version 5.12, the package name for PXF classes was changed to use the prefix `org.greenplum.*`. If you are upgrading from an older *PXF.from* version and you customized the `pxf-profiles.xml` file, you must change any `org.apache.hawq.*` references to `org.greenplum.*` when you re-apply your changes. + **Note:** Starting in Greenplum Database version 5.12.0, the package name for PXF classes was changed to use the prefix `org.greenplum.*`. If you are upgrading from an older *PXF.from* version and you customized the `pxf-profiles.xml` file, you must change any `org.apache.hawq.pxf.*` references to `org.greenplum.pxf.*` when you re-apply your changes. 5. If you added additional JAR files to your *PXF.from* installation, copy them to the corresponding directory in your *PXF.to* installation **on each segment host**. diff --git a/gpdb-doc/markdown/pxf/using_pxf.html.md.erb b/gpdb-doc/markdown/pxf/using_pxf.html.md.erb index fe533240e2..82d9785e2e 100644 --- a/gpdb-doc/markdown/pxf/using_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/using_pxf.html.md.erb @@ -109,19 +109,19 @@ PXF accesses data sources using different connectors, and filter pushdown suppor PXF filter pushdown can be used with these data types: -- `INT`, array of `INT` +- `INT` - `FLOAT` - `NUMERIC` - `BOOL` -- `CHAR`, `TEXT`, array of `TEXT` -- `DATE`, `TIMESTAMP`. +- `CHAR`, `TEXT` +- `DATE`, `TIMESTAMP` (JDBC connector only) PXF filter pushdown can be used with these operators: - `<`, `<=`, `>=`, `>` - `<>`, `=` -- `IN` -- `LIKE` (only for `TEXT` fields). +- `IN` operator on arrays of `INT` and `TEXT` +- `LIKE` (only for `TEXT` fields) (JDBC connector only) To summarize, all of the following criteria must be met for filter pushdown to occur: -- GitLab