From 0f0314abf478066d1df288afead47b65891db6f6 Mon Sep 17 00:00:00 2001 From: Lisa Owen Date: Thu, 12 Dec 2019 12:13:13 -0800 Subject: [PATCH] docs - pxf supports simultaneous access to multiple kerberized hadoop clusters (#9131) * WIP docs - pxf supports multiple kerberized hadoop clusters * some of the edits requested by david * change title again * address some comments from alex and francisco * impersonation default is on and other misc edits * remove the optional upgrade steps * fix formatting * when hive w/kerb, initially comment out user impersonation property * misc rewording on upgrade page * misc edits from my final review --- .../source/subnavs/pxf-subnav.erb | 6 +- gpdb-doc/markdown/pxf/cfg_server.html.md.erb | 36 ++- .../markdown/pxf/cfginitstart_pxf.html.md.erb | 9 +- .../markdown/pxf/client_instcfg.html.md.erb | 6 +- .../markdown/pxf/hive_jdbc_cfg.html.md.erb | 191 ++++++++++++++ gpdb-doc/markdown/pxf/hive_pxf.html.md.erb | 2 +- gpdb-doc/markdown/pxf/jdbc_cfg.html.md.erb | 243 ++++-------------- .../markdown/pxf/pxf_kerbhdfs.html.md.erb | 110 ++++++-- .../markdown/pxf/pxfuserimpers.html.md.erb | 115 +++++++-- .../pxf/troubleshooting_pxf.html.md.erb | 2 +- .../markdown/pxf/upgrade_pxf_6x.html.md.erb | 12 + 11 files changed, 482 insertions(+), 250 deletions(-) create mode 100644 gpdb-doc/markdown/pxf/hive_jdbc_cfg.html.md.erb diff --git a/gpdb-doc/book/master_middleman/source/subnavs/pxf-subnav.erb b/gpdb-doc/book/master_middleman/source/subnavs/pxf-subnav.erb index 1f5baf1504..f5ac7e9110 100644 --- a/gpdb-doc/book/master_middleman/source/subnavs/pxf-subnav.erb +++ b/gpdb-doc/book/master_middleman/source/subnavs/pxf-subnav.erb @@ -36,7 +36,11 @@
  • Configuring Connectors to Minio and S3 Object Stores (Optional)
  • Configuring Connectors to Azure and Google Cloud Storage Object Stores (Optional)
  • -
  • Configuring the JDBC Connector (Optional)
  • +
  • Configuring the JDBC Connector (Optional) + +
  • Configuring the PXF Agent Host and Port (Optional)
  • diff --git a/gpdb-doc/markdown/pxf/cfg_server.html.md.erb b/gpdb-doc/markdown/pxf/cfg_server.html.md.erb index bf3d0cdd83..57fdd31d07 100644 --- a/gpdb-doc/markdown/pxf/cfg_server.html.md.erb +++ b/gpdb-doc/markdown/pxf/cfg_server.html.md.erb @@ -26,9 +26,9 @@ PXF provides a template configuration file for each connector. These server tem ``` gpadmin@gpmaster$ ls $PXF_CONF/templates -adl-site.xml hbase-site.xml jdbc-site.xml s3-site.xml -core-site.xml hdfs-site.xml mapred-site.xml wasbs-site.xml -gs-site.xml hive-site.xml minio-site.xml yarn-site.xml +adl-site.xml hbase-site.xml jdbc-site.xml pxf-site.xml yarn-site.xml +core-site.xml hdfs-site.xml mapred-site.xml s3-site.xml +gs-site.xml hive-site.xml minio-site.xml wasbs-site.xml ``` For example, the contents of the `s3-site.xml` template file follow: @@ -62,8 +62,6 @@ PXF defines a special server named `default`. When you initialize PXF, it automa PXF automatically uses the `default` server configuration if you omit the `SERVER=` setting in the `CREATE EXTERNAL TABLE` command `LOCATION` clause. -**Note**: You *must* configure a Hadoop server as the PXF `default` server when your Hadoop cluster utilizes Kerberos authentication. - ## Configuring a Server @@ -84,15 +82,41 @@ After you configure a PXF server, you publish the server name to Greenplum Datab To configure a PXF server, refer to the connector configuration topic: - To configure a PXF server for Hadoop, refer to [Configuring PXF Hadoop Connectors ](client_instcfg.html). -- To configure a PXF server for an object store, refer to [Configuring Connectors to Azure, Google Cloud Storage, Minio, and S3 Object Stores](objstore_cfg.html). +- To configure a PXF server for an object store, refer to [Configuring Connectors to Minio and S3 Object Stores](s3_objstore_cfg.html) and [Configuring Connectors to Azure and Google Cloud Storage Object Stores](objstore_cfg.html). - To configure a PXF JDBC server, refer to [Configuring the JDBC Connector ](jdbc_cfg.html). +## About Kerberos and User Impersonation Configuration (pxf-site.xml) + +PXF includes a template file named `pxf-site.xml`. You use the `pxf-site.xml` template file to specify Kerberos and/or user impersonation settings for a server configuration. + +
    The settings in this file apply only to Hadoop and JDBC server configurations; they do not apply to object store server configurations.
    + +You configure properties in the `pxf-site.xml` file for a PXF server when one or more of the following conditions hold: + +- The remote Hadoop system utilizes Kerberos authentication. +- You want to enable/disable user impersonation on the the remote Hadoop or external database system. + +`pxf-site.xml` includes the following properties: + +| Property | Description | Default Value | +|----------------|--------------------------------------------|---------------| +| pxf.service.kerberos.principal | The Kerberos principal name. | gpadmin/\_HOST@EXAMPLE.COM | +| pxf.service.kerberos.keytab | The file system path to the Kerberos keytab file. | $PXF_CONF/keytabs/pxf.service.keytab | +| pxf.service.user.name | The log in user for the remote system. | The operating system user that starts the pxf process, typically `gpadmin`. | +| pxf.service.user.impersonation | Enables/disables user impersonation on the remote system. | The value of the (deprecated) `PXF_USER_IMPERSONATION` property when that property is set. If the `PXF_USER_IMPERSONATION` property does not exist and the `pxf.service.user.impersonation` property is missing from `pxf-site.xml`, the default is `false`, user impersonation is disabled on the remote system. | + +Refer to [Configuring PXF Hadoop Connectors ](client_instcfg.html) and [Configuring the JDBC Connector ](jdbc_cfg.html) for information about relevant `pxf-site.xml` property settings for Hadoop and JDBC server configurations, respectively. + + ## Configuring a PXF User You can configure access to an external data store on a per-server, per-Greenplum-user basis. +
    PXF per-server, per-user configuration provides the most benefit for JDBC servers.
    + You configure external data store user access credentials and properties for a specific Greenplum Database user by providing a `-user.xml` user configuration file in the PXF server configuration directory, `$PXF_CONF/servers//`. For example, you specify the properties for the Greenplum Database user named `bill` in the file `$PXF_CONF/servers//bill-user.xml`. You can configure zero, one, or more users in a PXF server configuration. + The properties that you specify in a user configuration file are connector-specific. You can specify any configuration property supported by the PXF connector server in a `-user.xml` configuration file. For example, suppose you have configured access to a PostgreSQL database in the PXF JDBC server configuration named `pgsrv1`. To allow the Greenplum Database user named `bill` to access this database as the PostgreSQL user named `pguser1`, password `changeme`, you create the user configuration file `$PXF_CONF/servers/pgsrv1/bill-user.xml` with the following properties: diff --git a/gpdb-doc/markdown/pxf/cfginitstart_pxf.html.md.erb b/gpdb-doc/markdown/pxf/cfginitstart_pxf.html.md.erb index 3a0c93fced..c3dddfbb42 100644 --- a/gpdb-doc/markdown/pxf/cfginitstart_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/cfginitstart_pxf.html.md.erb @@ -51,8 +51,13 @@ The `pxf-env.sh` file exposes the following PXF runtime configuration parameters | JAVA_HOME | The Java JRE home directory. | /usr/java/default | | PXF_LOGDIR | The PXF log directory. | $PXF_CONF/logs | | PXF_JVM_OPTS | Default options for the PXF Java virtual machine. | -Xmx2g -Xms1g | -| PXF_KEYTAB | The absolute path to the PXF service Kerberos principal keytab file. | $PXF_CONF/keytabs/pxf.service.keytab | -| PXF_PRINCIPAL | The PXF service Kerberos principal. | gpadmin/\_HOST@EXAMPLE.COM | +| PXF_MAX_THREADS | Default for the maximum number of PXF threads. | 200 | +| PXF_FRAGMENTER_CACHE | Enable/disable fragment caching. | Enabled | +| PXF_OOM_KILL | Enable/disable PXF auto-kill on OutOfMemoryError. | Enabled | +| PXF_OOM_DUMP_PATH | Absolute pathname to dump file generated on OOM. | No dump file | +| PXF_KEYTAB | The absolute path to the PXF service Kerberos principal keytab file. Deprecated; specify the keytab in a server-specific `pxf-site.xml` file. | $PXF_CONF/keytabs/pxf.service.keytab | +| PXF_PRINCIPAL | The PXF service Kerberos principal. Deprecated; specify the principal in a server-specific `pxf-site.xml` file. | gpadmin/\_HOST@EXAMPLE.COM | +| PXF_USER_IMPERSONATION | Enable/disable end user identity impersonation. Deprecated; enable/disable impersonation in a server-specific `pxf-site.xml` file. | true | You must synchronize any changes that you make to `pxf-env.sh`, `pxf-log4j.properties`, or `pxf-profiles.xml` to the Greenplum Database cluster, and (re)start PXF on each segment host. diff --git a/gpdb-doc/markdown/pxf/client_instcfg.html.md.erb b/gpdb-doc/markdown/pxf/client_instcfg.html.md.erb index 1e6c8d7bab..cd1a1096ff 100644 --- a/gpdb-doc/markdown/pxf/client_instcfg.html.md.erb +++ b/gpdb-doc/markdown/pxf/client_instcfg.html.md.erb @@ -15,7 +15,7 @@ Configuring PXF Hadoop connectors involves copying configuration files from your Perform the following procedure to configure the desired PXF Hadoop-related connectors on the Greenplum Database master host. After you configure the connectors, you will use the `pxf cluster sync` command to copy the PXF configuration to the Greenplum Database cluster. -In this procedure, you use the `default`, or create a new, PXF server configuration. You copy Hadoop configuration files to the server configuration directory on the Greenplum Database master host. You may also copy libraries to `$PXF_CONF/lib` for MapR support. You then synchronize the PXF configuration on the master host to the standby master and segment hosts. (PXF creates the`$PXF_CONF/*` directories when you run `pxf cluster init`.) +In this procedure, you use the `default`, or create a new, PXF server configuration. You copy Hadoop configuration files to the server configuration directory on the Greenplum Database master host. You identify Kerberos and user impersonation settings required for access, if applicable. You may also copy libraries to `$PXF_CONF/lib` for MapR support. You then synchronize the PXF configuration on the master host to the standby master and segment hosts. (PXF creates the`$PXF_CONF/*` directories when you run `pxf cluster init`.) 1. Log in to your Greenplum Database master node: @@ -23,7 +23,7 @@ In this procedure, you use the `default`, or create a new, PXF server configurat $ ssh gpadmin@ ``` -2. Identify the name of your PXF Hadoop server configuration. If your Hadoop cluster is Kerberized, you must use the `default` PXF server. +2. Identify the name of your PXF Hadoop server configuration. 3. If you are not using the `default` PXF server, create the `$PXF_CONF/servers/` directory. For example, use the following command to create a Hadoop server configuration named `hdp3`: @@ -86,7 +86,7 @@ In this procedure, you use the `default`, or create a new, PXF server configurat 6. If your Hadoop cluster is secured with Kerberos, you must configure PXF and generate Kerberos principals and keytabs for each segment host as described in [Configuring PXF for Secure HDFS](pxf_kerbhdfs.html). -## Updating Hadoop Configuration +## About Updating the Hadoop Configuration If you update your Hadoop, Hive, or HBase configuration while the PXF service is running, you must copy the updated configuration to the `$PXF_CONF/servers/` directory and re-sync the PXF configuration to your Greenplum Database cluster. For example: diff --git a/gpdb-doc/markdown/pxf/hive_jdbc_cfg.html.md.erb b/gpdb-doc/markdown/pxf/hive_jdbc_cfg.html.md.erb new file mode 100644 index 0000000000..8a652e881c --- /dev/null +++ b/gpdb-doc/markdown/pxf/hive_jdbc_cfg.html.md.erb @@ -0,0 +1,191 @@ +--- +title: Configuring the JDBC Connector for Hive Access (Optional) +--- + +You can use the PXF JDBC Connector to retrieve data from Hive. You can also use a JDBC named query to submit a custom SQL query to Hive and retrieve the results using the JDBC Connector. + +This topic describes how to configure the PXF JDBC Connector to access Hive. When you configure Hive access with JDBC, you must take into account the Hive user impersonation setting, as well as whether or not the Hadoop cluster is secured with Kerberos. + +*If you do not plan to use the PXF JDBC Connector to access Hive, then you do not need to perform this procedure.* + + +## JDBC Server Configuration + +The PXF JDBC Connector is installed with the JAR files required to access Hive via JDBC, `hive-jdbc-.jar` and `hive-service-.jar`, and automatically registers these JARs. + +When you configure a PXF JDBC server for Hive access, you must specify the JDBC driver class name, database URL, and client credentials just as you would when configuring a client connection to an SQL database. + +To access Hive via JDBC, you must specify the following properties and values in the `jdbc-site.xml` server configuration file: + +| Property | Value | +|----------------|--------| +| jdbc.driver | org.apache.hive.jdbc.HiveDriver | +| jdbc.url | jdbc:hive2://\:\/\ | + +The value of the HiveServer2 authentication (`hive.server2.authentication`) and impersonation (`hive.server2.enable.doAs`) properties, and whether or not the Hive service is utilizing Kerberos authentication, will inform the setting of other JDBC server configuration properties. These properties are defined in the `hive-site.xml` configuration file in the Hadoop cluster. You will need to obtain the values of these properties. + +The following table enumerates the Hive2 authentication and impersonation combinations supported by the PXF JDBC Connector. It identifies the possible Hive user identities and the JDBC server configuration required for each. + +Table heading key: + +- *authentication* -> Hive `hive.server2.authentication` Setting +- *enable.doAs* -> Hive `hive.server2.enable.doAs` Setting +- *User Identity* -> Identity that HiveServer2 will use to access data +- *Configuration Required* -> PXF JDBC Connector or Hive configuration required for *User Identity* + +| authentication | enable.doAs | User Identity | Configuration Required | +|------------------|---------------|----------------|-------------------| +| `NOSASL` | n/a | No authentication | Must set `jdbc.connection.property.auth` = `noSasl`. | +| `NONE`, or not specified | `TRUE` | User name that you provide | Set `jdbc.user`. | +| `NONE`, or not specified | `TRUE` | Greenplum user name | Set `pxf.service.user.impersonation` to `true` in `jdbc-site.xml`. | +| `NONE`, or not specified | `FALSE` | Name of the user who started Hive, typically `hive` | None | +| `KERBEROS` | `TRUE` | Identity provided in the PXF Kerberos principal, typically `gpadmin` | Must set `hadoop.security.authentication` to `kerberos` in `jdbc-site.xml`. | +| `KERBEROS` | `TRUE` | User name that you provide | Set `hive.server2.proxy.user` in `jdbc.url` and set `hadoop.security.authentication` to `kerberos` in `jdbc-site.xml`. | +| `KERBEROS` | `TRUE` | Greenplum user name | Set `pxf.service.user.impersonation` to `true` and `hadoop.security.authentication` to `kerberos` in `jdbc-site.xml`. | +| `KERBEROS` | `FALSE` | Identity provided in the `jdbc.url` `principal` parameter, typically `hive` | Must set `hadoop.security.authentication` to `kerberos` in `jdbc-site.xml`. | + +**Note**: There are additional configuration steps required when Hive utilizes Kerberos authentication. + +## Example Configuration Procedure + +Perform the following procedure to configure a PXF JDBC server for Hive: + +1. Log in to your Greenplum Database master node: + + ``` shell + $ ssh gpadmin@ + ``` + +2. Choose a name for the JDBC server. + +3. Create the `$PXF_CONF/servers/` directory. For example, use the following command to create a JDBC server configuration named `hivejdbc1`: + + ``` shell + gpadmin@gpmaster$ mkdir $PXF_CONF/servers/hivejdbc1 + ```` + +3. Navigate to the server configuration directory. For example: + + ```shell + gpadmin@gpmaster$ cd $PXF_CONF/servers/hivejdbc1 + ``` + +4. Copy the PXF JDBC server template file to the server configuration directory. For example: + + ``` shell + gpadmin@gpmaster$ cp $PXF_CONF/templates/jdbc-site.xml . + ``` + +4. When you access Hive secured with Kerberos, you also need to specify configuration properties in the `pxf-site.xml` file. *If this file does not yet exist in your server configuration*, copy the `pxf-site.xml` template file to the server config directory. For example: + + ``` shell + gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml . + ``` + +5. Open the `jdbc-site.xml` file in the editor of your choice and set the `jdbc.driver` and `jdbc.url` properties. Be sure to specify your Hive host, port, and database name: + + ``` xml + + jdbc.driver + org.apache.hive.jdbc.HiveDriver + + + jdbc.url + jdbc:hive2://:/ + + ``` + +6. Obtain the `hive-site.xml` file from your Hadoop cluster and examine the file. + +7. If the `hive.server2.authentication` property in `hive-site.xml` is set to `NOSASL`, HiveServer2 performs no authentication. Add the following connection-level property to `jdbc-site.xml`: + + ``` xml + + jdbc.connection.property.auth + noSasl + + ``` + Alternatively, you may choose to add `;auth=noSasl` to the `jdbc.url`. + +8. If the `hive.server2.authentication` property in `hive-site.xml` is set to `NONE`, or the property is not specified, you must set the `jdbc.user` property. The value to which you set the `jdbc.user` property is dependent upon the `hive.server2.enable.doAs` impersonation setting in `hive-site.xml`: + + 1. If `hive.server2.enable.doAs` is set to `TRUE` (the default), Hive runs Hadoop operations on behalf of the user connecting to Hive. *Choose/perform one of the following options*: + + **Set** `jdbc.user` to specify the user that has read permission on all Hive data accessed by Greenplum Database. For example, to connect to Hive and run all requests as user `gpadmin`: + + ``` xml + + jdbc.user + gpadmin + + ``` + **Or**, turn on JDBC server-level user impersonation so that PXF automatically uses the Greenplum Database user name to connect to Hive; uncomment the `pxf.service.user.impersonation` property in `jdbc-site.xml` and set the value to `true: + + ``` xml + + pxf.service.user.impersonation + true + + ``` + If you enable JDBC impersonation in this manner, you must not specify a `jdbc.user` nor include the setting in the `jdbc.url`. + + 2. If required, create a PXF user configuration file as described in [Configuring a PXF User](cfg_server.html#usercfg) to manage the password setting. + 3. If `hive.server2.enable.doAs` is set to `FALSE`, Hive runs Hadoop operations as the user who started the HiveServer2 process, usually the user `hive`. PXF ignores the `jdbc.user` setting in this circumstance. + +9. If the `hive.server2.authentication` property in `hive-site.xml` is set to `KERBEROS`: + 1. Identify the name of the server configuration. + 2. Ensure that you have configured Kerberos authentication for PXF as described in [Configuring PXF for Secure HDFS](pxf_kerbhdfs.html), and that you have specified the Kerberos principal and keytab in the `pxf-site.xml` properties as described in the procedure. + 3. Comment out the `pxf.service.user.impersonation` property in the `pxf-site.xml` file. If you require user impersonation, you will uncomment and set the property in an upcoming step.) + 3. Uncomment the `hadoop.security.authentication` setting in `$PXF_CONF/servers//jdbc-site.xml`: + + ``` xml + + hadoop.security.authentication + kerberos + + ``` + 4. Add the `saslQop` property to `jdbc.url`, and set it to match the `hive.server2.thrift.sasl.qop` property setting in `hive-site.xml`. For example, if the `hive-site.xml` file includes the following property setting: + + ``` xml + + hive.server2.thrift.sasl.qop + auth-conf + + ``` + You would add `;saslQop=auth-conf` to the `jdbc.url`. + + 5. Add the HiverServer2 `principal` name to the `jdbc.url`. For example: + +
    +        jdbc:hive2://hs2server:10000/default;principal=hive/hs2server@REALM;saslQop=auth-conf
    +        
    + 6. If `hive.server2.enable.doAs` is set to `TRUE` (the default), Hive runs Hadoop operations on behalf of the user connecting to Hive. *Choose/perform one of the following options*: + + **Do not** specify any additional properties. In this case, PXF initiates all Hadoop access with the identity provided in the PXF Kerberos principal (usually `gpadmin`). + + **Or**, set the `hive.server2.proxy.user` property in the `jdbc.url` to specify the user that has read permission on all Hive data. For example, to connect to Hive and run all requests as the user named `integration` use the following `jdbc.url`: + +
    +        jdbc:hive2://hs2server:10000/default;principal=hive/hs2server@REALM;saslQop=auth-conf;hive.server2.proxy.user=integration
    +        
    + + **Or**, enable PXF JDBC impersonation in the `pxf-site.xml` file so that PXF automatically uses the Greenplum Database user name to connect to Hive. Add or uncomment the `pxf.service.user.impersonation` property and set the value to `true`. For example: + + ``` xml + + pxf.service.user.impersonation + true + + ``` + If you enable JDBC impersonation, you must not explicitly specify a `hive.server2.proxy.user` in the `jdbc.url`. + 6. If required, create a PXF user configuration file to manage the password setting. + 7. If `hive.server2.enable.doAs` is set to `FALSE`, Hive runs Hadoop operations with the identity provided by the PXF Kerberos principal (usually `gpadmin`). + +10. Save your changes and exit the editor. + +11. Use the `pxf cluster sync` command to copy the new server configuration to the Greenplum Database cluster. For example: + + ``` shell + gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync + ``` + diff --git a/gpdb-doc/markdown/pxf/hive_pxf.html.md.erb b/gpdb-doc/markdown/pxf/hive_pxf.html.md.erb index c3a5dcb37d..371e7e6cd8 100644 --- a/gpdb-doc/markdown/pxf/hive_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/hive_pxf.html.md.erb @@ -29,7 +29,7 @@ The PXF Hive connector reads data stored in a Hive table. This section describes Before working with Hive table data using PXF, ensure that you have met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_prereq). -*If you plan to use PXF filter pushdown with Hive integral types*, ensure that the configuration parameter `hive.metastore.integral.jdo.pushdown` exists and is set to `true` in the `hive-site.xml` in both your Hadoop cluster **and** `$PXF_CONF/servers/default/hive-site.xml`. Refer to [Updating Hadoop Configuration](client_instcfg.html#client-cfg-update) for more information. +*If you plan to use PXF filter pushdown with Hive integral types*, ensure that the configuration parameter `hive.metastore.integral.jdo.pushdown` exists and is set to `true` in the `hive-site.xml` file in both your Hadoop cluster **and** `$PXF_CONF/servers/default/hive-site.xml`. Refer to [About Updating Hadoop Configuration](client_instcfg.html#client-cfg-update) for more information. ## Hive Data Formats diff --git a/gpdb-doc/markdown/pxf/jdbc_cfg.html.md.erb b/gpdb-doc/markdown/pxf/jdbc_cfg.html.md.erb index 21b5bc7ffb..8b856d9a29 100644 --- a/gpdb-doc/markdown/pxf/jdbc_cfg.html.md.erb +++ b/gpdb-doc/markdown/pxf/jdbc_cfg.html.md.erb @@ -134,23 +134,6 @@ Example: To set the `search_path` parameter before running a query in a PostgreS Ensure that the JDBC driver for the external SQL database supports any property that you specify. -### About JDBC User Impersonation - -The PXF JDBC Connector uses the `jdbc.user` setting or information in the `jdbc.url` to determine the identity of the user to connect to the external data store. When PXF JDBC user impersonation is disabled (the default), the behavior of the JDBC Connector is further dependent upon the external data store. For example, if you are using the JDBC Connector to access Hive, the Connector uses the settings of certain Hive authentication and impersonation properties to determine the user. You may be required to provide a `jdbc.user` setting, or add properties to the `jdbc.url` setting in the server `jdbc-site.xml` file. - -When you enable PXF JDBC user impersonation, the PXF JDBC Connector accesses the external data store on behalf of a Greenplum Database end user. The Connector uses the name of the Greenplum Database user that accesses the PXF external table to try to connect to the external data store. - -The `pxf.impersonation.jdbc` property governs JDBC user impersonation. JDBC user impersonation is disabled by default. To enable JDBC user impersonation for a server configuration, set the property to true: - -``` xml - - pxf.impersonation.jdbc - true - -``` - -When you enable JDBC user impersonation for a PXF server, PXF overrides the value of a `jdbc.user` property setting defined in either `jdbc-site.xml` or `-user.xml`, or specified in the external table DDL, with the Greenplum Database user name. For user impersonation to work effectively when the external data store requires passwords to authenticate connecting users, you must specify the `jdbc.password` setting for each user that can be impersonated in that user's `-user.xml` property override file. Refer to [Configuring a PXF User](cfg_server.html#usercfg) for more information about per-server, per-Greenplum-user configuration. - ### About JDBC Connection Pooling The PXF JDBC Connector uses JDBC connection pooling implemented by [HikariCP](https://github.com/brettwooldridge/HikariCP). When a user queries or writes to an external table, the Connector establishes a connection pool for the associated server configuration the first time that it encounters a unique combination of `jdbc.url`, `jdbc.user`, `jdbc.password`, connection property, and pool property settings. The Connector reuses connections in the pool subject to certain connection and timeout settings. @@ -197,6 +180,53 @@ For example, if your Greenplum Database cluster has 16 segment hosts and the tar In practice, you may choose to set `maxPoolSize` to a lower value, since the number of concurrent connections per JDBC query depends on the number of partitions used in the query. When a query uses no partitions, a single PXF JVM services the query. If a query uses 12 partitions, PXF establishes 12 concurrent JDBC connections to the remote database. Ideally, these connections are distributed equally among the PXF JVMs, but that is not guaranteed. +## JDBC User Impersonation + +The PXF JDBC Connector uses the `jdbc.user` setting or information in the `jdbc.url` to determine the identity of the user to connect to the external data store. When PXF JDBC user impersonation is disabled (the default), the behavior of the JDBC Connector is further dependent upon the external data store. For example, if you are using the JDBC Connector to access Hive, the Connector uses the settings of certain Hive authentication and impersonation properties to determine the user. You may be required to provide a `jdbc.user` setting, or add properties to the `jdbc.url` setting in the server `jdbc-site.xml` file. Refer to [Configuring Hive Access via the JDBC Connector](hive_jdbc_cfg.html) for more information on this procedure. + +When you enable PXF JDBC user impersonation, the PXF JDBC Connector accesses the external data store on behalf of a Greenplum Database end user. The Connector uses the name of the Greenplum Database user that accesses the PXF external table to try to connect to the external data store. + +When you enable JDBC user impersonation for a PXF server, PXF overrides the value of a `jdbc.user` property setting defined in either `jdbc-site.xml` or `-user.xml`, or specified in the external table DDL, with the Greenplum Database user name. For user impersonation to work effectively when the external data store requires passwords to authenticate connecting users, you must specify the `jdbc.password` setting for each user that can be impersonated in that user's `-user.xml` property override file. Refer to [Configuring a PXF User](cfg_server.html#usercfg) for more information about per-server, per-Greenplum-user configuration. + +The `pxf.service.user.impersonation` property in the `jdbc-site.xml` configuration file governs JDBC user impersonation. + +
    In previous versions of Greenplum Database, you configured JDBC user impersonation via the now deprecated pxf.impersonation.jdbc property setting in the jdbc-site.xml configuration file.
    + + +### Example Configuration Procedure + +By default, PXF JDBC user impersonation is disabled. Perform the following procedure to turn PXF user impersonation on or off for a JDBC server configuration. + +1. Log in to your Greenplum Database master node as the administrative user: + + ``` shell + $ ssh gpadmin@ + ``` + +2. Identify the name of the PXF JDBC server configuration that you want to update. + +3. Navigate to the server configuration directory. For example, if the server is named `mysqldb`: + + ```shell + gpadmin@gpmaster$ cd $PXF_CONF/servers/mysqldb + ``` + +5. Open the `jdbc-site.xml` file in the editor of your choice, and add or uncomment the user impersonation property and setting. For example, if you require user impersonation for this server configuration, set the `pxf.service.user.impersonation` property to `true`: + + ``` xml + + pxf.service.user.impersonation + true + + ``` + +7. Save the `jdbc-site.xml` file and exit the editor. + +8. Use the `pxf cluster sync` command to synchronize the PXF JDBC server configuration to your Greenplum Database cluster. For example: + + ``` shell + gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync + ``` ## JDBC Named Query Configuration @@ -251,6 +281,11 @@ Refer to [About Using Named Queries](jdbc_pxf.html#about_nq) for information abo You can override the JDBC server configuration by directly specifying certain JDBC properties via custom options in the `CREATE EXTERNAL TABLE` command `LOCATION` clause. Refer to [Overriding the JDBC Server Configuration via DDL](jdbc_pxf.html#jdbc_override) for additional information. +## Configuring Access to Hive + +You can use the JDBC Connector to access Hive. Refer to [Configuring the JDBC Connector for Hive Access](hive_jdbc_cfg.html) for detailed information on this configuration procedure. + + ## Example Configuration Procedure Ensure that you have initialized PXF before you configure a JDBC Connector server. @@ -310,177 +345,3 @@ In this procedure, you name and add a PXF JDBC server configuration for a Postgr gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync ``` -## Configuring Hive Access - -You can use the PXF JDBC Connector to retrieve data from Hive. You can also use a JDBC named query to submit a custom SQL query to Hive and retrieve the results using the JDBC Connector. - -This topic describes how to configure the PXF JDBC Connector to access Hive. When you configure Hive access with JDBC, you must take into account the Hive user impersonation setting, as well as whether or not the Hadoop cluster is secured with Kerberos. - - -### JDBC Server Configuration - -The PXF JDBC Connector is installed with the JAR files required to access Hive via JDBC, `hive-jdbc-.jar` and `hive-service-.jar`, and automatically registers these JARs. - -When you configure a PXF JDBC server for Hive access, you must specify the JDBC driver class name, database URL, and client credentials just as you would when configuring a client connection to an SQL database. - -To access Hive via JDBC, you must specify the following properties and values in the `jdbc-site.xml` server configuration file: - -| Property | Value | -|----------------|--------| -| jdbc.driver | org.apache.hive.jdbc.HiveDriver | -| jdbc.url | jdbc:hive2://\:\/\ | - -The value of the HiveServer2 authentication (`hive.server2.authentication`) and impersonation (`hive.server2.enable.doAs`) properties, and whether or not the Hive service is utilizing Kerberos authentication, will inform the setting of other JDBC server configuration properties. These properties are defined in the `hive-site.xml` configuration file in the Hadoop cluster. You will need to obtain the values of these properties. - -The following table enumerates the Hive2 authentication and impersonation combinations supported by the PXF JDBC Connector. It identifies the possible Hive user identities and the JDBC server configuration required for each. - -Table heading key: - -- *authentication* -> Hive hive.server2.authentication Setting -- *enable.doAs* -> Hive hive.server2.enable.doAs Setting -- *User Identity* -> Identity that HiveServer2 will use to access data -- *Configuration Required* -> PXF JDBC Connector or Hive configuration required for *User Identity* - -| authentication | enable.doAs | User Identity | Configuration Required | -|------------------|---------------|----------------|-------------------| -| `NOSASL` | n/a | No authentication | Must set `jdbc.connection.property.auth` = `noSasl` | -| `NONE`, or not specified | `TRUE` | User name that you provide | Set `jdbc.user` | -| `NONE`, or not specified | `TRUE` | Greenplum user name | Set `pxf.impersonation.jdbc` = `true` | -| `NONE`, or not specified | `FALSE` | Name of the user who started Hive, typically `hive` | None | -| `KERBEROS` | `TRUE` | Identity provided in the PXF Kerberos principal, typically `gpadmin` | None | -| `KERBEROS` | `TRUE` | User name that you provide | Set `hive.server2.proxy.user` in `jdbc.url` | -| `KERBEROS` | `TRUE` | Greenplum user name | Set `pxf.impersonation.jdbc` = `true` | -| `KERBEROS` | `FALSE` | Identity provided in the `jdbc.url` `principal` parameter, typically `hive` | None | - -**Note**: There are additional configuration steps required when Hive utilizes Kerberos authentication. - -### Example Configuration Procedure - -Perform the following procedure to configure a PXF JDBC server for Hive: - -1. Log in to your Greenplum Database master node: - - ``` shell - $ ssh gpadmin@ - ``` - -2. Choose a name for the JDBC server. - -3. Create the `$PXF_CONF/servers/` directory. For example, use the following command to create a JDBC server configuration named `hivejdbc1`: - - ``` shell - gpadmin@gpmaster$ mkdir $PXF_CONF/servers/hivejdbc1 - ```` - -4. Copy the PXF JDBC server template file to the server configuration directory. For example: - - ``` shell - gpadmin@gpmaster$ cp $PXF_CONF/templates/jdbc-site.xml $PXF_CONF/servers/hivejdbc1/ - ``` - -5. Open the `jdbc-site.xml` file in the editor of your choice and set the `jdbc.driver` and `jdbc.url` properties. Be sure to specify your Hive host, port, and database name: - - ``` xml - - jdbc.driver - org.apache.hive.jdbc.HiveDriver - - - jdbc.url - jdbc:hive2://:/ - - ``` - -6. Obtain the `hive-site.xml` file from your Hadoop cluster and examine the file. - -7. If the `hive.server2.authentication` property in `hive-site.xml` is set to `NOSASL`, HiveServer2 performs no authentication. Add the following connection-level property to `jdbc-site.xml`: - - ``` xml - - jdbc.connection.property.auth - noSasl - - ``` - Alternatively, you may choose to add `;auth=noSasl` to the `jdbc.url`. - -8. If the `hive.server2.authentication` property in `hive-site.xml` is set to `NONE`, or the property is not specified, you must set the `jdbc.user` property. The value to which you set the `jdbc.user` property is dependent upon the `hive.server2.enable.doAs` impersonation setting in `hive-site.xml`: - - 1. If `hive.server2.enable.doAs` is set to `TRUE` (the default), Hive runs Hadoop operations on behalf of the user connecting to Hive. *Choose/perform one of the following options*: - - **Set** `jdbc.user` to specify the user that has read permission on all Hive data accessed by Greenplum Database. For example, to connect to Hive and run all requests as user `gpadmin`: - - ``` xml - - jdbc.user - gpadmin - - ``` - **Or**, turn on JDBC-level user impersonation so that PXF automatically uses the Greenplum Database user name to connect to Hive: - - ``` xml - - pxf.impersonation.jdbc - true - - ``` - If you enable JDBC impersonation in this manner, you must not specify a `jdbc.user` nor include the setting in the `jdbc.url`. - - 2. If required, create a PXF user configuration file to manage the password setting. - 3. If `hive.server2.enable.doAs` is set to `FALSE`, Hive runs Hadoop operations as the user who started the HiveServer2 process, usually the user `hive`. PXF ignores the `jdbc.user` setting in this circumstance. - -9. If the `hive.server2.authentication` property in `hive-site.xml` is set to `KERBEROS`: - 1. Ensure that you have enabled Kerberos authentication for PXF as described in [Configuring PXF for Secure HDFS](pxf_kerbhdfs.html). - 2. Ensure that you have configured the Hadoop cluster as the `default` PXF server. - 3. Ensure that the `$PXF_CONF/servers/default/core-site.xml` file includes the following setting: - - ``` xml - - hadoop.security.authentication - kerberos - - ``` - 4. Add the `saslQop` property to `jdbc.url`, and set it to match the `hive.server2.thrift.sasl.qop` property setting in `hive-site.xml`. For example, if the `hive-site.xml` file includes the following property setting: - - ``` xml - - hive.server2.thrift.sasl.qop - auth-conf - - ``` - You would add `;saslQop=auth-conf` to the `jdbc.url`. - - 5. Add the HiverServer2 `principal` name to the `jdbc.url`. For example: - -
    -        jdbc:hive2://hs2server:10000/default;principal=hive/hs2server@REALM;saslQop=auth-conf
    -        
    - 6. If `hive.server2.enable.doAs` is set to `TRUE` (the default), Hive runs Hadoop operations on behalf of the user connecting to Hive. *Choose/perform one of the following options*: - - **Do not** specify any additional properties. In this case, PXF initiates all Hadoop access with the identity provided in the PXF Kerberos principal (usually `gpadmin`). - - **Or**, set the `hive.server2.proxy.user` property in the `jdbc.url` to specify the user that has read permission on all Hive data. For example, to connect to Hive and run all requests as the user named `integration` use the following `jdbc.url`: - -
    -        jdbc:hive2://hs2server:10000/default;principal=hive/hs2server@REALM;saslQop=auth-conf;hive.server2.proxy.user=integration
    -        
    - - **Or**, enable PXF JDBC impersonation in the `jdbc-site.xml` file so that PXF automatically uses the Greenplum Database user name to connect to Hive. For example: - - ``` xml - - pxf.impersonation.jdbc - true - - ``` - If you enable JDBC impersonation, you must not explicitly specify a `hive.server2.proxy.user` in the `jdbc.url`. - 6. If required, create a PXF user configuration file to manage the password setting. - 7. If `hive.server2.enable.doAs` is set to `FALSE`, Hive runs Hadoop operations with the identity provided by the PXF Kerberos principal (usually `gpadmin`). - -10. Save your changes and exit the editor. - -11. Use the `pxf cluster sync` command to copy the new server configuration to the Greenplum Database cluster. For example: - - ``` shell - gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync - ``` - diff --git a/gpdb-doc/markdown/pxf/pxf_kerbhdfs.html.md.erb b/gpdb-doc/markdown/pxf/pxf_kerbhdfs.html.md.erb index bb0a347349..17e4675195 100644 --- a/gpdb-doc/markdown/pxf/pxf_kerbhdfs.html.md.erb +++ b/gpdb-doc/markdown/pxf/pxf_kerbhdfs.html.md.erb @@ -4,17 +4,38 @@ title: Configuring PXF for Secure HDFS When Kerberos is enabled for your HDFS filesystem, PXF, as an HDFS client, requires a principal and keytab file to authenticate access to HDFS. To read or write files on a secure HDFS, you must create and deploy Kerberos principals and keytabs for PXF, and ensure that Kerberos authentication is enabled and functioning. +PXF supports simultaneous access to multiple Kerberos-secured Hadoop clusters. -## Prerequisites +
    In previous versions of Greenplum Database, you configured the PXF Kerberos principal and keytab for the default Hadoop server via the now deprecated PXF_PRINCIPAL and PXF_KEYTAB settings in the pxf-env.sh configuration file.
    + +When Kerberos is enabled, you access Hadoop with the PXF principal and keytab. You can also choose to access Hadoop using the identity of the Greenplum Database user. + +You configure the impersonation setting and the Kerberos principal and keytab for a Hadoop server via the `pxf-site.xml` server-specific configuration file. Refer to [About Kerberos and User Impersonation Configuration (pxf-site.xml)](cfg_server.html#pxf-site) for more information about the configuration properties in this file. + +Configure the Kerberos principal and keytab using the following `pxf-site.xml` properties: + +| Property | Description | Default Value | +|----------------|--------------------------------------------|---------------| +| pxf.service.kerberos.principal | The Kerberos principal name. | gpadmin/\_HOST@EXAMPLE.COM | +| pxf.service.kerberos.keytab | The file system path to the Kerberos keytab file. | $PXF_CONF/keytabs/pxf.service.keytab | +The following table describes two scenarios for accessing Hadoop when Kerberos authentication is enabled: + +| Scenario | Required Configuration | +|----------|---------------| +| PXF accesses Hadoop using the identity of the configured principal. | Set the `pxf.service.user.impersonation` property setting to `false` in the `pxf-site.xml` file to disable user impersonation. | +| PXF accesses Hadoop using the identity of the Greenplum Database user. | Set the `pxf.service.user.impersonation` property setting to `true` in the `pxf-site.xml` file to enable user impersonation. You must also configure Hadoop proxying for the Hadoop user identity specified in the *primary* component of the Kerberos principal. | + + +## Prerequisites Before you configure PXF for access to a secure HDFS filesystem, ensure that you have: -- Configured the Hadoop connectors using the default PXF server configuration. +- Configured a PXF server for the Hadoop cluster, and can identify the server configuration name. -- Initialized, configured, and started PXF as described in [Configuring PXF](instcfg_pxf.html), including enabling PXF and Hadoop user impersonation. +- Initialized, configured, and started PXF as described in [Configuring PXF](instcfg_pxf.html). -- Enabled Kerberos for your Hadoop cluster per the instructions for your specific distribution and verified the configuration. +- Verified that Kerberos is enabled for your Hadoop cluster. - Verified that the HDFS configuration parameter `dfs.block.access.token.enable` is set to `true`. You can find this setting in the `hdfs-site.xml` configuration file on a host in your Hadoop cluster. @@ -44,7 +65,7 @@ When you configure PXF for secure HDFS using an AD Kerberos KDC server, you will 1. Start **Active Directory Users and Computers**. 2. Expand the forest domain and the top-level UNIX organizational unit that describes your Greenplum user domain. 3. Select **Service Accounts**, right-click, then select **New->User**. -4. Type a name, eg. `ServiceGreenplumPROD1`, and change the login name to `gpadmin`. Note that the login name should be in compliance with POSIX standard and match hadoop.proxyuser..hosts/groups in the Hadoop `core-site.xml` and `PXF_PRINCIPAL` in `$PXF_CONF/conf/pxf-env.sh`. +4. Type a name, eg. `ServiceGreenplumPROD1`, and change the login name to `gpadmin`. Note that the login name should be in compliance with POSIX standard and match `hadoop.proxyuser..hosts/groups` in the Hadoop `core-site.xml` and the Kerberos principal. 5. Type and confirm the Active Directory service account password. Select the **User cannot change password** and **Password never expires** check boxes, then click **Next**. For security reasons, if you can't have **Password never expires** checked, you will need to generate new keytab file (step 7) every time you change the password of the service account. 6. Click **Finish** to complete the creation of the new user principal. 7. Open Powershell or a command prompt and run the `ktpass` command to generate the keytab file. For example: @@ -57,7 +78,7 @@ When you configure PXF for secure HDFS using an AD Kerberos KDC server, you will 8. Copy the `pxf.service.keytab` file to the Greenplum master host. -**Perform the following steps on the Greenplum Database master host**: +**Perform the following procedure on the Greenplum Database master host**: 1. Log in to the Greenplum Database master host. For example: @@ -65,13 +86,33 @@ When you configure PXF for secure HDFS using an AD Kerberos KDC server, you will $ ssh gpadmin@ ``` -2. Open the `$PXF_CONF/conf/pxf-env.sh` file in an editor. Update the `PXF_KEYTAB` and `PXF_PRINCIPAL` settings, if required, specifying the location of the keytab file and the Kerberos principal. Replace `EXAMPLE.COM` with your Kerberos realm. +2. Identify the name of the PXF Hadoop server configuration, and navigate to the server configuration directory. For example, if the server is named `hdp3`: + + ```shell + gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3 + ``` + +3. If the server configuration does not yet include a `pxf-site.xml` file, copy the template file to the directory. For example: ``` shell - export PXF_KEYTAB="${PXF_CONF}/keytabs/pxf.service.keytab" - export PXF_PRINCIPAL="gpadmin@EXAMPLE.COM" + gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml . ``` - + +4. Open the `pxf-site.xml` file in the editor of your choice, and update the keytab and principal property settings, if required. Specify the location of the keytab file and the Kerberos principal, substituting your realm. For example: + + ``` xml + + pxf.service.kerberos.principal + gpadmin@EXAMPLE.COM + + + pxf.service.kerberos.keytab + ${pxf.conf}/keytabs/pxf.service.keytab + + ``` + +5. Enable user impersonation as described in [Configure PXF User Impersonation](pxfuserimpers.html#pxf_cfg_impers), and configure or verify Hadoop proxying for the *primary* component of the Kerberos principal as described in [Configure Hadoop Proxying](pxfuserimpers.html#hadoop). For example, if your principal is `gpadmin@EXAMPLE.COM`, configure proxying for the Hadoop user `gpadmin`. + 4. Save the file and exit the editor. 5. Synchronize the PXF configuration to your Greenplum Database cluster and restart PXF. For example: @@ -82,13 +123,13 @@ When you configure PXF for secure HDFS using an AD Kerberos KDC server, you will gpadmin@master$ $GPHOME/pxf/bin/pxf cluster start ``` -6. Step 5 does not synchronize the keytabs in `$PXF_CONF`. You must distribute the keytab file to `$PXF_CONF/keytabs/`. Locate the keytab file, copy the file to the `$PXF_CONF` user configuration directory, and set required permissions. For example: +6. Step 7 does not synchronize the keytabs in `$PXF_CONF`. You must distribute the keytab file to `$PXF_CONF/keytabs/`. Locate the keytab file, copy the file to the `$PXF_CONF` user configuration directory, and set required permissions. For example: ``` shell gpadmin@gpmaster$ gpscp -f hostfile_all pxf.service.keytab =:$PXF_CONF/keytabs/ gpadmin@gpmaster$ gpssh -f hostfile_all chmod 400 $PXF_CONF/keytabs/pxf.service.keytab ``` - + ### Configuring PXF with an MIT Kerberos KDC Server @@ -163,29 +204,50 @@ When you configure PXF for secure HDFS using an MIT Kerberos KDC server, you wil ``` shell $ ssh gpadmin@ ``` - -2. Open the PXF `pxf-env.sh` user configuration file in the editor of your choice. For example, to open the file with `vi` when `PXF_CONF=/usr/local/greenplum-pxf`: - ``` shell - gpadmin@seghost$ vi /usr/local/greenplum-pxf/conf/pxf-env.sh +2. Identify the name of the PXF Hadoop server configuration that requires Kerberos access. + +3. Navigate to the server configuration directory. For example, if the server is named `hdp3`: + + ```shell + gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3 ``` - -3. Update the `PXF_KEYTAB` and `PXF_PRINCIPAL` settings, if required. Specify the location of the keytab file and the Kerberos principal, substituting your realm. *The default values for these settings are identified below*: + +4. If the server configuration does not yet include a `pxf-site.xml` file, copy the template file to the directory. For example: ``` shell - export PXF_KEYTAB="${PXF_CONF}/keytabs/pxf.service.keytab" - export PXF_PRINCIPAL="gpadmin/_HOST@EXAMPLE.COM" + gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml . + ``` + +5. Open the `pxf-site.xml` file in the editor of your choice, and update the keytab and principal property settings, if required. Specify the location of the keytab file and the Kerberos principal, substituting your realm. *The default values for these settings are identified below*: + + ``` xml + + pxf.service.kerberos.principal + gpadmin/_HOST@EXAMPLE.COM + + + pxf.service.kerberos.keytab + ${pxf.conf}/keytabs/pxf.service.keytab + ``` PXF automatically replaces ` _HOST` with the FQDN of the segment host. -4. Save the file and exit the editor. +6. If you want to access Hadoop as the Greenplum Database user: -5. Synchronize the PXF configuration to your Greenplum Database cluster and restart PXF. For example: + 1. Enable user impersonation as described in [Configure PXF User Impersonation](pxfuserimpers.html#pxf_cfg_impers). + 2. Configure Hadoop proxying for the *primary* component of the Kerberos principal as described in [Configure Hadoop Proxying](pxfuserimpers.html#hadoop). For example, if your principal is `gpadmin/_HOST@EXAMPLE.COM`, configure proxying for the Hadoop user `gpadmin`. + +7. If you want to access Hadoop using the identity of the Kerberos principal, disable user impersonation as described in [Configure PXF User Impersonation](pxfuserimpers.html#pxf_cfg_impers). + +8. PXF ignores the `pxf.service.user.name` property when it uses Kerberos authentication to Hadoop. You may choose to remove this property from the `pxf-site.xml` file. + +8. Save the file and exit the editor. + +9. Synchronize the PXF configuration to your Greenplum Database cluster. For example: ``` shell gpadmin@seghost$ $GPHOME/pxf/bin/pxf cluster sync - gpadmin@master$ $GPHOME/pxf/bin/pxf cluster stop - gpadmin@master$ $GPHOME/pxf/bin/pxf cluster start ``` diff --git a/gpdb-doc/markdown/pxf/pxfuserimpers.html.md.erb b/gpdb-doc/markdown/pxf/pxfuserimpers.html.md.erb index 3316fd17d2..247653a5be 100644 --- a/gpdb-doc/markdown/pxf/pxfuserimpers.html.md.erb +++ b/gpdb-doc/markdown/pxf/pxfuserimpers.html.md.erb @@ -1,16 +1,29 @@ --- -title: Configuring User Impersonation and Proxying +title: Configuring the Hadoop User, User Impersonation, and Proxying --- -PXF accesses Hadoop services on behalf of Greenplum Database end users. By default, PXF tries to access data source services (HDFS, Hive, HBase) using the identity of the Greenplum Database user account that logs into Greenplum Database and performs an operation using a PXF connector profile. Keep in mind that PXF uses only the _login_ identity of the user when accessing Hadoop services. For example, if a user logs into Greenplum Database as the user `jane` and then execute `SET ROLE` or `SET SESSION AUTHORIZATION` to assume a different user identity, all PXF requests still use the identity `jane` to access Hadoop services. +PXF accesses Hadoop services on behalf of Greenplum Database end users. -With the default PXF configuration, you must explicitly configure each Hadoop data source (HDFS, Hive, HBase) to allow the PXF process owner (usually `gpadmin`) to act as a proxy for impersonating users or groups. See [Configuring Hadoop Proxying](#hadoop), [Hive User Impersonation](#hive), and [HBase User Impersonation](#hbase). +When user impersonation is enabled (the default), PXF accesses Hadoop services using the identity of the Greenplum Database user account that logs in to Greenplum and performs an operation that uses a PXF connector. Keep in mind that PXF uses only the _login_ identity of the user when accessing Hadoop services. For example, if a user logs in to Greenplum Database as the user `jane` and then execute `SET ROLE` or `SET SESSION AUTHORIZATION` to assume a different user identity, all PXF requests still use the identity `jane` to access Hadoop services. When user impersonation is enabled, you must explicitly configure each Hadoop data source (HDFS, Hive, HBase) to allow PXF to act as a proxy for impersonating specific Hadoop users or groups. -As an alternative, you can disable PXF user impersonation. With user impersonation disabled, PXF executes all Hadoop service requests as the PXF process owner (usually `gpadmin`). This behavior matches earlier releases of PXF, but it provides no means to control access to Hadoop services for different Greenplum Database users in Hadoop. It requires that the `gpadmin` user have access to all files and directories in HDFS, and all tables in Hive and HBase that are referenced in PXF external table definitions. See [Configuring PXF User Impersonation](#pxf_cfg_proc) for information about disabling user impersonation. +When user impersonation is disabled, PXF executes all Hadoop service requests as the PXF process owner (usually `gpadmin`) or the Hadoop user identity that you specify. This behavior provides no means to control access to Hadoop services for different Greenplum Database users. It requires that this user have access to all files and directories in HDFS, and all tables in Hive and HBase that are referenced in PXF external table definitions. -## Configure PXF User Impersonation +You configure the Hadoop user and PXF user impersonation setting for a server via the `pxf-site.xml` server configuration file. Refer to [About Kerberos and User Impersonation Configuration (pxf-site.xml)](cfg_server.html#pxf-site) for more information about the configuration properties in this file. -Perform the following procedure to turn PXF user impersonation on or off in your Greenplum Database cluster. If you are configuring PXF for the first time, user impersonation is enabled by default. You need not perform this procedure. +The following table describes some of the PXF configuration scenarios for Hadoop access: + +| Scenario | pxf-site.xml Required | Impersonation Setting | Required Configuration | +|----------------|--------------|---------|---------------| +| PXF accesses Hadoop using the identity of the Greenplum Database user. | yes | true | Enable user impersonation, identify the Hadoop proxy user in the `pxf.service.user.name`, and configure Hadoop proxying for this Hadoop user identity. | +| PXF accesses Hadoop using the identity of the operating system user that started the PXF process. | yes | false | Disable user impersonation. | +| PXF accesses Hadoop using a user identity that you specify. | yes | false | Disable user impersonation and identify the Hadoop user identity in the `pxf.service.user.name` property setting. | + + +## Configure the Hadoop User + +By default, PXF accesses Hadoop using the identity of the Greenplum Database user, and you are required to set up a proxy Hadoop user. You can configure PXF to access Hadoop as a different user on a per-server basis. + +Perform the following procedure to configure the Hadoop user: 1. Log in to your Greenplum Database master node as the administrative user: @@ -18,52 +31,112 @@ Perform the following procedure to turn PXF user impersonation on or off in your $ ssh gpadmin@ ``` -2. Recall the location of the PXF user configuration directory (`$PXF_CONF`). Open the `$PXF_CONF/conf/pxf-env.sh` configuration file in a text editor. For example: +2. Identify the name of the PXF Hadoop server configuration that you want to update. - ``` shell - gpadmin@gpmaster$ vi $PXF_CONF/conf/pxf-env.sh +3. Navigate to the server configuration directory. For example, if the server is named `hdp3`: + + ```shell + gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3 ``` -3. Locate the `PXF_USER_IMPERSONATION` setting in the `pxf-env.sh` file. Set the value to `true` to turn PXF user impersonation on, or `false` to turn it off. For example: +4. If the server configuration does not yet include a `pxf-site.xml` file, copy the template file to the directory. For example: ``` shell - PXF_USER_IMPERSONATION="true" + gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml . + ``` + +5. Open the `pxf-site.xml` file in the editor of your choice, and configure the Hadoop user name. When impersonation is disabled, this name identifies the Hadoop user identity that PXF will use to access the Hadoop system. When user impersonation is enabled, this name identifies the PXF proxy Hadoop user. For example, if you want to access Hadoop as the user `hdfsuser1`: + + ``` xml + + pxf.service.user.name + hdfsuser1 + ``` -4. Use the `pxf cluster sync` command to copy the updated `pxf-env.sh` file to the Greenplum Database cluster. For example: +7. Save the `pxf-site.xml` file and exit the editor. + +8. Use the `pxf cluster sync` command to synchronize the PXF Hadoop server configuration to your Greenplum Database cluster. For example: ``` shell gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync ``` -5. If you have previously started PXF, restart it on each Greenplum Database segment host as described in [Restarting PXF](cfginitstart_pxf.html#restart_pxf) to apply the new setting. + +## Configure PXF User Impersonation + +PXF user impersonation is enabled by default for Hadoop servers. You can configure PXF user impersonation on a per-server basis. Perform the following procedure to turn PXF user impersonation on or off for the Hadoop server configuration: + +
    In previous versions of Greenplum Database, you configured user impersonation globally for Hadoop clusters via the now deprecated PXF_USER_IMPERSONATION setting in the pxf-env.sh configuration file.
    + +1. Navigate to the server configuration directory. For example, if the server is named `hdp3`: + + ```shell + gpadmin@gpmaster$ cd $PXF_CONF/servers/hdp3 + ``` + +2. If the server configuration does not yet include a `pxf-site.xml` file, copy the template file to the directory. For example: + + ``` shell + gpadmin@gpmaster$ cp $PXF_CONF/templates/pxf-site.xml . + ``` + +3. Open the `pxf-site.xml` file in the editor of your choice, and update the user impersonation property setting. For example, if you do not require user impersonation for this server configuration, set the `pxf.service.user.impersonation` property to `false`: + + ``` xml + + pxf.service.user.impersonation + false + + ``` + + If you require user impersonation, turn it on: + + ``` xml + + pxf.service.user.impersonation + true + + ``` + +3. If you enabled user impersonation, you must configure Hadoop proxying as described in [Configure Hadoop Proxying](#hadoop). You must also configure [Hive User Impersonation](#hive) and [HBase User Impersonation](#hbase) if you plan to use those services. + +4. Save the `pxf-site.xml` file and exit the editor. + +5. Use the `pxf cluster sync` command to synchronize the PXF Hadoop server configuration to your Greenplum Database cluster. For example: + + ``` shell + gpadmin@gpmaster$ $GPHOME/pxf/bin/pxf cluster sync + ``` ## Configure Hadoop Proxying -When PXF user personation is enabled (the default), you must configure the Hadoop `core-site.xml` configuration file to permit user impersonation for PXF. Follow these steps: +When PXF user impersonation is enabled for a Hadoop server configuration, you must configure Hadoop to permit PXF to proxy Greenplum users. This configuration involves setting certain `hadoop.proxyuser.*` properties. Follow these steps to set up PXF Hadoop proxy users: -1. On your Hadoop cluster, open the `core-site.xml` configuration file using a text editor, or use Ambari to add or edit the Hadoop property values described in this procedure. +1. Log in to your Hadoop cluster and open the `core-site.xml` configuration file using a text editor, or use Ambari or another Hadoop cluster manager to add or edit the Hadoop property values described in this procedure. -2. Set the property `hadoop.proxyuser..hosts` to specify the list of PXF host names from which proxy requests are permitted. Substitute the PXF proxy user (generally `gpadmin`) for ``, and provide multiple PXF host names in a comma-separated list. For example: +2. Set the property `hadoop.proxyuser..hosts` to specify the list of PXF host names from which proxy requests are permitted. Substitute the PXF proxy Hadoop user for ``. The PXF proxy Hadoop user is the `pxf.service.user.name` that you configured in the procedure above, or, if you are using Kerberos authentication to Hadoop, the proxy user identity is the *primary* component of the Kerberos principal. If you have not configured `pxf.service.user.name`, the proxy user is the operating system user that started PXF. Provide multiple PXF host names in a comma-separated list. For example, if the PXF proxy user is named `hdfsuser2`: ``` xml - hadoop.proxyuser.gpadmin.hosts + hadoop.proxyuser.hdfsuser2.hosts pxfhost1,pxfhost2,pxfhost3 ``` -3. Set the property `hadoop.proxyuser..groups` to specify the list of HDFS groups that PXF can impersonate. You should limit this list to only those groups that require access to HDFS data from PXF. For example: + +3. Set the property `hadoop.proxyuser..groups` to specify the list of HDFS groups that PXF as Hadoop user `` can impersonate. You should limit this list to only those groups that require access to HDFS data from PXF. For example: ``` xml - hadoop.proxyuser.gpadmin.groups + hadoop.proxyuser.hdfsuser2.groups group1,group2 ``` -4. After changing `core-site.xml`, you must restart Hadoop for your changes to take effect. -5. Copy the updated `core-site.xml` file to the PXF Hadoop server configuration directory `$PXF_CONF/servers/` on the master and synchronize the configuration to the standby master and each Greenplum Database segment host. +4. You must restart Hadoop for your `core-site.xml` changes to take effect. + +5. Copy the updated `core-site.xml` file to the PXF Hadoop server configuration directory `$PXF_CONF/servers/` on the Greenplum Database master and synchronize the configuration to the standby master and each Greenplum Database segment host. ## Hive User Impersonation diff --git a/gpdb-doc/markdown/pxf/troubleshooting_pxf.html.md.erb b/gpdb-doc/markdown/pxf/troubleshooting_pxf.html.md.erb index 7b1d20e91b..14d970f61b 100644 --- a/gpdb-doc/markdown/pxf/troubleshooting_pxf.html.md.erb +++ b/gpdb-doc/markdown/pxf/troubleshooting_pxf.html.md.erb @@ -33,7 +33,7 @@ The following table describes some errors you may encounter while using PXF: | NoSuchObjectException(message:\.\ table not found) | **Cause**: The Hive table that you specified with \.\ does not exist.
    **Solution**: Provide the name of an existing Hive table. | | Failed to connect to \ port 5888: Connection refused (libchurl.c:944) (\ slice\ \:40000 pid=\)
    ... |**Cause**: PXF is not running on \.
    **Solution**: Restart PXF on \. | | *ERROR*: failed to acquire resources on one or more segments
    *DETAIL*: could not connect to server: Connection refused
        Is the server running on host "\" and accepting
        TCP/IP connections on port 40000?(seg\ \:40000) | **Cause**: The Greenplum Database segment host \ is down. | -| org.apache.hadoop.security.AccessControlException: Permission denied: user=, access=READ, inode="":::-rw------- | **Cause**: The Greenplum Database user that executed the PXF operation does not have permission to access the underlying Hadoop service (HDFS or Hive). See [Configuring User Impersonation and Proxying](pxfuserimpers.html). | +| org.apache.hadoop.security.AccessControlException: Permission denied: user=, access=READ, inode="":::-rw------- | **Cause**: The Greenplum Database user that executed the PXF operation does not have permission to access the underlying Hadoop service (HDFS or Hive). See [Configuring the Hadoop User, User Impersonation, and Proxying](pxfuserimpers.html). | ## PXF Logging Enabling more verbose logging may aid PXF troubleshooting efforts. PXF provides two categories of message logging: service-level and client-level. diff --git a/gpdb-doc/markdown/pxf/upgrade_pxf_6x.html.md.erb b/gpdb-doc/markdown/pxf/upgrade_pxf_6x.html.md.erb index d2cd7648ba..02a0f1701d 100644 --- a/gpdb-doc/markdown/pxf/upgrade_pxf_6x.html.md.erb +++ b/gpdb-doc/markdown/pxf/upgrade_pxf_6x.html.md.erb @@ -40,6 +40,18 @@ After you upgrade to the new version of Greenplum Database, perform the followin 2. Initialize PXF on each segment host as described in [Initializing PXF](init_pxf.html). You may choose to use your existing `$PXF_CONF` for the initialization. +3. **If you are upgrading from Greenplum Database version 6.1.x or earlier** and you have configured any JDBC servers that access Kerberos-secured Hive, you must now set the `hadoop.security.authentication` property to the `jdbc-site.xml` file to explicitly identify use of the Kerberos authentication method. Perform the following for each of these server configs: + + 1. Navigate to the server configuration directory. + 2. Open the `jdbc-site.xml` file in the editor of your choice and uncomment or add the following property block to the file: + + ```xml + + hadoop.security.authentication + kerberos + + ``` + 3. Save the file and exit the editor. 3. Synchronize the PXF configuration from the master host to the standby master and each Greenplum Database segment host. For example: -- GitLab