提交 605f8209 编写于 作者: L Lisa Owen 提交者: David Yozie

docs - add content for PXF HBase profile (#3765)

* docs - add content for PXF HBase profile

* fix incorrect table name - indirect mapping now working w example

* use new page title in pxf overview xrefs

* incorporate edits from david

* note that hbase does not support predicate pushdown
上级 5ebd8cdf
......@@ -29,7 +29,7 @@
type="topic"/>
</topicref>
</topicref>
<topicref href="pxf-overview.xml" navtitle="Accessing HDFS and Hive Data with PXF"/>
<topicref href="pxf-overview.xml" navtitle="Accessing External Data with PXF"/>
<topicref href="g-using-hadoop-distributed-file-system--hdfs--tables.xml" type="topic">
<topicref href="g-one-time-hdfs-protocol-installation.xml" type="topic"/>
<topicref href="g-grant-privileges-for-the-hdfs-protocol.xml" type="topic"/>
......
......@@ -7,6 +7,6 @@
<p>The PXF Extension Framework <codeph>pxf</codeph> protocol is packaged as a Greenplum Database extension. The <codeph>pxf</codeph> protocol supports reading HDFS file and Hive table data. The protocol does not yet support writing to HDFS or Hive data stores.</p>
<p>When you use the <codeph>pxf</codeph> protocol to query HDFS and Hive systems, you specify the HDFS file or Hive table you want to access. PXF requests the data from HDFS and delivers the relevant portions in parallel to each Greenplum Database segment instance serving the query.</p>
<p>You must explicitly initialize and start the PXF Extension Framework before you can use the <codeph>pxf</codeph> protocol to read external data. You must also grant permissions to the <codeph>pxf</codeph> protocol and enable PXF in each database in which you want to create external tables to access external data.</p>
<p>For detailed information about configuring and using the PXF Extension Framework and the <codeph>pxf</codeph> protocol, refer to <xref href="pxf-overview.xml" type="topic">Accessing HDFS and Hive Data with PXF</xref>.</p>
<p>For detailed information about configuring and using the PXF Extension Framework and the <codeph>pxf</codeph> protocol, refer to <xref href="pxf-overview.xml" type="topic">Accessing External Data with PXF</xref>.</p>
</body>
</topic>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="topic_u14_wtd_dbb">
<title>Accessing HDFS and Hive Data with PXF </title>
<title>Accessing External Data with PXF</title>
<shortdesc>Data managed by your organization may already reside in external sources. The Greenplum Database PXF Extension Framework (PXF) provides access to this external data via built-in connectors that map an external data source to a Greenplum Database table definition.</shortdesc>
<body>
<p>PXF is installed with HDFS and Hive connectors. These connectors enable you to read external HDFS file system and Hive table data stored in text, Avro, RCFile, Parquet, SequenceFile, and ORC formats.</p>
<p>PXF is installed with HDFS, Hive, and HBase connectors. These connectors enable you to read external HDFS file system and Hive and HBase table data stored in text, Avro, RCFile, Parquet, SequenceFile, and ORC formats.</p>
<p>The PXF Extension Framework includes a protocol C library and a Java service. After you configure and initialize PXF, you start a single PXF JVM process on each Greenplum Database segment host. This long-running process concurrently serves multiple query requests.</p>
<p>For detailed information about the architecture of and using the PXF Extension Framework, refer to the <xref href="../../pxf/overview_pxf.html" type="topic" format="html">Using PXF with External Data</xref> documentation.</p>
</body>
......
---
title: Accessing HBase Data
---
Apache HBase is a distributed, versioned, non-relational database on Hadoop.
The PXF HBase connector reads data stored in an HBase table. This section describes how to use the PXF HBase connector.
**Note**: PXF does not yet support predicate pushdown to HBase.
## <a id="hbase_prereq"></a>Prerequisites
Before working with HBase table data, ensure that you have:
- Installed and configured an HBase client on each Greenplum Database segment host. Refer to [Installing and Configuring Clients for PXF](client_instcfg.html).
- Initialized PXF on your Greenplum Database segment hosts, and started PXF on each host. See [Configuring, Initializing, and Starting PXF](cfginitstart_pxf.html) for PXF initialization, configuration, and startup information.
## <a id="hbase_primer"></a>HBase Primer
This topic assumes that you have a basic understanding of the following HBase concepts:
- An HBase column includes two components: a column family and a column qualifier. These components are delimited by a colon `:` character, \<column-family\>:\<column-qualifier\>.
- An HBase row consists of a row key and one or more column values. A row key is a unique identifier for the table row.
- An HBase table is a multi-dimensional map comprised of one or more columns and rows of data. You specify the complete set of column families when you create an HBase table.
- An HBase cell is comprised of a row (column family, column qualifier, column value) and a timestamp. The column value and timestamp in a given cell represent a version of the value.
For detailed information about HBase, refer to the [Apache HBase Reference Guide](http://hbase.apache.org/book.html).
## <a id="hbase_shell"></a>HBase Shell
The HBase shell is a subsystem similar to that of `psql`. To start the HBase shell:
``` shell
$ hbase shell
<hbase output>
hbase(main):001:0>
```
The default HBase namespace is named `default`.
### <a id="hbaseshell_example" class="no-quick-link"></a>Example: Creating an HBase Table
Create a sample HBase table.
1. Create an HBase table named `order_info` in the `default` namespace. `order_info` has two column families: `product` and `shipping_info`:
``` pre
hbase(main):> create 'order_info', 'product', 'shipping_info'
```
2. The `order_info` `product` column family has qualifiers named `name` and `location`. The `shipping_info` column family has qualifiers named `state` and `zipcode`. Add some data to the `order_info` table:
``` pre
put 'order_info', '1', 'product:name', 'tennis racquet'
put 'order_info', '1', 'product:location', 'out of stock'
put 'order_info', '1', 'shipping_info:state', 'CA'
put 'order_info', '1', 'shipping_info:zipcode', '12345'
put 'order_info', '2', 'product:name', 'soccer ball'
put 'order_info', '2', 'product:location', 'on floor'
put 'order_info', '2', 'shipping_info:state', 'CO'
put 'order_info', '2', 'shipping_info:zipcode', '56789'
put 'order_info', '3', 'product:name', 'snorkel set'
put 'order_info', '3', 'product:location', 'warehouse'
put 'order_info', '3', 'shipping_info:state', 'OH'
put 'order_info', '3', 'shipping_info:zipcode', '34567'
```
You will access the `orders_info` HBase table directly via PXF in examples later in this topic.
3. Display the contents of the `order_info` table:
``` pre
hbase(main):> scan 'order_info'
ROW COLUMN+CELL
1 column=product:location, timestamp=1499074825516, value=out of stock
1 column=product:name, timestamp=1499074825491, value=tennis racquet
1 column=shipping_info:state, timestamp=1499074825531, value=CA
1 column=shipping_info:zipcode, timestamp=1499074825548, value=12345
2 column=product:location, timestamp=1499074825573, value=on floor
...
3 row(s) in 0.0400 seconds
```
## <a id="syntax3"></a>Querying External HBase Data
The PXF HBase connector supports a single profile named `HBase`.
Use the following syntax to create a Greenplum Database external table referencing an HBase table:
``` sql
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<hbase-table-name>?PROFILE=HBase')
FORMAT 'CUSTOM' (formatter='pxfwritable_import');
```
HBase connector-specific keywords and values used in the [CREATE EXTERNAL TABLE](../ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) call are described below.
| Keyword | Value |
|-------|-------------------------------------|
| \<hbase-table-name\> | The name of the HBase table. |
| PROFILE | The `PROFILE` keyword must specify `HBase`. |
| FORMAT | The `FORMAT` clause must specify `'CUSTOM' (formatter='pxfwritable_import')`. |
## <a id="datatypemapping"></a>Data Type Mapping
HBase is byte-based; it stores all data types as an array of bytes. To represent HBase data in Greenplum Database, select a data type for your Greenplum Database column that matches the underlying content of the HBase column qualifier values.
**Note**: PXF does not support complex HBase objects.
## <a id="columnmapping"></a>Column Mapping
You can create a Greenplum Database external table that references all, or a subset of, the column qualifiers defined in an HBase table. PXF supports direct or indirect mapping between a Greenplum Database table column and an HBase table column qualifier.
### <a id="directmapping" class="no-quick-link"></a>Direct Mapping
When you use direct mapping to map Greenplum Database external table column names to HBase qualifiers, you specify column-family-qualified HBase qualifier names as quoted values. The PXF HBase connector passes these column names as-is to HBase as it reads the table data.
For example, to create a Greenplum Database external table accessing the following data:
- qualifier `name` in the column family named `product`
- qualifier `zipcode` in the column family named `shipping_info` 
from the `order_info` HBase table you created in [Example: Creating an HBase Table](#hbaseshell_example), use this `CREATE EXTERNAL TABLE` syntax:
``` sql
CREATE EXTERNAL TABLE orderinfo_hbase ("product:name" varchar, "shipping_info:zipcode" int)
LOCATION ('pxf://order_info?PROFILE=HBase')
FORMAT 'CUSTOM' (formatter='pxfwritable_import');
```
### <a id="indirectmappingvialookuptable" class="no-quick-link"></a>Indirect Mapping via Lookup Table
When you use indirect mapping to map Greenplum Database external table column names to HBase qualifiers, you specify the mapping in a lookup table you create in HBase. The lookup table maps a \<column-family\>:\<column-qualifier\> to a column name alias that you specify when you create the Greenplum Database external table.
You must name the HBase PXF lookup table `pxflookup`. And you must define this table with a single column family named `mapping`. For example:
``` pre
hbase(main):> create 'pxflookup', 'mapping'
```
While the direct mapping method is fast and intuitive, using indirect mapping allows you to create a shorter, character-based alias for the HBase \<column-family\>:\<column-qualifier\> name. This better reconciles HBase column qualifier names with Greenplum Database due to the following:
- HBase qualifier names can be very long. Greenplum Database has a 63 character limit on the size of the column name.
- HBase qualifier names can include binary or non-printable characters. Greenplum Database column names are character-based.
When populating the `pxflookup` HBase table, add rows to the table such that the:
- row key specifies the HBase table name
- `mapping` column family qualifier identifies the Greenplum Database column name, and the value identifies the HBase `<column-family>:<column-qualifier>` for which you are creating the alias.
For example, to use indirect mapping with the `order_info` table, add these entries to the `pxflookup` table:
``` pre
hbase(main):> put 'pxflookup', 'order_info', 'mapping:pname', 'product:name'
hbase(main):> put 'pxflookup', 'order_info', 'mapping:zip', 'shipping_info:zipcode'
```
Then create a Greenplum Database external table using the following `CREATE EXTERNAL TABLE` syntax:
``` sql
CREATE EXTERNAL TABLE orderinfo_map (pname varchar, zip int)
LOCATION ('pxf://order_info?PROFILE=HBase')
FORMAT 'CUSTOM' (formatter='pxfwritable_import');
```
## <a id="rowkey"></a>Row Key
The HBase table row key is a unique identifier for the table row. PXF handles the row key in a special way.
To use the row key in the Greenplum Database external table query, define the external table using the PXF reserved column named `recordkey.` The `recordkey` column name instructs PXF to return the HBase table record key for each row.
Define the `recordkey` using the Greenplum Database data type `bytea`.
For example:
``` sql
CREATE EXTERNAL TABLE <table_name> (recordkey bytea, ... )
LOCATION ('pxf://<hbase_table_name>?PROFILE=HBase')
FORMAT 'CUSTOM' (formatter='pxfwritable_import');
```
After you have created the external table, you can use the `recordkey` in a `WHERE` clause to filter the HBase table on a range of row key values.
......@@ -2,7 +2,7 @@
title: Installing and Configuring PXF
---
The PXF Extension Framework provides connectors to Hadoop, Hive, and HBase data stores. To use these PXF connectors, you must install Hadoop, Hive, and HBase clients on each Greenplum Database segment host as described in this one-time installation and configuration procedure:
The Greenplum Platform Extension Framework (PXF) provides connectors to Hadoop, Hive, and HBase data stores. To use these PXF connectors, you must install Hadoop, Hive, and HBase clients on each Greenplum Database segment host as described in this one-time installation and configuration procedure:
- **[Installing and Configuring Hadoop Clients for PXF](client_instcfg.html)**
......
......@@ -47,6 +47,10 @@ The PXF Extension Framework (PXF) provides parallel, high throughput data access
This topic describes how to use the PXF Hive connector and related profiles to read Hive tables stored in TextFile, RCFile, Parquet, and ORC storage formats.
- **[Accessing HBase Table Data](hbase_pxf.html)**
This topic describes how to use the PXF HBase connector to read HBase table data.
- **[Troubleshooting PXF](troubleshooting_pxf.html)**
This topic details the service- and database- level logging configuration procuredures for PXF. It also identifies some common PXF errors.
......
......@@ -25,7 +25,7 @@ The PXF Extension Framework implements a protocol named `pxf` that you can use t
You must enable the PXF extension in each database in which you plan to use the framework to access external data. You must also explicitly `GRANT` permission to the `pxf` protocol to those users/roles who require access.
After the extension is registered and privileges are assigned, you can use the `CREATE EXTERNAL TABLE` command to create an external table using the `pxf` protocol. PXF provides built-in HDFS and Hive connectors. These connectors define profiles that support different file formats. You specify the profile name in the `CREATE EXTERNAL TABLE` command `LOCATION` URI.
After the extension is registered and privileges are assigned, you can use the `CREATE EXTERNAL TABLE` command to create an external table using the `pxf` protocol. PXF provides built-in HDFS, Hive, and HBase connectors. These connectors define profiles that support different file formats. You specify the profile name in the `CREATE EXTERNAL TABLE` command `LOCATION` URI.
## <a id="enable-pxf-ext"></a>Enabling/Disabling PXF
......@@ -87,12 +87,13 @@ GRANT SELECT ON PROTOCOL pxf TO bill;
## <a id="built-inprofiles"></a> PXF Profiles
PXF is installed with HDFS and Hive connectors that provide a number of built-in profiles. These profiles simplify and unify access to external data sources of varied formats. You provide the profile name when you specify the `pxf` protocol on a `CREATE EXTERNAL TABLE` command to create a Greenplum Database external table referencing an external data store.
PXF is installed with HDFS, Hive, and HBase connectors that provide a number of built-in profiles. These profiles simplify and unify access to external data sources of varied formats. You provide the profile name when you specify the `pxf` protocol on a `CREATE EXTERNAL TABLE` command to create a Greenplum Database external table referencing an external data store.
PXF provides the following built-in profiles:
| Data Source | Data Format | Profile Name(s) | Description |
|-------|---------|------------|----------------|
| HBase | Many | HBase | Any data type that can be converted to an array of bytes.|
| HDFS | Text | HdfsTextSimple | Delimited single line records from plain text files on HDFS.|
| HDFS | Text | HdfsTextMulti | Delimited single or multi-line records with quoted linefeeds from plain text files on HDFS. |
| HDFS | Avro | Avro | Avro format binary files (\<filename\>.avro). |
......@@ -157,7 +158,7 @@ Greenplum Database passes the parameters in the `LOCATION` string as headers to
| Keyword | Value and Description |
|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| \<path\-to\-data\> | A directory, file name, wildcard pattern, table name, etc. The syntax of \<path-to-data\> is dependent upon the profile currently in use. |
| PROFILE | The profile PXF uses to access the data. PXF supports HDFS and Hive connectors that expose profiles named `HdfsTextSimple`, `HdfsTextMulti`, `Avro`, `Hive`, `HiveText`, `HiveRC`, `HiveORC`, and `HiveVectorizedORC`. |
| PROFILE | The profile PXF uses to access the data. PXF supports HBase, HDFS, and Hive connectors that expose profiles named `HBase`, `HdfsTextSimple`, `HdfsTextMulti`, `Avro`, `Hive`, `HiveText`, `HiveRC`, `HiveORC`, and `HiveVectorizedORC`. |
| \<custom-option\>=\<value\> | Additional options and values supported by the profile.  |
| FORMAT \<value\>| PXF profiles support the '`TEXT`', '`CSV`', and '`CUSTOM`' `FORMAT`s. |
| \<formatting-properties\> | Formatting properties supported by the profile; for example, the `formatter` or `delimiter`.   |
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册