About Greenplum R
The Greenplum R Client (GreenplumR) is an interactive in-database data
analytics tool for Greenplum Database. The client provides an R interface
to tables and views, and requires no SQL knowledge to operate on these
database objects.
You can use GreenplumR with the Greenplum PL/R procedural language to
run an R function on data stored in Greenplum Database. GreenplumR parses
the R function and creates a user-defined function (UDF) for execution in
Greenplum. Greenplum runs the UDF in parallel on the segment hosts.
You can similarly use GreenplumR with Greenplum PL/Container 3 (Beta),
to run an R function against Greenplum data in a high-performance R
sandbox runtime environment.
No analytic data is loaded into R when you use GreenplumR, a key
requirement when dealing with large data sets. Only the R function and
minimal data is transferred between R and Greenplum.
Using the Greenplum R Client
You use GreenplumR to perform in-database analytics. Typical operations
that you may perform include:
- Loading the GreenplumR package.
- Connecting to and disconnecting from Greenplum Database.
- Examining database objects.
- Analyzing and manipulating data.
- Running R functions in Greenplum Database.
Loading GreenplumR
Use the R library() function to load GreenplumR:
user@clientsys$ R
> library("GreenplumR")
Connecting to Greenplum Database
The db.connect() and db.connect.dsn()
GreenplumR functions establish a connection to Greenplum Database. The
db.disconnect() function closes a database connection.
The GreenplumR connect and disconnect function signatures follow:
db.connect( host = "localhost", user = Sys.getenv("USER"),
dbname = user, password = "", port = 5432, conn.pkg = "RPostgreSQL",
default.schemas = NULL, verbose = TRUE, quick = FALSE )
db.connect.dsn( dsn.key, db.ini = "~/db.ini", default.schemas = NULL,
verbose = TRUE, quick = FALSE )
db.disconnect( conn.id = 1, verbose = TRUE, force = FALSE )
When you connect to Greenplum Database, you provide the master host,
port, database name, user name, password, and other information via
function arguments or a data source name (DSN) file. If you do not
specify an argument or value, GreenplumR uses the default.
The db.connect[.dsn]() functions return an integer
connection identifier. You specify this identifier when you operate
on tables or views in the database. You also specify this identifier
when you close the connection.
The db.disconnect() function returns a logical that
identifies whether or not the connection was successfully disconnected.
To list and display information about active Greenplum connections,
use the db.list() function.
Example:
## connect to Greenplum database named testdb on host gpmaster
> cid_to_testdb <- db.connect( host = "gpmaster", port=5432, dbname = "testdb" )
Loading required package: DBI
Created a connection to database with ID 1
[1] 1
> db.list()
Database Connection Info
## -------------------------------
[Connection ID 1]
Host : gpmaster
User : gpadmin
Database : testdb
DBMS : Greenplum 6
> db.disconnect( cid_to_testdb )
Connection 1 is disconnected!
[1] TRUE
Examining Database Obects
The db.objects() function lists the tables and views
in the database identified by a specific connection identifier. The
function signature is:
db.objects( search = NULL, conn.id = 1 )
If you choose, you can specify a filter string to narrow the returned
results. For example, to list the tables and views in the
public schema in the database identified by the
default connection identifier, invoke the function as follows:
> db.objects( search = "public." )
Analyzing and Manipulating Data
The fundamental data structure of R is the data.frame.
A data frame is a collection of variables represented as a list of
vectors. GreenplumR operates on db.data.frame objects,
and exposes functions to convert to and manipulate objects of this type:
- as.db.data.frame() - writes data in a file or a
data.frame into a Greenplum table. You can also use
the function to write the results of a query into a table, or to
create a local db.data.frame.
- db.data.frame() - creates a temporary R object
that references a view or table in a Greenplum database. No data is
loaded into R when you use this function.
Example:
## create a db.data.frame from the abalone example data set;
## abalone is a data.frame
> abdf1 <- as.db.data.frame(abalone, conn.id = cid_to_testdb, verbose = FALSE)
## sort on the id column and preview the first 5 rows
> lk( sort( abdf1, INDICES=abdf1$id ), 5 )
id sex length diameter height whole shucked viscera shell rings
1 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
2 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
3 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
4 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
5 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
## write the data frame to a Greenplum table named abalone_from_r;
## use most of the function defaults
> as.db.data.frame( abdf1, table.name = "public.abalone_from_r" )
The data contained in table "pg_temp_93"."gp_temp_5bdf4ec7_42f9_9f9799_a0d76231be8f" which is wrapped by abdf1 is c
opied into "public"."abalone_from_r" in database testdb on gpmaster !
Table : "public"."abalone_from_r"
Database : testdb
Host : gpmaster
Connection : 1
## list database objects, should display the newly created table
> db.objects( search = "public.")
[1] "public.abalone_from_r"
Running R Functions in Greenplum Database
GreenplumR supports two functions that allow you to run an R function,
in-database, on every row of a Greenplum Database table:
db.gpapply() and
db.gptapply().
You can use the Greenplum PL/R or PL/Container procedural language as
the vehicle in which to run the function.
The function signatures follow:
db.gpapply( X, MARGIN = NULL, FUN = NULL, output.name = NULL, output.signature = NULL,
clear.existing = FALSE, case.sensitive = FALSE, output.distributeOn = NULL,
debugger.mode = FALSE, runtime.id = "plc_r_shared", language = "plcontainer",
input.signature = NULL, ... )
db.gptapply( X, INDEX, FUN = NULL, output.name = NULL, output.signature = NULL,
clear.existing = FALSE, case.sensitive = FALSE,
output.distributeOn = NULL, debugger.mode = FALSE,
runtime.id = "plc_r_shared", language = "plcontainer",
input.signature = NULL, ... )
Use the second variant of the function when the table data is indexed.
By default, db.gp[t]apply() passes a single data frame
input argument to the R function FUN. If you define
FUN to take a list of arguments, you must specify the
function argument name to Greenplum table column name mapping in
input.signature. You must specify this mapping
in table column order.
Example 1:
Create a Greenplum table named table1 in the
database named testdb. This table has a single
integer-type field. Populate the table with some data:
user@clientsys$ psql -h gpmaster -d testdb
testdb=# CREATE TABLE table1( id int );
testdb=# INSERT INTO table1 SELECT generate_series(1,13);
testdb=# \q
Create an R function that increments an integer. Run the function on
table1 in Greenplum
using the PL/R procedural language. Then write the new values to a
table named table1_r_inc:
user@clientsys$ R
> ## create a reference to table1
> t1 <- db.data.frame("public.table1")
> ## create an R function that increment an integer by 1
> fn.function_plus_one <- function(num)
{
return (num[[1]] + 1)
}
> ## create the function output signature
> .sig <- list( "num" = "int" )
> ## run the function in Greenplum and print
> x <- db.gpapply( t1, output.name = NULL, FUN = fn.function_plus_one,
output.signature = .sig, clear.existing = TRUE, case.sensitive = TRUE, language = "plr" )
> print(x)
num
1 2
2 6
3 12
4 13
5 3
6 4
7 5
8 7
9 8
10 9
11 10
12 11
13 14
> ## run the function in Greenplum and write to the output table
> db.gpapply(t1, output.name = "public.table1_r_inc", FUN = fn.function_plus_one,
output.signature = .sig, clear.existing = TRUE, case.sensitive = TRUE,
language = "plr" )
## list database objects, should display the newly created table
> db.objects( search = "public.")
[1] "public.abalone_from_r" "public.table1_r_inc"
Example 2:
Create a Greenplum table named table2 in the
database named testdb. This table has two
integer-type fields. Populate the table with some data:
user@clientsys$ psql -h gpmaster -d testdb
testdb=# CREATE TABLE table2( c1 int, c2 int );
testdb=# INSERT INTO table2 VALUES (1, 2);
testdb=#\q
Create an R function that takes two integer arguments, manipulates
the arguments, and returns both. Run the function on the data
in table2 in Greenplum using the PL/R procedural
language, writing the new values to a table named table2_r_upd:
user@clientsys$ R
> ## create a reference to table2
> t2 <- db.data.frame("public.table2")
> ## create the R function
> fn.func_with_two_args <- function(a, b)
{
a <- a * 20
b <- a + 66
c <- list(a, b)
return (as.data.frame(c))
}
> ## create the function input signature, mapping function argument name to
> ## table column name
> input.sig <- list('a' = 'c1', 'b' = 'c2')
> ## create the function output signature
> return.sig <- list('a' = 'int', 'b' = 'int')
> ## run the function in Greenplum and write to the output table
> db.gpapply(t2, output.name = "public.table2_r_upd", FUN = fn.func_with_two_args,
output.signature = return.sig, clear.existing = TRUE, case.sensitive = TRUE,
language = "plr", input.signature = input.sig )
View the contents of the Greenplum table named table2_r_upd:
user@clientsys$ psql -h gpmaster -d testdb
testdb=# SELECT * FROM table2_r_upd;
a | b
----+----
20 | 86
(1 row)
testdb=# \q