+ About Greenplum R
+
+ The Greenplum R Client (GreenplumR) is an interactive in-database data
+ analytics tool for Greenplum Database. The client provides an R interface
+ to tables and views, and requires no SQL knowledge to operate on these
+ database objects.
+ You can use GreenplumR with the Greenplum PL/R procedural language to
+ run an R function on data stored in Greenplum Database. GreenplumR parses
+ the R function and creates a user-defined function (UDF) for execution in
+ Greenplum. Greenplum runs the UDF in parallel on the segment hosts.
+ You can similarly use GreenplumR with Greenplum PL/Container 3 (Beta),
+ to run an R function against Greenplum data in a high-performance R
+ sandbox runtime environment.
+ No analytic data is loaded into R when you use GreenplumR, a key
+ requirement when dealing with large data sets. Only the R function and
+ minimal data is transferred between R and Greenplum.
+
+
+
+ Using the Greenplum R Client
+
+ You use GreenplumR to perform in-database analytics. Typical operations
+ that you may perform include:
+
+ - Loading the GreenplumR package.
+ - Connecting to and disconnecting from Greenplum Database.
+ - Examining database objects.
+ - Analyzing and manipulating data.
+ - Running R functions in Greenplum Database.
+
+
+ Loading GreenplumR
+ Use the R library() function to load GreenplumR:
+ user@clientsys$ R
+> library("GreenplumR")
+
+
+ Connecting to Greenplum Database
+ The db.connect() and db.connect.dsn()
+ GreenplumR functions establish a connection to Greenplum Database. The
+ db.disconnect() function closes a database connection.
+ The GreenplumR connect and disconnect function signatures follow:
+ db.connect( host = "localhost", user = Sys.getenv("USER"),
+ dbname = user, password = "", port = 5432, conn.pkg = "RPostgreSQL",
+ default.schemas = NULL, verbose = TRUE, quick = FALSE )
+
+db.connect.dsn( dsn.key, db.ini = "~/db.ini", default.schemas = NULL,
+ verbose = TRUE, quick = FALSE )
+
+db.disconnect( conn.id = 1, verbose = TRUE, force = FALSE )
+
+ When you connect to Greenplum Database, you provide the master host,
+ port, database name, user name, password, and other information via
+ function arguments or a data source name (DSN) file. If you do not
+ specify an argument or value, GreenplumR uses the default.
+ The db.connect[.dsn]() functions return an integer
+ connection identifier. You specify this identifier when you operate
+ on tables or views in the database. You also specify this identifier
+ when you close the connection.
+ The db.disconnect() function returns a logical that
+ identifies whether or not the connection was successfully disconnected.
+ To list and display information about active Greenplum connections,
+ use the db.list() function.
+ Example:
+ ## connect to Greenplum database named testdb on host gpmaster
+> cid_to_testdb <- db.connect( host = "gpmaster", port=5432, dbname = "testdb" )
+Loading required package: DBI
+Created a connection to database with ID 1
+[1] 1
+
+> db.list()
+Database Connection Info
+## -------------------------------
+[Connection ID 1]
+Host : gpmaster
+User : gpadmin
+Database : testdb
+DBMS : Greenplum 6
+
+> db.disconnect( cid_to_testdb )
+Connection 1 is disconnected!
+[1] TRUE
+
+
+ Examining Database Obects
+ The db.object() function lists the tables and views
+ in the database identified by a specific connection identifier. The
+ function signature is:
+ db.object( search = NULL, conn.id = 1 )
+ If you choose, you can specify a filter string to narrow the returned
+ results. For example, to list the tables and views in the
+ public schema in the database identified by the
+ default connection identifier, invoke the function as follows:
+ > db.object( search = "public." )
+
+
+ Analyzing and Manipulating Data
+ The fundamental data structure of R is the data.frame.
+ A data frame is a collection of variables represented as a list of
+ vectors. GreenplumR operates on db.data.frame objects,
+ and exposes functions to convert to and manipulate objects of this type:
+
+ - as.db.data.frame() - writes data in a file or a
+ data.frame into a Greenplum table. You can also use
+ the function to write the results of a query into a table, or to
+ create a local db.data.frame.
+ - db.data.frame() - creates a temporary R object
+ that references a view or table in a Greenplum database. No data is
+ loaded into R when you use this function.
+
+ Example:
+ ## create a db.data.frame from the abalone example data set;
+## abalone is a data.frame
+> abdf1 <- as.db.data.frame(abalone, conn.id = cid_to_testdb, verbose = FALSE)
+
+## sort on the id column and preview the first 5 rows
+> lk( sort( abdf1, INDICES=abdf1$id ), 5 )
+ id sex length diameter height whole shucked viscera shell rings
+1 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
+2 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
+3 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
+4 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
+5 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
+
+## write the data frame to a Greenplum table named abalone_from_r;
+## use most of the function defaults
+> as.db.data.frame( abdf1, table.name = "public.abalone_from_r" )
+The data contained in table "pg_temp_93"."gp_temp_5bdf4ec7_42f9_9f9799_a0d76231be8f" which is wrapped by abdf1 is c
+opied into "public"."abalone_from_r" in database testdb on gpmaster !
+Table : "public"."abalone_from_r"
+Database : testdb
+Host : gpmaster
+Connection : 1
+
+## list database objects, should display the newly created table
+> db.objects( search = "public.")
+[1] "public.abalone_from_r"
+
+
+ Running R Functions in Greenplum Database
+ GreenplumR supports two functions that allow you to run an R function,
+ in-database, on every row of a Greenplum Database table:
+ db.gpapply() and
+ db.gptapply().
+ You can use the Greenplum PL/R or PL/Container procedural language as
+ the vehicle in which to run the function.
+ The function signatures follow:
+ db.gpapply( X, MARGIN = NULL, FUN = NULL, output.name = NULL, output.signature = NULL,
+ clear.existing = FALSE, case.sensitive = FALSE, output.distributeOn = NULL,
+ debugger.mode = FALSE, runtime.id = "plc_r_shared", language = "plcontainer", ... )
+
+db.gptapply( X, INDEX, FUN = NULL, output.name = NULL, output.signature = NULL,
+ clear.existing = FALSE, case.sensitive = FALSE,
+ output.distributeOn = NULL, debugger.mode = FALSE,
+ runtime.id = "plc_r_shared", language = "plcontainer", ... )
+ Use the second variant of the function when the table data is indexed.
+ Example:
+ Create a Greenplum table named table1 in the
+ database named testdb. This table has a single
+ integer-type field. Populate the table with some data:
+ user@clientsys$ psql -h gpmaster -d testdb
+testdb=# CREATE TABLE table1( id int );
+testdb=# INSERT INTO table1 SELECT generate_series(1,13);
+testdb=#\q
+ Create an R function that increments an integer. Run the function on
+ the table1 id column in Greenplum
+ using the PL/R procedural language. Then write the new values to a
+ table named table1_r_inc:
+ user@clientsys$ R
+> ## create a reference to table1
+> t1 <- db.data.frame("public.table1")
+
+> ## create an R function that increment an integer by 1
+> fn.function_plus_one <- function(num)
+{
+ return (num[[1]] + 1)
+}
+
+> ## create the output signature
+> .sig <- list( "num" = "int" )
+
+> ## run the function in Greenplum and print
+> x <- db.gpapply( t1, output.name = NULL, FUN = fn.function_plus_one,
+ output.signature = .sig, clear.existing = TRUE, case.sensitive = TRUE, language = "plr" )
+> print(x)
+ num
+1 2
+2 6
+3 12
+4 13
+5 3
+6 4
+7 5
+8 7
+9 8
+10 9
+11 10
+12 11
+13 14
+
+> ## run the function in Greenplum and write to the output table
+> db.gpapply(t1, output.name = "public.table1_r_inc", FUN = fn.function_plus_one,
+ output.signature = .sig, clear.existing = TRUE, case.sensitive = TRUE,
+ language = "plr" )
+
+## list database objects, should display the newly created table
+> db.objects( search = "public.")
+[1] "public.abalone_from_r" "public.table1_r_inc"
+
+
+
+
+