diff --git a/gpdb-doc/dita/analytics/analytics.ditamap b/gpdb-doc/dita/analytics/analytics.ditamap
index 343ecd3d75c69f9d7615cd970bb3ffde23f4ce28..42470716d6ba9c2bbde97f457c20e8e0c59e5bea 100644
--- a/gpdb-doc/dita/analytics/analytics.ditamap
+++ b/gpdb-doc/dita/analytics/analytics.ditamap
@@ -11,6 +11,7 @@
Graph Extension is part of MADLib capabilities. This chapter includes the following information: Many modern business problems involve connections and relationships between
+ entities, and are not solely based on discrete data. Graphs are powerful at representing
+ complex interconnections, and graph data modeling is very effective and flexible when the
+ number and depth of relationships increase exponentially. The use cases for graph analytics are diverse: social networks, transportation
+ routes, autonomous vehicles, cyber security, criminal networks, fraud detection, health
+ research, epidemiology, and so forth. This chapter contains the following information: MADlib is an open-source library for scalable in-database analytics. With the Greenplum
- Database MADlib extension, you can use MADlib functionality in a Greenplum Database. MADlib provides data-parallel implementations of mathematical, statistical and
- machine-learning methods for structured and unstructured data. It provides an suite of
- SQL-based algorithms for machine learning, data mining and statistics that run at scale
- within a database engine, with no need for transferring data between Greenplum Database and
- other tools. MADlib requires the MADlib can be used with PivotalR, an R package that enables users to interact with data
- resident in Greenplum Database using the R client. See Graphs represent the interconnections between objects (vertices) and their relationships
+ (edges). Example objects could be people, locations, cities, computers, or components on a
+ circuit board. Example connections could be roads, circuits, cables, or interpersonal
+ relationships. Edges can have directions and weights, for example the distance between
+ towns. Graphs can be small and easily traversed - as with a small group of friends - or extremely
+ large and complex, similar to contacts in a modern-day social network. Deep learning is a type of machine learning, originally inspired by biology of the brain,
- that uses a class of algorithms called artificial neural networks. Given the important use
- cases that can be effectively addressed with deep learning, it is starting to become a more
- important part of enterprise computing. Deep learning support for Keras and TensorFlow was added to Apache MADlib starting with the
- 1.16 release. Refer to the following documents for more information about deep learning on
- Greenplum using Apache MADlib:
-
+
-
-
To install MADlib on Greenplum Database, you first install a compatible Greenplum MADlib - package and then install the MADlib function libraries on all databases that will use - MADlib.
-The
Before you install the MADlib package, make sure that your Greenplum database is running,
- you have sourced
After installing the MADlib package, run the
For example, this command creates MADlib functions in the Greenplum database
-
After installing the functions, The Greenplum Database gpadmin superuser role should
- grant all privileges on the target schema (in the example
The madpack
You upgrade an installed MADlib package with the Greenplum Database
For information about the upgrade paths that MADlib supports, see the MADlib support and
- upgrade matrix in the
Efficient processing of very large graphs can be challenging. Greenplum offers a suitable + environment for this work for these key reasons:
+To upgrade MADlib, run the
After you upgrade the MADlib package from one major version to another, run
-
This example command upgrades the MADlib functions in the schema
When you remove MADlib support from a database, routines that you created in the database - that use MADlib functionality will no longer work.
+Installing Graph + Modules
+To use the MADlib graph modules, install the version of MADlib corresponding to your
+ Greenplum Database version. To download the software, access the VMware Tanzu Network. For
+ Greenplum 6.x, see
Graph modules on MADlib support many algorithms.
+Creating a Graph in + Greenplum
+To represent a graph in Greenplum, create tables that represent the vertices, edges, and + their properties.
+Using SQL, create the relevant tables in the database you want to use. This example uses
+
Create a table for vertices, called
Insert values related to your specific use case. For example :
+Now select the
Use the
If no databases use the MADlib functions, use the Greenplum
You can run the
After you uninstall the package, restart the database.
-Following are examples using the Greenplum MADlib extension:
-See the MADlib documentation for additional examples.
-This example runs a linear regression on the table
The following statements create the
The MADlib
The
Running this query against the
The model saved in the
This section lists the graph functions supported in MADlib. They include:
The all pairs shortest paths (APSP) algorithm finds the length (summed weights) of the + shortest paths between all pairs of vertices, such that the sum of the weights of the path + edges is minimized.
+The function is:
+For details on the parameters, with examples, see the
This example demonstrates the association rules data mining technique on a transactional - data set. Association rule mining is a technique for discovering relationships between - variables in a large data set. This example considers items in a store that are commonly - purchased together. In addition to market basket analysis, association rules are also used - in bioinformatics, web analytics, and other fields.
-The example analyzes purchase information for seven transactions that are stored in a
- table with the MADlib function
These commands create the table.
-This
The MADlib function
This
This is the output of the
To view the association rules, you can run this
This is the output. The
Based on the data, beer and diapers are often purchased together. To increase sales, you - might consider placing beer and diapers closer together on the shelves.
+Given a graph and a source vertex, the breadth-first search (BFS) algorithm finds all + nodes reachable from the source vertex by searching / traversing the graph in a + breadth-first manner.
+The function is:
+For details on the parameters, with examples, see the
Naive Bayes analysis predicts the likelihood of an outcome of a class variable, or - category, based on one or more independent variables, or attributes. The class variable is - a non-numeric categorial variable, a variable that can have one of a limited number of - values or categories. The class variable is represented with integers, each integer - representing a category. For example, if the category can be one of "true", "false", or - "unknown," the values can be represented with the integers 1, 2, or 3.
-The attributes can be of numeric types and non-numeric, categorical, types. The training - function has two signatures – one for the case where all attributes are numeric and - another for mixed numeric and categorical types. Additional arguments for the latter - identify the attributes that should be handled as numeric values. The attributes are - submitted to the training function in an array.
-The MADlib Naive Bayes training functions produce a features probabilities table and a - class priors table, which can be used with the prediction function to provide the - probability of a class for the set of attributes.
+The all pairs shortest paths (APSP) algorithm finds the length (summed weights) of the + shortest paths between all pairs of vertices, such that the sum of the weights of the path + edges is minimized.
+The function is:
+For details on the parameters, with examples, see the
Given a graph, the PageRank algorithm outputs a probability distribution representing a + person’s likelihood to arrive at any particular vertex while randomly traversing the + graph.
+MADlib graph also includes a personalized PageRank, where a notion of importance provides + personalization to a query. For example, importance scores can be biased according to a + specified set of graph vertices that are of interest or special in some way.
+The function is:
+For details on the parameters, with examples, see the
Given a graph and a source vertex, the single source shortest path (SSSP) algorithm finds + a path from the source vertex to every other vertex in the graph, such that the sum of the + weights of the path edges is minimized.
+The function is:
+For details on the parameters, with examples, see the
Given a directed graph, a weakly connected component (WCC) is a subgraph of the + original graph where all vertices are connected to each other by some path, ignoring the + direction of edges.
+The function is:
+For details on the parameters, with examples, see the
These algorithms relate to metrics computed on a graph and include:
This function computes the shortest path average between pairs of vertices. + Average path length is based on "reachable target vertices", so it averages the path + lengths in each connected component and ignores infinite-length paths between unconnected + vertices. If the user requires the average path length of a particular component, the + weakly connected components function may be used to isolate the relevant vertices.
+The function is:
+This function uses a previously run APSP (All Pairs Shortest Path) output. For
+ details on the parameters, with examples, see the
The closeness centrality algorithm helps quantify how much information passes through a + given vertex. The function returns various closeness centrality measures and the k-degree + for a given subset of vertices.
+The function is:
+This function uses a previously run APSP (All Pairs Shortest Path) output. For
+ details on the parameters, with examples, see the
Graph diameter is defined as the longest of all shortest paths in a graph. The + function is:
+This function uses a previously run APSP (All Pairs Shortest Path) output. For
+ details on the parameters, with examples, see the
This function computes the degree of each node. The node degree is the number + of edges adjacent to that node. The node in-degree is the number of edges pointing in to + the node and node out-degree is the number of edges pointing out of the node.
+The function is:
+For details on the parameters, with examples, see the
Naive Bayes Example 1 - Simple All-numeric Attributes
-In the first example, the
Actual - data in production scenarios is more extensive than this example data and yields better - results. Accuracy of classification improves significantly with larger training data - sets.
Naive Bayes Example 2 – Weather and Outdoor Sports
-This example calculates the probability that the user will play an outdoor sport, such as - golf or tennis, based on weather conditions.
-The table
The identification column for the table is
The
There are four attributes: outlook, temperature, humidity, and wind. These are categorical
- variables. The MADlib
The following table shows the training data, before encoding the variables.
-
-
The
- result is four rows, one for each record in the
MADlib web site is at
MADlib Apache web site and MADlib release notes are at
MADlib documentation is at
PivotalR is a first class R package that enables users to interact with data resident in - Greenplum Database and MADLib using an R client.
-The R language is an open-source language that is used for statistical computing. - PivotalR is an R package that enables users to interact with data resident in Greenplum - Database using the R client. Using PivotalR requires that MADlib is installed on the - Greenplum Database.
-PivotalR allows R users to leverage the scalability and performance of in-database - analytics without leaving the R command line. The computational work is executed - in-database, while the end user benefits from the familiar R interface. Compared with - respective native R functions, there is an increase in scalability and a decrease in - running time. Furthermore, data movement, which can take hours for very large data sets, - is eliminated with PivotalR.
-Key features of the PivotalR package:
For information about PivotalR, including supported MADlib functionality, see
The R package for PivotalR can be found at