From ea9fea0af3baf03735f5a06abda1474528b58b34 Mon Sep 17 00:00:00 2001 From: Lisa Owen Date: Thu, 16 May 2019 14:17:45 -0700 Subject: [PATCH] docs - add module landing page for gp_sparse_vector (#7573) * docs - add module landing page for gp_sparse_vector * edits requested by david --- .../ref_guide/modules/gp_sparse_vector.xml | 246 ++++++++++++++++++ gpdb-doc/dita/ref_guide/modules/intro.xml | 11 +- .../dita/ref_guide/modules/modules.ditamap | 2 +- 3 files changed, 252 insertions(+), 7 deletions(-) create mode 100644 gpdb-doc/dita/ref_guide/modules/gp_sparse_vector.xml diff --git a/gpdb-doc/dita/ref_guide/modules/gp_sparse_vector.xml b/gpdb-doc/dita/ref_guide/modules/gp_sparse_vector.xml new file mode 100644 index 0000000000..579874c357 --- /dev/null +++ b/gpdb-doc/dita/ref_guide/modules/gp_sparse_vector.xml @@ -0,0 +1,246 @@ + + + + + gp_sparse_vector + +

The gp_sparse_vector module implements a Greenplum Database + data type and associated functions that use compressed storage of zeros to + make vector computations on floating point numbers faster.

+

The gp_sparse_vector module is a Greenplum Database extension.

+ + + Installing and Registering the Module + +

The gp_sparse_vector module is installed when you install + Greenplum Database. Before you can use any of the functions defined in the + module, you must register the gp_sparse_vector extension in + each database where you want to use the functions. + Refer to Installing Additional Supplied Modules + for more information.

+ +
+ + Using the gp_sparse_vector Module + +

When you use arrays of floating point numbers for various calculations, + you will often have long runs of zeros. This is common in scientific, + retail optimization, and text processing applications. Each floating + point number takes 8 bytes of storage in memory and/or disk. Saving + those zeros is often impractical. There are also many computations + that benefit from skipping over the zeros.

+

For example, suppose the following array of doubles is + stored as a float8[] in Greenplum Database:

+ + '{0, 33, <40,000 zeros>, 12, 22 }'::float8[] +

This type of array arises often in text processing, where a dictionary may + have 40-100K terms and the number of words in a particular document is stored + in a vector. This array would occupy slightly more than 320KB of memory/disk, + most of it zeros. Any operation that you perform on this data + works on 40,001 fields that are not important.

+

The Greenplum Database built-in array datatype utilizes a + bitmap for null values, but it is a poor choice for this use case because + it is not optimized for float8[] or for long runs of + zeros instead of nulls, and the bitmap is not run-length-encoding- (RLE) + compressed. Even if each zero were stored as a NULL in the + array, the bitmap for nulls would use 5KB to mark the + nulls, which is not nearly as efficient as it could be.

+

The Greenplum Database gp_sparse_vector module defines a + data type and a simple RLE-based scheme that is biased toward being efficient + for zero value bitmaps. This scheme uses only 6 bytes for bitmap storage.

+ The sparse vector data type defined by the gp_sparse_vector + module is named svec. svec supports only + float8 vector values. +

You can construct an svec directly from a float array as + follows:

+ SELECT ('{0, 13, 37, 53, 0, 71 }'::float8[])::svec; +

The gp_sparse_vector module supports the vector operators + <, >, *, + **, /, =, + +, sum(), vec_count_nonzero(), + and so on. These operators take advantage of the efficient sparse storage format, + making computations on svecs faster.

+

The plus (+) operator adds each of the terms of two + vectors of the same dimension together. For example, if vector + a = {0,1,5} and vector b = {4,3,2}, +you + would compute the vector addition as follows:

+ SELECT ('{0,1,5}'::float8[]::svec + '{4,3,2}'::float8[]::svec)::float8[]; + float8 +--------- + {4,4,7} +

A vector dot product (%*%) between vectors a + and b returns a scalar result of type float8. + Compute the dot product ((0*4+1*3+5*2)=13) as follows:

+ SELECT '{0,1,5}'::float8[]::svec %*% '{4,3,2}'::float8[]::svec; + ?column? +---------- + 13 +

Special vector aggregate functions are also useful. sum() + is self explanatory. vec_count_nonzero() evaluates the count + of non-zero terms found in a set of svec and returns an + svec with the counts. For instance, for the set of vectors + {0,1,5},{10,0,3},{0,0,3},{0,1,0}, the count of non-zero terms + would be {1,2,3}. Use vec_count_nonzero() to compute + the count of these vectors:

+ CREATE TABLE listvecs( a svec ); + +INSERT INTO listvecs VALUES ('{0,1,5}'::float8[]), + ('{10,0,3}'::float8[]), + ('{0,0,3}'::float8[]), + ('{0,1,0}'::float8[]); + +SELECT vec_count_nonzero( a )::float8[] FROM listvecs; + count_vec +----------- + {1,2,3} +(1 row) + +
+ + Additional Module Documentation + +

Refer to the gp_sparse_vector READMEs in the + Greenplum Database github repository + for additional information about this module.

+

Apache MADlib includes an extended implementation of sparse vectors. See the + MADlib Documentation for a description + of this MADlib module.

+ +
+ + Example + +

A text classification example that describes a dictionary and some documents + follows. You will create Greenplum Database tables representing a dictionary + and some documents. You then perform document classification using vector + arithmetic on word counts and proportions of dictionary words in each + document.

+

Suppose that you have a dictionary composed of words in a text array. Create + a table to store the dictionary data and insert some data (words) into the table. + For example:

+ CREATE TABLE features (dictionary text[][]) DISTRIBUTED RANDOMLY; +INSERT INTO features + VALUES ('{am,before,being,bothered,corpus,document,i,in,is,me,never,now,' + 'one,really,second,the,third,this,until}'); +

You have a set of documents, also defined as an array of words. Create a + table to represent the documents and insert some data into the table:

+ CREATE TABLE documents(docnum int, document text[]) DISTRIBUTED RANDOMLY; +INSERT INTO documents VALUES + (1,'{this,is,one,document,in,the,corpus}'), + (2,'{i,am,the,second,document,in,the,corpus}'), + (3,'{being,third,never,really,bothered,me,until,now}'), + (4,'{the,document,before,me,is,the,third,document}'); +

Using the dictionary and document tables, find the dictionary words + that are present in each document. To do this, you first prepare a Sparse + Feature Vector, or SFV, for each document. An SFV is a vector of + dimension N, where N is the number of dictionary words, and + each SFV contains a count of each dictionary word in the document.

+

You can use the gp_extract_feature_histogram() function to + create an SFV from a document. gp_extract_feature_histogram() + outputs an svec for each document that contains the count of + each of the dictionary words in the ordinal positions of the dictionary.

+ SELECT gp_extract_feature_histogram( + (SELECT dictionary FROM features LIMIT 1), document)::float8[], document + FROM documents ORDER BY docnum; + + gp_extract_feature_histogram | document +-----------------------------------------+-------------------------------------------------- + {0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} | {this,is,one,document,in,the,corpus} + {1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} | {i,am,the,second,document,in,the,corpus} + {0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} | {being,third,never,really,bothered,me,until,now} + {0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} | {the,document,before,me,is,the,third,document} + +SELECT * FROM features; + dictionary +-------------------------------------------------------------------------------------------------------- + {am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until} +

The SFV of the second document, "i am the second document in the corpus", + is {1,3*0,1,1,1,1,6*0,1,2}. The word "am" is the first + ordinate in the dictionary, and there is 1 instance of it + in the SFV. The word "before" has no instances in the document, so its value + is 0; and so on.

+

gp_extract_feature_histogram() is very speed optimized - it + is a single routine version of a hash join that processes large numbers of + documents into their SFVs in parallel at the highest possible speeds.

+

For the next part of the processing, generate a sparse vector of the + dictionary dimension (19). The vectors that you generate for each + document are referred to as the corpus.

+ CREATE table corpus (docnum int, feature_vector svec) DISTRIBUTED RANDOMLY; + +INSERT INTO corpus + (SELECT docnum, + gp_extract_feature_histogram( + (select dictionary FROM features LIMIT 1), document) from documents); + +

Count the number of times each feature occurs at least once in all + documents:

+ SELECT (vec_count_nonzero(feature_vector))::float8[] AS count_in_document FROM corpus; + + count_in_document +----------------------------------------- + {1,1,1,1,2,3,1,2,2,2,1,1,1,1,1,3,2,1,1} +

Count all occurrences of each term in all documents:

+ SELECT (sum(feature_vector))::float8[] AS sum_in_document FROM corpus; + + sum_in_document +----------------------------------------- + {1,1,1,1,2,4,1,2,2,2,1,1,1,1,1,5,2,1,1} + +

The remainder of the classification process is vector math. The count is + turned into a weight that reflects Term Frequency / Inverse Document Frequency + (tf/idf). The calculation for a given term in a given document is:

+ #_times_term_appears_in_this_doc * log( #_docs / #_docs_the_term_appears_in ) +

#_docs is the total number of + documents (4 in this case). Note that there is one divisor for each + dictionary word and its value is the number of times that word appears in the + document.

+

For example, the term "document" in document 1 would have a weight of + 1 * log( 4/3 ). In document 4, the term would have a weight + of 2 * log( 4/3 ). Terms that appear in every document + would have weight 0.

+

This single vector for the whole corpus is then scalar product + multiplied by each document SFV to produce the tf/idf.

+

Calculate the tf/idf:

+ SELECT docnum, (feature_vector*logidf)::float8[] AS tf_idf + FROM (SELECT log(count(feature_vector)/vec_count_nonzero(feature_vector)) AS logidf FROM corpus) + AS foo, corpus ORDER BY docnum; + docnum | tf_idf +--------+---------------------------------------------------------------------------------------------------------------------------------------------------------- + 1 | {0,0,0,0,0.693147180559945,0.287682072451781,0,0.693147180559945,0.693147180559945,0,0,0,1.38629436111989,0,0,0.287682072451781,0,1.38629436111989,0} + 2 | {1.38629436111989,0,0,0,0.693147180559945,0.287682072451781,1.38629436111989,0.693147180559945,0,0,0,0,0,0,1.38629436111989,0.575364144903562,0,0,0} + 3 | {0,0,1.38629436111989,1.38629436111989,0,0,0,0,0,0.693147180559945,1.38629436111989,1.38629436111989,0,1.38629436111989,0,0,0.693147180559945,0,1.38629436111989 +} + 4 | {0,1.38629436111989,0,0,0,0.575364144903562,0,0,0.693147180559945,0.693147180559945,0,0,0,0,0,0.575364144903562,0.693147180559945,0,0} + +

You can determine the angular distance between one document and the + rest of the documents using the ACOS of the dot product of the document + vectors:

+ CREATE TABLE weights AS + (SELECT docnum, (feature_vector*logidf) tf_idf + FROM (SELECT log(count(feature_vector)/vec_count_nonzero(feature_vector)) + AS logidf FROM corpus) foo, corpus ORDER BY docnum) + DISTRIBUTED RANDOMLY; +

Calculate the angular distance between the first document and every other + document:

+ SELECT docnum, trunc((180.*(ACOS(dmin(1.,(tf_idf%*%testdoc)/(l2norm(tf_idf)*l2norm(testdoc))))/(4.*ATAN(1.))))::numeric,2) + AS angular_distance FROM weights, + (SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1) foo +ORDER BY 1; + + docnum | angular_distance +--------+------------------ + 1 | 0.00 + 2 | 78.82 + 3 | 90.00 + 4 | 80.02 +

You can see that the angular distance between document 1 and itself is 0 + degrees, and between document 1 and 3 is 90 degrees because they share no + features at all.

+ +
+
+
diff --git a/gpdb-doc/dita/ref_guide/modules/intro.xml b/gpdb-doc/dita/ref_guide/modules/intro.xml index 40ecf6a75a..ba26fd8dc8 100644 --- a/gpdb-doc/dita/ref_guide/modules/intro.xml +++ b/gpdb-doc/dita/ref_guide/modules/intro.xml @@ -21,12 +21,11 @@
  • fuzzystrmatch - Determines similarities and differences between strings.
  • -
  • gp_sparse_vector - Implements - a Greenplum Database data type that uses compressed storage of zeros to help - with floating point calculations.
  • -
  • gp_sparse_vector - Implements + a Greenplum Database data type that uses compressed storage of zeros to make + vector computations on floating point numbers faster.
  • +
  • hstore - Provides a data type for storing sets of key/value pairs within a single PostgreSQL value.
  • - + -- GitLab