docs - add module landing page for gp_sparse_vector (#7573)

* docs - add module landing page for gp_sparse_vector * edits requested by david

docs - add module landing page for gp_sparse_vector (#7573)
* docs - add module landing page for gp_sparse_vector * edits requested by david
ea9fea0a · Lisa Owen · David Yozie · 0dc423a5 · ea9fea0a · ea9fea0a
3 changed file
--- a/gpdb-doc/dita/ref_guide/modules/gp_sparse_vector.xml
+++ b/gpdb-doc/dita/ref_guide/modules/gp_sparse_vector.xml
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE dita PUBLIC "-//OASIS//DTD DITA Composite//EN" "ditabase.dtd">
+<dita>
+  <topic id="topic_gpsv">
+    <title>gp_sparse_vector</title>
+    <body>
+      <p>The <codeph>gp_sparse_vector</codeph> module implements a Greenplum Database 
+        data type and associated functions that use compressed storage of zeros to
+        make vector computations on floating point numbers faster.</p>
+      <p>The <codeph>gp_sparse_vector</codeph> module is a Greenplum Database extension.</p>
+    </body>
+    <topic id="topic_reg">
+      <title>Installing and Registering the Module</title>
+      <body>
+        <p>The <codeph>gp_sparse_vector</codeph> module is installed when you install
+          Greenplum Database. Before you can use any of the functions defined in the
+          module, you must register the <codeph>gp_sparse_vector</codeph> extension in
+          each database where you want to use the functions.
+          <ph otherprops="pivotal">Refer to <xref href="../../install_guide/install_modules.xml"
+            format="dita" scope="peer">Installing Additional Supplied Modules</xref>
+          for more information.</ph></p>
+      </body>
+    </topic>
+    <topic id="topic_doc">
+      <title>Using the gp_sparse_vector Module</title>
+      <body>
+        <p>When you use arrays of floating point numbers for various calculations,
+          you will often have long runs of zeros. This is common in scientific,
+          retail optimization, and text processing applications. Each floating
+          point number takes 8 bytes of storage in memory and/or disk. Saving
+          those zeros is often impractical. There are also many computations
+          that benefit from skipping over the zeros.</p>
+        <p>For example, suppose the following array of <codeph>double</codeph>s is
+          stored as a <codeph>float8[]</codeph> in Greenplum Database:</p>
+
+        <codeblock>'{0, 33, &lt;40,000 zeros>, 12, 22 }'::float8[]</codeblock>
+        <p>This type of array arises often in text processing, where a dictionary may
+          have 40-100K terms and the number of words in a particular document is stored
+          in a vector. This array would occupy slightly more than 320KB of memory/disk,
+          most of it zeros. Any operation that you perform on this data
+          works on 40,001 fields that are not important.</p>
+        <p>The Greenplum Database built-in <codeph>array</codeph> datatype utilizes a
+          bitmap for null values, but it is a poor choice for this use case because
+          it is not optimized for <codeph>float8[]</codeph> or for long runs of
+          zeros instead of nulls, and the bitmap is not run-length-encoding- (RLE)
+          compressed. Even if each zero were stored as a <codeph>NULL</codeph> in the
+          array, the bitmap for nulls would use 5KB to mark the
+          nulls, which is not nearly as efficient as it could be.</p>
+        <p>The Greenplum Database <codeph>gp_sparse_vector</codeph> module defines a
+          data type and a simple RLE-based scheme that is biased toward being efficient
+          for zero value bitmaps. This scheme uses only 6 bytes for bitmap storage.</p>
+        <note>The sparse vector data type defined by the <codeph>gp_sparse_vector</codeph>
+          module is named <codeph>svec</codeph>. <codeph>svec</codeph> supports only
+          <codeph>float8</codeph> vector values.</note>
+        <p>You can construct an <codeph>svec</codeph> directly from a float array as 
+          follows:</p>
+        <codeblock>SELECT ('{0, 13, 37, 53, 0, 71 }'::float8[])::svec;</codeblock>
+        <p>The <codeph>gp_sparse_vector</codeph> module supports the vector operators
+          <codeph>&lt;</codeph>, <codeph>&gt;</codeph>, <codeph>*</codeph>,
+          <codeph>**</codeph>, <codeph>/</codeph>, <codeph>=</codeph>,
+          <codeph>+</codeph>, <codeph>sum()</codeph>, <codeph>vec_count_nonzero()</codeph>,
+          and so on. These operators take advantage of the efficient sparse storage format,
+          making computations on <codeph>svec</codeph>s faster.</p>
+        <p>The plus (<codeph>+</codeph>) operator adds each of the terms of two
+          vectors of the same dimension together. For example, if vector
+          <codeph>a = {0,1,5}</codeph> and vector <codeph>b = {4,3,2}</codeph>,
+you
+          would compute the vector addition as follows:</p>
+        <codeblock>SELECT ('{0,1,5}'::float8[]::svec + '{4,3,2}'::float8[]::svec)::float8[];
+ float8  
+---------
+ {4,4,7}</codeblock>
+        <p>A vector dot product (<codeph>%*%</codeph>) between vectors <codeph>a</codeph>
+          and <codeph>b</codeph> returns a scalar result of type <codeph>float8</codeph>.
+          Compute the dot product (<codeph>(0*4+1*3+5*2)=13</codeph>) as follows:</p>
+        <codeblock>SELECT '{0,1,5}'::float8[]::svec %*% '{4,3,2}'::float8[]::svec;
+ ?column? 
+----------
+       13</codeblock>
+        <p>Special vector aggregate functions are also useful. <codeph>sum()</codeph>
+          is self explanatory. <codeph>vec_count_nonzero()</codeph> evaluates the count
+          of non-zero terms found in a set of <codeph>svec</codeph> and returns an
+          <codeph>svec</codeph> with the counts. For instance, for the set of vectors
+          <codeph>{0,1,5},{10,0,3},{0,0,3},{0,1,0}</codeph>, the count of non-zero terms
+          would be <codeph>{1,2,3}</codeph>. Use <codeph>vec_count_nonzero()</codeph> to compute
+          the count of these vectors:</p>
+        <codeblock>CREATE TABLE listvecs( a svec );
+
+INSERT INTO listvecs VALUES ('{0,1,5}'::float8[]),
+    ('{10,0,3}'::float8[]),
+    ('{0,0,3}'::float8[]),
+    ('{0,1,0}'::float8[]);
+
+SELECT vec_count_nonzero( a )::float8[] FROM listvecs;
+ count_vec 
+-----------
+ {1,2,3}
+(1 row)</codeblock>
+      </body>
+    </topic>
+    <topic id="topic_info">
+      <title>Additional Module Documentation</title>
+      <body>
+        <p>Refer to the <codeph>gp_sparse_vector</codeph> READMEs in the
+          <xref href="https://github.com/greenplum-db/gpdb/tree/master/gpcontrib/gp_sparse_vector/README"
+            format="html" scope="external">Greenplum Database github repository</xref>
+          for additional information about this module.</p>
+        <p>Apache MADlib includes an extended implementation of sparse vectors. See the
+          <xref href="http://madlib.apache.org/docs/latest/group__grp__svec.html"
+            format="html" scope="external">MADlib Documentation</xref> for a description
+            of this MADlib module.</p>
+      </body>
+    </topic>
+    <topic id="topic_examples">
+      <title>Example</title>
+      <body>
+        <p>A text classification example that describes a dictionary and some documents
+          follows. You will create Greenplum Database tables representing a dictionary
+          and some documents. You then perform document classification using vector
+          arithmetic on word counts and proportions of dictionary words in each
+          document.</p>
+        <p>Suppose that you have a dictionary composed of words in a text array. Create
+          a table to store the dictionary data and insert some data (words) into the table.
+          For example:</p>
+          <codeblock>CREATE TABLE features (dictionary text[][]) DISTRIBUTED RANDOMLY;
+INSERT INTO features 
+    VALUES ('{am,before,being,bothered,corpus,document,i,in,is,me,never,now,'
+            'one,really,second,the,third,this,until}');</codeblock>
+        <p>You have a set of documents, also defined as an array of words. Create a
+            table to represent the documents and insert some data into the table:</p>
+        <codeblock>CREATE TABLE documents(docnum int, document text[]) DISTRIBUTED RANDOMLY;
+INSERT INTO documents VALUES 
+    (1,'{this,is,one,document,in,the,corpus}'),
+    (2,'{i,am,the,second,document,in,the,corpus}'),
+    (3,'{being,third,never,really,bothered,me,until,now}'),
+    (4,'{the,document,before,me,is,the,third,document}');</codeblock>
+        <p>Using the dictionary and document tables, find the dictionary words
+          that are present in each document. To do this, you first prepare a <i>Sparse
+          Feature Vector</i>, or SFV, for each document. An SFV is a vector of
+          dimension <i>N</i>, where <i>N</i> is the number of dictionary words, and
+          each SFV contains a count of each dictionary word in the document.</p>
+        <p>You can use the <codeph>gp_extract_feature_histogram()</codeph> function to
+          create an SFV from a document. <codeph>gp_extract_feature_histogram()</codeph>
+          outputs an <codeph>svec</codeph> for each document that contains the count of
+          each of the dictionary words in the ordinal positions of the dictionary.</p>
+        <codeblock>SELECT gp_extract_feature_histogram(
+    (SELECT dictionary FROM features LIMIT 1), document)::float8[], document
+        FROM documents ORDER BY docnum;
+
+     gp_extract_feature_histogram        |                     document                         
+-----------------------------------------+--------------------------------------------------
+ {0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} | {this,is,one,document,in,the,corpus}
+ {1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} | {i,am,the,second,document,in,the,corpus}
+ {0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} | {being,third,never,really,bothered,me,until,now}
+ {0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} | {the,document,before,me,is,the,third,document}
+
+SELECT * FROM features;
+                                               dictionary
+--------------------------------------------------------------------------------------------------------
+ {am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until}</codeblock>
+        <p>The SFV of the second document, "i am the second document in the corpus",
+          is <codeph>{1,3*0,1,1,1,1,6*0,1,2}</codeph>. The word "am" is the first
+          ordinate in the dictionary, and there is <codeph>1</codeph> instance of it
+          in the SFV. The word "before" has no instances in the document, so its value
+          is <codeph>0</codeph>; and so on.</p>
+        <p><codeph>gp_extract_feature_histogram()</codeph> is very speed optimized - it
+          is a single routine version of a hash join that processes large numbers of
+          documents into their SFVs in parallel at the highest possible speeds.</p>
+        <p>For the next part of the processing, generate a sparse vector of the
+          dictionary dimension (19). The vectors that you generate for each
+          document are referred to as the <i>corpus</i>.</p>
+        <codeblock>CREATE table corpus (docnum int, feature_vector svec) DISTRIBUTED RANDOMLY;
+
+INSERT INTO corpus
+    (SELECT docnum, 
+        gp_extract_feature_histogram(
+            (select dictionary FROM features LIMIT 1), document) from documents);</codeblock>
+
+        <p>Count the number of times each feature occurs at least once in all
+          documents:</p>
+        <codeblock>SELECT (vec_count_nonzero(feature_vector))::float8[] AS count_in_document FROM corpus;
+
+            count_in_document
+-----------------------------------------
+ {1,1,1,1,2,3,1,2,2,2,1,1,1,1,1,3,2,1,1}</codeblock>
+        <p>Count all occurrences of each term in all documents:</p>
+        <codeblock>SELECT (sum(feature_vector))::float8[] AS sum_in_document FROM corpus;
+
+             sum_in_document
+-----------------------------------------
+ {1,1,1,1,2,4,1,2,2,2,1,1,1,1,1,5,2,1,1}</codeblock>
+
+        <p>The remainder of the classification process is vector math. The count is
+          turned into a weight that reflects <i>Term Frequency / Inverse Document Frequency</i>
+          (tf/idf). The calculation for a given term in a given document is:</p>
+        <codeblock>#_times_term_appears_in_this_doc * log( #_docs / #_docs_the_term_appears_in )</codeblock>
+        <p><codeph>#_docs</codeph> is the total number of
+          documents (4 in this case). Note that there is one divisor for each
+          dictionary word and its value is the number of times that word appears in the
+          document.</p>
+        <p>For example, the term "document" in document 1 would have a weight of
+          <codeph>1 * log( 4/3 )</codeph>. In document 4, the term would have a weight
+          of <codeph>2 * log( 4/3 )</codeph>. Terms that appear in every document 
+          would have weight <codeph>0</codeph>.</p>
+        <p>This single vector for the whole corpus is then scalar product
+          multiplied by each document SFV to produce the tf/idf.</p>
+        <p>Calculate the tf/idf:</p>
+        <codeblock>SELECT docnum, (feature_vector*logidf)::float8[] AS tf_idf 
+    FROM (SELECT log(count(feature_vector)/vec_count_nonzero(feature_vector)) AS logidf FROM corpus)
+    AS foo, corpus ORDER BY docnum;
+ docnum |                                                                          tf_idf                                                                          
+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------
+      1 | {0,0,0,0,0.693147180559945,0.287682072451781,0,0.693147180559945,0.693147180559945,0,0,0,1.38629436111989,0,0,0.287682072451781,0,1.38629436111989,0}
+      2 | {1.38629436111989,0,0,0,0.693147180559945,0.287682072451781,1.38629436111989,0.693147180559945,0,0,0,0,0,0,1.38629436111989,0.575364144903562,0,0,0}
+      3 | {0,0,1.38629436111989,1.38629436111989,0,0,0,0,0,0.693147180559945,1.38629436111989,1.38629436111989,0,1.38629436111989,0,0,0.693147180559945,0,1.38629436111989
+}
+      4 | {0,1.38629436111989,0,0,0,0.575364144903562,0,0,0.693147180559945,0.693147180559945,0,0,0,0,0,0.575364144903562,0.693147180559945,0,0}
+</codeblock>
+        <p>You can determine the <i>angular distance</i> between one document and the
+          rest of the documents using the ACOS of the dot product of the document
+          vectors:</p>
+        <codeblock>CREATE TABLE weights AS 
+    (SELECT docnum, (feature_vector*logidf) tf_idf 
+        FROM (SELECT log(count(feature_vector)/vec_count_nonzero(feature_vector))
+       AS logidf FROM corpus) foo, corpus ORDER BY docnum)
+    DISTRIBUTED RANDOMLY;</codeblock>
+        <p>Calculate the angular distance between the first document and every other
+	  document:</p>
+        <codeblock>SELECT docnum, trunc((180.*(ACOS(dmin(1.,(tf_idf%*%testdoc)/(l2norm(tf_idf)*l2norm(testdoc))))/(4.*ATAN(1.))))::numeric,2)
+     AS angular_distance FROM weights,
+     (SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1) foo
+ORDER BY 1;
+
+ docnum | angular_distance 
+--------+------------------
+      1 |             0.00
+      2 |            78.82
+      3 |            90.00
+      4 |            80.02</codeblock>
+        <p>You can see that the angular distance between document 1 and itself is 0
+          degrees, and between document 1 and 3 is 90 degrees because they share no
+          features at all.</p>
+      </body>
+    </topic>
+  </topic>
+</dita>
--- a/gpdb-doc/dita/ref_guide/modules/intro.xml
+++ b/gpdb-doc/dita/ref_guide/modules/intro.xml
@@ -21,12 +21,11 @@
       <li><codeph><xref href="fuzzystrmatch.xml"
             format="dita">fuzzystrmatch</xref></codeph> - Determines similarities and differences
         between strings.</li>
-       <li><codeph><xref
-          href="https://github.com/greenplum-db/gpdb/tree/master/gpcontrib/gp_sparse_vector/README"
-          format="html" scope="external">gp_sparse_vector</xref></codeph> - Implements
-         a Greenplum Database data type that uses compressed storage of zeros to help
-         with floating point calculations.</li>
-       <li><codeph><xref href="hstore.xml" format="dita"
+       <li><codeph><xref href="gp_sparse_vector.xml" scope="peer"
+             format="dita">gp_sparse_vector</xref></codeph> - Implements
+         a Greenplum Database data type that uses compressed storage of zeros to make
+         vector computations on floating point numbers faster.</li>
+       <li><codeph><xref href="hstore.xml" scope="peer" format="dita"
             >hstore</xref></codeph> - Provides a data type for storing sets of key/value pairs
         within a single PostgreSQL value.</li>
       <li><codeph><xref href="orafce_ref.xml" format="dita"

--- a/gpdb-doc/dita/ref_guide/modules/modules.ditamap
+++ b/gpdb-doc/dita/ref_guide/modules/modules.ditamap
@@ -5,7 +5,7 @@
    <topicref href="citext.xml" linking="none"/>
    <topicref href="dblink.xml" linking="none"/>
    <topicref href="fuzzystrmatch.xml" linking="none"/>
-    <topicref href="https://github.com/greenplum-db/gpdb/tree/master/gpcontrib/gp_sparse_vector/README" format="html" scope="external" navtitle="gp_sparse_vector" linking="none"/>
+    <topicref href="gp_sparse_vector.xml" linking="none"/>
    <topicref href="hstore.xml" linking="none"/>
    <topicref href="orafce_ref.xml" linking="none"/>
    <topicref href="pageinspect.xml" linking="none"/>