提交 ea9fea0a 编写于 作者: L Lisa Owen 提交者: David Yozie

docs - add module landing page for gp_sparse_vector (#7573)

* docs - add module landing page for gp_sparse_vector

* edits requested by david
上级 0dc423a5
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dita PUBLIC "-//OASIS//DTD DITA Composite//EN" "ditabase.dtd">
<dita>
<topic id="topic_gpsv">
<title>gp_sparse_vector</title>
<body>
<p>The <codeph>gp_sparse_vector</codeph> module implements a Greenplum Database
data type and associated functions that use compressed storage of zeros to
make vector computations on floating point numbers faster.</p>
<p>The <codeph>gp_sparse_vector</codeph> module is a Greenplum Database extension.</p>
</body>
<topic id="topic_reg">
<title>Installing and Registering the Module</title>
<body>
<p>The <codeph>gp_sparse_vector</codeph> module is installed when you install
Greenplum Database. Before you can use any of the functions defined in the
module, you must register the <codeph>gp_sparse_vector</codeph> extension in
each database where you want to use the functions.
<ph otherprops="pivotal">Refer to <xref href="../../install_guide/install_modules.xml"
format="dita" scope="peer">Installing Additional Supplied Modules</xref>
for more information.</ph></p>
</body>
</topic>
<topic id="topic_doc">
<title>Using the gp_sparse_vector Module</title>
<body>
<p>When you use arrays of floating point numbers for various calculations,
you will often have long runs of zeros. This is common in scientific,
retail optimization, and text processing applications. Each floating
point number takes 8 bytes of storage in memory and/or disk. Saving
those zeros is often impractical. There are also many computations
that benefit from skipping over the zeros.</p>
<p>For example, suppose the following array of <codeph>double</codeph>s is
stored as a <codeph>float8[]</codeph> in Greenplum Database:</p>
<codeblock>'{0, 33, &lt;40,000 zeros>, 12, 22 }'::float8[]</codeblock>
<p>This type of array arises often in text processing, where a dictionary may
have 40-100K terms and the number of words in a particular document is stored
in a vector. This array would occupy slightly more than 320KB of memory/disk,
most of it zeros. Any operation that you perform on this data
works on 40,001 fields that are not important.</p>
<p>The Greenplum Database built-in <codeph>array</codeph> datatype utilizes a
bitmap for null values, but it is a poor choice for this use case because
it is not optimized for <codeph>float8[]</codeph> or for long runs of
zeros instead of nulls, and the bitmap is not run-length-encoding- (RLE)
compressed. Even if each zero were stored as a <codeph>NULL</codeph> in the
array, the bitmap for nulls would use 5KB to mark the
nulls, which is not nearly as efficient as it could be.</p>
<p>The Greenplum Database <codeph>gp_sparse_vector</codeph> module defines a
data type and a simple RLE-based scheme that is biased toward being efficient
for zero value bitmaps. This scheme uses only 6 bytes for bitmap storage.</p>
<note>The sparse vector data type defined by the <codeph>gp_sparse_vector</codeph>
module is named <codeph>svec</codeph>. <codeph>svec</codeph> supports only
<codeph>float8</codeph> vector values.</note>
<p>You can construct an <codeph>svec</codeph> directly from a float array as
follows:</p>
<codeblock>SELECT ('{0, 13, 37, 53, 0, 71 }'::float8[])::svec;</codeblock>
<p>The <codeph>gp_sparse_vector</codeph> module supports the vector operators
<codeph>&lt;</codeph>, <codeph>&gt;</codeph>, <codeph>*</codeph>,
<codeph>**</codeph>, <codeph>/</codeph>, <codeph>=</codeph>,
<codeph>+</codeph>, <codeph>sum()</codeph>, <codeph>vec_count_nonzero()</codeph>,
and so on. These operators take advantage of the efficient sparse storage format,
making computations on <codeph>svec</codeph>s faster.</p>
<p>The plus (<codeph>+</codeph>) operator adds each of the terms of two
vectors of the same dimension together. For example, if vector
<codeph>a = {0,1,5}</codeph> and vector <codeph>b = {4,3,2}</codeph>,
you
would compute the vector addition as follows:</p>
<codeblock>SELECT ('{0,1,5}'::float8[]::svec + '{4,3,2}'::float8[]::svec)::float8[];
float8
---------
{4,4,7}</codeblock>
<p>A vector dot product (<codeph>%*%</codeph>) between vectors <codeph>a</codeph>
and <codeph>b</codeph> returns a scalar result of type <codeph>float8</codeph>.
Compute the dot product (<codeph>(0*4+1*3+5*2)=13</codeph>) as follows:</p>
<codeblock>SELECT '{0,1,5}'::float8[]::svec %*% '{4,3,2}'::float8[]::svec;
?column?
----------
13</codeblock>
<p>Special vector aggregate functions are also useful. <codeph>sum()</codeph>
is self explanatory. <codeph>vec_count_nonzero()</codeph> evaluates the count
of non-zero terms found in a set of <codeph>svec</codeph> and returns an
<codeph>svec</codeph> with the counts. For instance, for the set of vectors
<codeph>{0,1,5},{10,0,3},{0,0,3},{0,1,0}</codeph>, the count of non-zero terms
would be <codeph>{1,2,3}</codeph>. Use <codeph>vec_count_nonzero()</codeph> to compute
the count of these vectors:</p>
<codeblock>CREATE TABLE listvecs( a svec );
INSERT INTO listvecs VALUES ('{0,1,5}'::float8[]),
('{10,0,3}'::float8[]),
('{0,0,3}'::float8[]),
('{0,1,0}'::float8[]);
SELECT vec_count_nonzero( a )::float8[] FROM listvecs;
count_vec
-----------
{1,2,3}
(1 row)</codeblock>
</body>
</topic>
<topic id="topic_info">
<title>Additional Module Documentation</title>
<body>
<p>Refer to the <codeph>gp_sparse_vector</codeph> READMEs in the
<xref href="https://github.com/greenplum-db/gpdb/tree/master/gpcontrib/gp_sparse_vector/README"
format="html" scope="external">Greenplum Database github repository</xref>
for additional information about this module.</p>
<p>Apache MADlib includes an extended implementation of sparse vectors. See the
<xref href="http://madlib.apache.org/docs/latest/group__grp__svec.html"
format="html" scope="external">MADlib Documentation</xref> for a description
of this MADlib module.</p>
</body>
</topic>
<topic id="topic_examples">
<title>Example</title>
<body>
<p>A text classification example that describes a dictionary and some documents
follows. You will create Greenplum Database tables representing a dictionary
and some documents. You then perform document classification using vector
arithmetic on word counts and proportions of dictionary words in each
document.</p>
<p>Suppose that you have a dictionary composed of words in a text array. Create
a table to store the dictionary data and insert some data (words) into the table.
For example:</p>
<codeblock>CREATE TABLE features (dictionary text[][]) DISTRIBUTED RANDOMLY;
INSERT INTO features
VALUES ('{am,before,being,bothered,corpus,document,i,in,is,me,never,now,'
'one,really,second,the,third,this,until}');</codeblock>
<p>You have a set of documents, also defined as an array of words. Create a
table to represent the documents and insert some data into the table:</p>
<codeblock>CREATE TABLE documents(docnum int, document text[]) DISTRIBUTED RANDOMLY;
INSERT INTO documents VALUES
(1,'{this,is,one,document,in,the,corpus}'),
(2,'{i,am,the,second,document,in,the,corpus}'),
(3,'{being,third,never,really,bothered,me,until,now}'),
(4,'{the,document,before,me,is,the,third,document}');</codeblock>
<p>Using the dictionary and document tables, find the dictionary words
that are present in each document. To do this, you first prepare a <i>Sparse
Feature Vector</i>, or SFV, for each document. An SFV is a vector of
dimension <i>N</i>, where <i>N</i> is the number of dictionary words, and
each SFV contains a count of each dictionary word in the document.</p>
<p>You can use the <codeph>gp_extract_feature_histogram()</codeph> function to
create an SFV from a document. <codeph>gp_extract_feature_histogram()</codeph>
outputs an <codeph>svec</codeph> for each document that contains the count of
each of the dictionary words in the ordinal positions of the dictionary.</p>
<codeblock>SELECT gp_extract_feature_histogram(
(SELECT dictionary FROM features LIMIT 1), document)::float8[], document
FROM documents ORDER BY docnum;
gp_extract_feature_histogram | document
-----------------------------------------+--------------------------------------------------
{0,0,0,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0} | {this,is,one,document,in,the,corpus}
{1,0,0,0,1,1,1,1,0,0,0,0,0,0,1,2,0,0,0} | {i,am,the,second,document,in,the,corpus}
{0,0,1,1,0,0,0,0,0,1,1,1,0,1,0,0,1,0,1} | {being,third,never,really,bothered,me,until,now}
{0,1,0,0,0,2,0,0,1,1,0,0,0,0,0,2,1,0,0} | {the,document,before,me,is,the,third,document}
SELECT * FROM features;
dictionary
--------------------------------------------------------------------------------------------------------
{am,before,being,bothered,corpus,document,i,in,is,me,never,now,one,really,second,the,third,this,until}</codeblock>
<p>The SFV of the second document, "i am the second document in the corpus",
is <codeph>{1,3*0,1,1,1,1,6*0,1,2}</codeph>. The word "am" is the first
ordinate in the dictionary, and there is <codeph>1</codeph> instance of it
in the SFV. The word "before" has no instances in the document, so its value
is <codeph>0</codeph>; and so on.</p>
<p><codeph>gp_extract_feature_histogram()</codeph> is very speed optimized - it
is a single routine version of a hash join that processes large numbers of
documents into their SFVs in parallel at the highest possible speeds.</p>
<p>For the next part of the processing, generate a sparse vector of the
dictionary dimension (19). The vectors that you generate for each
document are referred to as the <i>corpus</i>.</p>
<codeblock>CREATE table corpus (docnum int, feature_vector svec) DISTRIBUTED RANDOMLY;
INSERT INTO corpus
(SELECT docnum,
gp_extract_feature_histogram(
(select dictionary FROM features LIMIT 1), document) from documents);</codeblock>
<p>Count the number of times each feature occurs at least once in all
documents:</p>
<codeblock>SELECT (vec_count_nonzero(feature_vector))::float8[] AS count_in_document FROM corpus;
count_in_document
-----------------------------------------
{1,1,1,1,2,3,1,2,2,2,1,1,1,1,1,3,2,1,1}</codeblock>
<p>Count all occurrences of each term in all documents:</p>
<codeblock>SELECT (sum(feature_vector))::float8[] AS sum_in_document FROM corpus;
sum_in_document
-----------------------------------------
{1,1,1,1,2,4,1,2,2,2,1,1,1,1,1,5,2,1,1}</codeblock>
<p>The remainder of the classification process is vector math. The count is
turned into a weight that reflects <i>Term Frequency / Inverse Document Frequency</i>
(tf/idf). The calculation for a given term in a given document is:</p>
<codeblock>#_times_term_appears_in_this_doc * log( #_docs / #_docs_the_term_appears_in )</codeblock>
<p><codeph>#_docs</codeph> is the total number of
documents (4 in this case). Note that there is one divisor for each
dictionary word and its value is the number of times that word appears in the
document.</p>
<p>For example, the term "document" in document 1 would have a weight of
<codeph>1 * log( 4/3 )</codeph>. In document 4, the term would have a weight
of <codeph>2 * log( 4/3 )</codeph>. Terms that appear in every document
would have weight <codeph>0</codeph>.</p>
<p>This single vector for the whole corpus is then scalar product
multiplied by each document SFV to produce the tf/idf.</p>
<p>Calculate the tf/idf:</p>
<codeblock>SELECT docnum, (feature_vector*logidf)::float8[] AS tf_idf
FROM (SELECT log(count(feature_vector)/vec_count_nonzero(feature_vector)) AS logidf FROM corpus)
AS foo, corpus ORDER BY docnum;
docnum | tf_idf
--------+----------------------------------------------------------------------------------------------------------------------------------------------------------
1 | {0,0,0,0,0.693147180559945,0.287682072451781,0,0.693147180559945,0.693147180559945,0,0,0,1.38629436111989,0,0,0.287682072451781,0,1.38629436111989,0}
2 | {1.38629436111989,0,0,0,0.693147180559945,0.287682072451781,1.38629436111989,0.693147180559945,0,0,0,0,0,0,1.38629436111989,0.575364144903562,0,0,0}
3 | {0,0,1.38629436111989,1.38629436111989,0,0,0,0,0,0.693147180559945,1.38629436111989,1.38629436111989,0,1.38629436111989,0,0,0.693147180559945,0,1.38629436111989
}
4 | {0,1.38629436111989,0,0,0,0.575364144903562,0,0,0.693147180559945,0.693147180559945,0,0,0,0,0,0.575364144903562,0.693147180559945,0,0}
</codeblock>
<p>You can determine the <i>angular distance</i> between one document and the
rest of the documents using the ACOS of the dot product of the document
vectors:</p>
<codeblock>CREATE TABLE weights AS
(SELECT docnum, (feature_vector*logidf) tf_idf
FROM (SELECT log(count(feature_vector)/vec_count_nonzero(feature_vector))
AS logidf FROM corpus) foo, corpus ORDER BY docnum)
DISTRIBUTED RANDOMLY;</codeblock>
<p>Calculate the angular distance between the first document and every other
document:</p>
<codeblock>SELECT docnum, trunc((180.*(ACOS(dmin(1.,(tf_idf%*%testdoc)/(l2norm(tf_idf)*l2norm(testdoc))))/(4.*ATAN(1.))))::numeric,2)
AS angular_distance FROM weights,
(SELECT tf_idf testdoc FROM weights WHERE docnum = 1 LIMIT 1) foo
ORDER BY 1;
docnum | angular_distance
--------+------------------
1 | 0.00
2 | 78.82
3 | 90.00
4 | 80.02</codeblock>
<p>You can see that the angular distance between document 1 and itself is 0
degrees, and between document 1 and 3 is 90 degrees because they share no
features at all.</p>
</body>
</topic>
</topic>
</dita>
......@@ -21,12 +21,11 @@
<li><codeph><xref href="fuzzystrmatch.xml"
format="dita">fuzzystrmatch</xref></codeph> - Determines similarities and differences
between strings.</li>
<li><codeph><xref
href="https://github.com/greenplum-db/gpdb/tree/master/gpcontrib/gp_sparse_vector/README"
format="html" scope="external">gp_sparse_vector</xref></codeph> - Implements
a Greenplum Database data type that uses compressed storage of zeros to help
with floating point calculations.</li>
<li><codeph><xref href="hstore.xml" format="dita"
<li><codeph><xref href="gp_sparse_vector.xml" scope="peer"
format="dita">gp_sparse_vector</xref></codeph> - Implements
a Greenplum Database data type that uses compressed storage of zeros to make
vector computations on floating point numbers faster.</li>
<li><codeph><xref href="hstore.xml" scope="peer" format="dita"
>hstore</xref></codeph> - Provides a data type for storing sets of key/value pairs
within a single PostgreSQL value.</li>
<li><codeph><xref href="orafce_ref.xml" format="dita"
......
......@@ -5,7 +5,7 @@
<topicref href="citext.xml" linking="none"/>
<topicref href="dblink.xml" linking="none"/>
<topicref href="fuzzystrmatch.xml" linking="none"/>
<topicref href="https://github.com/greenplum-db/gpdb/tree/master/gpcontrib/gp_sparse_vector/README" format="html" scope="external" navtitle="gp_sparse_vector" linking="none"/>
<topicref href="gp_sparse_vector.xml" linking="none"/>
<topicref href="hstore.xml" linking="none"/>
<topicref href="orafce_ref.xml" linking="none"/>
<topicref href="pageinspect.xml" linking="none"/>
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册