# 语料库和向量空间

c6a8ec8f · 片刻小哥哥 · a82250be · c6a8ec8f
隐藏空白更改
内联并排

Showing with 261 addition and 0 deletion

blog/tutorial/1.md blog/tutorial/1.md +261 -0

未找到文件。
--- a/blog/tutorial/1.md
+++ b/blog/tutorial/1.md
+# 语料库和向量空间
+
+本教程[在此处](https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/Corpora_and_Vector_Spaces.ipynb)以Jupyter Notebook的形式提供。
+
+别忘了设置
+
+```
+>>> import logging
+>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
+```
+
+如果你想看到记录事件。
+
+## [从字符串到向量](https://radimrehurek.com/gensim/tut1.html#from-strings-to-vectors "永久链接到这个标题")
+
+这一次，让我们从表示为字符串的文档开始：
+
+```
+>>> from gensim import corpora
+>>>
+>>> documents = ["Human machine interface for lab abc computer applications",
+>>>              "A survey of user opinion of computer system response time",
+>>>              "The EPS user interface management system",
+>>>              "System and human system engineering testing of EPS",
+>>>              "Relation of user perceived response time to error measurement",
+>>>              "The generation of random binary unordered trees",
+>>>              "The intersection graph of paths in trees",
+>>>              "Graph minors IV Widths of trees and well quasi ordering",
+>>>              "Graph minors A survey"]
+```
+
+这是一个由九个文档组成的小型语料库，每个文档只包含一个句子。
+
+首先，让我们对文档进行标记，删除常用单词（使用玩具停止列表）以及仅在语料库中出现一次的单词：
+
+```
+>>> # remove common words and tokenize
+>>> stoplist = set('for a of the and to in'.split())
+>>> texts = [[word for word in document.lower().split() if word not in stoplist]
+>>>          for document in documents]
+>>>
+>>> # remove words that appear only once
+>>> from collections import defaultdict
+>>> frequency = defaultdict(int)
+>>> for text in texts:
+>>>     for token in text:
+>>>         frequency[token] += 1
+>>>
+>>> texts = [[token for token in text if frequency[token] > 1]
+>>>          for text in texts]
+>>>
+>>> from pprint import pprint  # pretty-printer
+>>> pprint(texts)
+[['human', 'interface', 'computer'],
+ ['survey', 'user', 'computer', 'system', 'response', 'time'],
+ ['eps', 'user', 'interface', 'system'],
+ ['system', 'human', 'system', 'eps'],
+ ['user', 'response', 'time'],
+ ['trees'],
+ ['graph', 'trees'],
+ ['graph', 'minors', 'trees'],
+ ['graph', 'minors', 'survey']]
+```
+
+您处理文件的方式可能会有所不同; 在这里，我只拆分空格来标记，然后小写每个单词。实际上，我使用这种特殊的（简单和低效）设置来模仿Deerwester等人的原始LSA文章[[1]中](https://radimrehurek.com/gensim/tut1.html#id3)所做的实验。
+
+处理文档的方式是多种多样的，依赖于应用程序和语言，我决定*不*通过任何接口约束它们。相反，文档由从中提取的特征表示，而不是由其“表面”字符串形式表示：如何使用这些特征取决于您。下面我描述一种常见的通用方法（称为 *词袋*），但请记住，不同的应用程序域需要不同的功能，而且，一如既往，它是[垃圾，垃圾输出](https://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out) ......
+
+要将文档转换为向量，我们将使用名为[bag-of-words](https://en.wikipedia.org/wiki/Bag_of_words)的文档表示 。在此表示中，每个文档由一个向量表示，其中每个向量元素表示问题 - 答案对，格式为：
+
+> “单词系统出现在文档中的次数是多少？一旦。”
+
+仅通过它们的（整数）id来表示问题是有利的。问题和ID之间的映射称为字典：
+
+```
+>>> dictionary = corpora.Dictionary(texts)
+>>> dictionary.save('/tmp/deerwester.dict')  # store the dictionary, for future reference
+>>> print(dictionary)
+Dictionary(12 unique tokens)
+```
+
+在这里，我们为语料库中出现的所有单词分配了一个唯一的整数id [`gensim.corpora.dictionary.Dictionary`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary "gensim.corpora.dictionary.Dictionary")。这会扫描文本，收集字数和相关统计数据。最后，我们看到在处理过的语料库中有12个不同的单词，这意味着每个文档将由12个数字表示（即，通过12-D向量）。要查看单词及其ID之间的映射：
+
+```
+>>> print(dictionary.token2id)
+{'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0,
+'survey': 4, 'user': 7, 'human': 1, 'time': 6, 'interface': 2, 'response': 3}
+```
+
+要将标记化文档实际转换为向量：
+
+```
+>>> new_doc = "Human computer interaction"
+>>> new_vec = dictionary.doc2bow(new_doc.lower().split())
+>>> print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored
+[(0, 1), (1, 1)]
+```
+
+该函数`doc2bow()`只计算每个不同单词的出现次数，将单词转换为整数单词id，并将结果作为稀疏向量返回。 因此，稀疏向量 `[(0, 1), (1, 1)]` 读取：在文档“人机交互”中，单词computer （id 0）和human（id 1）出现一次; 其他十个字典单词（隐含地）出现零次。
+
+```
+>>> corpus = [dictionary.doc2bow(text) for text in texts]
+>>> corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus)  # store to disk, for later use
+>>> print(corpus)
+[(0, 1), (1, 1), (2, 1)]
+[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
+[(2, 1), (5, 1), (7, 1), (8, 1)]
+[(1, 1), (5, 2), (8, 1)]
+[(3, 1), (6, 1), (7, 1)]
+[(9, 1)]
+[(9, 1), (10, 1)]
+[(9, 1), (10, 1), (11, 1)]
+[(4, 1), (10, 1), (11, 1)]
+```
+
+到目前为止，应该清楚的是，矢量要素 `id=10` 代表问题“文字中出现多少次文字？”，前六个文件的答案为“零”，其余三个答案为“一” 。事实上，我们已经得到了与[快速示例](https://radimrehurek.com/gensim/tutorial.html#first-example)中完全相同的向量语料库。
+
+
+## [语料库流 - 一次一个文档](https://radimrehurek.com/gensim/tut1.html#corpus-streaming-one-document-at-a-time "永久链接到这个标题")
+
+请注意，上面的语料库完全驻留在内存中，作为普通的Python列表。在这个简单的例子中，它并不重要，但为了使事情清楚，让我们假设语料库中有数百万个文档。将所有这些存储在RAM中是行不通的。相反，我们假设文档存储在磁盘上的文件中，每行一个文档。Gensim只要求语料库必须能够一次返回一个文档向量：
+
+```
+>>> class MyCorpus(object):
+>>>     def __iter__(self):
+>>>         for line in open('mycorpus.txt'):
+>>>             # assume there's one document per line, tokens separated by whitespace
+>>>             yield dictionary.doc2bow(line.lower().split())
+```
+
+在[此处](https://radimrehurek.com/gensim/mycorpus.txt)下载示例[mycorpus.txt文件](https://radimrehurek.com/gensim/mycorpus.txt)。假设每个文档在单个文件中占据一行并不重要; 您可以模拟__iter__函数以适合您的输入格式，无论它是什么。行走目录，解析XML，访问网络......只需解析输入以在每个文档中检索一个干净的标记列表，然后通过字典将标记转换为它们的ID，并在__iter__中生成生成的稀疏向量。
+
+```
+>>> corpus_memory_friendly = MyCorpus()  # doesn't load the corpus into memory!
+>>> print(corpus_memory_friendly)
+```
+
+语料库现在是一个对象。我们没有定义任何打印方式，因此print只输出内存中对象的地址。不是很有用。要查看构成向量，让我们遍历语料库并打印每个文档向量（一次一个）：
+
+```
+>>> for vector in corpus_memory_friendly:  # load one vector into memory at a time
+...     print(vector)
+[(0, 1), (1, 1), (2, 1)]
+[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
+[(2, 1), (5, 1), (7, 1), (8, 1)]
+[(1, 1), (5, 2), (8, 1)]
+[(3, 1), (6, 1), (7, 1)]
+[(9, 1)]
+[(9, 1), (10, 1)]
+[(9, 1), (10, 1), (11, 1)]
+[(4, 1), (10, 1), (11, 1)]
+```
+
+尽管输出与普通Python列表的输出相同，但语料库现在更加内存友好，因为一次最多只有一个向量驻留在RAM中。您的语料库现在可以随意扩展。
+
+类似地，构造字典而不将所有文本加载到内存中：
+
+```
+>>> from six import iteritems
+>>> # collect statistics about all tokens
+>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))
+>>> # remove stop words and words that appear only once
+>>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist
+>>>             if stopword in dictionary.token2id]
+>>> once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
+>>> dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
+>>> dictionary.compactify()  # remove gaps in id sequence after words that were removed
+>>> print(dictionary)
+Dictionary(12 unique tokens)
+```
+
+这就是它的全部！至少就字袋表示而言。当然，我们用这种语料库做的是另一个问题; 如何计算不同单词的频率可能是有用的，这一点都不清楚。事实证明，它不是，我们需要首先对这个简单的表示应用转换，然后才能使用它来计算任何有意义的文档与文档的相似性。转换将[在下一个教程中介绍](https://radimrehurek.com/gensim/tut2.html)，但在此之前，让我们简单地将注意力转向*语料库持久性*。
+
+## [语料库格式](https://radimrehurek.com/gensim/tut1.html#corpus-formats "永久链接到这个标题")
+
+存在几种用于将Vector Space语料库（〜矢量序列）序列化到磁盘的文件格式。 Gensim通过前面提到的*流式语料库接口*实现它们：文件以懒惰的方式从（分别存储到）磁盘读取，一次一个文档，而不是一次将整个语料库读入主存储器。
+
+[市场矩阵格式](http://math.nist.gov/MatrixMarket/formats.html)是一种比较值得注意的文件[格式](http://math.nist.gov/MatrixMarket/formats.html)。要以Matrix Market格式保存语料库：
+
+```
+>>> # create a toy corpus of 2 documents, as a plain Python list
+>>> corpus = [[(1, 0.5)], []]  # make one document empty, for the heck of it
+>>>
+>>> corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)
+```
+
+其他格式包括[Joachim的SVMlight格式](http://svmlight.joachims.org/)， [Blei的LDA-C格式](https://www.cs.princeton.edu/~blei/lda-c/)和 [GibbsLDA ++格式](http://gibbslda.sourceforge.net/)。
+
+```
+>>> corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
+>>> corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
+>>> corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)
+```
+
+相反，要从Matrix Market文件加载语料库迭代器：
+
+```
+>>> corpus = corpora.MmCorpus('/tmp/corpus.mm')
+```
+
+语料库对象是流，因此通常您将无法直接打印它们：
+
+```
+>>> print(corpus)
+MmCorpus(2 documents, 2 features, 1 non-zero entries)
+```
+
+相反，要查看语料库的内容：
+
+```
+>>> # one way of printing a corpus: load it entirely into memory
+>>> print(list(corpus))  # calling list() will convert any sequence to a plain Python list
+[[(1, 0.5)], []]
+```
+
+要么
+
+```
+>>> # another way of doing it: print one document at a time, making use of the streaming interface
+>>> for doc in corpus:
+...     print(doc)
+[(1, 0.5)]
+[]
+```
+
+第二种方式显然对内存更友好，但是出于测试和开发目的，没有什么比调用的简单性更好`list(corpus)`。
+
+要以Blei的LDA-C格式保存相同的Matrix Market文档流，
+
+```
+>>> corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
+```
+
+通过这种方式，gensim还可以用作内存高效的**I / O格式转换工具**：只需使用一种格式加载文档流，然后立即以另一种格式保存。添加新格式非常容易，请查看[SVMlight语料库](https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py)的[代码](https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py)示例。
+
+## [与NumPy和SciPy的兼容性](https://radimrehurek.com/gensim/tut1.html#compatibility-with-numpy-and-scipy "永久链接到这个标题")
+
+Gensim还包含[有效的实用程序函数](https://radimrehurek.com/gensim/matutils.html) 来帮助转换为/ numpy矩阵：
+
+```
+>>> import gensim
+>>> import numpy as np
+>>> numpy_matrix = np.random.randint(10, size=[5,2])  # random matrix as an example
+>>> corpus = gensim.matutils.Dense2Corpus(numpy_matrix)
+>>> numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features)
+```
+
+从/到scipy.sparse矩阵：
+
+```
+>>> import scipy.sparse
+>>> scipy_sparse_matrix = scipy.sparse.random(5,2)  # random sparse matrix as example
+>>> corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix)
+>>> scipy_csc_matrix = gensim.matutils.corpus2csc(corpus)
+```
+
+---
+
+要获得完整的参考（想要将字典修剪为更小的尺寸？优化语料库和NumPy / SciPy数组之间的转换？），请参阅[API文档](https://radimrehurek.com/gensim/apiref.html)。或者继续下一个关于[主题和转换的](https://radimrehurek.com/gensim/tut2.html)教程。
+
+[[1]](https://radimrehurek.com/gensim/tut1.html#id1) 这与[Deerwester等人](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf)使用的语料库相同 [。](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf)[（1990）：通过潜在语义分析进行索引](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf)，表2。