# 分布式计算

# 相似性查询
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
## [相似性界面](https://radimrehurek.com/gensim/tut3.html#similarity-interface "永久链接到这个标题")
在之前关于[Corpora和向量空间](https://radimrehurek.com/gensim/tut1.html)以及[主题和转换的](https://radimrehurek.com/gensim/tut2.html)教程中,我们介绍了在向量空间模型中创建语料库以及如何在不同向量空间之间进行转换的含义。这种特征的一个常见原因是我们想要确定 **文档对****之间****相似性**,或者**特定文档与一组其他文档**(例如用户查询与索引文档)**之间****相似性**
为了说明在gensim中如何做到这一点,让我们考虑与之前的例子相同的语料库(它最初来自Deerwester等人的[“潜在语义分析索引”](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf) 1990年开篇 文章):
>>> from gensim import corpora, models, similarities
>>> dictionary = corpora.Dictionary.load('/tmp/deerwester.dict')
>>> corpus = corpora.MmCorpus('/tmp/deerwester.mm') # comes from the first tutorial, "From strings to vectors"
>>> print(corpus)
MmCorpus(9 documents, 12 features, 28 non-zero entries)
>>> lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
现在假设用户输入查询“人机交互”。我们希望按照与此查询相关的递减顺序对我们的九个语料库文档进行排序。与现代搜索引擎不同,这里我们只关注可能相似性的一个方面 - 关于其文本(单词)的明显语义相关性。没有超链接,没有随机游走静态排名,只是布尔关键字匹配的语义扩展:
>>> doc = "Human computer interaction"
>>> vec_bow = dictionary.doc2bow(doc.lower().split())
>>> vec_lsi = lsi[vec_bow] # convert the query to LSI space
>>> print(vec_lsi)
[(0, -0.461821), (1, 0.070028)]
此外,我们将考虑[余弦相似性](https://en.wikipedia.org/wiki/Cosine_similarity) 来确定两个向量的相似性。余弦相似度是向量空间建模中的标准度量,但是无论向量表示概率分布, [不同的相似性度量](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Symmetrised_divergence)可能更合适。
### [初始化查询结构](https://radimrehurek.com/gensim/tut3.html#initializing-query-structures "永久链接到这个标题")
>>> index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
> 警告
* `similarities.MatrixSimilarity`只有当整个向量集适合内存时,该类才适用。例如,当与此类一起使用时,一百万个文档的语料库在256维LSI空间中将需要2GB的RAM。
* 如果没有2GB的可用RAM,则需要使用`similarities.Similarity`该类。此类通过在磁盘上的多个文件(称为分片)之间拆分索引,在固定内存中运行。它使用`similarities.MatrixSimilarity``similarities.SparseMatrixSimilarity`内部,所以它仍然很快,虽然稍微复杂一点。
>>> index.save('/tmp/deerwester.index')
>>> index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')
对于所有相似性索引类(`similarities.Similarity`, `similarities.MatrixSimilarity``similarities.SparseMatrixSimilarity`)都是如此。同样在下文中,索引可以是任何这些的对象。如果有疑问,请使用`similarities.Similarity`,因为它是最具扩展性的版本,并且它还支持稍后向索引添加更多文档。
### [执行查询](https://radimrehurek.com/gensim/tut3.html#performing-queries "永久链接到这个标题")
>>> sims = index[vec_lsi] # perform a similarity query against the corpus
>>> print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
[(0, 0.99809301), (1, 0.93748635), (2, 0.99844527), (3, 0.9865886), (4, 0.90755945),
(5, -0.12416792), (6, -0.1063926), (7, -0.098794639), (8, 0.05004178)]
使用一些标准的Python魔术,我们将这些相似性按降序排序,并获得查询 `人机交互` 的最终答案:
>>> sims = sorted(enumerate(sims), key=lambda item: -item[1])
>>> print(sims) # print sorted (document number, similarity score) 2-tuples
[(2, 0.99844527), # The EPS user interface management system
(0, 0.99809301), # Human machine interface for lab abc computer applications
(3, 0.9865886), # System and human system engineering testing of EPS
(1, 0.93748635), # A survey of user opinion of computer system response time
(4, 0.90755945), # Relation of user perceived response time to error measurement
(8, 0.050041795), # Graph minors A survey
(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering
(6, -0.1063926), # The intersection graph of paths in trees
(5, -0.12416792)] # The generation of random binary unordered trees
这里要注意的是文件没有。标准布尔全文搜索永远不会返回2(`EPS用户界面管理系统`)和4(`用户感知响应时间与错误测量的关系`),因为他们不与 `人机交互` 分享任何常用词。然而,在应用LSI之后,我们可以观察到它们都获得了相当高的相似性得分(第2个实际上是最相似的!),这更符合我们对它们与查询共享 `computer-human` 相关主题的直觉。事实上,这种语义概括是我们首先应用转换并进行主题建模的原因。
## [下一个在哪里](https://radimrehurek.com/gensim/tut3.html#where-next "永久链接到这个标题")
* 有些部分可以更有效地实现(例如,在C中),或者更好地利用并行性(多个机器内核)
* 新算法一直在发布; 帮助gensim通过[讨论](https://groups.google.com/group/gensim)[贡献代码](https://github.com/piskvorky/gensim/wiki/Developer-page)来跟上[](https://github.com/piskvorky/gensim/wiki/Developer-page)
* 您的**反馈非常受欢迎**和赞赏(而且不仅仅是代码!): [创意贡献](https://github.com/piskvorky/gensim/wiki/Ideas-&-Features-proposals), [错误报告](https://github.com/piskvorky/gensim/issues)或只考虑贡献 [用户故事和一般问题](https://groups.google.com/group/gensim/topics)
# 英语维基百科上的实验
此页面描述了获取和处理Wikipedia的过程,以便任何人都可以重现结果。假设您已正确[安装](https://radimrehurek.com/gensim/install.html) gensim。[](https://radimrehurek.com/gensim/install.html)
## [准备语料库](https://radimrehurek.com/gensim/wiki.html#preparing-the-corpus "永久链接到这个标题")
1. 首先,从[http://download.wikimedia.org/enwiki/](https://download.wikimedia.org/enwiki/)下载所有维基百科文章的转储 (您需要文件enwiki-latest-pages-articles.xml.bz2或enwiki-YYYYMMDD-pages-articles.xml。 bz2用于特定于日期的转储)。此文件大小约为8GB,包含英语维基百科的所有文章(压缩版本)。
2. 将文章转换为纯文本(处理Wiki标记)并将结果存储为稀疏TF-IDF向量。在Python中,这很容易在运行中进行,我们甚至不需要将整个存档解压缩到磁盘。gensim中包含一个脚本 可以执行此操作,运行:
`$ python -m gensim.scripts.make_wiki`
> 注意
* 这个预处理步骤通过8.2GB压缩wiki转储进行两次传递(一次用于提取字典,一次用于创建和存储稀疏向量),并且在笔记本电脑上花费大约9个小时,因此您可能想要喝咖啡或二。
* 此外,您将需要大约35GB的可用磁盘空间来存储稀疏输出向量。我建议立即压缩这些文件,例如使用bzip2(低至~13GB)。Gensim可以直接使用压缩文件,因此可以节省磁盘空间。
## [潜在语义分析](https://radimrehurek.com/gensim/wiki.html#latent-semantic-analysis "永久链接到这个标题")
>>> import logging, gensim
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
>>> # load id->word mapping (the dictionary), one of the results of step 2 above
>>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
>>> # load corpus iterator
>>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
>>> # mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm.bz2') # use this if you compressed the TFIDF output (recommended)
>>> print(mm)
MmCorpus(3931787 documents, 100000 features, 756379027 non-zero entries)
>>> # extract 400 LSI topics; use the default one-pass algorithm
>>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
>>> # print the most contributing words (both positively and negatively) for each of the first ten topics
>>> lsi.print_topics(10)
topic #0(332.762): 0.425*"utc" + 0.299*"talk" + 0.293*"page" + 0.226*"article" + 0.224*"delete" + 0.216*"discussion" + 0.205*"deletion" + 0.198*"should" + 0.146*"debate" + 0.132*"be"
topic #1(201.852): 0.282*"link" + 0.209*"he" + 0.145*"com" + 0.139*"his" + -0.137*"page" + -0.118*"delete" + 0.114*"blacklist" + -0.108*"deletion" + -0.105*"discussion" + 0.100*"diff"
topic #2(191.991): -0.565*"link" + -0.241*"com" + -0.238*"blacklist" + -0.202*"diff" + -0.193*"additions" + -0.182*"users" + -0.158*"coibot" + -0.136*"user" + 0.133*"he" + -0.130*"resolves"
topic #3(141.284): -0.476*"image" + -0.255*"copyright" + -0.245*"fair" + -0.225*"use" + -0.173*"album" + -0.163*"cover" + -0.155*"resolution" + -0.141*"licensing" + 0.137*"he" + -0.121*"copies"
topic #4(130.909): 0.264*"population" + 0.246*"age" + 0.243*"median" + 0.213*"income" + 0.195*"census" + -0.189*"he" + 0.184*"households" + 0.175*"were" + 0.167*"females" + 0.166*"males"
topic #5(120.397): 0.304*"diff" + 0.278*"utc" + 0.213*"you" + -0.171*"additions" + 0.165*"talk" + -0.159*"image" + 0.159*"undo" + 0.155*"www" + -0.152*"page" + 0.148*"contribs"
topic #6(115.414): -0.362*"diff" + -0.203*"www" + 0.197*"you" + -0.180*"undo" + -0.180*"kategori" + 0.164*"users" + 0.157*"additions" + -0.150*"contribs" + -0.139*"he" + -0.136*"image"
topic #7(111.440): 0.429*"kategori" + 0.276*"categoria" + 0.251*"category" + 0.207*"kategorija" + 0.198*"kategorie" + -0.188*"diff" + 0.163*"категория" + 0.153*"categoría" + 0.139*"kategoria" + 0.133*"categorie"
topic #8(109.907): 0.385*"album" + 0.224*"song" + 0.209*"chart" + 0.204*"band" + 0.169*"released" + 0.151*"music" + 0.142*"diff" + 0.141*"vocals" + 0.138*"she" + 0.132*"guitar"
topic #9(102.599): -0.237*"league" + -0.214*"he" + -0.180*"season" + -0.174*"football" + -0.166*"team" + 0.159*"station" + -0.137*"played" + -0.131*"cup" + 0.131*"she" + -0.128*"utc"
在我的笔记本电脑上创建维基百科的LSI模型大约需要4小时9分钟[[1]](https://radimrehurek.com/gensim/wiki.html#id6)。这是约**每分钟16000的文件,包括所有的I / O**
> 注意
## [潜在Dirichlet分配](https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation "永久链接到这个标题")
与上面的Latent Semantic Analysis一样,首先加载语料库迭代器和字典:
>>> import logging, gensim
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
>>> # load id->word mapping (the dictionary), one of the results of step 2 above
>>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
>>> # load corpus iterator
>>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
>>> # mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm.bz2') # use this if you compressed the TFIDF output
>>> print(mm)
MmCorpus(3931787 documents, 100000 features, 756379027 non-zero entries)
>>> # extract 100 LDA topics, using 1 pass and updating once every 1 chunk (10,000 documents)
>>> lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1)
using serial LDA version on this node
running online LDA training, 100 topics, 1 passes over the supplied corpus of 3931787 documents, updating model once every 10000 documents
>>> # print the most contributing words for 20 randomly selected topics
>>> lda.print_topics(20)
topic #0: 0.009*river + 0.008*lake + 0.006*island + 0.005*mountain + 0.004*area + 0.004*park + 0.004*antarctic + 0.004*south + 0.004*mountains + 0.004*dam
topic #1: 0.026*relay + 0.026*athletics + 0.025*metres + 0.023*freestyle + 0.022*hurdles + 0.020*ret + 0.017*divisão + 0.017*athletes + 0.016*bundesliga + 0.014*medals
topic #2: 0.002*were + 0.002*he + 0.002*court + 0.002*his + 0.002*had + 0.002*law + 0.002*government + 0.002*police + 0.002*patrolling + 0.002*their
topic #3: 0.040*courcelles + 0.035*centimeters + 0.023*mattythewhite + 0.021*wine + 0.019*stamps + 0.018*oko + 0.017*perennial + 0.014*stubs + 0.012*ovate + 0.011*greyish
topic #4: 0.039*al + 0.029*sysop + 0.019*iran + 0.015*pakistan + 0.014*ali + 0.013*arab + 0.010*islamic + 0.010*arabic + 0.010*saudi + 0.010*muhammad
topic #5: 0.020*copyrighted + 0.020*northamerica + 0.014*uncopyrighted + 0.007*rihanna + 0.005*cloudz + 0.005*knowles + 0.004*gaga + 0.004*zombie + 0.004*wigan + 0.003*maccabi
topic #6: 0.061*israel + 0.056*israeli + 0.030*sockpuppet + 0.025*jerusalem + 0.025*tel + 0.023*aviv + 0.022*palestinian + 0.019*ifk + 0.016*palestine + 0.014*hebrew
topic #7: 0.015*melbourne + 0.014*rovers + 0.013*vfl + 0.012*australian + 0.012*wanderers + 0.011*afl + 0.008*dinamo + 0.008*queensland + 0.008*tracklist + 0.008*brisbane
topic #8: 0.011*film + 0.007*her + 0.007*she + 0.004*he + 0.004*series + 0.004*his + 0.004*episode + 0.003*films + 0.003*television + 0.003*best
topic #9: 0.019*wrestling + 0.013*château + 0.013*ligue + 0.012*discus + 0.012*estonian + 0.009*uci + 0.008*hockeyarchives + 0.008*wwe + 0.008*estonia + 0.007*reign
topic #10: 0.078*edits + 0.059*notability + 0.035*archived + 0.025*clearer + 0.022*speedy + 0.021*deleted + 0.016*hook + 0.015*checkuser + 0.014*ron + 0.011*nominator
topic #11: 0.013*admins + 0.009*acid + 0.009*molniya + 0.009*chemical + 0.007*ch + 0.007*chemistry + 0.007*compound + 0.007*anemone + 0.006*mg + 0.006*reaction
topic #12: 0.018*india + 0.013*indian + 0.010*tamil + 0.009*singh + 0.008*film + 0.008*temple + 0.006*kumar + 0.006*hindi + 0.006*delhi + 0.005*bengal
topic #13: 0.047*bwebs + 0.024*malta + 0.020*hobart + 0.019*basa + 0.019*columella + 0.019*huon + 0.018*tasmania + 0.016*popups + 0.014*tasmanian + 0.014*modèle
topic #14: 0.014*jewish + 0.011*rabbi + 0.008*bgwhite + 0.008*lebanese + 0.007*lebanon + 0.006*homs + 0.005*beirut + 0.004*jews + 0.004*hebrew + 0.004*caligari
topic #15: 0.025*german + 0.020*der + 0.017*von + 0.015*und + 0.014*berlin + 0.012*germany + 0.012*die + 0.010*des + 0.008*kategorie + 0.007*cross
topic #16: 0.003*can + 0.003*system + 0.003*power + 0.003*are + 0.003*energy + 0.002*data + 0.002*be + 0.002*used + 0.002*or + 0.002*using
topic #17: 0.049*indonesia + 0.042*indonesian + 0.031*malaysia + 0.024*singapore + 0.022*greek + 0.021*jakarta + 0.016*greece + 0.015*dord + 0.014*athens + 0.011*malaysian
topic #18: 0.031*stakes + 0.029*webs + 0.018*futsal + 0.014*whitish + 0.013*hyun + 0.012*thoroughbred + 0.012*dnf + 0.012*jockey + 0.011*medalists + 0.011*racehorse
topic #19: 0.119*oblast + 0.034*uploaded + 0.034*uploads + 0.033*nordland + 0.025*selsoviet + 0.023*raion + 0.022*krai + 0.018*okrug + 0.015*hålogaland + 0.015*russiae + 0.020*manga + 0.017*dragon + 0.012*theme + 0.011*dvd + 0.011*super + 0.011*hunter + 0.009*ash + 0.009*dream + 0.009*angel
在我的笔记本电脑上创建维基百科的这个LDA模型需要大约6小时20分钟[[1]](https://radimrehurek.com/gensim/wiki.html#id6)。如果您需要更快地获得结果,请考虑在计算机群集上运行[Distributed Latent Dirichlet Allocation](https://radimrehurek.com/gensim/dist_lda.html)
注意LDA和LSA运行之间的两个区别:我们要求LSA提取400个主题,LDA只有100个主题(因此速度差异实际上更大)。其次,gensim中的LSA实现是真正的在线:如果输入流的性质随时间变化,LSA将在相当少量的更新中重新定位自己以反映这些变化。相比之下,LDA并不是真正的在线( 尽管[[3]](https://radimrehurek.com/gensim/wiki.html#id8)文章的名称),因为后来更新对模型的影响逐渐减弱。如果输入文档流中存在主题偏差,LDA将会变得混乱,并且在调整自身以适应新的状态时会越来越慢。
>>> # extract 100 LDA topics, using 20 full passes, no online updates
>>> lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, passes=20)
>>> doc_lda = lda[doc_bow]
[1] *([1](https://radimrehurek.com/gensim/wiki.html#id1)[2](https://radimrehurek.com/gensim/wiki.html#id4)*我的笔记本=的MacBook Pro,英特尔酷睿i7 2.3GHz的,16GB DDR3 RAM,具有OS X libVec。
在这里,我们最感兴趣的是性能,但是查看检索到的LSA概念也很有趣。我不是维基百科的专家,也没有看到维基百科的内容,但Brian Mingus对结果有这样的说法:
There appears to be a lot of noise in your dataset. The first three topics
in your list appear to be meta topics, concerning the administration and
cleanup of Wikipedia. These show up because you didn't exclude templates
such as these, some of which are included in most articles for quality
control: http://en.wikipedia.org/wiki/Wikipedia:Template_messages/Cleanup
The fourth and fifth topics clearly shows the influence of bots that import
massive databases of cities, countries, etc. and their statistics such as
population, capita, etc.
The sixth shows the influence of sports bots, and the seventh of music bots.
因此,十大概念显然由维基百科机器人和扩展模板主导; 这是一个很好的提醒,LSA是一个强大的数据分析工具,但没有银弹。一如既往,它是[垃圾,垃圾输出](https://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out) ......顺便说一句,欢迎改进Wiki标记解析代码:-)
[3] *([1](https://radimrehurek.com/gensim/wiki.html#id3)[2](https://radimrehurek.com/gensim/wiki.html#id5)*霍夫曼,Blei,巴赫。2010.潜在Dirichlet分配的在线学习[ [pdf](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) ] [ [code](https://www.cs.princeton.edu/~mdhoffma/) ]
## [为何分布式计算?](https://radimrehurek.com/gensim/distributed.html#why-distributed-computing "永久链接到这个标题") ## [为何分布式计算?](https://radimrehurek.com/gensim/distributed.html#why-distributed-computing "永久链接到这个标题")
