diff --git a/README.md b/README.md index 8dc39f560b0a4df57aee3d0bfbb1f48727a79870..a4208140adc4699f1b09e39706645aae57f726bc 100644 --- a/README.md +++ b/README.md @@ -93,8 +93,8 @@ Dive into Deep Learning with PyTorch. [10.2 近似训练](https://github.com/ShusenTang/Dive-into-DL-PyTorch/blob/master/docs/chapter10_natural-language-processing/10.2_approx-training.md) [10.3 word2vec的实现](https://github.com/ShusenTang/Dive-into-DL-PyTorch/blob/master/docs/chapter10_natural-language-processing/10.3_word2vec-pytorch.md) [10.4 子词嵌入(fastText)](https://github.com/ShusenTang/Dive-into-DL-PyTorch/blob/master/docs/chapter10_natural-language-processing/10.4_fasttext.md) -[10.5 全局向量的词嵌入(GloVe)](https://github.com/ShusenTang/Dive-into-DL-PyTorch/blob/master/docs/chapter10_natural-language-processing/10.5_glove.md) - +[10.5 全局向量的词嵌入(GloVe)](https://github.com/ShusenTang/Dive-into-DL-PyTorch/blob/master/docs/chapter10_natural-language-processing/10.5_glove.md) +[10.6 求近义词和类比词](https://github.com/ShusenTang/Dive-into-DL-PyTorch/blob/master/docs/chapter10_natural-language-processing/10.6_similarity-analogy.md) diff --git a/code/chapter10_natural-language-processing/10.6_similarity-analogy.ipynb b/code/chapter10_natural-language-processing/10.6_similarity-analogy.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..6c4faa293a22655d23781810102dbe320b43e786 --- /dev/null +++ b/code/chapter10_natural-language-processing/10.6_similarity-analogy.ipynb @@ -0,0 +1,352 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 10.6 求近义词和类比词\n", + "## 10.6.1 使用预训练的词向量" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1.0.0\n" + ] + }, + { + "data": { + "text/plain": [ + "dict_keys(['charngram.100d', 'fasttext.en.300d', 'fasttext.simple.300d', 'glove.42B.300d', 'glove.840B.300d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.6B.50d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d'])" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import torch\n", + "import torchtext.vocab as vocab\n", + "\n", + "print(torch.__version__)\n", + "vocab.pretrained_aliases.keys()" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['glove.42B.300d',\n", + " 'glove.840B.300d',\n", + " 'glove.twitter.27B.25d',\n", + " 'glove.twitter.27B.50d',\n", + " 'glove.twitter.27B.100d',\n", + " 'glove.twitter.27B.200d',\n", + " 'glove.6B.50d',\n", + " 'glove.6B.100d',\n", + " 'glove.6B.200d',\n", + " 'glove.6B.300d']" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "[key for key in vocab.pretrained_aliases.keys()\n", + " if \"glove\" in key]" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# glove = vocab.pretrained_aliases[\"glove.6B.50d\"](cache=\"/Users/tangshusen/Datasets\")\n", + "glove = vocab.GloVe(name='6B', dim=50, cache=\"/Users/tangshusen/Datasets\") # 与上面等价" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "一共包含400000个词。\n" + ] + } + ], + "source": [ + "print(\"一共包含%d个词。\" % len(glove.stoi))" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(3366, 'beautiful')" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "glove.stoi['beautiful'], glove.itos[3366]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 10.6.2 应用预训练词向量\n", + "### 10.6.2.1 求近义词" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def knn(W, x, k):\n", + " # 添加的1e-9是为了数值稳定性\n", + " cos = torch.matmul(W, x.view((-1,))) / (\n", + " (torch.sum(W * W, dim=1) + 1e-9).sqrt() * torch.sum(x * x).sqrt())\n", + " _, topk = torch.topk(cos, k=k)\n", + " topk = topk.cpu().numpy()\n", + " return topk, [cos[i].item() for i in topk]" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def get_similar_tokens(query_token, k, embed):\n", + " topk, cos = knn(embed.vectors,\n", + " embed.vectors[embed.stoi[query_token]], k+1)\n", + " for i, c in zip(topk[1:], cos[1:]): # 除去输入词\n", + " print('cosine sim=%.3f: %s' % (c, (embed.itos[i])))" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "cosine sim=0.856: chips\n", + "cosine sim=0.749: intel\n", + "cosine sim=0.749: electronics\n" + ] + } + ], + "source": [ + "get_similar_tokens('chip', 3, glove)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "cosine sim=0.839: babies\n", + "cosine sim=0.800: boy\n", + "cosine sim=0.792: girl\n" + ] + } + ], + "source": [ + "get_similar_tokens('baby', 3, glove)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "cosine sim=0.921: lovely\n", + "cosine sim=0.893: gorgeous\n", + "cosine sim=0.830: wonderful\n" + ] + } + ], + "source": [ + "get_similar_tokens('beautiful', 3, glove)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 10.6.2.2 求类比词" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def get_analogy(token_a, token_b, token_c, embed):\n", + " vecs = [embed.vectors[embed.stoi[t]] \n", + " for t in [token_a, token_b, token_c]]\n", + " x = vecs[1] - vecs[0] + vecs[2]\n", + " topk, cos = knn(embed.vectors, x, 1)\n", + " return embed.itos[topk[0]]" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'daughter'" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "get_analogy('man', 'woman', 'son', glove)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'japan'" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "get_analogy('beijing', 'china', 'tokyo', glove)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'biggest'" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "get_analogy('bad', 'worst', 'big', glove)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'went'" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "get_analogy('do', 'did', 'go', glove)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python [default]", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.8" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/chapter10_natural-language-processing/10.6_similarity-analogy.md b/docs/chapter10_natural-language-processing/10.6_similarity-analogy.md index 6f82b889a248ac1424f5ce76bfe54095f6c68ffb..43344f8e2c79b5d6de7daa0339e2c5fa44493f44 100644 --- a/docs/chapter10_natural-language-processing/10.6_similarity-analogy.md +++ b/docs/chapter10_natural-language-processing/10.6_similarity-analogy.md @@ -1,120 +1,174 @@ # 10.6 求近义词和类比词 -在[“word2vec的实现”](./word2vec-gluon.md)一节中,我们在小规模数据集上训练了一个word2vec词嵌入模型,并通过词向量的余弦相似度搜索近义词。实际中,在大规模语料上预训练的词向量常常可以应用到下游自然语言处理任务中。本节将演示如何用这些预训练的词向量来求近义词和类比词。我们还将在后面两节中继续应用预训练的词向量。 +在10.3节(word2vec的实现)中,我们在小规模数据集上训练了一个word2vec词嵌入模型,并通过词向量的余弦相似度搜索近义词。实际中,在大规模语料上预训练的词向量常常可以应用到下游自然语言处理任务中。本节将演示如何用这些预训练的词向量来求近义词和类比词。我们还将在后面两节中继续应用预训练的词向量。 ## 10.6.1 使用预训练的词向量 -MXNet的`contrib.text`包提供了跟自然语言处理相关的函数和类(更多参见GluonNLP工具包 [1])。下面查看它目前提供的预训练词嵌入的名称。 +基于PyTorch的关于自然语言处理的常用包有官方的[torchtext](https://github.com/pytorch/text)以及第三方的[pytorch-nlp](https://github.com/PetrochukM/PyTorch-NLP)等等。你可以使用`pip`很方便地按照它们,例如命令行执行 +``` +pip install torchtext +``` +详情请参见其README。 + -```{.python .input} -from mxnet import nd -from mxnet.contrib import text +本节我们使用torchtext进行练习。下面查看它目前提供的预训练词嵌入的名称。 -text.embedding.get_pretrained_file_names().keys() +``` python +import torch +import torchtext.vocab as vocab + +vocab.pretrained_aliases.keys() +``` +输出: +``` +dict_keys(['charngram.100d', 'fasttext.en.300d', 'fasttext.simple.300d', 'glove.42B.300d', 'glove.840B.300d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.6B.50d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d']) ``` -给定词嵌入名称,可以查看该词嵌入提供了哪些预训练的模型。每个模型的词向量维度可能不同,或是在不同数据集上预训练得到的。 +下面查看查看该`glove`词嵌入提供了哪些预训练的模型。每个模型的词向量维度可能不同,或是在不同数据集上预训练得到的。 -```{.python .input n=35} -print(text.embedding.get_pretrained_file_names('glove')) +``` python +[key for key in vocab.pretrained_aliases.keys() + if "glove" in key] +``` +输出: +``` +['glove.42B.300d', + 'glove.840B.300d', + 'glove.twitter.27B.25d', + 'glove.twitter.27B.50d', + 'glove.twitter.27B.100d', + 'glove.twitter.27B.200d', + 'glove.6B.50d', + 'glove.6B.100d', + 'glove.6B.200d', + 'glove.6B.300d'] ``` -预训练的GloVe模型的命名规范大致是“模型.(数据集.)数据集词数.词向量维度.txt”。更多信息可以参考GloVe和fastText的项目网站 [2,3]。下面我们使用基于维基百科子集预训练的50维GloVe词向量。第一次创建预训练词向量实例时会自动下载相应的词向量,因此需要联网。 +预训练的GloVe模型的命名规范大致是“模型.(数据集.)数据集词数.词向量维度”。更多信息可以参考GloVe和fastText的项目网站[1,2]。下面我们使用基于维基百科子集预训练的50维GloVe词向量。第一次创建预训练词向量实例时会自动下载相应的词向量到`cache`指定文件夹(默认为`.vector_cache`),因此需要联网。 -```{.python .input n=11} -glove_6b50d = text.embedding.create( - 'glove', pretrained_file_name='glove.6B.50d.txt') +``` python +# glove = vocab.pretrained_aliases["glove.6B.50d"](cache="/Users/tangshusen/Datasets") +glove = vocab.GloVe(name='6B', dim=50, cache="/Users/tangshusen/Datasets") # 与上面等价 ``` +返回的实例主要有以下三个属性: +* `stoi`: 词到索引的字典: +* `itos`: 索引到词的字典; +* `vectors`: 词向量。 -打印词典大小。其中含有40万个词和1个特殊的未知词符号。 +打印词典大小。其中含有40万个词。 -```{.python .input} -len(glove_6b50d) +``` python +print("一共包含%d个词。" % len(glove.stoi)) +``` +输出: +``` +一共包含400000个词。 ``` 我们可以通过词来获取它在词典中的索引,也可以通过索引获取词。 -```{.python .input n=12} -glove_6b50d.token_to_idx['beautiful'], glove_6b50d.idx_to_token[3367] +``` python +glove.stoi['beautiful'], glove.itos[3366] # (3366, 'beautiful') ``` + ## 10.6.2 应用预训练词向量 下面我们以GloVe模型为例,展示预训练词向量的应用。 ### 10.6.2.1 求近义词 -这里重新实现[“word2vec的实现”](./word2vec-gluon.md)一节中介绍过的使用余弦相似度来搜索近义词的算法。为了在求类比词时重用其中的求$k$近邻($k$-nearest neighbors)的逻辑,我们将这部分逻辑单独封装在`knn`函数中。 +这里重新实现10.3节(word2vec的实现)中介绍过的使用余弦相似度来搜索近义词的算法。为了在求类比词时重用其中的求$k$近邻($k$-nearest neighbors)的逻辑,我们将这部分逻辑单独封装在`knn`函数中。 -```{.python .input} +``` python def knn(W, x, k): # 添加的1e-9是为了数值稳定性 - cos = nd.dot(W, x.reshape((-1,))) / ( - (nd.sum(W * W, axis=1) + 1e-9).sqrt() * nd.sum(x * x).sqrt()) - topk = nd.topk(cos, k=k, ret_typ='indices').asnumpy().astype('int32') - return topk, [cos[i].asscalar() for i in topk] + cos = torch.matmul(W, x.view((-1,))) / ( + (torch.sum(W * W, dim=1) + 1e-9).sqrt() * torch.sum(x * x).sqrt()) + _, topk = torch.topk(cos, k=k) + topk = topk.cpu().numpy() + return topk, [cos[i].item() for i in topk] ``` 然后,我们通过预训练词向量实例`embed`来搜索近义词。 -```{.python .input} +``` python def get_similar_tokens(query_token, k, embed): - topk, cos = knn(embed.idx_to_vec, - embed.get_vecs_by_tokens([query_token]), k+1) + topk, cos = knn(embed.vectors, + embed.vectors[embed.stoi[query_token]], k+1) for i, c in zip(topk[1:], cos[1:]): # 除去输入词 - print('cosine sim=%.3f: %s' % (c, (embed.idx_to_token[i]))) + print('cosine sim=%.3f: %s' % (c, (embed.itos[i]))) ``` 已创建的预训练词向量实例`glove_6b50d`的词典中含40万个词和1个特殊的未知词。除去输入词和未知词,我们从中搜索与“chip”语义最相近的3个词。 -```{.python .input} -get_similar_tokens('chip', 3, glove_6b50d) +``` python +get_similar_tokens('chip', 3, glove) +``` +输出: +``` +cosine sim=0.856: chips +cosine sim=0.749: intel +cosine sim=0.749: electronics ``` 接下来查找“baby”和“beautiful”的近义词。 -```{.python .input} -get_similar_tokens('baby', 3, glove_6b50d) +``` python +get_similar_tokens('baby', 3, glove) +``` +输出: +``` +cosine sim=0.839: babies +cosine sim=0.800: boy +cosine sim=0.792: girl ``` -```{.python .input} -get_similar_tokens('beautiful', 3, glove_6b50d) +``` python +get_similar_tokens('beautiful', 3, glove) +``` +输出: +``` +cosine sim=0.921: lovely +cosine sim=0.893: gorgeous +cosine sim=0.830: wonderful ``` ### 10.6.2.2 求类比词 除了求近义词以外,我们还可以使用预训练词向量求词与词之间的类比关系。例如,“man”(男人): “woman”(女人):: “son”(儿子) : “daughter”(女儿)是一个类比例子:“man”之于“woman”相当于“son”之于“daughter”。求类比词问题可以定义为:对于类比关系中的4个词 $a : b :: c : d$,给定前3个词$a$、$b$和$c$,求$d$。设词$w$的词向量为$\text{vec}(w)$。求类比词的思路是,搜索与$\text{vec}(c)+\text{vec}(b)-\text{vec}(a)$的结果向量最相似的词向量。 -```{.python .input} +``` python def get_analogy(token_a, token_b, token_c, embed): - vecs = embed.get_vecs_by_tokens([token_a, token_b, token_c]) + vecs = [embed.vectors[embed.stoi[t]] + for t in [token_a, token_b, token_c]] x = vecs[1] - vecs[0] + vecs[2] - topk, cos = knn(embed.idx_to_vec, x, 1) - return embed.idx_to_token[topk[0]] + topk, cos = knn(embed.vectors, x, 1) + return embed.itos[topk[0]] ``` 验证一下“男-女”类比。 -```{.python .input n=18} -get_analogy('man', 'woman', 'son', glove_6b50d) +``` python +get_analogy('man', 'woman', 'son', glove) # 'daughter' ``` “首都-国家”类比:“beijing”(北京)之于“china”(中国)相当于“tokyo”(东京)之于什么?答案应该是“japan”(日本)。 -```{.python .input n=19} -get_analogy('beijing', 'china', 'tokyo', glove_6b50d) +``` python +get_analogy('beijing', 'china', 'tokyo', glove) # 'japan' ``` “形容词-形容词最高级”类比:“bad”(坏的)之于“worst”(最坏的)相当于“big”(大的)之于什么?答案应该是“biggest”(最大的)。 -```{.python .input n=20} -get_analogy('bad', 'worst', 'big', glove_6b50d) +``` python +get_analogy('bad', 'worst', 'big', glove) # 'biggest' ``` “动词一般时-动词过去时”类比:“do”(做)之于“did”(做过)相当于“go”(去)之于什么?答案应该是“went”(去过)。 -```{.python .input n=21} -get_analogy('do', 'did', 'go', glove_6b50d) +``` python +get_analogy('do', 'did', 'go', glove) # 'went' ``` ## 小结 @@ -126,8 +180,10 @@ get_analogy('do', 'did', 'go', glove_6b50d) ## 参考文献 -[1] GluonNLP工具包。 https://gluon-nlp.mxnet.io/ -[2] GloVe项目网站。 https://nlp.stanford.edu/projects/glove/ +[1] GloVe项目网站。 https://nlp.stanford.edu/projects/glove/ + +[2] fastText项目网站。 https://fasttext.cc/ -[3] fastText项目网站。 https://fasttext.cc/ +----------- +> 注:本节除代码外与原书基本相同,[原书传送门](https://zh.d2l.ai/chapter_natural-language-processing/similarity-analogy.html)