提交 473f2eef 编写于 作者: W wizardforcel

2021-01-16 22:37:26

上级 b6aab600
......@@ -258,7 +258,7 @@ X = ["is", "a", "learning", "framework"]; y = "deep"
返回
7. We can also define a **get_word_embedding()** function, which will allow us to extract embeddings for a given word after our model has been trained:
7. 我们还可以定义一个`get_word_embedding()`函数,它可以让我们在训练完我们的模型后提取给定单词的嵌入。
def get_word_emdedding(self,word):
......@@ -266,7 +266,7 @@ X = ["is", "a", "learning", "framework"]; y = "deep"
返回 self.embeddings(word).view(1,-1)
8. Now, we are ready to train our model. We first create an instance of our model and define the loss function and optimizer:
8. 现在,我们已经准备好训练我们的模型。我们首先创建一个模型的实例,并定义损失函数和优化器。
模型= CBOW(corpus_length,embedding_length)
......@@ -274,7 +274,7 @@ X = ["is", "a", "learning", "framework"]; y = "deep"
优化程序= torch.optim.SGD(model.parameters(),lr = 0.01)
9. We then create a helper function that takes our input context words, gets the word indexes for each of these, and transforms them into a tensor of length 4, which forms the input to our neural network:
9. 然后,我们创建一个辅助函数,将我们的输入语境词,得到每一个语境词的词索引,并将其转换为一个长度为4的张量,形成我们神经网络的输入。
def make_sentence_vector(sentence,word_dict):
......@@ -290,7 +290,7 @@ X = ["is", "a", "learning", "framework"]; y = "deep"
图 3.11 –张量值
10. Now, we train our network. We loop through 100 epochs and for each pass, we loop through all our context words, that is, target word pairs. For each of these pairs, we load the context sentence using **make_sentence_vector()** and use our current model state to obtain predictions. We evaluate these predictions against our actual target in order to obtain our loss. We backpropagate to calculate the gradients and step through our optimizer to update the weights. Finally, we sum all our losses for the epoch and print this out. Here, we can see that our loss is decreasing, showing that our model is learning:
10. 现在,我们训练我们的网络。我们循环通过100个周期,对于每一个通道,我们循环通过我们所有的上下文词,也就是目标词对。对于这些对,我们使用`make_sentence_vector()`加载上下文句子,并使用我们当前的模型状态来获得预测。我们根据我们的实际目标评估这些预测,以获得我们的损失。我们反推计算梯度,并通过我们的优化器来更新权重。最后,我们将该时代的所有损失相加并打印出来。在这里,我们可以看到,我们的损失正在减少,这表明我们的模型正在学习。
对于范围(100)中的时代:
......@@ -324,7 +324,7 @@ X = ["is", "a", "learning", "framework"]; y = "deep"
现在我们的模型已经训练完毕,我们可以进行预测了。 我们定义了几个函数来允许我们这样做。 `get_predicted_result()`从预测数组中返回预测的单词,而我们的`predicted_sentence()`函数则根据上下文单词进行预测。
11. We split our sentences into individual words and transform them into an input vector. We then create our prediction array by feeding this into our model and get our final predicted word by using the **get_predicted_result()** function. We also print the two words before and after the predicted target word for context. We can run a couple of predictions to validate our model is working correctly:
11. 我们将我们的句子分割成单个单词,并将它们转化为一个输入向量。然后我们将其输入到模型中,创建我们的预测数组,并使用`get_predicted_result()`函数获得最终的预测词。我们也会打印出预测目标词前后的两个词作为上下文。我们可以运行几个预测来验证我们的模型是否正常工作。
def get_predicted_result(input,inverse_word_dict):
......@@ -354,7 +354,7 @@ X = ["is", "a", "learning", "framework"]; y = "deep"
图 3.13 –预测值
12. Now that we have a trained model, we are able to use the **get_word_embedding()** function in order to return the 20 dimensions word embedding for any word in our corpus. If we needed our embeddings for another NLP task, we could actually extract the weights from the whole embedding layer and use this in our new model:
12. 现在我们已经有了一个训练好的模型,我们能够使用`get_word_embedding()`函数,以便返回我们语料库中任何单词的20维单词嵌入。如果我们在另一个NLP任务中需要我们的嵌入,我们实际上可以从整个嵌入层中提取权重,并将其用于我们的新模型中。
打印(model.get_word_emdedding('leap'))
......@@ -442,7 +442,7 @@ My favourite language is ___
接下来,我们将学习 NLP 的分词化,这是一种预处理文本的方式,可以输入到模型中。 标记化将我们的句子分成较小的部分。 这可能涉及将一个句子拆分成单个单词,或者将整个文档分解成单个句子。 这是 NLP 必不可少的预处理步骤,可以在 Python 中相当简单地完成:
1. We first take a basic sentence and split this up into individual words using the **word tokenizer** in NLTK:
1. 我们先接收一个基本的句子,用NLTK中的**分词器**把这个句子分割成各个词。
text ='这是一个句子。'
......@@ -456,7 +456,7 @@ My favourite language is ___
图 3.18 –拆分句子
2. Note how a period (**.**) is considered a token as it is a part of natural language. Depending on what we want to do with the text, we may wish to keep or dispose of the punctuation:
2. 请注意句号(`.`)是如何被认为是一个符号,因为它是自然语言的一部分。根据我们对文本的处理,我们可能希望保留或放弃标点符号。
no_punctuation = [如果 word.isalpha(),则为令牌中的 word 的 word.lower()]
......@@ -468,7 +468,7 @@ My favourite language is ___
图 3.19 –删除标点符号
3. We can also tokenize documents into individual sentences using the **sentence** **tokenizer**:
3. 我们还可以使用**句子分词器**将文档标记为单个句子。
text =“这是第一句话。这是第二句话。一个文档包含很多句子。”
......@@ -480,7 +480,7 @@ My favourite language is ___
图 3.20 –将多个句子拆分为单个句子
4. Alternatively, we can combine the two to split into individual sentences of words:
4. 另外,我们也可以将两者结合起来,拆成单独的词句。
打印([send_tokenize(文本)中句子的 word_tokenize(句子)])
......@@ -490,7 +490,7 @@ My favourite language is ___
图 3.21 –将多个句子分解为单词
5. One other optional step in the process of tokenization, which is the removal of stopwords. Stopwords are very common words that do not contribute to the overall meaning of a sentence. These include words such as`a`,`I`, and **or**. We can print a complete list from NLTK using the following code:
5. 在标记化的过程中,还有一个可选的步骤,那就是去除停顿词。歇后语是非常常见的词,对句子的整体意思没有帮助。这些词包括`a``I``or`等。我们可以使用下面的代码从NLTK中打印出一个完整的列表。
stop_words = stopwords.words('english')
......@@ -502,7 +502,7 @@ My favourite language is ___
图 3.22 –显示停用词
6. We can easily remove these stopwords from our words using basic list comprehension:
6. 我们可以利用基本的列表理解,轻松地将这些停顿词从我们的单词中删除。
text ='这是一个句子。'
......@@ -648,7 +648,7 @@ This is a small giraffe
在这里,我们将使用 NLTK 数据集中的 Emma 语料对数据集实施 TF-IDF。 该数据集由 Jane Austen 的书《Emma》中的句子组成,我们希望为这些句子中的每一个计算一个嵌入式向量表示:
1. We start by importing our dataset and looping through each of the sentences, removing any punctuation and non-alphanumeric characters (such as astericks). We choose to leave stopwords in our dataset to demonstrate how TF-IDF accounts for these as these words appear in many documents and so have a very low IDF. We create a list of parsed sentences and a set of the distinct words in our corpus:
1. 我们首先导入我们的数据集,并循环处理每一个句子,删除所有标点符号和非字母数字字符(如星号)。我们选择在我们的数据集中留下停顿词,以展示TF-IDF如何处理这些词,因为这些词出现在许多文档中,因此具有非常低的IDF。我们在语料库中创建了一个解析句子的列表和一组不同的词。
艾玛= nltk.corpus.gutenberg.sents('austen-emma.txt')
......@@ -668,7 +668,7 @@ This is a small giraffe
emma_word_set =设置(emma_word_set)
2. Next, we create a function that will return our Term Frequencies for a given word in a given document. We take the length of the document to give us the number of words and count the occurrences of this word in the document before returning the ratio. Here, we can see that the word **ago** appears in the sentence once and that the sentence is 41 words long, giving us a Term Frequency of 0.024:
2. 接下来,我们创建一个函数,将返回给定文档中某个词的术语频率。我们以文档的长度来给出我们的词数,并计算这个词在文档中的出现次数,然后再返回比率。在这里,我们可以看到`ago`这个词在句子中出现了一次,而这个句子的长度是41个字,我们得到的词频是0.024。
def TermFreq(文档,单词):
......@@ -686,7 +686,7 @@ This is a small giraffe
图 3.29 – TF-IDF 分数
3. Next, we calculate our Document Frequency. In order to do this efficiently, we first need to pre-compute a Document Frequency dictionary. This loops through all the data and counts the number of documents each word in our corpus appears in. We pre-compute this so we that do not have to perform this loop every time we wish to calculate Document Frequency for a given word:
3. 接下来,我们计算我们的文档频率。为了有效地进行计算,我们首先需要预先计算一个文档频率词典。这将循环浏览所有数据,并统计语料库中每个词出现的文档数量。我们预先计算这个,这样我们就不必在每次计算某个词的文档频率时都要执行这个循环。
def build_DF_dict():
......@@ -708,7 +708,7 @@ This is a small giraffe
df_dict ['ago']
4. Here, we can see that the word **ago** appears within our document 32 times. Using this dictionary, we can very easily calculate our Inverse Document Frequency by dividing the total number of documents by our Document Frequency and taking the logarithm of this value. Note how we add one to the Document Frequency to avoid a divide by zero error when the word doesn't appear in the corpus:
4. 在这里,我们可以看到,`ago`这个词在我们的文档中出现了32次。使用这个词典,我们可以非常容易地计算出我们的反文档频率,方法是用文档频率除以文档总数,然后取这个值的对数。请注意,当这个词在语料库中没有出现时,我们如何在文档频率上加一,以避免除以零的错误。
def InverseDocumentFrequency(word):
......@@ -726,7 +726,7 @@ This is a small giraffe
InverseDocumentFrequency('ago')
5. Finally, we simply combine the Term Frequency and Inverse Document Frequency to get the TF-IDF weighting for each word/document pair:
5. 最后,我们只需将词频和逆文档频率结合起来,就可以得到每个词/文档对的TF-IDF权重。
def TFIDF(doc,word):
......@@ -752,7 +752,7 @@ This is a small giraffe
接下来,我们可以显示这些 TF-IDF 加权如何应用于嵌入:
1. We first load our pre-computed GLoVe embeddings to provide the initial embedding representation of words in our corpus:
1. 我们首先加载我们预先计算的GLoVe嵌入,以提供我们语料库中单词的初始嵌入表示。
def loadGlove(path):
......@@ -774,7 +774,7 @@ This is a small giraffe
手套= loadGlove('glove.6B.50d.txt')
2. We then calculate an unweighted mean average of all the individual embeddings in our document to get a vector representation of the sentence as a whole. We simply loop through all the words in our document, extract the embedding from the GLoVe dictionary, and calculate the average over all these vectors:
2. 然后,我们计算文档中所有单个嵌入的非加权平均数,以获得句子整体的向量表示。我们简单地循环浏览文档中的所有单词,从GLoVe字典中提取嵌入物,然后计算所有这些向量的平均值。
嵌入= []
......@@ -792,7 +792,7 @@ This is a small giraffe
图 3.31 –均值嵌入
3. We repeat this process to calculate our TF-IDF weighted document vector, but this time, we multiply our vectors by their TF-IDF weighting before we average them:
3. 我们重复这个过程来计算我们的TF-IDF加权文档向量,但这次,我们在求平均之前,先将我们的向量乘以它们的TF-IDF加权。
嵌入= []
......@@ -812,7 +812,7 @@ This is a small giraffe
图 3.32 – TF-IDF 嵌入
4. We can then compare the TF-IDF weighted embedding with our average embedding to see how similar they are. We can do this using cosine similarity, as follows:
4. 然后,我们可以将TF-IDF加权嵌入与我们的平均嵌入进行比较,看看它们的相似度。我们可以使用余弦相似度来实现,如下。
余弦相似度(均值嵌入,tfidf_weighted_embedding)
......
......@@ -226,11 +226,11 @@ Cat -> Cats' (Plural possessive)
**Porter 词干提取器**是具有大量逻辑规则的算法,可用于返回单词的词干。 在继续讨论该算法之前,我们将首先展示如何使用 NLTK 在 Python 中实现 Porter 词干提取器。
1. First, we create an instance of the Porter Stemmer:
1. 首先,我们创建一个Porter 词干提取器的实例。
搬运工= PorterStemmer()
2. We then simply call this instance of the stemmer on individual words and print the results. Here, we can see an example of the stems returned by the Porter Stemmer:
2. 然后,我们只需在单个单词上调用这个词干提取器的实例,并打印结果。在这里,我们可以看到Porter 词干提取器返回的词干的一个例子。
word_list = [“ see”,“ saw”,“ cat”,“ cats”,“ stem”,“ stemming”,“ lemma”,“ lemmatization”,“ known”,“ knowing”,“ time”,“ timing” ,“足球”,“足球运动员”]
......@@ -244,7 +244,7 @@ Cat -> Cats' (Plural possessive)
图 4.7 –返回词干
3. We can also apply stemming to an entire sentence, first by tokenizing the sentence and then by stemming each term individually:
3. 我们也可以对整个句子应用词干提取,首先将句子符号化,然后对每个词单独进行词干提取。
def SentenceStemmer(sentence):
......
......@@ -683,7 +683,7 @@ mkdir 模型
在`app.py`文件中,我们可以开始构建我们的 API:
1. We first carry out all of our imports and create a **predict** route. This allows us to call our API with the **predict** argument in order to run a **predict()** method within our API:
1. 我们首先进行所有的导入,并创建一个`predict`路由。这样我们就可以用`predict`参数来调用我们的API,以便在API中运行`predict()`方法。
进口烧瓶
......@@ -711,7 +711,7 @@ mkdir 模型
@ app.route('/ predict',Methods = ['GET'])
2. Next, we define our **predict()** method within our **app.py** file. This is largely a rehash of our model file, so to avoid repetition of code, it is advised that you look at the completed **app.py** file within the GitHub repository linked in the *Technical requirements* section of this chapter. You will see that there are a few additional lines. Firstly, within our **preprocess_review()** function, we will see the following lines:
2. 接下来,我们在`app.py`文件中定义·predict()方法。这在很大程度上是我们模型文件的翻版,所以为了避免代码重复,建议你查看本章“技术需求”部分链接的GitHub仓库中完成的·app.py文件。你会发现有几行额外的内容。首先,在`preprocess_review()`函数中,我们会看到以下几行。
使用 open('models / word_to_int_dict.json')作为句柄:
......@@ -719,13 +719,13 @@ mkdir 模型
这将使用我们在主模型笔记本中计算的`word_to_int`字典并将其加载到我们的模型中。 这样我们的词索引与我们训练的模型一致。 然后,我们使用该词典将输入文本转换为编码序列。 确保从原始笔记本输出中提取`word_to_int_dict.json`文件,并将其放置在`model`目录中。
3. Similarly, we must also load the weights from our trained model. We first define our **SentimentLSTM** class and the load our weights using **torch.load**. We will use the **.pkl** file from our original notebook, so be sure to place this in the **models** directory as well:
3. 同样,我们也必须从我们训练的模型中加载权重。我们首先定义我们的`SentimentLSTM`类,然后使用`torch.load`加载我们的权重。我们将使用我们原始笔记本中的`.pkl`文件,所以一定要把它也放在`models`目录下。
模型=情绪 LSTM(5401,50,100,1,2)
model.load_state_dict(torch.load(“ models / model_nlp.pkl”))
4. We must also define the input and outputs of our API. We want our model to take the input from our API and pass this to our **preprocess_review()** function. We do this using **request.get_json()**:
4. 我们还必须定义我们API的输入和输出。我们希望我们的模型能够从API中获取输入,并将其传递给我们的`precess_review()`函数。我们使用`request.get_json()`来完成。
request_json = request.get_json()
......@@ -733,7 +733,7 @@ mkdir 模型
单词= np.array([preprocess_review(review = i)])
5. To define our output, we return a JSON response consisting of the output from our model and a response code,`200`, which is what is returned by our predict function:
5. 为了定义我们的输出,我们返回一个JSON响应,由我们模型的输出和响应代码`200`组成,这就是我们预测函数返回的内容。
输出= model(x)[0] .item()
......@@ -741,7 +741,7 @@ mkdir 模型
返回响应,200
6. With the main body of our app complete, there are just two more additional things we must add in order to make our API run. We must first add the following to our **wsgi.py** file:
6. 随着我们的应用程序主体的完成,我们还必须添加两件额外的事情来使我们的API运行。首先,我们必须在`wsgi.py`文件中添加以下内容。
从应用程序导入应用程序作为应用程序
......@@ -749,7 +749,7 @@ mkdir 模型
application.run()
7. Finally, add the following to our Procfile:
7. 最后,在我们的Procfile中添加以下内容。
网址:gunicorn app:app --preload
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册