9.We then create a helper function that takes our input context words, gets the word indexes for each of these, and transforms them into a tensor of length 4, which forms the input to our neural network:
@@ -290,7 +290,7 @@ X = ["is", "a", "learning", "framework"]; y = "deep"
图 3.11 –张量值
10.Now, we train our network. We loop through 100 epochs and for each pass, we loop through all our context words, that is, target word pairs. For each of these pairs, we load the context sentence using **make_sentence_vector()** and use our current model state to obtain predictions. We evaluate these predictions against our actual target in order to obtain our loss. We backpropagate to calculate the gradients and step through our optimizer to update the weights. Finally, we sum all our losses for the epoch and print this out. Here, we can see that our loss is decreasing, showing that our model is learning:
11.We split our sentences into individual words and transform them into an input vector. We then create our prediction array by feeding this into our model and get our final predicted word by using the **get_predicted_result()** function. We also print the two words before and after the predicted target word for context. We can run a couple of predictions to validate our model is working correctly:
@@ -354,7 +354,7 @@ X = ["is", "a", "learning", "framework"]; y = "deep"
图 3.13 –预测值
12.Now that we have a trained model, we are able to use the **get_word_embedding()** function in order to return the 20 dimensions word embedding for any word in our corpus. If we needed our embeddings for another NLP task, we could actually extract the weights from the whole embedding layer and use this in our new model:
1.We first take a basic sentence and split this up into individual words using the **word tokenizer** in NLTK:
1.我们先接收一个基本的句子,用NLTK中的**分词器**把这个句子分割成各个词。
text ='这是一个句子。'
...
...
@@ -456,7 +456,7 @@ My favourite language is ___
图 3.18 –拆分句子
2.Note how a period (**.**) is considered a token as it is a part of natural language. Depending on what we want to do with the text, we may wish to keep or dispose of the punctuation:
no_punctuation = [如果 word.isalpha(),则为令牌中的 word 的 word.lower()]
...
...
@@ -468,7 +468,7 @@ My favourite language is ___
图 3.19 –删除标点符号
3.We can also tokenize documents into individual sentences using the **sentence****tokenizer**:
3.我们还可以使用**句子分词器**将文档标记为单个句子。
text =“这是第一句话。这是第二句话。一个文档包含很多句子。”
...
...
@@ -480,7 +480,7 @@ My favourite language is ___
图 3.20 –将多个句子拆分为单个句子
4.Alternatively, we can combine the two to split into individual sentences of words:
4.另外,我们也可以将两者结合起来,拆成单独的词句。
打印([send_tokenize(文本)中句子的 word_tokenize(句子)])
...
...
@@ -490,7 +490,7 @@ My favourite language is ___
图 3.21 –将多个句子分解为单词
5.One other optional step in the process of tokenization, which is the removal of stopwords. Stopwords are very common words that do not contribute to the overall meaning of a sentence. These include words such as`a`,`I`, and **or**. We can print a complete list from NLTK using the following code:
6.We can easily remove these stopwords from our words using basic list comprehension:
6.我们可以利用基本的列表理解,轻松地将这些停顿词从我们的单词中删除。
text ='这是一个句子。'
...
...
@@ -648,7 +648,7 @@ This is a small giraffe
在这里,我们将使用 NLTK 数据集中的 Emma 语料对数据集实施 TF-IDF。 该数据集由 Jane Austen 的书《Emma》中的句子组成,我们希望为这些句子中的每一个计算一个嵌入式向量表示:
1.We start by importing our dataset and looping through each of the sentences, removing any punctuation and non-alphanumeric characters (such as astericks). We choose to leave stopwords in our dataset to demonstrate how TF-IDF accounts for these as these words appear in many documents and so have a very low IDF. We create a list of parsed sentences and a set of the distinct words in our corpus:
2.Next, we create a function that will return our Term Frequencies for a given word in a given document. We take the length of the document to give us the number of words and count the occurrences of this word in the document before returning the ratio. Here, we can see that the word **ago** appears in the sentence once and that the sentence is 41 words long, giving us a Term Frequency of 0.024:
3.Next, we calculate our Document Frequency. In order to do this efficiently, we first need to pre-compute a Document Frequency dictionary. This loops through all the data and counts the number of documents each word in our corpus appears in. We pre-compute this so we that do not have to perform this loop every time we wish to calculate Document Frequency for a given word:
4.Here, we can see that the word **ago** appears within our document 32 times. Using this dictionary, we can very easily calculate our Inverse Document Frequency by dividing the total number of documents by our Document Frequency and taking the logarithm of this value. Note how we add one to the Document Frequency to avoid a divide by zero error when the word doesn't appear in the corpus:
5.Finally, we simply combine the Term Frequency and Inverse Document Frequency to get the TF-IDF weighting for each word/document pair:
5.最后,我们只需将词频和逆文档频率结合起来,就可以得到每个词/文档对的TF-IDF权重。
def TFIDF(doc,word):
...
...
@@ -752,7 +752,7 @@ This is a small giraffe
接下来,我们可以显示这些 TF-IDF 加权如何应用于嵌入:
1.We first load our pre-computed GLoVe embeddings to provide the initial embedding representation of words in our corpus:
1.我们首先加载我们预先计算的GLoVe嵌入,以提供我们语料库中单词的初始嵌入表示。
def loadGlove(path):
...
...
@@ -774,7 +774,7 @@ This is a small giraffe
手套= loadGlove('glove.6B.50d.txt')
2.We then calculate an unweighted mean average of all the individual embeddings in our document to get a vector representation of the sentence as a whole. We simply loop through all the words in our document, extract the embedding from the GLoVe dictionary, and calculate the average over all these vectors:
3.We repeat this process to calculate our TF-IDF weighted document vector, but this time, we multiply our vectors by their TF-IDF weighting before we average them:
4.We can then compare the TF-IDF weighted embedding with our average embedding to see how similar they are. We can do this using cosine similarity, as follows:
1.First, we create an instance of the Porter Stemmer:
1.首先,我们创建一个Porter 词干提取器的实例。
搬运工= PorterStemmer()
2.We then simply call this instance of the stemmer on individual words and print the results. Here, we can see an example of the stems returned by the Porter Stemmer:
1. We first carry out all of our imports and create a **predict** route. This allows us to call our API with the **predict** argument in order to run a **predict()** method within our API:
2. Next, we define our **predict()** method within our **app.py** file. This is largely a rehash of our model file, so to avoid repetition of code, it is advised that you look at the completed **app.py** file within the GitHub repository linked in the *Technical requirements* section of this chapter. You will see that there are a few additional lines. Firstly, within our **preprocess_review()** function, we will see the following lines:
3. Similarly, we must also load the weights from our trained model. We first define our **SentimentLSTM** class and the load our weights using **torch.load**. We will use the **.pkl** file from our original notebook, so be sure to place this in the **models** directory as well:
4. We must also define the input and outputs of our API. We want our model to take the input from our API and pass this to our **preprocess_review()** function. We do this using **request.get_json()**:
5. To define our output, we return a JSON response consisting of the output from our model and a response code,`200`, which is what is returned by our predict function:
6. With the main body of our app complete, there are just two more additional things we must add in order to make our API run. We must first add the following to our **wsgi.py** file: