@@ -446,7 +446,7 @@ pred = self.fc_out(output.squeeze(0))
epoch_loss = 0
4.We then loop through each batch within our training iterator and extract the sentence to be translated (**src**) and the correct translation of this sentence (**trg**). We then zero our gradients (to prevent gradient accumulation) and calculate the output of our model by passing our model function our inputs and outputs:
@@ -458,7 +458,7 @@ pred = self.fc_out(output.squeeze(0))
输出=模型(src,trg)
5.Next, we need to calculate the loss of our model’s prediction by comparing our predicted output to the true, correct translated sentence. We reshape our output data and our target data using the shape and view functions in order to create two tensors that can be compared to calculate the loss. We calculate the **loss** criterion between our output and **trg** tensors and then backpropagate this loss through the network:
@@ -470,7 +470,7 @@ pred = self.fc_out(output.squeeze(0))
loss.backward()
6.We then implement gradient clipping to prevent exploding gradients within our model, step our optimizer in order to perform the necessary parameter updates via gradient descent, and finally add the loss of the batch to the epoch loss. This whole process is repeated for all the batches within a single training epoch, whereby the final averaged loss per batch is returned:
@@ -480,19 +480,19 @@ pred = self.fc_out(output.squeeze(0))
返回 epoch_loss / len(迭代器)
7.After, we create a similar function called **evaluate()**. This function will calculate the loss of our validation data across the network in order to evaluate how our model performs when translating data it hasn’t seen before. This function is almost identical to our **train()** function, with the exception of the fact that we switch to evaluation mode:
8.Since we don’t perform any updates for our weights, we need to make sure to implement **no_grad** mode:
8.由于我们不对我们的权重进行任何更新,我们需要确保实现`no_grad`模式。
使用 torch.no_grad():
9.The only other difference is that we need to make sure we turn off teacher forcing when in evaluation mode. We wish to assess our model’s performance on unseen data, and enabling teacher forcing would use our correct (target) data to help our model make better predictions. We want our model to be able to make perfect, unaided predictions:
10.Finally, we need to create a training loop, within which our **train()** and **evaluate()** functions are called. We begin by defining how many epochs we wish to train for and our maximum gradient (for use with gradient clipping). We also set our lowest validation loss to infinity. This will be used later to select our best-performing model:
@@ -500,7 +500,7 @@ pred = self.fc_out(output.squeeze(0))
minimum_validation_loss = float(‘inf’)
11.We then loop through each of our epochs and within each epoch, calculate our training and validation loss using our **train()** and **evaluate()** functions. We also time how long this takes by calling **time.time()** before and after the training process:
@@ -512,7 +512,7 @@ pred = self.fc_out(output.squeeze(0))
end_time = time.time()
12.Next, for each epoch, we determine whether the model we just trained is the best-performing model we have seen thus far. If our model performs the best on our validation data (if the validation loss is the lowest we have seen so far), we save our model:
1.We start by creating a **translate()** function. This is functionally identical to the **evaluate()** function we created to calculate the loss over our validation set. However, this time, we are not concerned with the loss of our model, but rather the predicted output. We pass the model our source and target sentences and also make sure we turn teacher forcing off so that our model does not use these to make predictions. We then take our model’s predictions and use an **argmax** function to determine the index of the word that our model predicted for each word in our predicted output sentence:
2.Then, we can use this index to obtain the actual predicted word from our German vocabulary. Finally, we compare the English input to our model that contains the correct German sentence and the predicted German sentence. Note that here, we use **[1:-1]** to drop the start and end tokens from our predictions and we reverse the order of our English input (since the input sentences were reversed before they were fed into the model):
1.We start by creating our **Vocabulary** class. We initialize this class with empty dictionaries—**word2index** and **word2count**. We also initialize the **index2word** dictionary with placeholders for our padding tokens, as well as our **Start-of-Sentence** (**SOS**) and **End-of-Sentence** (**EOS**) tokens. We keep a running count of the number of words in our vocabulary, too (which is 3 to start with as our corpus already contains the three tokens mentioned). These are the default values for an empty vocabulary; however, they will be populated as we read our data in:
2.Next, we create the functions that we will use to populate our vocabulary. **addWord** takes a word as input. If this is a new word that is not already in our vocabulary, we add this word to our indices, set the count of this word to 1, and increment the total number of words in our vocabulary by 1\. If the word in question is already in our vocabulary, we simply increment the count of this word by 1:
4.To remove low-frequency words from our vocabulary, we can implement a **trim** function. The function first loops through the word count dictionary and if the occurrence of the word is greater than the minimum required count, it is appended to a new list:
5.Finally, our indices are rebuilt from the new **words_to_keep** list. We set all the indices to their initial empty values and then repopulate them by looping through our kept words with the **addWord** function:
1.The first step for reading in our data is to perform any necessary steps to clean the data and make it more human-readable. We start by converting it from Unicode into ASCII format. We can easily use a function to do this:
2.Next, we want to process our input strings so that they are all in lowercase and do not contain any trailing whitespace or punctuation, except the most basic characters. We can do this by using a series of regular expressions:
3.Finally, we apply this function within a wider function—**readVocs**. This function reads our data file into lines and then applies the **cleanString** function to every line. It also creates an instance of the **Vocabulary** class that we created earlier, meaning this function outputs both our data and vocabulary:
4.To do this, we create a couple of filter functions. The first one, **filterPair**, returns a Boolean value based on whether the current line has an input and output length that is less than the maximum length. Our second function, **filterPairs**, simply applies this condition to all the pairs within our dataset, only keeping the ones that meet this condition:
5.Now, we just need to create one final function that applies all the previous functions we have put together and run it to create our vocabulary and data pairs:
1.We first calculate the percentage of words that we will keep within our model:
1.我们首先要计算出我们将保留在模型中的词的百分比。
def removeRareWords(voc,all_pairs,最小值):
...
...
@@ -357,7 +357,7 @@ corpus_name = "movie_corpus"
图 8.9 –要保留的单词百分比
2.Within this same function, we loop through all the words in the input and output sentences. If for a given pair either the input or output sentence has a word that isn't in our new trimmed corpus, we drop this pair from our dataset. We print the output and see that even though we have dropped over half of our vocabulary, we only drop around 17% of our training pairs. This again reflects how our corpus of words is distributed over our individual training pairs:
1.We start by creating several helper functions, which we can use to transform our pairs into tensors. We first create a **indexFromSentence** function, which grabs the index of each word in the sentence from the vocabulary and appends an EOS token to the end:
2.Secondly, we create a **zeroPad** function, which pads any tensors with zeroes so that all of the sentences within the tensor are effectively the same length:
3.Then, to generate our input tensor, we apply both of these functions. First, we get the indices of our input sentence, then apply padding, and then transform the output into **LongTensor**. We will also obtain the lengths of each of our input sentences out output this as a tensor:
4.Within our network, our padded tokens should generally be ignored. We don't want to train our model on these padded tokens, so we create a Boolean mask to ignore these tokens. To do so, we use a **getMask** function, which we apply to our output tensor. This simply returns`1`if the output consists of a word and`0`if it consists of a padding token:
5.We then apply this to our **outputVar** function. This is identical to the **inputVar** function, except that along with the indexed output tensor and the tensor of lengths, we also return the Boolean mask of our output tensor. This Boolean mask just returns **True** when there is a word within the output tensor and **False** when there is a padding token. We also return the maximum length of sentences within our output tensor:
6.Finally, in order to create our input and output batches concurrently, we loop through the pairs in our batch and create input and output tensors for both pairs using the functions we created previously. We then return all the necessary variables:
7.This function should be all we need to transform our training pairs into tensors for training our model. We can validate that this is working correctly by performing a single iteration of our **batch2Train** function on a random selection of our data. We set our batch size to`5`and run this once:
1.As with all of our PyTorch models, we start by creating an **Encoder** class that inherits from **nn.Module**. All the elements here should look familiar to the ones used in previous chapters:
3.Next, we need to create a forward pass for our encoder. We do this by first embedding our input sentences and then using the **pack_padded_sequence** function on our embeddings. This function "packs" our padded sequence so that all of our inputs are of the same length. We then pass out the packed sequences through our GRU to perform a forward pass:
4.After this, we unpack our padding and sum the GRU outputs. We can then return this summed output, along with our final hidden state, to complete our forward pass:
2.Then, create the **dot_score** function within this class. This function simply calculates the dot product of our encoder output with the output of our hidden state by our encoder. While there are other ways of transforming these two tensors into a single representation, using a dot product is one of the simplest:
3.We then use this function within our forward pass. First, calculate the attention weights/energies based on the **dot_score** method, then transpose the results, and return the softmax transformed probability scores: