2.We then create our layers within this module. We will create an embedding layer and a corresponding dropout layer. We use GRUs again for our decoder; however, this time, we do not need to make our GRU layer bidirectional as we will be decoding the output from our encoder sequentially. We will also create two linear layers—one regular layer for calculating our output and one layer that can be used for concatenation. This layer is twice the width of the regular hidden layer as it will be used on two concatenated vectors, each with a length of **hidden_size**. We also initialize an instance of our attention module from the last section in order to be able to use it within our **Decoder** class:
3.After defining all of our layers, we need to create a forward pass for the decoder. Notice how the forward pass will be used one step (word) at a time. We start by getting the embedding of the current input word and making a forward pass through the GRU layer to get our output and hidden states:
4.Next, we use the attention module to get the attention weights from the GRU output. These weights are then multiplied by the encoder outputs to effectively give us a weighted sum of our attention weights and our encoder output:
5.We then concatenate our weighted context vector with the output of our GRU and apply a **tanh** function to get out final concatenated output:
5.然后,我们将加权上下文向量与GRU的输出相连接,并应用`tanh`函数得到最终的连接输出。
rnn_output = rnn_output.squeeze(0)
...
...
@@ -699,7 +699,7 @@ corpus_name = "movie_corpus"
concat_output =火炬.tanh(self.concat(concat_input))
6.For the final step within our decoder, we simply use this final concatenated output to predict the next word and apply a **softmax** function. The forward pass finally returns this output, along with the final hidden state. This forward pass will be iterated upon, with the next forward pass using the next word in the sentence and this new hidden state:
1.In the following function, we can see that we calculate cross-entropy loss across the whole output tensors. However, to get the total loss, we only average over the elements of the tensor that are selected by the Boolean mask:
2.For the majority of our training, we need two main functions—one function, **train()**, which performs training on a single batch of our training data and another function, **trainIters()**, which iterates through our whole dataset and calls **train()** on each of the individual batches. We start by defining **train()** in order to train on a single batch of data. Create the **train()** function, then get the gradients to 0, define the device options, and initialize the variables:
3.Then, perform a forward pass of the inputs and sequence lengths though the encoder to get the output and hidden states:
3.然后,通过编码器执行输入和序列长度的正向传递,得到输出和隐藏状态。
编码器输出,编码器隐藏=编码器(输入变量,长度)
4.Next, we create our initial decoder input, starting with SOS tokens for each sentence. We then set the initial hidden state of our decoder to be equal to that of the encoder:
4.接下来,我们创建我们的初始解码器输入,从每个句子的 SOS 标记开始。然后我们将解码器的初始隐藏状态设置为与编码器的状态相等。
coder_input = torch.LongTensor([[[SOS_token for _ in \
6.Then, if we do need to implement teacher forcing, run the following code. We pass each of our sequence batches through the decoder to obtain our output. We then set the next input as the true output (**target**). Finally, we calculate and accumulate the loss using our loss function and print this to the console:
7.If we do not implement teacher forcing on a given batch, the procedure is almost identical. However, instead of using the true output as the next input into the sequence, we use the one generated by the model:
8.Finally, as with all of our models, the final steps are to perform backpropagation, implement gradient clipping, and step through both of our encoder and decoder optimizers to update the weights using gradient descent. Remember that we clip out gradients in order to prevent the vanishing/exploding gradient problem, which was discussed in earlier chapters. Finally, our training step returns our average loss:
9.Next, as previously stated, we need to create the **trainIters()** function, which repeatedly calls our training function on different batches of input data. We start by splitting our data into batches using the **batch2Train** function we created earlier:
10.We then create a few variables that will allow us to count iterations and keep track of the total loss over each epoch:
10.然后,我们创建一些变量,使我们能够计算迭代次数,并跟踪每个时代的总损失。
打印('开始...')
...
...
@@ -855,7 +855,7 @@ corpus_name = "movie_corpus"
start_iteration = checkpoint ['iteration'] + 1
11.Next, we define our training loop. For each iteration, we get a training batch from our list of batches. We then extract the relevant fields from our batch and run a single training iteration using these parameters. Finally, we add the loss from this batch to our overall loss:
12.On every iteration, we also make sure we print our progress so far, keeping track of how many iterations we have completed and what our loss was for each epoch:
13.For the sake of completion, we also need to save our model state after every few epochs. This allows us to revisit any historical models we have trained; for example, if our model were to begin overfitting, we could revert back to an earlier iteration:
1.We will start by defining a class that will allow us to decode the encoded input and produce text. We do this by using what is known as a **greedy encoder**. This simply means that at each step of the decoder, our model takes the word with the highest predicted probability as the output. We start by initializing the **GreedyEncoder()** class with our pretrained encoder and decoder:
2.Next, define a forward pass for our decoder. We pass the input through our encoder to get our encoder's output and hidden state. We take the encoder's final hidden layer to be the first hidden input to the decoder:
4.After that, iterate through the sequence, decoding one word at a time. We perform a forward pass through the encoder and add a **max** function to obtain the highest-scoring predicted word and its score, which we then append to the **all_tokens** and **all_scores** variables. Finally, we take this predicted token and use it as the next input to our decoder. After the whole sequence has been iterated over, we return the complete predicted sentence:
5.We first define an **evaluate()** function, which takes our input function and returns the predicted output words. We start by transforming our input sentence into indices using our vocabulary. We then obtain a tensor of the lengths of each of these sentences and transpose it:
6.Then, we assign our lengths and input tensors to the relevant devices. Next, run the inputs through the searcher (**GreedySearchDecoder**) to obtain the word indices of the predicted output. Finally, we transform these word indices back into word tokens before returning them as the function output:
7.Finally, we create a **runchatbot** function, which acts as the interface with our chatbot. This function takes human-typed input and prints the chatbot's response. We create this function as a **while** loop that continues until we terminate the function or type **quit** as our input:
8.We then take the typed input and normalize it, before passing the normalized input to our **evaluate()** function, which returns the predicted words from the chatbot:
9.Finally, we take these output words and format them, ignoring the EOS and padding tokens, before printing the chatbot's response. Because this is a **while** loop, this allows us to continue the conversation with the chatbot indefinitely:
1.We first initialize our hyperparameters. While these are only suggested hyperparameters, our models have been set up in a way that will allow them to adapt to whatever hyperparameters they are passed. It is good practice to experiment with different hyperparameters to see which ones result in an optimal model configuration. Here, you could experiment with increasing the number of layers in your encoder and decoder, increasing or decreasing the size of the hidden layers, or increasing the batch size. All of these hyperparameters will have an effect on how well your model learns, as well as a number of other factors, such as the time it takes to train the model:
2.After that, we can load our checkpoints. If we have previously trained a model, we can load the checkpoints and model states from previous iterations. This saves us from having to retrain our model each time:
3.After that, we can begin to build our models. We first load our embeddings from the vocabulary. If we have already trained a model, we can load the trained embeddings layer:
4.We then do the same for our encoder and decoder, creating model instances using the defined hyperparameters. Again, if we have already trained a model, we simply load the trained model states into our models:
6.We also need to define the learning rates of our models and our decoder learning ratio. You will find that your model performs better when the decoder carries out larger parameter updates during gradient descent. Therefore, we introduce a decoder learning ratio to apply a multiplier to the learning rate so that the learning rate is greater for the decoder than it is for the encoder. We also define how often our model prints and saves the results, as well as how many epochs we want our model to run for:
7.Next, as always when training models in PyTorch, we switch our models to training mode to allow the parameters to be updated:
7.接下来,和以往在 PyTorch 中训练模型时一样,我们将模型切换到训练模式,以便更新参数。
encoder.train()
coder.train()
8.Next, we create optimizers for both the encoder and decoder. We initialize these as Adam optimizers, but other optimizers will work equally well. Experimenting with different optimizers may yield different levels of model performance. If you have trained a model previously, you can also load the optimizer states if required:
9.The final step before running the training is to make sure CUDA is configured to be called if you wish to use GPU training. To do this, we simply loop through the optimizer states for both the encoder and decoder and enable CUDA across all of the states:
1.To begin the evaluation, we first switch our model into evaluation mode. As with all other PyTorch models, this is done to prevent any further parameter updates occurring within the evaluation process: