# `torchtext`语言翻译 > 原文: 本教程介绍了如何使用`torchtext`预处理包含英语和德语句子的著名数据集的数据,并使用它来训练序列到序列模型,并能将德语句子翻译成英语。 它基于 PyTorch 社区成员 [Ben Trevett](https://github.com/bentrevett) 的本教程,并获得 Ben 的许可。 我们通过删除一些旧代码来更新教程。 在本教程结束时,您将可以将句子预处理为张量以用于 NLP 建模,并可以使用[`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader)来训练和验证模型。 ## 数据处理 `torchtext`具有工具,可用于创建可以轻松迭代的数据集,以创建语言翻译模型。 在此示例中,我们展示了如何对原始文本句子进行标记,构建词汇表以及将标记数字化为张量。 注意:本教程中的分词需要 [Spacy](https://spacy.io) 我们使用 Spacy 是因为它为英语以外的其他语言的分词提供了强大的支持。 `torchtext`提供了`basic_english`标记器,并支持其他英语标记器(例如 [Moses](https://bitbucket.org/luismsgomes/mosestokenizer/src/default/)),但对于语言翻译(需要多种语言),Spacy 是您的最佳选择。 要运行本教程,请先使用`pip`或`conda`安装`spacy`。 接下来,下载英语和德语 Spacy 分词器的原始数据: ```py python -m spacy download en python -m spacy download de ``` ```py import torchtext import torch from torchtext.data.utils import get_tokenizer from collections import Counter from torchtext.vocab import Vocab from torchtext.utils import download_from_url, extract_archive import io url_base = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/' train_urls = ('train.de.gz', 'train.en.gz') val_urls = ('val.de.gz', 'val.en.gz') test_urls = ('test_2016_flickr.de.gz', 'test_2016_flickr.en.gz') train_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in train_urls] val_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in val_urls] test_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in test_urls] de_tokenizer = get_tokenizer('spacy', language='de') en_tokenizer = get_tokenizer('spacy', language='en') def build_vocab(filepath, tokenizer): counter = Counter() with io.open(filepath, encoding="utf8") as f: for string_ in f: counter.update(tokenizer(string_)) return Vocab(counter, specials=['', '', '', '']) de_vocab = build_vocab(train_filepaths[0], de_tokenizer) en_vocab = build_vocab(train_filepaths[1], en_tokenizer) def data_process(filepaths): raw_de_iter = iter(io.open(filepaths[0], encoding="utf8")) raw_en_iter = iter(io.open(filepaths[1], encoding="utf8")) data = [] for (raw_de, raw_en) in zip(raw_de_iter, raw_en_iter): de_tensor_ = torch.tensor([de_vocab[token] for token in de_tokenizer(raw_de)], dtype=torch.long) en_tensor_ = torch.tensor([en_vocab[token] for token in en_tokenizer(raw_en)], dtype=torch.long) data.append((de_tensor_, en_tensor_)) return data train_data = data_process(train_filepaths) val_data = data_process(val_filepaths) test_data = data_process(test_filepaths) ``` ## `DataLoader` 我们将使用的最后`torch`个特定函数是`DataLoader`,它易于使用,因为它将数据作为第一个参数。 具体来说,正如文档所说:`DataLoader`结合了一个数据集和一个采样器,并在给定的数据集上提供了可迭代的。 `DataLoader`支持映射样式和可迭代样式的数据集,具有单进程或多进程加载,自定义加载顺序以及可选的自动批量(归类)和内存固定。 请注意`collate_fn`(可选),它将合并样本列表以形成张量的小批量。 在从映射样式数据集中使用批量加载时使用。 ```py import torch device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') BATCH_SIZE = 128 PAD_IDX = de_vocab[''] BOS_IDX = de_vocab[''] EOS_IDX = de_vocab[''] from torch.nn.utils.rnn import pad_sequence from torch.utils.data import DataLoader def generate_batch(data_batch): de_batch, en_batch = [], [] for (de_item, en_item) in data_batch: de_batch.append(torch.cat([torch.tensor([BOS_IDX]), de_item, torch.tensor([EOS_IDX])], dim=0)) en_batch.append(torch.cat([torch.tensor([BOS_IDX]), en_item, torch.tensor([EOS_IDX])], dim=0)) de_batch = pad_sequence(de_batch, padding_value=PAD_IDX) en_batch = pad_sequence(en_batch, padding_value=PAD_IDX) return de_batch, en_batch train_iter = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generate_batch) valid_iter = DataLoader(val_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generate_batch) test_iter = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generate_batch) ``` ## 定义我们的`nn.Module`和`Optimizer` 这大部分是从`torchtext`角度出发的:构建了数据集并定义了迭代器,本教程的其余部分仅将模型定义为`nn.Module`以及`Optimizer`,然后对其进行训练。 具体来说,我们的模型遵循[此处描述的架构](https://arxiv.org/abs/1409.0473)(您可以在[这里](https://github.com/SethHWeidman/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb)找到注释更多的版本。 注意:此模型只是可用于语言翻译的示例模型; 我们选择它是因为它是任务的标准模型,而不是因为它是用于翻译的推荐模型。 如您所知,目前最先进的模型基于“转换器”; 您可以看到 PyTorch 的实现[`Transformer`](https://pytorch.org/docs/stable/nn.html#transformer-layers)层的功能; 特别是,以下模型中使用的“注意”与转换器模型中存在的多头自我注意不同。 ```py import random from typing import Tuple import torch.nn as nn import torch.optim as optim import torch.nn.functional as F from torch import Tensor class Encoder(nn.Module): def __init__(self, input_dim: int, emb_dim: int, enc_hid_dim: int, dec_hid_dim: int, dropout: float): super().__init__() self.input_dim = input_dim self.emb_dim = emb_dim self.enc_hid_dim = enc_hid_dim self.dec_hid_dim = dec_hid_dim self.dropout = dropout self.embedding = nn.Embedding(input_dim, emb_dim) self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True) self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim) self.dropout = nn.Dropout(dropout) def forward(self, src: Tensor) -> Tuple[Tensor]: embedded = self.dropout(self.embedding(src)) outputs, hidden = self.rnn(embedded) hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))) return outputs, hidden class Attention(nn.Module): def __init__(self, enc_hid_dim: int, dec_hid_dim: int, attn_dim: int): super().__init__() self.enc_hid_dim = enc_hid_dim self.dec_hid_dim = dec_hid_dim self.attn_in = (enc_hid_dim * 2) + dec_hid_dim self.attn = nn.Linear(self.attn_in, attn_dim) def forward(self, decoder_hidden: Tensor, encoder_outputs: Tensor) -> Tensor: src_len = encoder_outputs.shape[0] repeated_decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_len, 1) encoder_outputs = encoder_outputs.permute(1, 0, 2) energy = torch.tanh(self.attn(torch.cat(( repeated_decoder_hidden, encoder_outputs), dim = 2))) attention = torch.sum(energy, dim=2) return F.softmax(attention, dim=1) class Decoder(nn.Module): def __init__(self, output_dim: int, emb_dim: int, enc_hid_dim: int, dec_hid_dim: int, dropout: int, attention: nn.Module): super().__init__() self.emb_dim = emb_dim self.enc_hid_dim = enc_hid_dim self.dec_hid_dim = dec_hid_dim self.output_dim = output_dim self.dropout = dropout self.attention = attention self.embedding = nn.Embedding(output_dim, emb_dim) self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim) self.out = nn.Linear(self.attention.attn_in + emb_dim, output_dim) self.dropout = nn.Dropout(dropout) def _weighted_encoder_rep(self, decoder_hidden: Tensor, encoder_outputs: Tensor) -> Tensor: a = self.attention(decoder_hidden, encoder_outputs) a = a.unsqueeze(1) encoder_outputs = encoder_outputs.permute(1, 0, 2) weighted_encoder_rep = torch.bmm(a, encoder_outputs) weighted_encoder_rep = weighted_encoder_rep.permute(1, 0, 2) return weighted_encoder_rep def forward(self, input: Tensor, decoder_hidden: Tensor, encoder_outputs: Tensor) -> Tuple[Tensor]: input = input.unsqueeze(0) embedded = self.dropout(self.embedding(input)) weighted_encoder_rep = self._weighted_encoder_rep(decoder_hidden, encoder_outputs) rnn_input = torch.cat((embedded, weighted_encoder_rep), dim = 2) output, decoder_hidden = self.rnn(rnn_input, decoder_hidden.unsqueeze(0)) embedded = embedded.squeeze(0) output = output.squeeze(0) weighted_encoder_rep = weighted_encoder_rep.squeeze(0) output = self.out(torch.cat((output, weighted_encoder_rep, embedded), dim = 1)) return output, decoder_hidden.squeeze(0) class Seq2Seq(nn.Module): def __init__(self, encoder: nn.Module, decoder: nn.Module, device: torch.device): super().__init__() self.encoder = encoder self.decoder = decoder self.device = device def forward(self, src: Tensor, trg: Tensor, teacher_forcing_ratio: float = 0.5) -> Tensor: batch_size = src.shape[1] max_len = trg.shape[0] trg_vocab_size = self.decoder.output_dim outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device) encoder_outputs, hidden = self.encoder(src) # first input to the decoder is the token output = trg[0,:] for t in range(1, max_len): output, hidden = self.decoder(output, hidden, encoder_outputs) outputs[t] = output teacher_force = random.random() < teacher_forcing_ratio top1 = output.max(1)[1] output = (trg[t] if teacher_force else top1) return outputs INPUT_DIM = len(de_vocab) OUTPUT_DIM = len(en_vocab) # ENC_EMB_DIM = 256 # DEC_EMB_DIM = 256 # ENC_HID_DIM = 512 # DEC_HID_DIM = 512 # ATTN_DIM = 64 # ENC_DROPOUT = 0.5 # DEC_DROPOUT = 0.5 ENC_EMB_DIM = 32 DEC_EMB_DIM = 32 ENC_HID_DIM = 64 DEC_HID_DIM = 64 ATTN_DIM = 8 ENC_DROPOUT = 0.5 DEC_DROPOUT = 0.5 enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT) attn = Attention(ENC_HID_DIM, DEC_HID_DIM, ATTN_DIM) dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn) model = Seq2Seq(enc, dec, device).to(device) def init_weights(m: nn.Module): for name, param in m.named_parameters(): if 'weight' in name: nn.init.normal_(param.data, mean=0, std=0.01) else: nn.init.constant_(param.data, 0) model.apply(init_weights) optimizer = optim.Adam(model.parameters()) def count_parameters(model: nn.Module): return sum(p.numel() for p in model.parameters() if p.requires_grad) print(f'The model has {count_parameters(model):,} trainable parameters') ``` 出: ```py The model has 3,491,552 trainable parameters ``` 注意:特别是对语言翻译模型的表现进行评分时,我们必须告诉`nn.CrossEntropyLoss`函数忽略仅填充目标的索引。 ```py PAD_IDX = en_vocab.stoi[''] criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX) ``` 最后,我们可以训练和评估该模型: ```py import math import time def train(model: nn.Module, iterator: torch.utils.data.DataLoader, optimizer: optim.Optimizer, criterion: nn.Module, clip: float): model.train() epoch_loss = 0 for _, (src, trg) in enumerate(iterator): src, trg = src.to(device), trg.to(device) optimizer.zero_grad() output = model(src, trg) output = output[1:].view(-1, output.shape[-1]) trg = trg[1:].view(-1) loss = criterion(output, trg) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), clip) optimizer.step() epoch_loss += loss.item() return epoch_loss / len(iterator) def evaluate(model: nn.Module, iterator: torch.utils.data.DataLoader, criterion: nn.Module): model.eval() epoch_loss = 0 with torch.no_grad(): for _, (src, trg) in enumerate(iterator): src, trg = src.to(device), trg.to(device) output = model(src, trg, 0) #turn off teacher forcing output = output[1:].view(-1, output.shape[-1]) trg = trg[1:].view(-1) loss = criterion(output, trg) epoch_loss += loss.item() return epoch_loss / len(iterator) def epoch_time(start_time: int, end_time: int): elapsed_time = end_time - start_time elapsed_mins = int(elapsed_time / 60) elapsed_secs = int(elapsed_time - (elapsed_mins * 60)) return elapsed_mins, elapsed_secs N_EPOCHS = 10 CLIP = 1 best_valid_loss = float('inf') for epoch in range(N_EPOCHS): start_time = time.time() train_loss = train(model, train_iter, optimizer, criterion, CLIP) valid_loss = evaluate(model, valid_iter, criterion) end_time = time.time() epoch_mins, epoch_secs = epoch_time(start_time, end_time) print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s') print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}') print(f'\t Val. Loss: {valid_loss:.3f} | Val. PPL: {math.exp(valid_loss):7.3f}') test_loss = evaluate(model, test_iter, criterion) print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |') ``` 出: ```py Epoch: 01 | Time: 0m 59s Train Loss: 5.790 | Train PPL: 327.039 Val. Loss: 5.250 | Val. PPL: 190.532 Epoch: 02 | Time: 0m 59s Train Loss: 4.762 | Train PPL: 116.990 Val. Loss: 5.037 | Val. PPL: 153.939 Epoch: 03 | Time: 0m 59s Train Loss: 4.527 | Train PPL: 92.475 Val. Loss: 4.924 | Val. PPL: 137.525 Epoch: 04 | Time: 0m 59s Train Loss: 4.344 | Train PPL: 76.977 Val. Loss: 4.801 | Val. PPL: 121.673 Epoch: 05 | Time: 0m 59s Train Loss: 4.210 | Train PPL: 67.356 Val. Loss: 4.758 | Val. PPL: 116.536 Epoch: 06 | Time: 0m 59s Train Loss: 4.125 | Train PPL: 61.875 Val. Loss: 4.691 | Val. PPL: 109.004 Epoch: 07 | Time: 0m 59s Train Loss: 4.043 | Train PPL: 56.979 Val. Loss: 4.639 | Val. PPL: 103.446 Epoch: 08 | Time: 0m 59s Train Loss: 3.947 | Train PPL: 51.771 Val. Loss: 4.589 | Val. PPL: 98.396 Epoch: 09 | Time: 0m 59s Train Loss: 3.874 | Train PPL: 48.135 Val. Loss: 4.514 | Val. PPL: 91.324 Epoch: 10 | Time: 0m 59s Train Loss: 3.785 | Train PPL: 44.021 Val. Loss: 4.467 | Val. PPL: 87.126 | Test Loss: 4.433 | Test PPL: 84.168 | ``` ## 后续步骤 * 查看其余的 [Ben Trevett](https://github.com/bentrevett/) 的`torchtext`使用教程。 * 敬请关注使用其他`torchtext`函数以及`nn.Transformer`通过下一个单词预测进行语言建模的教程! **脚本的总运行时间**:(10 分钟 13.398 秒) [下载 Python 源码:`torchtext_translation_tutorial.py`](../_downloads/96d6dc961c7477af88e16ca6c9592240/torchtext_translation_tutorial.py) [下载 Jupyter 笔记本:`torchtext_translation_tutorial.ipynb`](../_downloads/05baddac9b2f50d639a62ea5fa6e21e4/torchtext_translation_tutorial.ipynb) [由 Sphinx 画廊](https://sphinx-gallery.readthedocs.io)生成的画廊