根据译文片段预测翻译作者.md 5.2 KB
Newer Older
M
MaoXianxin 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142
本教程的目的是带领大家学会,根据译文片段预测翻译作者

本次用到的数据集是三个 txt 文本,分别是 cowper.txt、derby.txt、butler.txt ,该文本已经经过一些预处理,去除了表头,页眉等

接下来我们加载数据,这里我们使用 tf.data.TextLineDataset API,而不是之前使用的 text_dataset_from_directory,两者的区别是,前者加载 txt 文件里的每一行作为一个样本,后者是加载整个 txt 文件作为一个样本

```
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

def labeler(example, index):
  return example, tf.cast(index, tf.int64)

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)
```

![](https://maoxianxin1996.oss-accelerate.aliyuncs.com/codechina1/20210722152638.png)

如上图所示,我们可以看到,txt 文件里的每一行确实是一个样本,其实上面的数据已经经过进一步处理了,变成 (example, label) pair 了

接下来我们需要对文本进行 standardize and tokenize,然后再使用 StaticVocabularyTable,建立 tokens 到 integers 的映射

这里我们使用 UnicodeScriptTokenizer 来 tokenize 数据集,代码如下所示

```
tokenizer = tf_text.UnicodeScriptTokenizer()

def tokenize(text, unused_label):
  lower_case = tf_text.case_fold_utf8(text)
  return tokenizer.tokenize(lower_case)

tokenized_ds = all_labeled_data.map(tokenize)
```

![](https://maoxianxin1996.oss-accelerate.aliyuncs.com/codechina1/20210722153112.png)

上图是 tokenize 的结果展示

下一步,我们需要建立 vocabulary,根据 tokens 的频率做一个排序,并取排名靠前的 VOCAB_SIZE 个元素

```
tokenized_ds = configure_dataset(tokenized_ds)

vocab_dict = collections.defaultdict(lambda: 0)
for toks in tokenized_ds.as_numpy_iterator():
  for tok in toks:
    vocab_dict[tok] += 1

vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries:", vocab[:5])
```

接下来,我们需要用 vocab 创建 StaticVocabularyTable,因为 0 被保留用于表明 padding,1 被保留用于表明 OOV token,所以我们的实际 map tokens 的integer 是 [2, vocab_size+2],代码如下所示

```
keys = vocab
values = range(2, len(vocab) + 2)  # reserve 0 for padding, 1 for OOV

init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)
```

最后我们要封装一个函数用于 standardize, tokenize and vectorize 数据集,通过 tokenizer and lookup table

```
def preprocess_text(text, label):
  standardized = tf_text.case_fold_utf8(text)
  tokenized = tokenizer.tokenize(standardized)
  vectorized = vocab_table.lookup(tokenized)
  return vectorized, label
```

![](https://maoxianxin1996.oss-accelerate.aliyuncs.com/codechina1/20210722153651.png)

上图是关于把 raw text 转化成 tokens 的展示结果

接下来,我们需要对数据集进行划分,然后再创建模型,最后就可以开始训练了,代码如下所示

```
all_encoded_data = all_labeled_data.map(preprocess_text)

train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

vocab_size += 2

train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

model = create_model(vocab_size=vocab_size, num_labels=3)
model.compile(
    optimizer='adam',
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'])
history = model.fit(train_data, validation_data=validation_data, epochs=3)
```

![](https://maoxianxin1996.oss-accelerate.aliyuncs.com/codechina1/20210722154014.png)

上图是训练的结果展示,在验证集上的准确率达到了 84.18%

```
inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]
predicted_scores = export_model.predict(inputs)
predicted_labels = tf.argmax(predicted_scores, axis=1)
for input, label in zip(inputs, predicted_labels):
  print("Question: ", input)
  print("Predicted label: ", label.numpy())
```

最后我们用训练后的模型进行预测,结果如下图所示

![](https://maoxianxin1996.oss-accelerate.aliyuncs.com/codechina1/20210722154236.png)

预测结果和实际标签都对应上了

代码地址: https://codechina.csdn.net/csdn_codechina/enterprise_technology/-/blob/master/predict_translations_author.ipynb