diff --git "a/\346\240\271\346\215\256\350\257\221\346\226\207\347\211\207\346\256\265\351\242\204\346\265\213\347\277\273\350\257\221\344\275\234\350\200\205.md" "b/\346\240\271\346\215\256\350\257\221\346\226\207\347\211\207\346\256\265\351\242\204\346\265\213\347\277\273\350\257\221\344\275\234\350\200\205.md" new file mode 100644 index 0000000000000000000000000000000000000000..d8152ca76154b12e1ea2f3bf1c042e9d53e128f4 --- /dev/null +++ "b/\346\240\271\346\215\256\350\257\221\346\226\207\347\211\207\346\256\265\351\242\204\346\265\213\347\277\273\350\257\221\344\275\234\350\200\205.md" @@ -0,0 +1,142 @@ +本教程的目的是带领大家学会,根据译文片段预测翻译作者 + +本次用到的数据集是三个 txt 文本,分别是 cowper.txt、derby.txt、butler.txt ,该文本已经经过一些预处理,去除了表头,页眉等 + +接下来我们加载数据,这里我们使用 tf.data.TextLineDataset API,而不是之前使用的 text_dataset_from_directory,两者的区别是,前者加载 txt 文件里的每一行作为一个样本,后者是加载整个 txt 文件作为一个样本 + +``` +DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/' +FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt'] + +for name in FILE_NAMES: + text_dir = utils.get_file(name, origin=DIRECTORY_URL + name) + +parent_dir = pathlib.Path(text_dir).parent +list(parent_dir.iterdir()) + +def labeler(example, index): + return example, tf.cast(index, tf.int64) + +labeled_data_sets = [] + +for i, file_name in enumerate(FILE_NAMES): + lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name)) + labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i)) + labeled_data_sets.append(labeled_dataset) +``` + +![](https://maoxianxin1996.oss-accelerate.aliyuncs.com/codechina1/20210722152638.png) + +如上图所示,我们可以看到,txt 文件里的每一行确实是一个样本,其实上面的数据已经经过进一步处理了,变成 (example, label) pair 了 + +接下来我们需要对文本进行 standardize and tokenize,然后再使用 StaticVocabularyTable,建立 tokens 到 integers 的映射 + +这里我们使用 UnicodeScriptTokenizer 来 tokenize 数据集,代码如下所示 + +``` +tokenizer = tf_text.UnicodeScriptTokenizer() + +def tokenize(text, unused_label): + lower_case = tf_text.case_fold_utf8(text) + return tokenizer.tokenize(lower_case) + +tokenized_ds = all_labeled_data.map(tokenize) +``` + +![](https://maoxianxin1996.oss-accelerate.aliyuncs.com/codechina1/20210722153112.png) + +上图是 tokenize 的结果展示 + +下一步,我们需要建立 vocabulary,根据 tokens 的频率做一个排序,并取排名靠前的 VOCAB_SIZE 个元素 + +``` +tokenized_ds = configure_dataset(tokenized_ds) + +vocab_dict = collections.defaultdict(lambda: 0) +for toks in tokenized_ds.as_numpy_iterator(): + for tok in toks: + vocab_dict[tok] += 1 + +vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True) +vocab = [token for token, count in vocab] +vocab = vocab[:VOCAB_SIZE] +vocab_size = len(vocab) +print("Vocab size: ", vocab_size) +print("First five vocab entries:", vocab[:5]) +``` + +接下来,我们需要用 vocab 创建 StaticVocabularyTable,因为 0 被保留用于表明 padding,1 被保留用于表明 OOV token,所以我们的实际 map tokens 的integer 是 [2, vocab_size+2],代码如下所示 + +``` +keys = vocab +values = range(2, len(vocab) + 2) # reserve 0 for padding, 1 for OOV + +init = tf.lookup.KeyValueTensorInitializer( + keys, values, key_dtype=tf.string, value_dtype=tf.int64) + +num_oov_buckets = 1 +vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets) +``` + +最后我们要封装一个函数用于 standardize, tokenize and vectorize 数据集,通过 tokenizer and lookup table + +``` +def preprocess_text(text, label): + standardized = tf_text.case_fold_utf8(text) + tokenized = tokenizer.tokenize(standardized) + vectorized = vocab_table.lookup(tokenized) + return vectorized, label +``` + +![](https://maoxianxin1996.oss-accelerate.aliyuncs.com/codechina1/20210722153651.png) + +上图是关于把 raw text 转化成 tokens 的展示结果 + +接下来,我们需要对数据集进行划分,然后再创建模型,最后就可以开始训练了,代码如下所示 + +``` +all_encoded_data = all_labeled_data.map(preprocess_text) + +train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE) +validation_data = all_encoded_data.take(VALIDATION_SIZE) + +train_data = train_data.padded_batch(BATCH_SIZE) +validation_data = validation_data.padded_batch(BATCH_SIZE) + +vocab_size += 2 + +train_data = configure_dataset(train_data) +validation_data = configure_dataset(validation_data) + +model = create_model(vocab_size=vocab_size, num_labels=3) +model.compile( + optimizer='adam', + loss=losses.SparseCategoricalCrossentropy(from_logits=True), + metrics=['accuracy']) +history = model.fit(train_data, validation_data=validation_data, epochs=3) +``` + +![](https://maoxianxin1996.oss-accelerate.aliyuncs.com/codechina1/20210722154014.png) + +上图是训练的结果展示,在验证集上的准确率达到了 84.18% + +``` +inputs = [ + "Join'd to th' Ionians with their flowing robes,", # Label: 1 + "the allies, and his armour flashed about him so that he seemed to all", # Label: 2 + "And with loud clangor of his arms he fell.", # Label: 0 +] +predicted_scores = export_model.predict(inputs) +predicted_labels = tf.argmax(predicted_scores, axis=1) +for input, label in zip(inputs, predicted_labels): + print("Question: ", input) + print("Predicted label: ", label.numpy()) +``` + +最后我们用训练后的模型进行预测,结果如下图所示 + +![](https://maoxianxin1996.oss-accelerate.aliyuncs.com/codechina1/20210722154236.png) + +预测结果和实际标签都对应上了 + +代码地址: https://codechina.csdn.net/csdn_codechina/enterprise_technology/-/blob/master/predict_translations_author.ipynb \ No newline at end of file