diff --git a/official/r1/transformer/README.md b/official/r1/transformer/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..319e58dbb790d5acda5a43ae048f2ca76dc8f9b1
--- /dev/null
+++ b/official/r1/transformer/README.md
@@ -0,0 +1,376 @@
+# Transformer Translation Model
+This is an implementation of the Transformer translation model as described in the [Attention is All You Need](https://arxiv.org/abs/1706.03762) paper. Based on the code provided by the authors: [Transformer code](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py) from [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor). Also, check out the [tutorial](https://www.tensorflow.org/beta/tutorials/text/transformer) on Transformer in TF 2.0.
+
+**Please follow the [README](https://github.com/tensorflow/models/official/transformer/README.md), the new Keras-based TF 2 implementation, to walk through the new Transformer.**
+
+Transformer is a neural network architecture that solves sequence to sequence problems using attention mechanisms. Unlike traditional neural seq2seq models, Transformer does not involve recurrent connections. The attention mechanism learns dependencies between tokens in two sequences. Since attention weights apply to all tokens in the sequences, the Transformer model is able to easily capture long-distance dependencies.
+
+Transformer's overall structure follows the standard encoder-decoder pattern. The encoder uses self-attention to compute a representation of the input sequence. The decoder generates the output sequence one token at a time, taking the encoder output and previous decoder-outputted tokens as inputs.
+
+The model also applies embeddings on the input and output tokens, and adds a constant positional encoding. The positional encoding adds information about the position of each token.
+
+## Contents
+  * [Contents](#contents)
+  * [Walkthrough](#walkthrough)
+  * [Benchmarks](#benchmarks)
+    * [Training times](#training-times)
+    * [Evaluation results](#evaluation-results)
+  * [Detailed instructions](#detailed-instructions)
+    * [Environment preparation](#environment-preparation)
+    * [Download and preprocess datasets](#download-and-preprocess-datasets)
+    * [Model training and evaluation](#model-training-and-evaluation)
+    * [Translate using the model](#translate-using-the-model)
+    * [Compute official BLEU score](#compute-official-bleu-score)
+    * [TPU](#tpu)
+  * [Export trained model](#export-trained-model)
+    * [Example translation](#example-translation)
+  * [Implementation overview](#implementation-overview)
+    * [Model Definition](#model-definition)
+    * [Model Estimator](#model-estimator)
+    * [Other scripts](#other-scripts)
+    * [Test dataset](#test-dataset)
+  * [Term definitions](#term-definitions)
+
+## Walkthrough
+
+Below are the commands for running the Transformer model. See the
+[Detailed instructions](#detailed-instructions) for more details on running the
+model.
+
+```
+cd /path/to/models/official/transformer
+
+# Ensure that PYTHONPATH is correctly defined as described in
+# https://github.com/tensorflow/models/tree/master/official#requirements
+# export PYTHONPATH="$PYTHONPATH:/path/to/models"
+
+# Export variables
+PARAM_SET=big
+DATA_DIR=$HOME/transformer/data
+MODEL_DIR=$HOME/transformer/model_$PARAM_SET
+VOCAB_FILE=$DATA_DIR/vocab.ende.32768
+
+# Download training/evaluation/test datasets
+python data_download.py --data_dir=$DATA_DIR
+
+# Train the model for 10 epochs, and evaluate after every epoch.
+python transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
+    --vocab_file=$VOCAB_FILE --param_set=$PARAM_SET \
+    --bleu_source=$DATA_DIR/newstest2014.en --bleu_ref=$DATA_DIR/newstest2014.de
+
+# Run during training in a separate process to get continuous updates,
+# or after training is complete.
+tensorboard --logdir=$MODEL_DIR
+
+# Translate some text using the trained model
+python translate.py --model_dir=$MODEL_DIR --vocab_file=$VOCAB_FILE \
+    --param_set=$PARAM_SET --text="hello world"
+
+# Compute model's BLEU score using the newstest2014 dataset.
+python translate.py --model_dir=$MODEL_DIR --vocab_file=$VOCAB_FILE \
+    --param_set=$PARAM_SET --file=$DATA_DIR/newstest2014.en --file_out=translation.en
+python compute_bleu.py --translation=translation.en --reference=$DATA_DIR/newstest2014.de
+```
+
+## Benchmarks
+### Training times
+
+Currently, both big and base parameter sets run on a single GPU. The measurements below
+are reported from running the model on a P100 GPU.
+
+Param Set | batches/sec | batches per epoch | time per epoch
+--- | --- | --- | ---
+base | 4.8 | 83244 | 4 hr
+big | 1.1 | 41365 | 10 hr
+
+### Evaluation results
+Below are the case-insensitive BLEU scores after 10 epochs.
+
+Param Set | Score
+--- | --- |
+base | 27.7
+big | 28.9
+
+
+## Detailed instructions
+
+
+0. ### Environment preparation
+
+   #### Add models repo to PYTHONPATH
+   Follow the instructions described in the [Requirements](https://github.com/tensorflow/models/tree/master/official#requirements) section to add the models folder to the python path.
+
+   #### Export variables (optional)
+
+   Export the following variables, or modify the values in each of the snippets below:
+   ```
+   PARAM_SET=big
+   DATA_DIR=$HOME/transformer/data
+   MODEL_DIR=$HOME/transformer/model_$PARAM_SET
+   VOCAB_FILE=$DATA_DIR/vocab.ende.32768
+   ```
+
+1. ### Download and preprocess datasets
+
+   [data_download.py](data_download.py) downloads and preprocesses the training and evaluation WMT datasets. After the data is downloaded and extracted, the training data is used to generate a vocabulary of subtokens. The evaluation and training strings are tokenized, and the resulting data is sharded, shuffled, and saved as TFRecords.
+
+   1.75GB of compressed data will be downloaded. In total, the raw files (compressed, extracted, and combined files) take up 8.4GB of disk space. The resulting TFRecord and vocabulary files are 722MB. The script takes around 40 minutes to run, with the bulk of the time spent downloading and ~15 minutes spent on preprocessing.
+
+   Command to run:
+   ```
+   python data_download.py --data_dir=$DATA_DIR
+   ```
+
+   Arguments:
+   * `--data_dir`: Path where the preprocessed TFRecord data, and vocab file will be saved.
+   * Use the `--help` or `-h` flag to get a full list of possible arguments.
+
+2. ### Model training and evaluation
+
+   [transformer_main.py](transformer_main.py) creates a Transformer model, and trains it using Tensorflow Estimator.
+
+   Command to run:
+   ```
+   python transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
+       --vocab_file=$VOCAB_FILE --param_set=$PARAM_SET
+   ```
+
+   Arguments:
+   * `--data_dir`: This should be set to the same directory given to the `data_download`'s `data_dir` argument.
+   * `--model_dir`: Directory to save Transformer model training checkpoints.
+   * `--vocab_file`: Path to subtoken vocabulary file. If data_download was used, you may find the file in `data_dir`.
+   * `--param_set`: Parameter set to use when creating and training the model. Options are `base` and `big` (default).
+   * Use the `--help` or `-h` flag to get a full list of possible arguments.
+
+   #### Customizing training schedule
+
+   By default, the model will train for 10 epochs, and evaluate after every epoch. The training schedule may be defined through the flags:
+   * Training with epochs (default):
+     * `--train_epochs`: The total number of complete passes to make through the dataset
+     * `--epochs_between_evals`: The number of epochs to train between evaluations.
+   * Training with steps:
+     * `--train_steps`: sets the total number of training steps to run.
+     * `--steps_between_evals`: Number of training steps to run between evaluations.
+
+   Only one of `train_epochs` or `train_steps` may be set. Since the default option is to evaluate the model after training for an epoch, it may take 4 or more hours between model evaluations. To get more frequent evaluations, use the flags `--train_steps=250000 --steps_between_evals=1000`.
+
+   Note: At the beginning of each training session, the training dataset is reloaded and shuffled. Stopping the training before completing an epoch may result in worse model quality, due to the chance that some examples may be seen more than others. Therefore, it is recommended to use epochs when the model quality is important.
+
+   #### Compute BLEU score during model evaluation
+
+   Use these flags to compute the BLEU when the model evaluates:
+   * `--bleu_source`: Path to file containing text to translate.
+   * `--bleu_ref`: Path to file containing the reference translation.
+   * `--stop_threshold`: Train until the BLEU score reaches this lower bound. This setting overrides the `--train_steps` and `--train_epochs` flags.
+
+   When running `transformer_main.py`, use the flags: `--bleu_source=$DATA_DIR/newstest2014.en --bleu_ref=$DATA_DIR/newstest2014.de`
+
+   #### Tensorboard
+   Training and evaluation metrics (loss, accuracy, approximate BLEU score, etc.) are logged, and can be displayed in the browser using Tensorboard.
+   ```
+   tensorboard --logdir=$MODEL_DIR
+   ```
+   The values are displayed at [localhost:6006](localhost:6006).
+
+3. ### Translate using the model
+   [translate.py](translate.py) contains the script to use the trained model to translate input text or file. Each line in the file is translated separately.
+
+   Command to run:
+   ```
+   python translate.py --model_dir=$MODEL_DIR --vocab_file=$VOCAB_FILE \
+       --param_set=$PARAM_SET --text="hello world"
+   ```
+
+   Arguments for initializing the Subtokenizer and trained model:
+   * `--model_dir` and `--param_set`: These parameters are used to rebuild the trained model
+   * `--vocab_file`: Path to subtoken vocabulary file. If data_download was used, you may find the file in `data_dir`.
+
+   Arguments for specifying what to translate:
+   * `--text`: Text to translate
+   * `--file`: Path to file containing text to translate
+   * `--file_out`: If `--file` is set, then this file will store the input file's translations.
+
+   To translate the newstest2014 data, run:
+   ```
+   python translate.py --model_dir=$MODEL_DIR --vocab_file=$VOCAB_FILE \
+       --param_set=$PARAM_SET --file=$DATA_DIR/newstest2014.en --file_out=translation.en
+   ```
+
+   Translating the file takes around 15 minutes on a GTX1080, or 5 minutes on a P100.
+
+4. ### Compute official BLEU score
+   Use [compute_bleu.py](compute_bleu.py) to compute the BLEU by comparing generated translations to the reference translation.
+
+   Command to run:
+   ```
+   python compute_bleu.py --translation=translation.en --reference=$DATA_DIR/newstest2014.de
+   ```
+
+   Arguments:
+   * `--translation`: Path to file containing generated translations.
+   * `--reference`: Path to file containing reference translations.
+   * Use the `--help` or `-h` flag to get a full list of possible arguments.
+
+5. ### TPU
+   TPU support for this version of Transformer is experimental. Currently it is present for
+   demonstration purposes only, but will be optimized in the coming weeks.
+
+## Export trained model
+To export the model as a Tensorflow [SavedModel](https://www.tensorflow.org/guide/saved_model) format, use the argument `--export_dir` when running `transformer_main.py`. A folder will be created in the directory with the name as the timestamp (e.g. $EXPORT_DIR/1526427396).
+
+```
+EXPORT_DIR=$HOME/transformer/saved_model
+python transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
+  --vocab_file=$VOCAB_FILE --param_set=$PARAM_SET --export_model=$EXPORT_DIR
+```
+
+To inspect the SavedModel, use saved_model_cli:
+```
+SAVED_MODEL_DIR=$EXPORT_DIR/{TIMESTAMP}  # replace {TIMESTAMP} with the name of the folder created
+saved_model_cli show --dir=$SAVED_MODEL_DIR  --all
+```
+
+### Example translation
+Let's translate **"hello world!"**, **"goodbye world."**, and **"Would you like some pie?"**.
+
+The SignatureDef for "translate" is:
+
+    signature_def['translate']:
+        The given SavedModel SignatureDef contains the following input(s):
+          inputs['input'] tensor_info:
+              dtype: DT_INT64
+              shape: (-1, -1)
+              name: Placeholder:0
+        The given SavedModel SignatureDef contains the following output(s):
+          outputs['outputs'] tensor_info:
+              dtype: DT_INT32
+              shape: (-1, -1)
+              name: model/Transformer/strided_slice_19:0
+          outputs['scores'] tensor_info:
+              dtype: DT_FLOAT
+              shape: (-1)
+              name: model/Transformer/strided_slice_20:0
+
+Follow the steps below to use the translate signature def:
+
+1. #### Encode the inputs to integer arrays.
+   This can be done using `utils.tokenizer.Subtokenizer`, and the vocab file in the SavedModel assets (`$SAVED_MODEL_DIR/assets.extra/vocab.txt`).
+
+   ```
+   from official.transformer.utils.tokenizer import Subtokenizer
+   s = Subtokenizer(PATH_TO_VOCAB_FILE)
+   print(s.encode("hello world!", add_eos=True))
+   ```
+
+   The encoded inputs are:
+   * `"hello world!" = [6170, 3731, 178, 207, 1]`
+   * `"goodbye world." = [15431, 13966, 36, 178, 3, 1]`
+   * `"Would you like some pie?" = [9092, 72, 155, 202, 19851, 102, 1]`
+
+2. #### Run `saved_model_cli` to obtain the predicted translations
+   The encoded inputs should be padded so that they are the same length. The padding token is `0`.
+   ```
+   ENCODED_INPUTS="[[26228, 145, 178, 1, 0, 0, 0], \
+                   [15431, 13966, 36, 178, 3, 1, 0], \
+                   [9092, 72, 155, 202, 19851, 102, 1]]"
+   ```
+
+   Now, use the `run` command with `saved_model_cli` to get the outputs.
+
+   ```
+   saved_model_cli run --dir=$SAVED_MODEL_DIR --tag_set=serve --signature_def=translate \
+     --input_expr="input=$ENCODED_INPUTS"
+   ```
+
+   The outputs will look similar to:
+   ```
+   Result for output key outputs:
+   [[18744   145   297     1     0     0     0     0     0     0     0     0
+         0     0]
+    [ 5450  4642    21    11   297     3     1     0     0     0     0     0
+         0     0]
+    [25940    22    66   103 21713    31   102     1     0     0     0     0
+         0     0]]
+   Result for output key scores:
+   [-1.5493642 -1.4032784 -3.252089 ]
+   ```
+
+3. #### Decode the outputs to strings.
+   Use the `Subtokenizer` and vocab file as described in step 1 to decode the output integer arrays.
+   ```
+   from official.transformer.utils.tokenizer import Subtokenizer
+   s = Subtokenizer(PATH_TO_VOCAB_FILE)
+   print(s.decode([18744, 145, 297, 1]))
+   ```
+   The decoded outputs from above are:
+   * `[18744, 145, 297, 1] = "Hallo Welt<EOS>"`
+   * `[5450, 4642, 21, 11, 297, 3, 1] = "Abschied von der Welt.<EOS>"`
+   * `[25940, 22, 66, 103, 21713, 31, 102, 1] = "Möchten Sie einen Kuchen?<EOS>"`
+
+## Implementation overview
+
+A brief look at each component in the code:
+
+### Model Definition
+The [model](model) subdirectory contains the implementation of the Transformer model. The following files define the Transformer model and its layers:
+* [transformer.py](model/transformer.py): Defines the transformer model and its encoder/decoder layer stacks.
+* [embedding_layer.py](model/embedding_layer.py): Contains the layer that calculates the embeddings. The embedding weights are also used to calculate the pre-softmax probabilities from the decoder output.
+* [attention_layer.py](model/attention_layer.py): Defines the multi-headed and self attention layers that are used in the encoder/decoder stacks.
+* [ffn_layer.py](model/ffn_layer.py): Defines the feedforward network that is used in the encoder/decoder stacks. The network is composed of 2 fully connected layers.
+
+Other files:
+* [beam_search.py](model/beam_search.py) contains the beam search implementation, which is used during model inference to find high scoring translations.
+* [model_params.py](model/model_params.py) contains the parameters used for the big and base models.
+* [model_utils.py](model/model_utils.py) defines some helper functions used in the model (calculating padding, bias, etc.).
+
+
+### Model Estimator
+[transformer_main.py](model/transformer.py) creates an `Estimator` to train and evaluate the model.
+
+Helper functions:
+* [utils/dataset.py](utils/dataset.py): contains functions for creating a `dataset` that is passed to the `Estimator`.
+* [utils/metrics.py](utils/metrics.py): defines metrics functions used by the `Estimator` to evaluate the
+
+### Other scripts
+
+Aside from the main file to train the Transformer model, we provide other scripts for using the model or downloading the data:
+
+#### Data download and preprocessing
+
+[data_download.py](data_download.py) downloads and extracts data, then uses `Subtokenizer` to tokenize strings into arrays of int IDs. The int arrays are converted to `tf.Examples` and saved in the `tf.RecordDataset` format.
+
+ The data is downloaded from the Workshop of Machine Translation (WMT) [news translation task](http://www.statmt.org/wmt17/translation-task.html). The following datasets are used:
+
+ * Europarl v7
+ * Common Crawl corpus
+ * News Commentary v12
+
+ See the [download section](http://www.statmt.org/wmt17/translation-task.html#download) to explore the raw datasets. The parameters in this model are tuned to fit the English-German translation data, so the EN-DE texts are extracted from the downloaded compressed files.
+
+The text is transformed into arrays of integer IDs using the `Subtokenizer` defined in [`utils/tokenizer.py`](util/tokenizer.py). During initialization of the `Subtokenizer`, the raw training data is used to generate a vocabulary list containing common subtokens.
+
+The target vocabulary size of the WMT dataset is 32,768. The set of subtokens is found through binary search on the minimum number of times a subtoken appears in the data. The actual vocabulary size is 33,708, and is stored in a 324kB file.
+
+#### Translation
+Translation is defined in [translate.py](translate.py). First, `Subtokenizer` tokenizes the input. The vocabulary file is the same used to tokenize the training/eval files. Next, beam search is used to find the combination of tokens that maximizes the probability outputted by the model decoder. The tokens are then converted back to strings with `Subtokenizer`.
+
+#### BLEU computation
+[compute_bleu.py](compute_bleu.py): Implementation from [https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/bleu_hook.py](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/bleu_hook.py).
+
+### Test dataset
+The [newstest2014 files](https://storage.googleapis.com/tf-perf-public/official_transformer/test_data/newstest2014.tgz)
+are extracted from the [NMT Seq2Seq tutorial](https://google.github.io/seq2seq/nmt/#download-data).
+The raw text files are converted from the SGM format of the
+[WMT 2016](http://www.statmt.org/wmt16/translation-task.html) test sets. The
+newstest2014 files are put into the `$DATA_DIR` when executing
+`data_download.py`
+
+## Term definitions
+
+**Steps / Epochs**:
+* Step: unit for processing a single batch of data
+* Epoch: a complete run through the dataset
+
+Example: Consider a training a dataset with 100 examples that is divided into 20 batches with 5 examples per batch. A single training step trains the model on one batch. After 20 training steps, the model will have trained on every batch in the dataset, or one epoch.
+
+**Subtoken**: Words are referred to as tokens, and parts of words are referred to as 'subtokens'. For example, the word 'inclined' may be split into `['incline', 'd_']`. The '\_' indicates the end of the token. The subtoken vocabulary list is guaranteed to contain the alphabet (including numbers and special characters), so all words can be tokenized.
diff --git a/official/transformer/model/attention_layer.py b/official/r1/transformer/attention_layer.py
similarity index 100%
rename from official/transformer/model/attention_layer.py
rename to official/r1/transformer/attention_layer.py
diff --git a/official/transformer/utils/dataset.py b/official/r1/transformer/dataset.py
similarity index 100%
rename from official/transformer/utils/dataset.py
rename to official/r1/transformer/dataset.py
diff --git a/official/transformer/model/embedding_layer.py b/official/r1/transformer/embedding_layer.py
similarity index 100%
rename from official/transformer/model/embedding_layer.py
rename to official/r1/transformer/embedding_layer.py
diff --git a/official/transformer/model/ffn_layer.py b/official/r1/transformer/ffn_layer.py
similarity index 100%
rename from official/transformer/model/ffn_layer.py
rename to official/r1/transformer/ffn_layer.py
diff --git a/official/transformer/utils/schedule.py b/official/r1/transformer/schedule.py
similarity index 100%
rename from official/transformer/utils/schedule.py
rename to official/r1/transformer/schedule.py
diff --git a/official/transformer/utils/schedule_test.py b/official/r1/transformer/schedule_test.py
similarity index 98%
rename from official/transformer/utils/schedule_test.py
rename to official/r1/transformer/schedule_test.py
index bb6c857275cc233983581c30d747a7966a0dd4bf..1b3a3f8e2579d7c470bce0a141cf7d1e54c23b62 100644
--- a/official/transformer/utils/schedule_test.py
+++ b/official/r1/transformer/schedule_test.py
@@ -16,7 +16,7 @@
 
 import tensorflow as tf
 
-from official.transformer.utils import schedule
+from official.r1.transformer import schedule
 
 
 class ScheduleBaseTester(tf.test.TestCase):
diff --git a/official/transformer/model/transformer.py b/official/r1/transformer/transformer.py
similarity index 99%
rename from official/transformer/model/transformer.py
rename to official/r1/transformer/transformer.py
index d45615b70902b3131773301bd90d956e60c0e9d7..1758ae16b6d95a999ad5d2949537a11606c98cc5 100644
--- a/official/transformer/model/transformer.py
+++ b/official/r1/transformer/transformer.py
@@ -24,10 +24,10 @@ from __future__ import print_function
 
 import tensorflow as tf  # pylint: disable=g-bad-import-order
 
-from official.transformer.model import attention_layer
+from official.r1.transformer import attention_layer
+from official.r1.transformer import embedding_layer
+from official.r1.transformer import ffn_layer
 from official.transformer.model import beam_search
-from official.transformer.model import embedding_layer
-from official.transformer.model import ffn_layer
 from official.transformer.model import model_utils
 from official.transformer.utils.tokenizer import EOS_ID
 
diff --git a/official/transformer/transformer_main.py b/official/r1/transformer/transformer_main.py
similarity index 97%
rename from official/transformer/transformer_main.py
rename to official/r1/transformer/transformer_main.py
index 2182ba848cadd3aa7ad26a3a7d8b32f690346f98..896a144edf54b9880baa97fba65a4c8d17811c0d 100644
--- a/official/transformer/transformer_main.py
+++ b/official/r1/transformer/transformer_main.py
@@ -34,13 +34,13 @@ import tensorflow as tf
 
 from official.r1.utils import export
 from official.r1.utils import tpu as tpu_util
+from official.r1.transformer import translate
+from official.r1.transformer import transformer
+from official.r1.transformer import dataset
+from official.r1.transformer import schedule
 from official.transformer import compute_bleu
-from official.transformer import translate
 from official.transformer.model import model_params
-from official.transformer.model import transformer
-from official.transformer.utils import dataset
 from official.transformer.utils import metrics
-from official.transformer.utils import schedule
 from official.transformer.utils import tokenizer
 from official.utils.flags import core as flags_core
 from official.utils.logs import hooks_helper
@@ -115,8 +115,10 @@ def model_fn(features, labels, mode, params):
         metric_fn = lambda logits, labels: (
             metrics.get_eval_metrics(logits, labels, params=params))
         eval_metrics = (metric_fn, [logits, labels])
-        return tf.contrib.tpu.TPUEstimatorSpec(
-            mode=mode, loss=loss, predictions={"predictions": logits},
+        return tf.estimator.tpu.TPUEstimatorSpec(
+            mode=mode,
+            loss=loss,
+            predictions={"predictions": logits},
             eval_metrics=eval_metrics)
       return tf.estimator.EstimatorSpec(
           mode=mode, loss=loss, predictions={"predictions": logits},
@@ -128,12 +130,14 @@ def model_fn(features, labels, mode, params):
       # in TensorBoard.
       metric_dict["minibatch_loss"] = loss
       if params["use_tpu"]:
-        return tf.contrib.tpu.TPUEstimatorSpec(
-            mode=mode, loss=loss, train_op=train_op,
+        return tf.estimator.tpu.TPUEstimatorSpec(
+            mode=mode,
+            loss=loss,
+            train_op=train_op,
             host_call=tpu_util.construct_scalar_host_call(
-                metric_dict=metric_dict, model_dir=params["model_dir"],
-                prefix="training/")
-        )
+                metric_dict=metric_dict,
+                model_dir=params["model_dir"],
+                prefix="training/"))
       record_scalars(metric_dict)
       return tf.estimator.EstimatorSpec(mode=mode, loss=loss, train_op=train_op)
 
@@ -342,6 +346,7 @@ def run_loop(
         steps=schedule_manager.single_iteration_train_steps,
         hooks=train_hooks)
 
+    eval_results = None
     eval_results = estimator.evaluate(
         input_fn=dataset.eval_input_fn,
         steps=schedule_manager.single_iteration_eval_steps)
@@ -534,25 +539,26 @@ def construct_estimator(flags_obj, params, schedule_manager):
       project=flags_obj.tpu_gcp_project
   )
 
-  tpu_config = tf.contrib.tpu.TPUConfig(
+  tpu_config = tf.estimator.tpu.TPUConfig(
       iterations_per_loop=schedule_manager.single_iteration_train_steps,
       num_shards=flags_obj.num_tpu_shards)
 
-  run_config = tf.contrib.tpu.RunConfig(
+  run_config = tf.estimator.tpu.RunConfig(
       cluster=tpu_cluster_resolver,
       model_dir=flags_obj.model_dir,
       session_config=tf.ConfigProto(
           allow_soft_placement=True, log_device_placement=True),
       tpu_config=tpu_config)
 
-  return tf.contrib.tpu.TPUEstimator(
+  return tf.estimator.tpu.TPUEstimator(
       model_fn=model_fn,
       use_tpu=params["use_tpu"] and flags_obj.tpu != tpu_util.LOCAL,
       train_batch_size=schedule_manager.batch_size,
       eval_batch_size=schedule_manager.batch_size,
       params={
           # TPUEstimator needs to populate batch_size itself due to sharding.
-          key: value for key, value in params.items() if key != "batch_size"},
+          key: value for key, value in params.items() if key != "batch_size"
+      },
       config=run_config)
 
 
diff --git a/official/transformer/translate.py b/official/r1/transformer/translate.py
similarity index 100%
rename from official/transformer/translate.py
rename to official/r1/transformer/translate.py
diff --git a/official/transformer/README.md b/official/transformer/README.md
index 83f759ccc500d557e9e244db98be92579e82ba5b..0b24dabf20905429c445ec84529a3f0092ce23f1 100644
--- a/official/transformer/README.md
+++ b/official/transformer/README.md
@@ -1,35 +1,19 @@
 # Transformer Translation Model
-This is an implementation of the Transformer translation model as described in the [Attention is All You Need](https://arxiv.org/abs/1706.03762) paper. Based on the code provided by the authors: [Transformer code](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py) from [Tensor2Tensor](https://github.com/tensorflow/tensor2tensor). Also, check out the [tutorial](https://www.tensorflow.org/beta/tutorials/text/transformer) on Transformer in TF 2.0.
-
-**A new Keras-based TF 2.0 implementation is located inside v2 folder. Please follow the [README](v2/README.md) in v2 folder to walk through the new Transformer.**
-
-Transformer is a neural network architecture that solves sequence to sequence problems using attention mechanisms. Unlike traditional neural seq2seq models, Transformer does not involve recurrent connections. The attention mechanism learns dependencies between tokens in two sequences. Since attention weights apply to all tokens in the sequences, the Transformer model is able to easily capture long-distance dependencies.
-
-Transformer's overall structure follows the standard encoder-decoder pattern. The encoder uses self-attention to compute a representation of the input sequence. The decoder generates the output sequence one token at a time, taking the encoder output and previous decoder-outputted tokens as inputs.
-
-The model also applies embeddings on the input and output tokens, and adds a constant positional encoding. The positional encoding adds information about the position of each token.
+This is an implementation of the Transformer translation model as described in
+the [Attention is All You Need](https://arxiv.org/abs/1706.03762) paper. The
+implementation leverages tf.keras and makes sure it is compatible with TF 2.0.
 
 ## Contents
   * [Contents](#contents)
   * [Walkthrough](#walkthrough)
-  * [Benchmarks](#benchmarks)
-    * [Training times](#training-times)
-    * [Evaluation results](#evaluation-results)
   * [Detailed instructions](#detailed-instructions)
     * [Environment preparation](#environment-preparation)
     * [Download and preprocess datasets](#download-and-preprocess-datasets)
     * [Model training and evaluation](#model-training-and-evaluation)
-    * [Translate using the model](#translate-using-the-model)
-    * [Compute official BLEU score](#compute-official-bleu-score)
-    * [TPU](#tpu)
-  * [Export trained model](#export-trained-model)
-    * [Example translation](#example-translation)
   * [Implementation overview](#implementation-overview)
     * [Model Definition](#model-definition)
-    * [Model Estimator](#model-estimator)
-    * [Other scripts](#other-scripts)
+    * [Model Trainer](#model-trainer)
     * [Test dataset](#test-dataset)
-  * [Term definitions](#term-definitions)
 
 ## Walkthrough
 
@@ -38,11 +22,11 @@ Below are the commands for running the Transformer model. See the
 model.
 
 ```
-cd /path/to/models/official/transformer
-
 # Ensure that PYTHONPATH is correctly defined as described in
 # https://github.com/tensorflow/models/tree/master/official#requirements
-# export PYTHONPATH="$PYTHONPATH:/path/to/models"
+export PYTHONPATH="$PYTHONPATH:/path/to/models"
+
+cd /path/to/models/official/transformer/v2
 
 # Export variables
 PARAM_SET=big
@@ -51,47 +35,25 @@ MODEL_DIR=$HOME/transformer/model_$PARAM_SET
 VOCAB_FILE=$DATA_DIR/vocab.ende.32768
 
 # Download training/evaluation/test datasets
-python data_download.py --data_dir=$DATA_DIR
+python3 data_download.py --data_dir=$DATA_DIR
 
-# Train the model for 10 epochs, and evaluate after every epoch.
-python transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
+# Train the model for 100000 steps and evaluate every 5000 steps on a single GPU.
+# Each train step, takes 4096 tokens as a batch budget with 64 as sequence
+# maximal length.
+python3 transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
     --vocab_file=$VOCAB_FILE --param_set=$PARAM_SET \
-    --bleu_source=$DATA_DIR/newstest2014.en --bleu_ref=$DATA_DIR/newstest2014.de
+    --train_steps=100000 --steps_between_evals=5000 \
+    --batch_size=4096 --max_length=64 \
+    --bleu_source=$DATA_DIR/newstest2014.en \
+    --bleu_ref=$DATA_DIR/newstest2014.de \
+    --num_gpus=1 \
+    --enable_time_history=false
 
 # Run during training in a separate process to get continuous updates,
 # or after training is complete.
 tensorboard --logdir=$MODEL_DIR
-
-# Translate some text using the trained model
-python translate.py --model_dir=$MODEL_DIR --vocab_file=$VOCAB_FILE \
-    --param_set=$PARAM_SET --text="hello world"
-
-# Compute model's BLEU score using the newstest2014 dataset.
-python translate.py --model_dir=$MODEL_DIR --vocab_file=$VOCAB_FILE \
-    --param_set=$PARAM_SET --file=$DATA_DIR/newstest2014.en --file_out=translation.en
-python compute_bleu.py --translation=translation.en --reference=$DATA_DIR/newstest2014.de
 ```
 
-## Benchmarks
-### Training times
-
-Currently, both big and base parameter sets run on a single GPU. The measurements below
-are reported from running the model on a P100 GPU.
-
-Param Set | batches/sec | batches per epoch | time per epoch
---- | --- | --- | ---
-base | 4.8 | 83244 | 4 hr
-big | 1.1 | 41365 | 10 hr
-
-### Evaluation results
-Below are the case-insensitive BLEU scores after 10 epochs.
-
-Param Set | Score
---- | --- |
-base | 27.7
-big | 28.9
-
-
 ## Detailed instructions
 
 
@@ -103,7 +65,8 @@ big | 28.9
    #### Export variables (optional)
 
    Export the following variables, or modify the values in each of the snippets below:
-   ```
+
+   ```shell
    PARAM_SET=big
    DATA_DIR=$HOME/transformer/data
    MODEL_DIR=$HOME/transformer/model_$PARAM_SET
@@ -118,7 +81,7 @@ big | 28.9
 
    Command to run:
    ```
-   python data_download.py --data_dir=$DATA_DIR
+   python3 data_download.py --data_dir=$DATA_DIR
    ```
 
    Arguments:
@@ -127,11 +90,20 @@ big | 28.9
 
 2. ### Model training and evaluation
 
-   [transformer_main.py](transformer_main.py) creates a Transformer model, and trains it using Tensorflow Estimator.
+   [transformer_main.py](v2/transformer_main.py) creates a Transformer keras model,
+   and trains it uses keras model.fit().
+
+   Users need to adjust `batch_size` and `num_gpus` to get good performance
+   running multiple GPUs.
+
+   **Note that:**
+   when using multiple GPUs or TPUs, this is the global batch size for all
+   devices. For example, if the batch size is `4096*4` and there are 4 devices,
+   each device will take 4096 tokens as a batch budget.
 
    Command to run:
    ```
-   python transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
+   python3 transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
        --vocab_file=$VOCAB_FILE --param_set=$PARAM_SET
    ```
 
@@ -140,28 +112,76 @@ big | 28.9
    * `--model_dir`: Directory to save Transformer model training checkpoints.
    * `--vocab_file`: Path to subtoken vocabulary file. If data_download was used, you may find the file in `data_dir`.
    * `--param_set`: Parameter set to use when creating and training the model. Options are `base` and `big` (default).
+   * `--enable_time_history`: Whether add TimeHistory call. If so, --log_steps must be specified.
+   * `--batch_size`: The number of tokens to consider in a batch. Combining with
+     `--max_length`, they decide how many sequences are used per batch.
    * Use the `--help` or `-h` flag to get a full list of possible arguments.
 
+    #### Using multiple GPUs
+    You can train these models on multiple GPUs using `tf.distribute.Strategy` API.
+    You can read more about them in this
+    [guide](https://www.tensorflow.org/guide/distribute_strategy).
+
+    In this example, we have made it easier to use is with just a command line flag
+    `--num_gpus`. By default this flag is 1 if TensorFlow is compiled with CUDA,
+    and 0 otherwise.
+
+    - --num_gpus=0: Uses tf.distribute.OneDeviceStrategy with CPU as the device.
+    - --num_gpus=1: Uses tf.distribute.OneDeviceStrategy with GPU as the device.
+    - --num_gpus=2+: Uses tf.distribute.MirroredStrategy to run synchronous
+    distributed training across the GPUs.
+
+   #### Using TPUs
+
+   Note: This model will **not** work with TPUs on Colab.
+
+   You can train the Transformer model on Cloud TPUs using
+   `tf.distribute.TPUStrategy`. If you are not familiar with Cloud TPUs, it is
+   strongly recommended that you go through the
+   [quickstart](https://cloud.google.com/tpu/docs/quickstart) to learn how to
+   create a TPU and GCE VM.
+
+   To run the Transformer model on a TPU, you must set
+   `--distribution_strategy=tpu`, `--tpu=$TPU_NAME`, and `--use_ctl=True` where
+   `$TPU_NAME` the name of your TPU in the Cloud Console.
+
+   An example command to run Transformer on a v2-8 or v3-8 TPU would be:
+
+   ```bash
+   python transformer_main.py \
+     --tpu=$TPU_NAME \
+     --model_dir=$MODEL_DIR \
+     --data_dir=$DATA_DIR \
+     --vocab_file=$DATA_DIR/vocab.ende.32768 \
+     --bleu_source=$DATA_DIR/newstest2014.en \
+     --bleu_ref=$DATA_DIR/newstest2014.end \
+     --batch_size=6144 \
+     --train_steps=2000 \
+     --static_batch=true \
+     --use_ctl=true \
+     --param_set=big \
+     --max_length=64 \
+     --decode_batch_size=32 \
+     --decode_max_length=97 \
+     --padded_decode=true \
+     --distribution_strategy=tpu
+   ```
+   Note: `$MODEL_DIR` and `$DATA_DIR` must be GCS paths.
+
    #### Customizing training schedule
 
    By default, the model will train for 10 epochs, and evaluate after every epoch. The training schedule may be defined through the flags:
-   * Training with epochs (default):
-     * `--train_epochs`: The total number of complete passes to make through the dataset
-     * `--epochs_between_evals`: The number of epochs to train between evaluations.
+
    * Training with steps:
      * `--train_steps`: sets the total number of training steps to run.
      * `--steps_between_evals`: Number of training steps to run between evaluations.
 
-   Only one of `train_epochs` or `train_steps` may be set. Since the default option is to evaluate the model after training for an epoch, it may take 4 or more hours between model evaluations. To get more frequent evaluations, use the flags `--train_steps=250000 --steps_between_evals=1000`.
-
-   Note: At the beginning of each training session, the training dataset is reloaded and shuffled. Stopping the training before completing an epoch may result in worse model quality, due to the chance that some examples may be seen more than others. Therefore, it is recommended to use epochs when the model quality is important.
-
    #### Compute BLEU score during model evaluation
 
    Use these flags to compute the BLEU when the model evaluates:
+
    * `--bleu_source`: Path to file containing text to translate.
    * `--bleu_ref`: Path to file containing the reference translation.
-   * `--stop_threshold`: Train until the BLEU score reaches this lower bound. This setting overrides the `--train_steps` and `--train_epochs` flags.
 
    When running `transformer_main.py`, use the flags: `--bleu_source=$DATA_DIR/newstest2014.en --bleu_ref=$DATA_DIR/newstest2014.de`
 
@@ -172,205 +192,25 @@ big | 28.9
    ```
    The values are displayed at [localhost:6006](localhost:6006).
 
-3. ### Translate using the model
-   [translate.py](translate.py) contains the script to use the trained model to translate input text or file. Each line in the file is translated separately.
-
-   Command to run:
-   ```
-   python translate.py --model_dir=$MODEL_DIR --vocab_file=$VOCAB_FILE \
-       --param_set=$PARAM_SET --text="hello world"
-   ```
-
-   Arguments for initializing the Subtokenizer and trained model:
-   * `--model_dir` and `--param_set`: These parameters are used to rebuild the trained model
-   * `--vocab_file`: Path to subtoken vocabulary file. If data_download was used, you may find the file in `data_dir`.
-
-   Arguments for specifying what to translate:
-   * `--text`: Text to translate
-   * `--file`: Path to file containing text to translate
-   * `--file_out`: If `--file` is set, then this file will store the input file's translations.
-
-   To translate the newstest2014 data, run:
-   ```
-   python translate.py --model_dir=$MODEL_DIR --vocab_file=$VOCAB_FILE \
-       --param_set=$PARAM_SET --file=$DATA_DIR/newstest2014.en --file_out=translation.en
-   ```
-
-   Translating the file takes around 15 minutes on a GTX1080, or 5 minutes on a P100.
-
-4. ### Compute official BLEU score
-   Use [compute_bleu.py](compute_bleu.py) to compute the BLEU by comparing generated translations to the reference translation.
-
-   Command to run:
-   ```
-   python compute_bleu.py --translation=translation.en --reference=$DATA_DIR/newstest2014.de
-   ```
-
-   Arguments:
-   * `--translation`: Path to file containing generated translations.
-   * `--reference`: Path to file containing reference translations.
-   * Use the `--help` or `-h` flag to get a full list of possible arguments.
-
-5. ### TPU
-   TPU support for this version of Transformer is experimental. Currently it is present for
-   demonstration purposes only, but will be optimized in the coming weeks.
-
-## Export trained model
-To export the model as a Tensorflow [SavedModel](https://www.tensorflow.org/guide/saved_model) format, use the argument `--export_dir` when running `transformer_main.py`. A folder will be created in the directory with the name as the timestamp (e.g. $EXPORT_DIR/1526427396).
-
-```
-EXPORT_DIR=$HOME/transformer/saved_model
-python transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
-  --vocab_file=$VOCAB_FILE --param_set=$PARAM_SET --export_model=$EXPORT_DIR
-```
-
-To inspect the SavedModel, use saved_model_cli:
-```
-SAVED_MODEL_DIR=$EXPORT_DIR/{TIMESTAMP}  # replace {TIMESTAMP} with the name of the folder created
-saved_model_cli show --dir=$SAVED_MODEL_DIR  --all
-```
-
-### Example translation
-Let's translate **"hello world!"**, **"goodbye world."**, and **"Would you like some pie?"**.
-
-The SignatureDef for "translate" is:
-
-    signature_def['translate']:
-        The given SavedModel SignatureDef contains the following input(s):
-          inputs['input'] tensor_info:
-              dtype: DT_INT64
-              shape: (-1, -1)
-              name: Placeholder:0
-        The given SavedModel SignatureDef contains the following output(s):
-          outputs['outputs'] tensor_info:
-              dtype: DT_INT32
-              shape: (-1, -1)
-              name: model/Transformer/strided_slice_19:0
-          outputs['scores'] tensor_info:
-              dtype: DT_FLOAT
-              shape: (-1)
-              name: model/Transformer/strided_slice_20:0
-
-Follow the steps below to use the translate signature def:
-
-1. #### Encode the inputs to integer arrays.
-   This can be done using `utils.tokenizer.Subtokenizer`, and the vocab file in the SavedModel assets (`$SAVED_MODEL_DIR/assets.extra/vocab.txt`).
-
-   ```
-   from official.transformer.utils.tokenizer import Subtokenizer
-   s = Subtokenizer(PATH_TO_VOCAB_FILE)
-   print(s.encode("hello world!", add_eos=True))
-   ```
-
-   The encoded inputs are:
-   * `"hello world!" = [6170, 3731, 178, 207, 1]`
-   * `"goodbye world." = [15431, 13966, 36, 178, 3, 1]`
-   * `"Would you like some pie?" = [9092, 72, 155, 202, 19851, 102, 1]`
-
-2. #### Run `saved_model_cli` to obtain the predicted translations
-   The encoded inputs should be padded so that they are the same length. The padding token is `0`.
-   ```
-   ENCODED_INPUTS="[[26228, 145, 178, 1, 0, 0, 0], \
-                   [15431, 13966, 36, 178, 3, 1, 0], \
-                   [9092, 72, 155, 202, 19851, 102, 1]]"
-   ```
-
-   Now, use the `run` command with `saved_model_cli` to get the outputs.
-
-   ```
-   saved_model_cli run --dir=$SAVED_MODEL_DIR --tag_set=serve --signature_def=translate \
-     --input_expr="input=$ENCODED_INPUTS"
-   ```
-
-   The outputs will look similar to:
-   ```
-   Result for output key outputs:
-   [[18744   145   297     1     0     0     0     0     0     0     0     0
-         0     0]
-    [ 5450  4642    21    11   297     3     1     0     0     0     0     0
-         0     0]
-    [25940    22    66   103 21713    31   102     1     0     0     0     0
-         0     0]]
-   Result for output key scores:
-   [-1.5493642 -1.4032784 -3.252089 ]
-   ```
-
-3. #### Decode the outputs to strings.
-   Use the `Subtokenizer` and vocab file as described in step 1 to decode the output integer arrays.
-   ```
-   from official.transformer.utils.tokenizer import Subtokenizer
-   s = Subtokenizer(PATH_TO_VOCAB_FILE)
-   print(s.decode([18744, 145, 297, 1]))
-   ```
-   The decoded outputs from above are:
-   * `[18744, 145, 297, 1] = "Hallo Welt<EOS>"`
-   * `[5450, 4642, 21, 11, 297, 3, 1] = "Abschied von der Welt.<EOS>"`
-   * `[25940, 22, 66, 103, 21713, 31, 102, 1] = "Möchten Sie einen Kuchen?<EOS>"`
-
 ## Implementation overview
 
 A brief look at each component in the code:
 
 ### Model Definition
-The [model](model) subdirectory contains the implementation of the Transformer model. The following files define the Transformer model and its layers:
-* [transformer.py](model/transformer.py): Defines the transformer model and its encoder/decoder layer stacks.
-* [embedding_layer.py](model/embedding_layer.py): Contains the layer that calculates the embeddings. The embedding weights are also used to calculate the pre-softmax probabilities from the decoder output.
-* [attention_layer.py](model/attention_layer.py): Defines the multi-headed and self attention layers that are used in the encoder/decoder stacks.
-* [ffn_layer.py](model/ffn_layer.py): Defines the feedforward network that is used in the encoder/decoder stacks. The network is composed of 2 fully connected layers.
+* [transformer.py](v2/transformer.py): Defines a tf.keras.Model: `Transformer`.
+* [embedding_layer.py](v2/embedding_layer.py): Contains the layer that calculates the embeddings. The embedding weights are also used to calculate the pre-softmax probabilities from the decoder output.
+* [attention_layer.py](v2/attention_layer.py): Defines the multi-headed and self attention layers that are used in the encoder/decoder stacks.
+* [ffn_layer.py](v2/ffn_layer.py): Defines the feedforward network that is used in the encoder/decoder stacks. The network is composed of 2 fully connected layers.
 
 Other files:
-* [beam_search.py](model/beam_search.py) contains the beam search implementation, which is used during model inference to find high scoring translations.
-* [model_params.py](model/model_params.py) contains the parameters used for the big and base models.
-* [model_utils.py](model/model_utils.py) defines some helper functions used in the model (calculating padding, bias, etc.).
-
-
-### Model Estimator
-[transformer_main.py](model/transformer.py) creates an `Estimator` to train and evaluate the model.
-
-Helper functions:
-* [utils/dataset.py](utils/dataset.py): contains functions for creating a `dataset` that is passed to the `Estimator`.
-* [utils/metrics.py](utils/metrics.py): defines metrics functions used by the `Estimator` to evaluate the
-
-### Other scripts
-
-Aside from the main file to train the Transformer model, we provide other scripts for using the model or downloading the data:
-
-#### Data download and preprocessing
+* [beam_search.py](v2/beam_search.py) contains the beam search implementation, which is used during model inference to find high scoring translations.
 
-[data_download.py](data_download.py) downloads and extracts data, then uses `Subtokenizer` to tokenize strings into arrays of int IDs. The int arrays are converted to `tf.Examples` and saved in the `tf.RecordDataset` format.
-
- The data is downloaded from the Workshop of Machine Translation (WMT) [news translation task](http://www.statmt.org/wmt17/translation-task.html). The following datasets are used:
-
- * Europarl v7
- * Common Crawl corpus
- * News Commentary v12
-
- See the [download section](http://www.statmt.org/wmt17/translation-task.html#download) to explore the raw datasets. The parameters in this model are tuned to fit the English-German translation data, so the EN-DE texts are extracted from the downloaded compressed files.
-
-The text is transformed into arrays of integer IDs using the `Subtokenizer` defined in [`utils/tokenizer.py`](util/tokenizer.py). During initialization of the `Subtokenizer`, the raw training data is used to generate a vocabulary list containing common subtokens.
-
-The target vocabulary size of the WMT dataset is 32,768. The set of subtokens is found through binary search on the minimum number of times a subtoken appears in the data. The actual vocabulary size is 33,708, and is stored in a 324kB file.
-
-#### Translation
-Translation is defined in [translate.py](translate.py). First, `Subtokenizer` tokenizes the input. The vocabulary file is the same used to tokenize the training/eval files. Next, beam search is used to find the combination of tokens that maximizes the probability outputted by the model decoder. The tokens are then converted back to strings with `Subtokenizer`.
-
-#### BLEU computation
-[compute_bleu.py](compute_bleu.py): Implementation from [https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/bleu_hook.py](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/bleu_hook.py).
+### Model Trainer
+[transformer_main.py](v2/transformer_main.py) creates an `TransformerTask` to train and evaluate the model using tf.keras.
 
 ### Test dataset
 The [newstest2014 files](https://storage.googleapis.com/tf-perf-public/official_transformer/test_data/newstest2014.tgz)
 are extracted from the [NMT Seq2Seq tutorial](https://google.github.io/seq2seq/nmt/#download-data).
 The raw text files are converted from the SGM format of the
 [WMT 2016](http://www.statmt.org/wmt16/translation-task.html) test sets. The
-newstest2014 files are put into the `$DATA_DIR` when executing
-`data_download.py`
-
-## Term definitions
-
-**Steps / Epochs**:
-* Step: unit for processing a single batch of data
-* Epoch: a complete run through the dataset
-
-Example: Consider a training a dataset with 100 examples that is divided into 20 batches with 5 examples per batch. A single training step trains the model on one batch. After 20 training steps, the model will have trained on every batch in the dataset, or one epoch.
-
-**Subtoken**: Words are referred to as tokens, and parts of words are referred to as 'subtokens'. For example, the word 'inclined' may be split into `['incline', 'd_']`. The '\_' indicates the end of the token. The subtoken vocabulary list is guaranteed to contain the alphabet (including numbers and special characters), so all words can be tokenized.
+newstest2014 files are put into the `$DATA_DIR` when executing `data_download.py`
diff --git a/official/transformer/model/__init__.py b/official/transformer/model/__init__.py
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..07d1285d3c1fb287d6ded2149f748edbee6bcea5 100644
--- a/official/transformer/model/__init__.py
+++ b/official/transformer/model/__init__.py
@@ -0,0 +1,20 @@
+# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""Bring in the shared legacy Transformer modules into this module."""
+
+from official.r1.transformer import transformer
+from official.r1.transformer import ffn_layer
+from official.r1.transformer import embedding_layer
+from official.r1.transformer import attention_layer
diff --git a/official/transformer/transformer_estimator_benchmark.py b/official/transformer/transformer_estimator_benchmark.py
deleted file mode 100644
index f1b757b04197722895fff255a1cba0905c5511d0..0000000000000000000000000000000000000000
--- a/official/transformer/transformer_estimator_benchmark.py
+++ /dev/null
@@ -1,510 +0,0 @@
-# Copyright 2019 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-"""Executes Transformer w/Estimator benchmark and accuracy tests."""
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import os
-import time
-
-from absl import flags
-from absl.testing import flagsaver
-import tensorflow as tf  # pylint: disable=g-bad-import-order
-
-from official.transformer import transformer_main as transformer_main
-from official.utils.flags import core as flags_core
-from official.utils.logs import hooks
-
-TRANSFORMER_EN2DE_DATA_DIR_NAME = 'wmt32k-en2de-official'
-EN2DE_2014_BLEU_DATA_DIR_NAME = 'newstest2014'
-FLAGS = flags.FLAGS
-
-
-class EstimatorBenchmark(tf.test.Benchmark):
-  """Methods common to executing transformer w/Estimator tests.
-
-     Code under test for the Transformer Estimator models report the same data
-     and require the same FLAG setup.
-  """
-  local_flags = None
-
-  def __init__(self, output_dir=None, default_flags=None, flag_methods=None):
-    if not output_dir:
-      output_dir = '/tmp'
-    self.output_dir = output_dir
-    self.default_flags = default_flags or {}
-    self.flag_methods = flag_methods or {}
-
-  def _get_model_dir(self, folder_name):
-    """Returns directory to store info, e.g. saved model and event log."""
-    return os.path.join(self.output_dir, folder_name)
-
-  def _setup(self):
-    """Sets up and resets flags before each test."""
-    tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
-    if EstimatorBenchmark.local_flags is None:
-      for flag_method in self.flag_methods:
-        flag_method()
-      # Loads flags to get defaults to then override. List cannot be empty.
-      flags.FLAGS(['foo'])
-      # Overrides flag values with defaults for the class of tests.
-      for k, v in self.default_flags.items():
-        setattr(FLAGS, k, v)
-      saved_flag_values = flagsaver.save_flag_values()
-      EstimatorBenchmark.local_flags = saved_flag_values
-    else:
-      flagsaver.restore_flag_values(EstimatorBenchmark.local_flags)
-
-  def _report_benchmark(self,
-                        stats,
-                        wall_time_sec,
-                        bleu_max=None,
-                        bleu_min=None):
-    """Report benchmark results by writing to local protobuf file.
-
-    Args:
-      stats: dict returned from estimator models with known entries.
-      wall_time_sec: the during of the benchmark execution in seconds.
-      bleu_max: highest passing level for bleu score.
-      bleu_min: lowest passing level for bleu score.
-    """
-    examples_per_sec_hook = None
-    for hook in stats['train_hooks']:
-      if isinstance(hook, hooks.ExamplesPerSecondHook):
-        examples_per_sec_hook = hook
-        break
-
-    eval_results = stats['eval_results']
-    metrics = []
-    if 'bleu_uncased' in stats:
-      metrics.append({'name': 'bleu_uncased',
-                      'value': stats['bleu_uncased'],
-                      'min_value': bleu_min,
-                      'max_value': bleu_max})
-
-    if examples_per_sec_hook:
-      exp_per_second_list = examples_per_sec_hook.current_examples_per_sec_list
-      # ExamplesPerSecondHook skips the first 10 steps.
-      exp_per_sec = sum(exp_per_second_list) / (len(exp_per_second_list))
-      metrics.append({'name': 'exp_per_second',
-                      'value': exp_per_sec})
-
-    flags_str = flags_core.get_nondefault_flags_as_str()
-    self.report_benchmark(iters=eval_results['global_step'],
-                          wall_time=wall_time_sec,
-                          metrics=metrics,
-                          extras={'flags': flags_str})
-
-
-class TransformerBigEstimatorAccuracy(EstimatorBenchmark):
-  """Benchmark accuracy tests for Transformer Big model w/Estimator."""
-
-  def __init__(self, output_dir=None, root_data_dir=None, **kwargs):
-    """Benchmark accuracy tests for Transformer Big model w/Estimator.
-
-    Args:
-      output_dir: directory where to output, e.g. log files.
-      root_data_dir: directory under which to look for dataset.
-      **kwargs: arbitrary named arguments. This is needed to make the
-                constructor forward compatible in case PerfZero provides more
-                named arguments before updating the constructor.
-    """
-    flag_methods = [transformer_main.define_transformer_flags]
-
-    self.train_data_dir = os.path.join(root_data_dir,
-                                       TRANSFORMER_EN2DE_DATA_DIR_NAME)
-
-    self.vocab_file = os.path.join(root_data_dir,
-                                   TRANSFORMER_EN2DE_DATA_DIR_NAME,
-                                   'vocab.ende.32768')
-
-    self.bleu_source = os.path.join(root_data_dir,
-                                    EN2DE_2014_BLEU_DATA_DIR_NAME,
-                                    'newstest2014.en')
-
-    self.bleu_ref = os.path.join(root_data_dir,
-                                 EN2DE_2014_BLEU_DATA_DIR_NAME,
-                                 'newstest2014.de')
-
-    super(TransformerBigEstimatorAccuracy, self).__init__(
-        output_dir=output_dir, flag_methods=flag_methods)
-
-  def benchmark_graph_8_gpu(self):
-    """Benchmark graph mode 8 gpus.
-
-      SOTA is 28.4 BLEU (uncased).
-    """
-    self._setup()
-    FLAGS.num_gpus = 8
-    FLAGS.data_dir = self.train_data_dir
-    FLAGS.vocab_file = self.vocab_file
-    # Sets values directly to avoid validation check.
-    FLAGS['bleu_source'].value = self.bleu_source
-    FLAGS['bleu_ref'].value = self.bleu_ref
-    FLAGS.param_set = 'big'
-    FLAGS.batch_size = 3072 * 8
-    FLAGS.train_steps = 100000
-    FLAGS.steps_between_evals = 5000
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_8_gpu')
-    FLAGS.hooks = ['ExamplesPerSecondHook']
-    self._run_and_report_benchmark()
-
-  def benchmark_graph_8_gpu_static_batch(self):
-    """Benchmark graph mode 8 gpus.
-
-      SOTA is 28.4 BLEU (uncased).
-    """
-    self._setup()
-    FLAGS.num_gpus = 8
-    FLAGS.data_dir = self.train_data_dir
-    FLAGS.vocab_file = self.vocab_file
-    # Sets values directly to avoid validation check.
-    FLAGS['bleu_source'].value = self.bleu_source
-    FLAGS['bleu_ref'].value = self.bleu_ref
-    FLAGS.param_set = 'big'
-    FLAGS.batch_size = 3072 * 8
-    FLAGS.static_batch = True
-    FLAGS.max_length = 64
-    FLAGS.train_steps = 100000
-    FLAGS.steps_between_evals = 5000
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_8_gpu')
-    FLAGS.hooks = ['ExamplesPerSecondHook']
-    self._run_and_report_benchmark()
-
-  def _run_and_report_benchmark(self, bleu_min=28.3, bleu_max=29):
-    """Run benchmark and report results.
-
-    Args:
-      bleu_min: minimum expected uncased bleu. default is SOTA.
-      bleu_max: max expected uncased bleu. default is a high number.
-    """
-    start_time_sec = time.time()
-    stats = transformer_main.run_transformer(flags.FLAGS)
-    wall_time_sec = time.time() - start_time_sec
-    self._report_benchmark(stats,
-                           wall_time_sec,
-                           bleu_min=bleu_min,
-                           bleu_max=bleu_max)
-
-
-class TransformerBaseEstimatorAccuracy(EstimatorBenchmark):
-  """Benchmark accuracy tests for Transformer Base model w/ Estimator."""
-
-  def __init__(self, output_dir=None, root_data_dir=None, **kwargs):
-    """Benchmark accuracy tests for Transformer Base model w/ Estimator.
-
-    Args:
-      output_dir: directory where to output e.g. log files
-      root_data_dir: directory under which to look for dataset
-      **kwargs: arbitrary named arguments. This is needed to make the
-                constructor forward compatible in case PerfZero provides more
-                named arguments before updating the constructor.
-    """
-    flag_methods = [transformer_main.define_transformer_flags]
-
-    self.train_data_dir = os.path.join(root_data_dir,
-                                       TRANSFORMER_EN2DE_DATA_DIR_NAME)
-
-    self.vocab_file = os.path.join(root_data_dir,
-                                   TRANSFORMER_EN2DE_DATA_DIR_NAME,
-                                   'vocab.ende.32768')
-
-    self.bleu_source = os.path.join(root_data_dir,
-                                    EN2DE_2014_BLEU_DATA_DIR_NAME,
-                                    'newstest2014.en')
-
-    self.bleu_ref = os.path.join(root_data_dir,
-                                 EN2DE_2014_BLEU_DATA_DIR_NAME,
-                                 'newstest2014.de')
-
-    super(TransformerBaseEstimatorAccuracy, self).__init__(
-        output_dir=output_dir, flag_methods=flag_methods)
-
-  def benchmark_graph_2_gpu(self):
-    """Benchmark graph mode 2 gpus.
-
-      The paper uses 8 GPUs and a much larger effective batch size, this is will
-      not converge to the 27.3 BLEU (uncased) SOTA.
-    """
-    self._setup()
-    FLAGS.num_gpus = 2
-    FLAGS.data_dir = self.train_data_dir
-    FLAGS.vocab_file = self.vocab_file
-    # Sets values directly to avoid validation check.
-    FLAGS['bleu_source'].value = self.bleu_source
-    FLAGS['bleu_ref'].value = self.bleu_ref
-    FLAGS.param_set = 'base'
-    FLAGS.batch_size = 4096 * 2
-    FLAGS.train_steps = 100000
-    FLAGS.steps_between_evals = 5000
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_2_gpu')
-    FLAGS.hooks = ['ExamplesPerSecondHook']
-    # These bleu scores are based on test runs after at this limited
-    # number of steps and batch size after verifying SOTA at 8xV100s.
-    self._run_and_report_benchmark(bleu_min=25.3, bleu_max=26)
-
-  def benchmark_graph_8_gpu(self):
-    """Benchmark graph mode 8 gpus.
-
-      SOTA is 27.3 BLEU (uncased).
-      Best so far is 27.2  with 4048*8 at 75,000 steps.
-      27.009 with 4096*8 at 100,000 steps and earlier.
-      Other test: 2024 * 8 peaked at 26.66 at 100,000 steps.
-    """
-    self._setup()
-    FLAGS.num_gpus = 8
-    FLAGS.data_dir = self.train_data_dir
-    FLAGS.vocab_file = self.vocab_file
-    # Sets values directly to avoid validation check.
-    FLAGS['bleu_source'].value = self.bleu_source
-    FLAGS['bleu_ref'].value = self.bleu_ref
-    FLAGS.param_set = 'base'
-    FLAGS.batch_size = 4096 * 8
-    FLAGS.train_steps = 100000
-    FLAGS.steps_between_evals = 5000
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_8_gpu')
-    FLAGS.hooks = ['ExamplesPerSecondHook']
-    self._run_and_report_benchmark()
-
-  def benchmark_graph_8_gpu_static_batch(self):
-    """Benchmark graph mode 8 gpus.
-
-      SOTA is 27.3 BLEU (uncased).
-      Best so far is 27.2  with 4048*8 at 75,000 steps.
-      27.009 with 4096*8 at 100,000 steps and earlier.
-      Other test: 2024 * 8 peaked at 26.66 at 100,000 steps.
-    """
-    self._setup()
-    FLAGS.num_gpus = 8
-    FLAGS.data_dir = self.train_data_dir
-    FLAGS.vocab_file = self.vocab_file
-    # Sets values directly to avoid validation check.
-    FLAGS['bleu_source'].value = self.bleu_source
-    FLAGS['bleu_ref'].value = self.bleu_ref
-    FLAGS.param_set = 'base'
-    FLAGS.batch_size = 4096 * 8
-    FLAGS.static_batch = True
-    FLAGS.max_length = 64
-    FLAGS.train_steps = 100000
-    FLAGS.steps_between_evals = 5000
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_8_gpu')
-    FLAGS.hooks = ['ExamplesPerSecondHook']
-    self._run_and_report_benchmark()
-
-  def benchmark_graph_fp16_8_gpu(self):
-    """benchmark 8 gpus with fp16 mixed precision.
-
-      SOTA is 27.3 BLEU (uncased).
-    """
-    self._setup()
-    FLAGS.num_gpus = 8
-    FLAGS.dtype = 'fp16'
-    FLAGS.data_dir = self.train_data_dir
-    FLAGS.vocab_file = self.vocab_file
-    # Sets values directly to avoid validation check.
-    FLAGS['bleu_source'].value = self.bleu_source
-    FLAGS['bleu_ref'].value = self.bleu_ref
-    FLAGS.param_set = 'base'
-    FLAGS.batch_size = 4096 * 8
-    FLAGS.train_steps = 100000
-    FLAGS.steps_between_evals = 5000
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_fp16_8_gpu')
-    FLAGS.hooks = ['ExamplesPerSecondHook']
-    self._run_and_report_benchmark()
-
-  def _run_and_report_benchmark(self, bleu_min=27.3, bleu_max=28):
-    """Run benchmark and report results.
-
-    Args:
-      bleu_min: minimum expected uncased bleu. default is SOTA.
-      bleu_max: max expected uncased bleu. default is a high number.
-    """
-    start_time_sec = time.time()
-    stats = transformer_main.run_transformer(flags.FLAGS)
-    wall_time_sec = time.time() - start_time_sec
-    self._report_benchmark(stats,
-                           wall_time_sec,
-                           bleu_min=bleu_min,
-                           bleu_max=bleu_max)
-
-
-class TransformerEstimatorBenchmark(EstimatorBenchmark):
-  """Benchmarks for Transformer (Base and Big) using Estimator."""
-
-  def __init__(self, output_dir=None, default_flags=None, batch_per_gpu=4096):
-    """Initialize.
-
-    Args:
-      output_dir: Based directory for saving artifacts, e.g. checkpoints.
-      default_flags: default flags to use for all tests.
-      batch_per_gpu: batch size to use per gpu.
-    """
-
-    flag_methods = [transformer_main.define_transformer_flags]
-    self.batch_per_gpu = batch_per_gpu
-
-    super(TransformerEstimatorBenchmark, self).__init__(
-        output_dir=output_dir,
-        default_flags=default_flags,
-        flag_methods=flag_methods)
-
-  def benchmark_graph_1_gpu(self):
-    """Benchmark graph 1 gpu."""
-    self._setup()
-    FLAGS.num_gpus = 1
-    FLAGS.batch_size = self.batch_per_gpu
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_1_gpu')
-    self._run_and_report_benchmark()
-
-  def benchmark_graph_fp16_1_gpu(self):
-    """Benchmark graph fp16 1 gpu."""
-    self._setup()
-    FLAGS.num_gpus = 1
-    FLAGS.dtype = 'fp16'
-    FLAGS.batch_size = self.batch_per_gpu
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_fp16_1_gpu')
-    self._run_and_report_benchmark()
-
-  def benchmark_graph_2_gpu(self):
-    """Benchmark graph 2 gpus."""
-    self._setup()
-    FLAGS.num_gpus = 2
-    FLAGS.batch_size = self.batch_per_gpu * 2
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_2_gpu')
-    self._run_and_report_benchmark()
-
-  def benchmark_graph_fp16_2_gpu(self):
-    """Benchmark graph fp16 2 gpus."""
-    self._setup()
-    FLAGS.num_gpus = 2
-    FLAGS.dtype = 'fp16'
-    FLAGS.batch_size = self.batch_per_gpu * 2
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_fp16_2_gpu')
-    self._run_and_report_benchmark()
-
-  def benchmark_graph_4_gpu(self):
-    """Benchmark graph 4 gpus."""
-    self._setup()
-    FLAGS.num_gpus = 4
-    FLAGS.batch_size = self.batch_per_gpu * 4
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_4_gpu')
-    self._run_and_report_benchmark()
-
-  def benchmark_graph_fp16_4_gpu(self):
-    """Benchmark 4 graph fp16 gpus."""
-    self._setup()
-    FLAGS.num_gpus = 4
-    FLAGS.dtype = 'fp16'
-    FLAGS.batch_size = self.batch_per_gpu * 4
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_fp16_4_gpu')
-    self._run_and_report_benchmark()
-
-  def benchmark_graph_8_gpu(self):
-    """Benchmark graph 8 gpus."""
-    self._setup()
-    FLAGS.num_gpus = 8
-    FLAGS.batch_size = self.batch_per_gpu * 8
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_8_gpu')
-    self._run_and_report_benchmark()
-
-  def benchmark_graph_fp16_8_gpu(self):
-    """Benchmark graph fp16 8 gpus."""
-    self._setup()
-    FLAGS.num_gpus = 8
-    FLAGS.dtype = 'fp16'
-    FLAGS.batch_size = self.batch_per_gpu * 8
-    FLAGS.model_dir = self._get_model_dir('benchmark_graph_fp16_8_gpu')
-    self._run_and_report_benchmark()
-
-  def _run_and_report_benchmark(self):
-    start_time_sec = time.time()
-    stats = transformer_main.run_transformer(flags.FLAGS)
-    wall_time_sec = time.time() - start_time_sec
-    self._report_benchmark(stats, wall_time_sec)
-
-
-class TransformerBaseEstimatorBenchmarkSynth(TransformerEstimatorBenchmark):
-  """Transformer based version synthetic benchmark tests."""
-
-  def __init__(self, output_dir=None, root_data_dir=None, **kwargs):
-    def_flags = {}
-    def_flags['param_set'] = 'base'
-    def_flags['use_synthetic_data'] = True
-    def_flags['train_steps'] = 200
-    def_flags['steps_between_evals'] = 200
-    def_flags['hooks'] = ['ExamplesPerSecondHook']
-
-    super(TransformerBaseEstimatorBenchmarkSynth, self).__init__(
-        output_dir=output_dir, default_flags=def_flags)
-
-
-class TransformerBaseEstimatorBenchmarkReal(TransformerEstimatorBenchmark):
-  """Transformer based version real data benchmark tests."""
-
-  def __init__(self, output_dir=None, root_data_dir=None, **kwargs):
-    train_data_dir = os.path.join(root_data_dir,
-                                  TRANSFORMER_EN2DE_DATA_DIR_NAME)
-    vocab_file = os.path.join(root_data_dir,
-                              TRANSFORMER_EN2DE_DATA_DIR_NAME,
-                              'vocab.ende.32768')
-
-    def_flags = {}
-    def_flags['param_set'] = 'base'
-    def_flags['vocab_file'] = vocab_file
-    def_flags['data_dir'] = train_data_dir
-    def_flags['train_steps'] = 200
-    def_flags['steps_between_evals'] = 200
-    def_flags['hooks'] = ['ExamplesPerSecondHook']
-
-    super(TransformerBaseEstimatorBenchmarkReal, self).__init__(
-        output_dir=output_dir, default_flags=def_flags)
-
-
-class TransformerBigEstimatorBenchmarkReal(TransformerEstimatorBenchmark):
-  """Transformer based version real data benchmark tests."""
-
-  def __init__(self, output_dir=None, root_data_dir=None, **kwargs):
-    train_data_dir = os.path.join(root_data_dir,
-                                  TRANSFORMER_EN2DE_DATA_DIR_NAME)
-    vocab_file = os.path.join(root_data_dir,
-                              TRANSFORMER_EN2DE_DATA_DIR_NAME,
-                              'vocab.ende.32768')
-
-    def_flags = {}
-    def_flags['param_set'] = 'big'
-    def_flags['vocab_file'] = vocab_file
-    def_flags['data_dir'] = train_data_dir
-    def_flags['train_steps'] = 200
-    def_flags['steps_between_evals'] = 200
-    def_flags['hooks'] = ['ExamplesPerSecondHook']
-
-    super(TransformerBigEstimatorBenchmarkReal, self).__init__(
-        output_dir=output_dir, default_flags=def_flags, batch_per_gpu=3072)
-
-
-class TransformerBigEstimatorBenchmarkSynth(TransformerEstimatorBenchmark):
-  """Transformer based version synthetic benchmark tests."""
-
-  def __init__(self, output_dir=None, root_data_dir=None, **kwargs):
-    def_flags = {}
-    def_flags['param_set'] = 'big'
-    def_flags['use_synthetic_data'] = True
-    def_flags['train_steps'] = 200
-    def_flags['steps_between_evals'] = 200
-    def_flags['hooks'] = ['ExamplesPerSecondHook']
-
-    super(TransformerBigEstimatorBenchmarkSynth, self).__init__(
-        output_dir=output_dir, default_flags=def_flags, batch_per_gpu=3072)
diff --git a/official/transformer/v2/README.md b/official/transformer/v2/README.md
deleted file mode 100644
index 1b17f83386c9f02ef54a16c8e73828b38b5c1c1e..0000000000000000000000000000000000000000
--- a/official/transformer/v2/README.md
+++ /dev/null
@@ -1,216 +0,0 @@
-# Transformer Translation Model
-This is an implementation of the Transformer translation model as described in
-the [Attention is All You Need](https://arxiv.org/abs/1706.03762) paper. The
-implementation leverages tf.keras and makes sure it is compatible with TF 2.0.
-
-## Contents
-  * [Contents](#contents)
-  * [Walkthrough](#walkthrough)
-  * [Detailed instructions](#detailed-instructions)
-    * [Environment preparation](#environment-preparation)
-    * [Download and preprocess datasets](#download-and-preprocess-datasets)
-    * [Model training and evaluation](#model-training-and-evaluation)
-  * [Implementation overview](#implementation-overview)
-    * [Model Definition](#model-definition)
-    * [Model Trainer](#model-trainer)
-    * [Test dataset](#test-dataset)
-
-## Walkthrough
-
-Below are the commands for running the Transformer model. See the
-[Detailed instructions](#detailed-instructions) for more details on running the
-model.
-
-```
-# Ensure that PYTHONPATH is correctly defined as described in
-# https://github.com/tensorflow/models/tree/master/official#requirements
-export PYTHONPATH="$PYTHONPATH:/path/to/models"
-
-cd /path/to/models/official/transformer/v2
-
-# Export variables
-PARAM_SET=big
-DATA_DIR=$HOME/transformer/data
-MODEL_DIR=$HOME/transformer/model_$PARAM_SET
-VOCAB_FILE=$DATA_DIR/vocab.ende.32768
-
-# Download training/evaluation/test datasets
-python3 ../data_download.py --data_dir=$DATA_DIR
-
-# Train the model for 100000 steps and evaluate every 5000 steps on a single GPU.
-# Each train step, takes 4096 tokens as a batch budget with 64 as sequence
-# maximal length.
-python3 transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
-    --vocab_file=$VOCAB_FILE --param_set=$PARAM_SET \
-    --train_steps=100000 --steps_between_evals=5000 \
-    --batch_size=4096 --max_length=64 \
-    --bleu_source=$DATA_DIR/newstest2014.en \
-    --bleu_ref=$DATA_DIR/newstest2014.de \
-    --num_gpus=1 \
-    --enable_time_history=false
-
-# Run during training in a separate process to get continuous updates,
-# or after training is complete.
-tensorboard --logdir=$MODEL_DIR
-```
-
-## Detailed instructions
-
-
-0. ### Environment preparation
-
-   #### Add models repo to PYTHONPATH
-   Follow the instructions described in the [Requirements](https://github.com/tensorflow/models/tree/master/official#requirements) section to add the models folder to the python path.
-
-   #### Export variables (optional)
-
-   Export the following variables, or modify the values in each of the snippets below:
-
-   ```shell
-   PARAM_SET=big
-   DATA_DIR=$HOME/transformer/data
-   MODEL_DIR=$HOME/transformer/model_$PARAM_SET
-   VOCAB_FILE=$DATA_DIR/vocab.ende.32768
-   ```
-
-1. ### Download and preprocess datasets
-
-   [data_download.py](../data_download.py) downloads and preprocesses the training and evaluation WMT datasets. After the data is downloaded and extracted, the training data is used to generate a vocabulary of subtokens. The evaluation and training strings are tokenized, and the resulting data is sharded, shuffled, and saved as TFRecords.
-
-   1.75GB of compressed data will be downloaded. In total, the raw files (compressed, extracted, and combined files) take up 8.4GB of disk space. The resulting TFRecord and vocabulary files are 722MB. The script takes around 40 minutes to run, with the bulk of the time spent downloading and ~15 minutes spent on preprocessing.
-
-   Command to run:
-   ```
-   python3 data_download.py --data_dir=$DATA_DIR
-   ```
-
-   Arguments:
-   * `--data_dir`: Path where the preprocessed TFRecord data, and vocab file will be saved.
-   * Use the `--help` or `-h` flag to get a full list of possible arguments.
-
-2. ### Model training and evaluation
-
-   [transformer_main.py](transformer_main.py) creates a Transformer keras model,
-   and trains it uses keras model.fit().
-
-   Users need to adjust `batch_size` and `num_gpus` to get good performance
-   running multiple GPUs.
-
-   **Note that:**
-   when using multiple GPUs or TPUs, this is the global batch size for all
-   devices. For example, if the batch size is `4096*4` and there are 4 devices,
-   each device will take 4096 tokens as a batch budget.
-
-   Command to run:
-   ```
-   python3 transformer_main.py --data_dir=$DATA_DIR --model_dir=$MODEL_DIR \
-       --vocab_file=$VOCAB_FILE --param_set=$PARAM_SET
-   ```
-
-   Arguments:
-   * `--data_dir`: This should be set to the same directory given to the `data_download`'s `data_dir` argument.
-   * `--model_dir`: Directory to save Transformer model training checkpoints.
-   * `--vocab_file`: Path to subtoken vocabulary file. If data_download was used, you may find the file in `data_dir`.
-   * `--param_set`: Parameter set to use when creating and training the model. Options are `base` and `big` (default).
-   * `--enable_time_history`: Whether add TimeHistory call. If so, --log_steps must be specified.
-   * `--batch_size`: The number of tokens to consider in a batch. Combining with
-     `--max_length`, they decide how many sequences are used per batch.
-   * Use the `--help` or `-h` flag to get a full list of possible arguments.
-
-    #### Using multiple GPUs
-    You can train these models on multiple GPUs using `tf.distribute.Strategy` API.
-    You can read more about them in this
-    [guide](https://www.tensorflow.org/guide/distribute_strategy).
-
-    In this example, we have made it easier to use is with just a command line flag
-    `--num_gpus`. By default this flag is 1 if TensorFlow is compiled with CUDA,
-    and 0 otherwise.
-
-    - --num_gpus=0: Uses tf.distribute.OneDeviceStrategy with CPU as the device.
-    - --num_gpus=1: Uses tf.distribute.OneDeviceStrategy with GPU as the device.
-    - --num_gpus=2+: Uses tf.distribute.MirroredStrategy to run synchronous
-    distributed training across the GPUs.
-
-   #### Using TPUs
-
-   Note: This model will **not** work with TPUs on Colab.
-
-   You can train the Transformer model on Cloud TPUs using
-   `tf.distribute.TPUStrategy`. If you are not familiar with Cloud TPUs, it is
-   strongly recommended that you go through the
-   [quickstart](https://cloud.google.com/tpu/docs/quickstart) to learn how to
-   create a TPU and GCE VM.
-
-   To run the Transformer model on a TPU, you must set
-   `--distribution_strategy=tpu`, `--tpu=$TPU_NAME`, and `--use_ctl=True` where
-   `$TPU_NAME` the name of your TPU in the Cloud Console.
-
-   An example command to run Transformer on a v2-8 or v3-8 TPU would be:
-
-   ```bash
-   python transformer_main.py \
-     --tpu=$TPU_NAME \
-     --model_dir=$MODEL_DIR \
-     --data_dir=$DATA_DIR \
-     --vocab_file=$DATA_DIR/vocab.ende.32768 \
-     --bleu_source=$DATA_DIR/newstest2014.en \
-     --bleu_ref=$DATA_DIR/newstest2014.end \
-     --batch_size=6144 \
-     --train_steps=2000 \
-     --static_batch=true \
-     --use_ctl=true \
-     --param_set=big \
-     --max_length=64 \
-     --decode_batch_size=32 \
-     --decode_max_length=97 \
-     --padded_decode=true \
-     --distribution_strategy=tpu
-   ```
-   Note: `$MODEL_DIR` and `$DATA_DIR` must be GCS paths.
-
-   #### Customizing training schedule
-
-   By default, the model will train for 10 epochs, and evaluate after every epoch. The training schedule may be defined through the flags:
-
-   * Training with steps:
-     * `--train_steps`: sets the total number of training steps to run.
-     * `--steps_between_evals`: Number of training steps to run between evaluations.
-
-   #### Compute BLEU score during model evaluation
-
-   Use these flags to compute the BLEU when the model evaluates:
-
-   * `--bleu_source`: Path to file containing text to translate.
-   * `--bleu_ref`: Path to file containing the reference translation.
-
-   When running `transformer_main.py`, use the flags: `--bleu_source=$DATA_DIR/newstest2014.en --bleu_ref=$DATA_DIR/newstest2014.de`
-
-   #### Tensorboard
-   Training and evaluation metrics (loss, accuracy, approximate BLEU score, etc.) are logged, and can be displayed in the browser using Tensorboard.
-   ```
-   tensorboard --logdir=$MODEL_DIR
-   ```
-   The values are displayed at [localhost:6006](localhost:6006).
-
-## Implementation overview
-
-A brief look at each component in the code:
-
-### Model Definition
-* [transformer.py](transformer.py): Defines a tf.keras.Model: `Transformer`.
-* [embedding_layer.py](embedding_layer.py): Contains the layer that calculates the embeddings. The embedding weights are also used to calculate the pre-softmax probabilities from the decoder output.
-* [attention_layer.py](attention_layer.py): Defines the multi-headed and self attention layers that are used in the encoder/decoder stacks.
-* [ffn_layer.py](ffn_layer.py): Defines the feedforward network that is used in the encoder/decoder stacks. The network is composed of 2 fully connected layers.
-
-Other files:
-* [beam_search.py](beam_search.py) contains the beam search implementation, which is used during model inference to find high scoring translations.
-
-### Model Trainer
-[transformer_main.py](transformer_main.py) creates an `TransformerTask` to train and evaluate the model using tf.keras.
-
-### Test dataset
-The [newstest2014 files](https://storage.googleapis.com/tf-perf-public/official_transformer/test_data/newstest2014.tgz)
-are extracted from the [NMT Seq2Seq tutorial](https://google.github.io/seq2seq/nmt/#download-data).
-The raw text files are converted from the SGM format of the
-[WMT 2016](http://www.statmt.org/wmt16/translation-task.html) test sets. The
-newstest2014 files are put into the `$DATA_DIR` when executing `data_download.py`