8866fa47 · 4f2f6b7d · 4f2f6b7d · 4f2f6b7d · 4f2f6b7d · 4f2f6b7d
59 changed file
--- a/.gitignore
+++ b/.gitignore
@@ -3,4 +3,17 @@ en/site/
 site/

 .idea
-.DS_Store
\ No newline at end of file
+.DS_Store
+
+log/
+core.*
+*.pyc
+*.ipynb
+/.vscode
+/.idea
+/manylinux*
+wheelhouse/
+wheelhouse*
+.clangd
+.cache
+/tmp
\ No newline at end of file
--- a/cn/docs/adv_examples/alexnet.md
+++ b/cn/docs/adv_examples/alexnet.md
--- a/cn/docs/adv_examples/bert.md
+++ b/cn/docs/adv_examples/bert.md
-
-## 模型概述
-BERT(Bidirectional Encoder Representations from Transformers)是NLP领域的一种预训练模型。本案例中，基于论文[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)实现了BERT模型的OneFlow版本。
-
-### 模型架构
-| **Model** | **Hidden layers** | **Hidden unit size** | **Attention heads** | **Feedforward filter size** | **Max sequence length** | **Parameters** |
-|:---------:|:----------:|:----:|:---:|:--------:|:---:|:----:|
-|BERTBASE |12 encoder| 768| 12|4 x  768|512|110M|
-
-BERT 在实际应用中往往分为两步：
-
-* 首先，预训练得到 BERT 语言模型；
-
-* 然后，为满足下游应用，在得到的 BERT 语言模型的基础上，多加一层网络，并进行微调，得到下游应用。
-
-
-## 快速开始
-### 获取相关数据集
-我们提供了完成 BERT 预训练及 SQuAD 微调的 [OFRecord 数据集及相关数据文件](https://oneflow-static.oss-cn-beijing.aliyuncs.com/oneflow-tutorial-attachments/bert_squad_dataset.zip)，可以通过以下命令下载并解压：
-
-```bash
-wget https://oneflow-static.oss-cn-beijing.aliyuncs.com/oneflow-tutorial-attachments/bert_squad_dataset.zip
-unzip bert_squad_dataset.zip
-```
-解压后的文件目录清单如下：
-
-* bert_config.json、vocab.txt：制作 prediction json 文件需要的文件，来自[google bert](https://github.com/google-research/bert)
-
-* dev-v1.1/、dev-v1.1.json：SQuAD 检验集，用于打分
-
-* part-0：预训练集样本（40个样本）
-
-* train-v1.1：SQuAD 训练集，已经转为 ofrecord 数据集格式
-
-以上各个文件将在下文的预训练任务、SQuAD 微调中使用到。
-
-### 训练 BERT 模型
-首先，克隆 `OneFlow-Benchmark` 仓库。
-
-```bash
-git clone https://github.com/Oneflow-Inc/OneFlow-Benchmark.git
-cd OneFlow-Benchmark/LanguageModeling/BERT/
-```
-
-然后，通过以下命令，使用我们预训练好的 pretrain 模型以及小型样本集合，开始 BERT 预训练查看效果：
-```bash
-python ./run_pretraining.py\
-    --gpu_num_per_node=1 \
-    --learning_rate=3e-5 \
-    --batch_size_per_device=1 \
-    --iter_num=3 \
-    --loss_print_every_n_iter=50 \
-    --seq_length=128 \
-    --max_predictions_per_seq=20 \
-    --num_hidden_layers=12 \
-    --num_attention_heads=12 \
-    --max_position_embeddings=512 \
-    --type_vocab_size=2 \
-    --vocab_size=30522 \
-    --attention_probs_dropout_prob=0.0 \
-    --hidden_dropout_prob=0.0 \
-    --hidden_size_per_head=64 \
-    --use_boxing_v2=True \
-    --data_dir=./dataset/ \
-    --data_part_num=1 \
-    --log_dir=./bert_regresssioin_test/of \
-    --loss_print_every_n_iter=5 \
-    --model_save_dir=./bert_regresssioin_test/of \
-    --warmup_batches 831 \
-    --save_last_snapshot True
-```
-我们将获得类似以下输出：
-```text
-==================================================================
-Running bert: num_gpu_per_node = 1, num_nodes = 1.
-==================================================================
-gpu_num_per_node = 1
-node_num = 1
-node_list = None
-learning_rate = 3e-05
-weight_decay_rate = 0.01
-batch_size_per_device = 1
-iter_num = 20
-warmup_batches = 831
-log_every_n_iter = 1
-data_dir = ./dataset/
-data_part_num = 1
-use_fp16 = None
-use_boxing_v2 = True
-loss_print_every_n_iter = 5
-model_save_every_n_iter = 10000
-model_save_dir = ./bert_regresssioin_test/of
-save_last_snapshot = True
-model_load_dir = None
-log_dir = ./bert_regresssioin_test/of
-seq_length = 128
-max_predictions_per_seq = 20
-num_hidden_layers = 12
-num_attention_heads = 12
-max_position_embeddings = 512
-type_vocab_size = 2
-vocab_size = 30522
-attention_probs_dropout_prob = 0.0
-hidden_dropout_prob = 0.0
-hidden_size_per_head = 64
------------------------------------------------------------------
-Time stamp: 2020-07-06-19:09:29
-I0706 19:09:29.605840639   34801 ev_epoll_linux.c:82]        Use of signals is disabled. Epoll engine will not be used
-Init model on demand
-iter 4, total_loss: 11.032, mlm_loss: 10.281, nsp_loss: 0.751, speed: 33.086(sec/batch), 0.151(sentences/sec)
-iter 9, total_loss: 11.548, mlm_loss: 10.584, nsp_loss: 0.965, speed: 0.861(sec/batch), 5.806(sentences/sec)
-iter 14, total_loss: 10.697, mlm_loss: 10.249, nsp_loss: 0.448, speed: 0.915(sec/batch), 5.463(sentences/sec)
-iter 19, total_loss: 10.685, mlm_loss: 10.266, nsp_loss: 0.419, speed: 1.087(sec/batch), 4.602(sentences/sec)
-Saving model to ./bert_regresssioin_test/of/last_snapshot.
------------------------------------------------------------------
-average speed: 0.556(sentences/sec)
------------------------------------------------------------------
-```
-
-## 详细说明
-### 脚本说明
-| **分类** | **说明** | **所属**|
-|:---------:|:----------:|:----------:|
-|pretrain.py、bert.py|定义了 BERT 网络模型；|BERT|
-|run_pretraining.py|启动BERT训练的用户脚本，用户通过命令行参数进行BERT训练的训练环境及超参配置，各个参数的具体作用将在下文 **脚本参数** 中说明。| BERT|
-|squad.py|定义了squad网络；|SQuAD|
-|run_squad.py|用于启动SQuAD的训练|SQuAD|
-|run_squad_predict.py|使用训练好的SQuAD模型进行预测|SQuAD|
-|npy2json.py|将OneFlow的预测结果转化为prediction json格式的必要脚本|SQuAD|
-|convert_tf_ckpt_to_of.py|将TensorFlow模型转为OneFlow的模型格式|BERT/SQuAD|
-
-
-
-### 脚本参数
-`run_pretraining.py`通过命令行参数配置包括超参在内的训练环境，可以通过
-`run_pretraining.py --help`查看，以下是这些参数作用的具体说明：
-
-* gpu_num_per_node： 每个节点上 GPU 的数目，OneFlow 要求每个节点的 GPU 数目必须一致
-
-* node_num： 节点数目，即分布式训练时的主机数目
-
-* node_list： 节点列表，如果节点数大于1，则需要通过 node_list 指定节点列表，节点列表为字符串形式，采用逗号分隔，如`--node_num=2 --node_list="192.168.1.12,192.168.1.14"`
-
-* learning_rate： Learning rate
-
-* weight_decay_rate：设置权重衰减率
-
-* batch_size_per_device： 分布式训练时每个设备上的batch大小
-
-* iter_num ITER_NUM： 训练的总轮数
-
-* warmup_batches： 预热轮数，默认值为10000
-
-* data_dir： OFRecord数据集的路径
-
-* data_part_num：OFRecord数据集目录下的数据文件数目
-
-* use_fp16： 是否使用fp16
-
-* use_boxing_v2： 是否使用boxing v2
-
-* loss_print_every_n_iter：训练中每隔多少轮打印一次训练信息（loss信息）
-
-* model_save_every_n_iter： 训练中每隔多少轮保存一次模型
-
-* model_save_dir： 模型存储路径
-
-* save_last_snapshot：指定最后一轮训练完成后，模型保存路径
-
-* model_load_dir：指定模型加载路径
-
-* log_dir LOG_DIR：指定日志路径
-
-* seq_length： 指定BERT句子长度，默认值为512
-
-* max_predictions_per_seq： 默认值为80
-
-* num_hidden_layers：隐藏层数目，默认值为24
-
-* num_attention_heads： Attention头数目，默认值为16
-
-### 使用完整的 Wikipedia + BookCorpus 数据集
-如果需要从无到有进行 BERT 的 pretrain 训练，则需要使用较大的训练集。
-
-如果感兴趣，可以通过 [google-research BERT](https://github.com/google-research/bert) 的页面，下载 tfrecord 格式的数据集。再根据[加载与准备OFRecord数据集](../extended_topics/how_to_make_ofdataset.md)中的方法，将 TFRecord 数据转为 OFRecord 数据集使用。
-
-### 将 Tensorflow 的 BERT 模型转为 OneFlow 模型格式
-如果想直接使用已经训练好的 pretrained 模型做 fine-tune 任务（如以下将展示的SQuAD），可以考虑直接从 [google-research BERT](https://github.com/google-research/bert) 页面下载已经训练好的 BERT 模型。
-
-再利用我们提供的 `convert_tf_ckpt_to_of.py` 脚本，将其转为 OneFlow 模型格式。转换过程如下：
-
-首先，下载并解压某个版本的 BERT 模型，如 `uncased_L-12_H-768_A-12`。
-```
-wget https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-768_A-12.zip
-unzip uncased_L-12_H-768_A-12.zip -d uncased_L-12_H-768_A-12
-```
-
-然后，运行以下命令：
-```
-cd uncased_L-12_H-768_A-12/
-cat > checkpoint <<ONEFLOW
-model_checkpoint_path: "bert_model.ckpt"
-all_model_checkpoint_paths: "bert_model.ckpt"
-ONEFLOW
-```
-
-该命令将在解压目录下创建一个 `checkpoint` 文件，并写入以下内容：
-```
-model_checkpoint_path: "bert_model.ckpt"
-all_model_checkpoint_paths: "bert_model.ckpt"
-```
-
-此时，已经准备好待转化的 TensorFlow 模型目录，整个模型目录的结构如下：
-```
-uncased_L-12_H-768_A-12
-├── bert_config.json
-├── bert_model.ckpt.data-00000-of-00001
-├── bert_model.ckpt.index
-├── checkpoint
-└── vocab.txt
-```
-
-我们接着使用 `convert_tf_ckpt_to_of.py` 将 TensorFlow 模型转为 OneFlow 模型：
-```bash
-python convert_tf_ckpt_to_of.py \
-  --tf_checkpoint_path ./uncased_L-12_H-768_A-12 \
-  --of_dump_path ./uncased_L-12_H-768_A-12-oneflow
-```
-以上命令，将转化好的 OneFlow 格式的模型保存在 `./uncased_L-12_H-768_A-12-oneflow` 目录下，供后续微调训练(如：SQuAD)使用。
-
-## 微调：SQuAD 问答任务
-### 将 pretrained 模型修改为 SQuAD 模型
-我们只需要在 BERT 的 backbone 基础上，加上一层 `output` 层，并修改 loss 的表达式即可，完整的代码可以查看 `squad.py` 脚本，以下是几处关键修改：
-```python
-def SQuADTrain():
-    #...
-    backbone = bert_util.BertBackbone()
-
-    #在BERT的基础上加上一个全连接层
-    with flow.name_scope("cls-squad"):
-        final_hidden = backbone.sequence_output()
-        final_hidden_matrix = flow.reshape(final_hidden, [-1, hidden_size])
-        logits = bert_util._FullyConnected(
-                    final_hidden_matrix,
-                    hidden_size,
-                    units=2,
-                    weight_initializer=bert_util.CreateInitializer(initializer_range),
-                    name='output')
-        logits = flow.reshape(logits, [-1, seq_length, 2])
-
-        start_logits = flow.slice(logits, [None, None, 0], [None, None, 1])
-        end_logits = flow.slice(logits, [None, None, 1], [None, None, 1])
-
-    #重新定义SQuAD任务的loss
-        start_loss = _ComputeLoss(start_logits, start_positions_blob, seq_length)
-        end_loss = _ComputeLoss(end_logits, end_positions_blob, seq_length)
-
-        total_loss = 0.5*(start_loss + end_loss)
-
-    return total_loss
-```
-
-为了得到一个初始化的 squad 模型，我们通过以下脚本启动 squad 训练，并保存模型。
-
-```
-python ./run_squad.py\
-    --gpu_num_per_node=1\
-    --learning_rate=3e-5\
-    --batch_size_per_device=2\
-    --iter_num=50\
-    --loss_print_every_n_iter=50\
-    --seq_length=384\
-    --max_predictions_per_seq=20\
-    --num_hidden_layers=12\
-    --num_attention_heads=12\
-    --max_position_embeddings=512\
-    --type_vocab_size=2\
-    --vocab_size=30522\
-    --attention_probs_dropout_prob=0.0\
-    --hidden_dropout_prob=0.0\
-    --hidden_size_per_head=64\
-    --use_boxing_v2=True\
-    --data_dir=./dataset/train-v1.1\
-    --data_part_num=1\
-    --log_dir=./bert_regresssioin_test/of\
-    --model_save_dir=./bert_regresssioin_test/of\
-    --warmup_batches 831\
-    --save_last_snapshot True
-```
-完成训练后，在 `./bert_regresssioin_test/of/last_snapshot` 中保存有初始化的 SQuAD 模型，我们将其与训练好的 BERT 合并后，进行微调（fine-tune）训练。
-
-### 合并 pretrained 模型为 SQuAD 模型
-SQuAD 模型是在 pretrained 模型基础上的扩充，我们需要参照[模型的加载与保存](../basics_topics/model_load_save.md)中的“模型部分初始化和部分导入”方法，将训练好的 BERT pretrained 模型与初始化的 SQuAD 模型合并。
-
-```
-cp -R ./bert_regresssioin_test/of/last_snapshot ./squadModel
-cp -R --remove-destination ./dataset/uncased_L-12_H-768_A-12_oneflow/* ./squadModel/
-```
-
-### OneFlow 预训练模型的训练次数问题
-OneFlow 生成的模型目录中，会有一个名为 `System-Train-TrainStep-xxx` 的子目录(xxx为作业函数的函数名)，该子目录下的 out 文件中，保存有训练总迭代数，并且这个迭代数会用于动态调节训练过程的`learning rate`。
-
-为了防止保存的迭代数影响到微调的训练，应该将out文件中的二进制数据清零：
-```
-cd System-Train-TrainStep-xxx
-xxd -r > out <<ONEFLOW
-00000000: 0000 0000 0000 0000
-ONEFLOW
-```
-
-如果你使用的是由 TensorFlow 转过来的预训练模型，则可以省去这个步骤。
-
-### 开始 SQuAD 训练
-通过 `run_suqad.py` 脚本，开始训练 SQuAD 模型，主要配置如下：
-
-* 使用以上合并得到的 SQuAD 模型 `./squadModel`
-
-* 采用 SQuAD v1.1 作为训练集
-
-* epoch = 3 (`iternum = 88641*3/(4*8) = 8310`)
-
-* learning rate = 3e-5
-
-```
-python ./run_squad.py\
-    --gpu_num_per_node=4\
-    --learning_rate=3e-5\
-    --batch_size_per_device=8\
-    --iter_num=8310\
-    --loss_print_every_n_iter=50\
-    --seq_length=384\
-    --max_predictions_per_seq=20\
-    --num_hidden_layers=12\
-    --num_attention_heads=12\
-    --max_position_embeddings=512\
-    --type_vocab_size=2\
-    --vocab_size=30522\
-    --attention_probs_dropout_prob=0.0\
-    --hidden_dropout_prob=0.0\
-    --hidden_size_per_head=64\
-    --use_boxing_v2=True\
-    --data_dir=./dataset/train-v1.1\
-    --data_part_num=8\
-    --log_dir=./bert_regresssioin_test/of\
-    --model_save_dir=./bert_regresssioin_test/of\
-    --warmup_batches 831\
-    --save_last_snapshot True\
-    --model_load_dir=./squadModel
-```
-
-### 预测及打分
-生成为了生成 [Preidiction File](https://rajpurkar.github.io/SQuAD-explorer/) 格式的 json 文件，我们先将预测结果保存为 npy 文件，再使用 [google BERT的run_squad.py](https://github.com/google-research/bert/blob/master/run_squad.py) 中的 `write_predictions` 函数，转化为 json 格式。
-
-利用 `run_squad_predict.py` 生成 `all_results.npy` 文件：
-```bash
-python run_squad_predict.py \
-  --gpu_num_per_node=1 \
-  --batch_size_per_device=4 \
-  --iter_num=2709 \
-  --seq_length=384 \
-  --max_predictions_per_seq=20 \
-  --num_hidden_layers=12 \
-  --num_attention_heads=12 \
-  --max_position_embeddings=512 \
-  --type_vocab_size=2 \
-  --vocab_size=30522 \
-  --attention_probs_dropout_prob=0.0 \
-  --hidden_dropout_prob=0.0 \
-  --hidden_size_per_head=64 \
-  --use_boxing_v2=True \
-  --data_part_num=1 \
-  --data_dir=./dataset/dev-v1.1 \
-  --log_dir=./bert_regresssioin_test/of \
-  --model_load_dir=path/to/squadModel \
-  --warmup_batches 831
-```
-注意将以上 `model_load_dir` 修改为 **训练好的** squadModel。
-
-得到 `all_results.npy` 文件后，在[google bert](https://github.com/google-research/bert/)仓库目录下（注意该仓库的 tensorflow 版本为 **tensorflow v1** ），运行我们提供的 `npy2json.py` (由 google bert 中的 run_squand.py 修改得来)：
-```
-python npy2json.py\
-  --vocab_file=./dataset/vocab.txt \
-  --bert_config_file=./dataset/bert_config.json \
-  --do_train=False \
-  --do_predict=True \
-  --all_results_file=./all_results.npy \
-  --predict_file=./dataset/dev-v1.1.json \
-  --max_seq_length=384 \
-  --doc_stride=128 \
-  --output_dir=./squad_base/
-```
-
-注意将 `all_results_file` 修改为上一步得到的 `all_results.npy` 的路径。
-
-最终，得到 `predictions.json` 文件，可以使用 [evaluate-v1.1.py](https://rajpurkar.github.io/SQuAD-explorer/) 进行打分。
-
-```bash
-python evaluate-v1.1.py \
-./dataset/dev-v1.1.json \
-path/to/squad_base/predictions.json
-```
-
-## 分布式训练
-如之前介绍脚本参数时描述：进行分布式训练，只需要在启动训练脚本式加入 `node_num` 选项指定主机数目及 `node_list` 选项即可：
-
-```bash
-python run_squad_predict.py \
-  --gpu_num_per_node=1 \
-  --batch_size_per_device=4 \
-  --iter_num=2709 \
-  --seq_length=384 \
-  --max_predictions_per_seq=20 \
-  --num_hidden_layers=12 \
-  --num_attention_heads=12 \
-  --max_position_embeddings=512 \
-  --type_vocab_size=2 \
-  --vocab_size=30522 \
-  --attention_probs_dropout_prob=0.0 \
-  --hidden_dropout_prob=0.0 \
-  --hidden_size_per_head=64 \
-  --use_boxing_v2=True \
-  --data_part_num=1 \
-  --data_dir=./dataset/dev-v1.1 \
-  --log_dir=./bert_regresssioin_test/of \
-  --model_load_dir=path/to/squadModel \
-  --warmup_batches 831 \
-  --node_num=2 \
-  --node_list="192.168.1.12,192.168.1.14"
-```
--- a/cn/docs/adv_examples/dcgan.md
+++ b/cn/docs/adv_examples/dcgan.md
-# DCGAN tutorial 
-
-
-
-## 简介
-
-生成对抗网络(GANs)属于一种生成网络，它通过两个网络的相互博弈的方式来学习特定的数据分布。而DCGAN则是一种基于卷积/反卷积运算的生成对抗网络，被广泛应用于图像生成领域
-
-本例程将主要演示如何在Oneflow中运行DCGAN网络，而不重点讨论生成对抗网络的原理和细节。如果感兴趣的话，可以参考：
-
- [Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks](https://arxiv.org/abs/1511.06434)
-
- [NLPS 2016 tutorial:generative adversarial networks](https://arxiv.org/abs/1511.06434)
-
-
-
-## 对齐测试
-
-本例程的核心代码在`dcgan.py`文件中，其中的模型结构和参数参考了tensorflow的[官方示例](https://www.tensorflow.org/tutorials/generative/dcgan)
-
-通过以下代码，可以运行一个简单的对齐测试，保证oneflow的模型结果与tensorflow的结果是一致的
-
-```python
-dcgan = DCGAN()
-dcgan.compare_with_tensorflow()
-```
-
-
-
-## 数据集准备
-
-例程提供了数据集下载脚本，运行`download.py`下载mnist数据集， 数据集默认保存在`./data/minst`目录中
-
-```bash
-python download.py mnist
-```
-
-
-
-## 训练
-
-在准备好数据集后，可通过DCGAN实例的`train`方法进行DCGAN的训练
-
-```python
-dcgan.train(epochs=2)
-```
-
-训练将每隔`self.eval_interval`个batch输出生成的图像
-
-![1](imgs/1.png)
-
-## 导出动图
-
-再完成训练后，可以通过DCGAN实例的`save_to_gif`方法将图像导出为动图
-
-```python
-dcgan.save_to_gif()
-```
\ No newline at end of file
--- a/cn/docs/adv_examples/imgs/1.png
+++ b/cn/docs/adv_examples/imgs/1.png
--- a/cn/docs/adv_examples/imgs/big_vocab_table_2x1024.png
+++ b/cn/docs/adv_examples/imgs/big_vocab_table_2x1024.png
--- a/cn/docs/adv_examples/imgs/big_vocab_table_7x1024.png
+++ b/cn/docs/adv_examples/imgs/big_vocab_table_7x1024.png
--- a/cn/docs/adv_examples/imgs/detected_000004.jpg
+++ b/cn/docs/adv_examples/imgs/detected_000004.jpg
--- a/cn/docs/adv_examples/imgs/detected_kite.jpg
+++ b/cn/docs/adv_examples/imgs/detected_kite.jpg
--- a/cn/docs/adv_examples/imgs/eval_auc_loss_500iters.png
+++ b/cn/docs/adv_examples/imgs/eval_auc_loss_500iters.png
--- a/cn/docs/adv_examples/imgs/fish.jpg
+++ b/cn/docs/adv_examples/imgs/fish.jpg
--- a/cn/docs/adv_examples/imgs/fixed_batch_size_latency.png
+++ b/cn/docs/adv_examples/imgs/fixed_batch_size_latency.png
--- a/cn/docs/adv_examples/imgs/fixed_batch_size_memory.png
+++ b/cn/docs/adv_examples/imgs/fixed_batch_size_memory.png
--- a/cn/docs/adv_examples/imgs/resnet50_validation_acuracy.png
+++ b/cn/docs/adv_examples/imgs/resnet50_validation_acuracy.png
--- a/cn/docs/adv_examples/imgs/scaled_batch_size_latency.png
+++ b/cn/docs/adv_examples/imgs/scaled_batch_size_latency.png
--- a/cn/docs/adv_examples/imgs/scaled_batch_size_latency_1gpu.png
+++ b/cn/docs/adv_examples/imgs/scaled_batch_size_latency_1gpu.png
--- a/cn/docs/adv_examples/imgs/scaled_batch_size_memory.png
+++ b/cn/docs/adv_examples/imgs/scaled_batch_size_memory.png
--- a/cn/docs/adv_examples/imgs/tiger.jpg
+++ b/cn/docs/adv_examples/imgs/tiger.jpg
--- a/cn/docs/adv_examples/imgs/train_eval_auc_loss.png
+++ b/cn/docs/adv_examples/imgs/train_eval_auc_loss.png
--- a/cn/docs/adv_examples/mask_rcnn.md
+++ b/cn/docs/adv_examples/mask_rcnn.md
--- a/cn/docs/adv_examples/resnet.md
+++ b/cn/docs/adv_examples/resnet.md
-## 简介 Introduction
-
-### 图像分类与CNN
-
- **图像分类** 是指将图像信息中所反映的不同特征，把不同类别的目标区分开来的图像处理方法，是计算机视觉中其他任务，比如目标检测、语义分割、人脸识别等高层视觉任务的基础。
-
-ImageNet 大规模视觉识别挑战赛（ILSVRC），常称为 ImageNet 竞赛，包括图像分类、物体定位，以及物体检测等任务，是推动计算机视觉领域发展最重要的比赛之一。
-
-在2012年的 ImageNet 竞赛中，深度卷积网络 AlexNet 横空出世。以超出第二名10%以上的top-5准确率，勇夺 ImageNet2012 比赛的冠军。从此，以 **CNN（卷积神经网络）** 为代表的深度学习方法开始在计算机视觉领域的应用开始大放异彩，更多的更深的CNN网络被提出，比如 ImageNet2014 比赛的冠军 VGGNet, ImageNet2015 比赛的冠军 ResNet。
-
-
-
-### ResNet
-
-[ResNet](https://arxiv.org/abs/1512.03385) 是2015年ImageNet竞赛的冠军。目前，ResNet 相对对于传统的机器学习分类算法而言，效果已经相当的出色，之后大量的检测，分割，识别等任务也都在 ResNet 基础上完成。
-
-[OneFlow-Benchmark](https://github.com/Oneflow-Inc/OneFlow-Benchmark) 仓库中，提供 ResNet50 v1.5 的 OneFlow 实现。我们在 ImageNet-2012 数据集上训练90轮后，验证集上的准确率能够达到：77.318%(top1)，93.622%(top5)。
-
-更详细的网络参数对齐工作，见 [OneFlow-Benchmark的cnns](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns) 部分
-
-![resnet50_validation_acuracy](imgs/resnet50_validation_acuracy.png)
-
-
-
-**关于 ResNet50 v1.5 的说明：**
-
-> ResNet50 v1.5 是原始 [ResNet50 v1](https://arxiv.org/abs/1512.03385) 的一个改进版本，相对于原始的模型，精度稍有提升 (~0.5% top1)，详细说明参见[这里](https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5) 。
->
-
-
-
-准备好亲自动手，复现上面的结果了吗？
-
-
-
-下面，本文就以上面的 ResNet50 为例，一步步展现如何使用 OneFlow 进行 ResNet50 网络的训练和预测。
-
-主要内容包括：
-
- 准备工作
-  - 项目安装和准备工作
-
- 快速开始
-  - 预测/推理
-  - 训练和验证
-  - 评估
- 更详细的说明
-  - 分布式训练
-  - 混合精度训练与预测
- 进阶
-  - 参数对齐
-  - 数据集制作(ImageNet2012)
-  - OneFlow 模型转 ONNX 模型
-
-
-
-## 准备工作 Requirements
-
-别担心，使用 OneFlow 非常容易，只要准备好下面三步，即可开始 OneFlow 的图像识别之旅。
-
- 安装 OneFlow，安装方式参考 [OneFlow项目主页](https://github.com/Oneflow-Inc/oneflow)
-
- 克隆/下载 [OneFlow-Benchmark](https://github.com/Oneflow-Inc/OneFlow-Benchmark) 仓库。
-
-  `git clone git@github.com:Oneflow-Inc/OneFlow-Benchmark.git`
-
-  `cd  OneFlow-Benchmark/Classification/cnns`
-
- 准备数据集（可选）
-
-  - 直接使用 synthetic 虚拟合成数据集
-  - 下载我们制作的 Imagenet(2012) [迷你数据集](https://oneflow-public.oss-cn-beijing.aliyuncs.com/online_document/dataset/imagenet/mini-imagenet.zip) 解压放入data目录
-  - 或者：制作完整 OFRecord 格式的 ImageNet 数据集（见下文进阶部分）
-
-我们提供了通用脚本：`train.sh` 和 `inference.sh`，它们适用于此仓库下所有cnn网络模型的训练、验证、推理。您可以通过设置参数使用不同的模型、数据集来训练/推理。
-
- **关于模型的说明：**
-
-> 默认情况下，我们使用resnet50，您也可以通过改动脚本中的--model参数指定其他模型，如：`--model="resnet50"`，`--model="vgg"` 等。
-
-**关于数据集的说明：**
-
-
-> 1）为了使读者快速上手，我们提供了 synthetic 虚拟合成数据，“合成数据”是指不通过磁盘加载数据，而是直接在内存中生成一些随机数据，作为神经网络的数据输入源。
->
-> 2）同时，我们提供了一个小的迷你示例数据集。直接下载解压至 cnn 项目的 data 目录，即可快速开始训练。读者可以在熟悉了流程后，参考数据集制作部分，制作完整的 Imagenet2012 数据集。
->
-> 3）使用 OFRcord 格式的数据集可以提高数据加载效率（但这非必须，参考[数据输入](../basics_topics/data_input.md)，OneFlow 支持直接加载 numpy 数据）。
-
-
-
-## 快速开始 Quick Start
-
-那么接下来，立马开始 OneFlow 的图像识别之旅吧！
-
-首先，切换到目录：
-
-```
-cd OneFlow-Benchmark/Classification/cnns
-```
-
-### 预训练模型
-
-#### resnet50
-
-[resnet50_v1.5_model](https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/resnet_v15_of_best_model_val_top1_77318.tgz ) (validation accuracy: 77.318% top1，93.622% top5 )
-
-### 预测/推理
-
-下载好预训练模型后，解压后放入当前目录，然后执行：
-
-```
-sh inference.sh
-```
-
-此脚本将调用模型对这张金鱼图片进行分类：
-
-<div align="center">
-    <img src="imgs/fish.jpg" align='center'/>
-</div>
-
-若输出下面的内容，则表示预测成功：
-
-```
-data/fish.jpg
-0.87059885 goldfish, Carassius auratus
-```
-
-可见，模型判断这张图片有87.05%的概率是金鱼 goldfish。
-
-### 训练和验证（Train & Validation）
-
- 训练同样很简单，只需执行：
-
-  ```
-  sh train.sh
-  ```
-
-  即可开始模型的训练，您将看到如下输出：
-
-  ```
-  Loading synthetic data.
-  Loading synthetic data.
-  Saving model to ./output/snapshots/model_save-20200723124215/snapshot_initial_model.
-  Init model on demand.
-  train: epoch 0, iter 10, loss: 7.197278, top_1: 0.000000, top_k: 0.000000, samples/s: 61.569
-  train: epoch 0, iter 20, loss: 6.177684, top_1: 0.000000, top_k: 0.000000, samples/s: 122.555
-  Saving model to ./output/snapshots/model_save-20200723124215/snapshot_epoch_0.
-  train: epoch 0, iter 30, loss: 3.988656, top_1: 0.525000, top_k: 0.812500, samples/s: 120.337
-  train: epoch 1, iter 10, loss: 1.185733, top_1: 1.000000, top_k: 1.000000, samples/s: 80.705
-  train: epoch 1, iter 20, loss: 1.042017, top_1: 1.000000, top_k: 1.000000, samples/s: 118.478
-  Saving model to ./output/snapshots/model_save-20200723124215/snapshot_epoch_1.
-  ...
-  ```
-
-  >  为了方便运行演示，我们默认使用synthetic虚拟合成数据集，使您可以快速看到模型运行的效果
-
-  同样，你也可以使用[迷你示例数据集](https://oneflow-public.oss-cn-beijing.aliyuncs.com/online_document/dataset/imagenet/mini-imagenet.zip)，下载解压后放入 cnn 项目的 data 目录即可，然后修改训练脚本如下：
-
-  ```
-  rm -rf core.*
-  rm -rf ./output/snapshots/*
-
-  DATA_ROOT=data/imagenet/ofrecord
-
-  python3 of_cnn_train_val.py \
-      --train_data_dir=$DATA_ROOT/train \
-      --num_examples=50 \
-      --train_data_part_num=1 \
-      --val_data_dir=$DATA_ROOT/validation \
-      --num_val_examples=50 \
-      --val_data_part_num=1 \
-      --num_nodes=1 \
-      --gpu_num_per_node=1 \
-      --model_update="momentum" \
-      --learning_rate=0.001 \
-      --loss_print_every_n_iter=1 \
-      --batch_size_per_device=16 \
-      --val_batch_size_per_device=10 \
-      --num_epoch=10 \
-      --model="resnet50"
-  ```
-
-  运行此脚本，将在仅有50张金鱼图片的迷你 ImageNet 数据集上，训练出一个分类模型，利用它，你可以对金鱼图片进行分类。
-
-  不要着急，如果您需要在完整的 ImageNet2012 数据集上进行训练，请参考：[OneFlow-Benchmark](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns)仓库。
-
-
-
-### 评估(Evaluate)
-
-你可以使用自己训练好的模型，或者我们提供的 [resnet50_v1.5_model](https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/resnet_v15_of_best_model_val_top1_77318.tgz ) （解压后放入当前目录），对resnet50模型的精度进行评估。
-
-只需运行：
-
-```
-sh evaluate.sh
-```
-
-即可获得训练好的模型在50000张验证集上的准确率：
-
-```
-Time stamp: 2020-07-27-09:28:28
-Restoring model from resnet_v15_of_best_model_val_top1_77318.
-I0727 09:28:28.773988162    8411 ev_epoll_linux.c:82]        Use of signals is disabled. Epoll engine will not be used
-Loading data from /dataset/ImageNet/ofrecord/validation
-validation: epoch 0, iter 195, top_1: 0.773277, top_k: 0.936058, samples/s: 1578.325
-validation: epoch 0, iter 195, top_1: 0.773237, top_k: 0.936078, samples/s: 1692.303
-validation: epoch 0, iter 195, top_1: 0.773297, top_k: 0.936018, samples/s: 1686.896
-```
-
-> 执行 `sh evaluate.sh` 前，确保准备了 ImageNet(2012) 的验证集，验证集制作方法请参考：[OneFlow-Benchmark](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns)仓库。
-
-从3轮的评估结果来看，我们的模型在 ImageNet(2012) 上已经达到了77.32+%的 top1 精度。
-
-最后，恭喜你！完成了 Resnet 模型在 ImageNet 上完整的训练/验证、推理和评估，为自己鼓个掌吧！
-
-
-
-## 更详细的说明 Details
-
-### 分布式训练
-
-**简单而易用的分布式，是 OneFlow 的主打特色之一。**
-
-OneFlow 框架从底层设计上，就原生支持高效的分布式训练。尤其对于分布式的数据并行，用户完全不用操心算法从单机单卡扩展到多机多卡时，数据如何划分以及同步的问题。也就是说，使用 OneFlow，用户以单机单卡的视角写好的代码，**自动具备多机多卡分布式数据并行的能力。**
-
-
-#### 如何配置并运行分布式训练？
-
-还是以上面"快速开始"部分演示的代码为例，在 `train.sh` 中，只要用 `--num_nodes` 指定节点（机器）个数，同时用 `--node_ips` 指定节点的 IP 地址，然后用 `--gpu_num_per_node` 指定每个节点上使用的卡数，就轻松地完成了分布式的配置。
-
-例如，想要在2机8卡上进行分布式训练，像下面这样配置：
-
-```
-# train.sh
-python3 of_cnn_train_val.py \
-    --num_nodes=2 \
-    --node_ips="192.168.1.1, 192.168.1.2"
-    --gpu_num_per_node=4 \
-    ...
-    --model="resnet50"
-```
-
-然后分别在两台机器上，同时执行：
-
-```
-./train.sh
-```
-
-程序启动后，通过 `watch -n 0.1 nvidia-smi` 命令可以看到，两台机器的 GPU 都开始了工作。一段时间后，会在 `--node_ips` 设置中的第一台机器的屏幕上，打印输出。
-
-
-### 混合精度训练与预测
-
-目前，OneFlow 已经原生支持 float16/float32 的混合精度训练。训练时，模型参数（权重）使用 float16 进行训练，同时保留 float32 用作梯度更新和计算过程。由于参数的存储减半，会带来训练速度的提升。
-
-在 OneFlow 中开启 float16/float32 的混合精度训练模式，ResNet50 的训练速度理论上能达到`1.7`倍的加速。
-
-
-#### 如何开启 float16 / float32 混合精度训练？
-
-只需要在 `train.sh` 脚本中添加参数 `--use_fp16=True` 即可。
-
-#### 混合精度模型
-
-我们为您提供了一个在 ImageNet2012 完整训练了90个 epoch 的混合精度模型，Top_1：77.33%
-
-您可以直接下载使用：[resnet50_v15_fp16](https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/resnet_fp16_of_best_model_val_top1_77330.zip)
-
-
-
-## 进阶 Advanced
-
-### 参数对齐
-
-OneFlow 的 ResNet50 实现，为了保证和[英伟达的 Mxnet 版实现](https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5)对齐，我们从 learning rate 学习率，优化器 Optimizer 的选择，数据增强的图像参数设定，到更细的每一层网络的形态，bias，weight 初始化等都做了细致且几乎完全一致的对齐工作。具体的参数对齐工作，请参考：[OneFlow-Benchmark](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns) 仓库
-
-
-
-###  数据集制作
-
-#### 用于图像分类数据集简介
-
-用于图像分类的公开数据集有CIFAR，ImageNet 等等，这些数据集中，是以 jpeg 的格式提供原始的图片。
-
- [CIFAR](http://www.cs.toronto.edu/~kriz/cifar.html)
-  是由Hinton 的学生 Alex Krizhevsky 和 Ilya Sutskever 整理的一个用于识别普适物体的小型数据集。包括CIFAR-10和CIFAR-100。
-
- [ImageNet](http://image-net.org/index)
-  ImageNet 数据集，一般是指2010-2017年间大规模视觉识别竞赛 (ILSVRC) 的所使用的数据集的统称。ImageNet 数据从2010年来稍有变化，常用 ImageNet-2012 数据集包含1000个类别，其中训练集包含1,281,167张图片，每个类别数据732至1300张不等，验证集包含50,000张图片，平均每个类别50张图片。
-
-完整的 ImageNet(2012)制作过程，请参考 tools 目录下的[README说明](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns/tools/README.md)
-
-
-
-### OneFlow 模型转 ONNX 模型
-
-#### 简介
-
- **ONNX (Open Neural Network Exchange)**  是一种较为广泛使用的神经网络中间格式，通过 ONNX 格式，OneFlow 模型可以被许多部署框架（如 OpenVINO、ONNX Runtime 和移动端的 ncnn、tnn、TEngine 等）所使用。这一节介绍如何将训练好的 ResNet50 v1.5 模型转换为 ONNX 模型并验证正确性。
-
-#### 快速上手
-
-我们提供了完整代码：[resnet\_to\_onnx.py](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns/resnet_to_onnx.py)  帮你轻松完成模型的转换和测试的工作
-
- **步骤一：** 下载预训练模型：[resnet50_v1.5_model](https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/resnet_v15_of_best_model_val_top1_77318.tgz ) ，解压后放入当前目录
-
- **步骤二：** 执行：`python3 resnet_to_onnx.py `
-
-此代码将完成 OneFlow 模型 -> ONNX 模型的转化，然后使用 ONNX Runtime 加载转换后的模型对单张图片进行测试。测试图片如下：
-
-<div align="center">
-    <img src="imgs/tiger.jpg" align='center'/>
-</div>
->                                              图片来源：https://en.wikipedia.org/wiki/Tiger
-
-输出：
-
-```python
-Convert to onnx success! >>  onnx/model/resnet_v15_of_best_model_val_top1_77318.onnx
-data/tiger.jpg
-Are the results equal? Yes
-Class: tiger, Panthera tigris; score: 0.8112028241157532
-```
-
-
-
-#### 如何生成 ONNX 模型
-
-上面的示例代码，介绍了如何转换 OneFlow 的 ResNet 模型至 ONNX 模型，并给出了一个利用 onnx runtime 进行预测的例子，同样，你也可以利用下面的步骤来完成自己训练的 ResNet 或其他模型的转换。
-
-**步骤一：将模型权重保存到本地**
-
-首先指定待转换的 OneFlow 模型路径，然后指定转换后的 ONNX 模型存放路径，例如示例中：
-
-```python
-#set up your model path
-flow_weights_path = 'resnet_v15_of_best_model_val_top1_77318'
-onnx_model_dir = 'onnx/model'
-```
-
-**步骤二：新建一个用于推理的 job function**
-
-然后新建一个用于推理的 job function，它只包含网络结构本身，不包含读取 OFRecord 的算子，并且直接接受 numpy 数组形式的输入。可参考 `resnet\_to\_onnx.py` 中的 `InferenceNet`。
-
-**步骤三：调用 `flow.onnx.export `方法**
-
-接下来代码中会调用 `oneflow_to_onnx()` 方法，此方法包含了核心的模型转换方法： `flow.onnx.export()`
-
- **`flow.onnx.export`** 将从 OneFlow 网络得到 ONNX 模型，它的第一个参数是上文所说的专用于推理的 job function，第二个参数是 OneFlow 模型路径，第三个参数是（转换后）ONNX 模型的存放路径
-
-```python
-onnx_model = oneflow_to_onnx(InferenceNet, flow_weights_path, onnx_model_dir, external_data=False)
-```
-
-#### 验证 ONNX 模型的正确性
-
-生成 ONNX 模型之后可以使用 ONNX Runtime 运行 ONNX 模型，以验证 OneFlow 模型和 ONNX 模型能够在相同的输入下产生相同的结果。相应的代码在 resnet\_to\_onnx.py 的 `check_equality`。
--- a/cn/docs/adv_examples/wide_deep.md
+++ b/cn/docs/adv_examples/wide_deep.md
-# Wide & Deep
-
-[HugeCTR](https://github.com/NVIDIA/HugeCTR)是英伟达提供的一种高效的GPU框架，专为点击率（CTR）估计训练而设计。
-
-OneFlow对标HugeCTR搭建了Wide & Deep 学习网络（WDL)。OneFlow-WDL网络实现了模型并行与稀疏更新，在8卡12G TitanV的服务器上实现支持超过4亿的词表大小，而且性能没有损失与小词表性能相当。
-
-本文介绍如何使用OneFlow-WDL网络进行训练，以及一些训练结果及分析。
-
-## 环境和准备
-运行OneFlow-WDL需要有安装好OneFlow的python环境，并安装了[scikit-learn](https://scikit-learn.org/stable/install.html)。
-### 软件要求
- python 3.x（推荐）
- OneFlow 0.x
- scikit-learn
-
-### 数据准备
-我们准备了一个小的[样本数据集](https://oneflow-public.oss-cn-beijing.aliyuncs.com/datasets/wdl_ofrecord_examples.tgz)，可以下载进行简单测试。
-
-或者参考[《使用Spark创建WDL数据集》](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/ClickThroughRate/WideDeepLearning/how_to_make_ofrecord_for_wdl.md)中的步骤，从CriteoLabs官网下载原始数据集并制作成OneFlow所需要的OFRecord格式的数据集。
-
-### OneFlow-WDL脚本
-OneFlow-WDL脚本只有一个文件`wdl_train_eval.py`，请从[这里](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/ClickThroughRate/WideDeepLearning/wdl_train_eval.py)下载。
-
-## 运行OneFlow-WDL脚本
-```
-EMBD_SIZE=1603616
-DATA_ROOT=/path/to/wdl/ofrecord
-python3 wdl_train_eval.py \
-  --train_data_dir $DATA_ROOT/train \
-  --train_data_part_num 256 \
-  --train_part_name_suffix_length=5 \
-  --eval_data_dir $DATA_ROOT/val \
-  --eval_data_part_num 256 \
-  --max_iter=300000 \
-  --loss_print_every_n_iter=1000 \
-  --eval_interval=1000 \
-  --batch_size=16384 \
-  --wide_vocab_size=$EMBD_SIZE \
-  --deep_vocab_size=$EMBD_SIZE \
-  --gpu_num 1
-```
-通常配置好数据集的位置`DATA_ROOT`后，上面的shell脚本就可以被执行了，如果屏幕上能够输出下面类似的结果，就表示已经正确运行。
-```
-1000 time 2020-07-08 00:28:08.066281 loss 0.503295350909233
-1000 eval_loss 0.4846755236387253 eval_auc 0.7616240146992771
-2000 time 2020-07-08 00:28:11.613961 loss 0.48661992555856703
-2000 eval_loss 0.4816856697201729 eval_auc 0.765256583562705
-3000 time 2020-07-08 00:28:15.149135 loss 0.48245503094792364
-3000 eval_loss 0.47835959643125536 eval_auc 0.7715609382514008
-4000 time 2020-07-08 00:28:18.686327 loss 0.47975033831596375
-4000 eval_loss 0.47925308644771575 eval_auc 0.7781267916810946
-```
-## 测试结果及说明
-我们在一台有8块12G显存的TitanV的服务器上对OneFlow-WDL进行了一组测试，并使用HugeCTR提供的docker容器做了同样参数的测试。
-
-### 多GPU性能测试
-主要测试目的是在batch size = 16384的情况下，测量不同GPU数量处理每个批次的平均时延（latency）。
-测试配置了7个1024神经单元的隐藏层。
-
-结果如下图：
-
-![image](imgs/fixed_batch_size_latency.png)
-
-我们同时记录了，测试时实际最大占用显存的大小，结果如下图：
-
-![image](imgs/fixed_batch_size_memory.png)
-
-综合上面结果表明，1卡到8卡，OneFlow-WDL在占用较少的显存的情况下，速度要比HugeCTR快。
-
-### batch size=16384每卡，多卡性能测试
-主要测试目的是在保证每GPU卡处理16384batch size情况下，使用1至8GPU卡进行训练每个批次的平均时延（latency）。
-测试配置了7个1024神经单元的隐藏层。
-
-结果如下图：
-
-![image](imgs/scaled_batch_size_latency.png)
-
-我们同时记录了，测试时实际最大占用显存的大小，结果如下图：
-
-![image](imgs/scaled_batch_size_memory.png)
-
-综合上面结果表明，随着卡数的增加，时延增加，OneFlow-WDL在占用较少的显存的情况下，速度要比HugeCTR快；因为每卡保证16384 batch size，OneFlow每卡占用的内存并无显著变化。
-
-### 单GPU卡不同batch size性能测试
-主要测试目的是在一个GPU卡情况下，测量不同batch size每个批次的平均时延（latency）。
-测试配置了2个1024神经单元的隐藏层。
-
-结果如下图：
-
-![image](imgs/scaled_batch_size_latency_1gpu.png)
-
-### 超大词表测试
-OneFlow-WDL中配置了两个Embedding Table：
- `wide_embedding` 大小是vocab_size x 1
- `deep_embedding` 大小是vocab_size x 16
-
-HugeCTR中词表大小（vocab_size）是1603616。我们从3200000开始测起，一直到支持4亿的词表大小，结果如下图：
-
-![image](imgs/big_vocab_table_2x1024.png)
-![image](imgs/big_vocab_table_7x1024.png)
-
-上面的图中，蓝色柱子是批次训练的平均时延（latency），红色曲线代表GPU显存的占用。
-
-结论：随着词表大小的增大，内存随之增大，但latency没有明显的变化。
-
-### 收敛性测试1
-我们选取了batch size=512进行了收敛性能的测试。
-
-下面这张图是，前500步的结果，每一步训练都在验证集中选取20条记录进行验证，图中的曲线分别是loss和AUC：
-![image](imgs/eval_auc_loss_500iters.png)
-
-结论：AUC迅速就增长到超过了0.75。
-
-### 收敛性测试2
-和收敛性测试1同样的情况，这一次是每训练1000步打印训练loss的平均值，然后选取20条验证集数据进行验证，一共训练30万步，结果如下：
-
-![image](imgs/train_eval_auc_loss.png)
-
-结论与分析：
-1. 蓝色的train loss曲线有明显向下的台阶，因为整个训练集有36674623条数据，batch_size=512的情况下，大概71630步就过了整个数据集（一个epoch），30万步就把训练数据集用了4次多，蓝色曲线的台阶印证了这些。OneFlow在训练过程中支持数据的打乱，每当数据集被完整的用完一遍之后，数据会被重新打乱，减少过拟合。
-2. 橙色的曲线是验证集loss，在前两个epoch的时候基本保持下降的趋势，从第三个epoch开始，loss开始有上升的趋势，表明已经过拟合了。
-3. 灰色是验证集的AUC，AUC也是在第二个epoch的时候达到了峰值，超过了0.8，后面几个epoch就开始下降。
--- a/cn/docs/adv_examples/yolov3.md
+++ b/cn/docs/adv_examples/yolov3.md
-## YoloV3
-
-## 1.简介
-
-[YOLO](https://pjreddie.com/darknet/yolo/) 系列的算法(经典的v1~v3)，是单阶段目标检测网络的开山鼻祖，YOLO—You only look once，表明其单阶段的特征，正是由于网络简单，单阶段的效率较快，使其区别于 Faster-RCNN 为代表的两阶段目标检测器，从一开始推出至今，便以速度快和较高的准确率而风靡目标检测领域，受到广泛使用和好评。
-
-而Yolov3是其中的经典和集大成者(当然官方最近也推出了 Yolov4 )，其以融合了残差网络的 Darknet-53 为骨干网络，融合了多尺度，3路输出的 feature map，上采样等特点，使其模型精度和对小目标检测能力都大为提升。
-
-
-
-<div align="center">
-    <img src="imgs/detected_000004.jpg" align='center'/>
-</div>
-
-
-
-本文，我们提供了 Yolov3 的 OneFlow 版实现，和其他版本实现的区别在于，我们将输出特征的 nms 过程写进了 C++ 代码中，通过自定义 user op 的方式来调用，当然，我们也同时支持直接使用 python 代码处理 nms。
-
-
-
-## 2.快速开始
-
-开始前，请确保您已正确安装了[oneflow](https://github.com/Oneflow-Inc/oneflow)，并且在python3环境下可以成功import oneflow。
-
-1. git clone [此仓库](https://github.com/Oneflow-Inc/oneflow_yolov3)到本地
-
-```
-git clone --recursive https://github.com/Oneflow-Inc/oneflow_yolov3.git
-```
-2. 安装 python 依赖库
-
-```
-   pip install -r requirements.txt
-```
-3. 在项目 root 目录下，执行:
-
-```
-./scripts/build.sh
-```
-执行此脚本，将 cpp 代码中自定义的 op 算子编译成可调用执行的 .so 文件，您将在项目路径下看到：
-
- libdarknet.so
-
- liboneflow_yolov3.so
-
-
-
-### 预训练模型
-
-我们使用了 Yolov3 原作者提供的预训练模型—[yolov3.weight](https://pjreddie.com/media/files/yolov3.weights) ，经转换后生成了 OneFlow 格式的模型。下载预训练模型：[of_model_yolov3.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/of_model_yolov3.zip)  ，并将解压后的 `of_model` 文件夹放置在项目 root 目录下，即可使用。
-
-
-
-## 3. 预测/推理
-
-运行：
-
-```
-sh yolo_predict.sh
-```
-或者：
-```
-sh yolo_predict_python_data_preprocess.sh
-```
-
-运行脚本后，将在 `data/result` 下生成检测后带 bbox 标记框的图片：
-
-<div align="center">
-    <img src="imgs/detected_kite.jpg" align='center'/>
-</div>
-
-
-
-参数说明
- --pretrained_model    预训练模型路径
-
- --label_path                  coco 类别标签路径(coco.name)
-
- --input_dir                    待检测图片文件夹路径
-
- --output_dir	             检测结构输出路径
-
- --image_paths             单个/多个待检测图片路径，如：
-
-  --image_paths  'data/images/000002.jpg'  'data/images/000004.jpg'
-
-训练同样很简单，准备好数据集后，只需要执行：`sh yolo_train.sh`即可，数据集制作过程见下文【数据集制作】部分。
-
-
-
-## 4. 数据集制作
-
-Yolov3 支持任意目标检测数据集，下面我们以 [COCO2014](http://cocodataset.org/#download) 制作过程为例，介绍训练/验证所需的数据集制作，其它数据集如 [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/) 或自定义数据集等，都可以采用相同格式。
-
-### 资源文件
-
-下载 COCO2014 训练集和验证集图片，将解压后的 `train2014` 和 `val2014` 放在 `data/COCO/images` 目录下
-
-（如果本地已下载过 COCO2014 数据集，可以 ln 软链接 images 至本地 `train2014` 和 `val2014` 的父目录）
-
-准备资源文件：`labels`，`5k.part`，`trainvalno5k.part`
-
-```
-wget -c https://pjreddie.com/media/files/coco/5k.part
-wget -c https://pjreddie.com/media/files/coco/trainvalno5k.part
-wget -c https://pjreddie.com/media/files/coco/labels.tgz
-```
-
-### 脚本
-
-在 `data/COCO` 目录下执行脚本：
-
-```
-# get label file
-tar xzf labels.tgz
-
-# set up image list
-paste <(awk "{print \"$PWD\"}" <5k.part) 5k.part | tr -d '\t' > 5k.txt
-paste <(awk "{print \"$PWD\"}" <trainvalno5k.part) trainvalno5k.part | tr -d '\t' > trainvalno5k.txt
-
-# copy label txt to image dir
-find labels/train2014/ -name "*.txt"  | xargs -i cp {} images/train2014/
-find labels/val2014/   -name "*.txt"  | xargs -i cp {} images/val2014/
-```
-
-执行脚本将自动解压缩 `labels.tgz` 文件，并在当前目录下生成 `5k.txt` 和 `trainvalno5k.txt`，然后将 `labels/train2014` 和 `labels/val2014` 的所有 `label.txt` 文件复制到对应的训练集和验证集文件夹中( **保证图片和 label 在同一目录** )。
-
-至此，完成整个数据集的准备过程。
-
-
-
-## 5.训练
-
-修改 `yolo_train.sh` 脚本中的参数，令：`--image_path_file="data/COCO/trainvalno5k.txt"` 并执行：
-
-```
-sh yolo_train.sh
-```
-
-即可开始训练过程，更详细的参数介绍如下：
-
- --gpu_num_per_node    每台机器使用的gpu数量
- --batch_size                     批大小
- --base_lr                           初始学习率
- --classes                           目标类别数量（COCO 80；VOC 20）
- --model_save_dir            模型存放文件夹路径
- --dataset_dir                    训练/验证集文件夹路径
- --num_epoch                   迭代总轮数
- --save_frequency            指定模型保存的epoch间隔
-
-
-## 说明
-
-目前如果调用 `yolo_predict.sh` 执行，数据预处理部分对 `darknet` 有依赖
-
-其中：
-
-`predict decoder` 中调用 `load_image_color`、`letterbox_image` 函数
-`train decoder` 中调用 `load_data_detection` 函数
-主要涉及以下操作，在后续的版本中会使用 `OneFlow decoder ops` 替换
-
- image read
- nhwc -> nchw
- image / 255
- bgr2rgb
- resize_image
- fill_image
- random_distort_image
- clip image
- random flip image and box
- randomize_boxes
- correct_boxes
--- a/cn/docs/arch_design/overview.md
+++ b/cn/docs/arch_design/overview.md
-# OneFlow 架构设计概览
-
-在本文中，我们将简要介绍：
-
-* OneFlow 的核心架构
-
-* OneFlow 训练任务从 Python 层到运行时的流程
-
-
-通过阅读本文，可以对 OneFlow 的架构有一个初步的了解；本文末尾附上 OneFlow 各个模块技术特色深入介绍的文章索引，读者可以根据兴趣和需要自行选择。
-
-## OneFlow 的架构层次图解
-![summary of oneflow](imgs/design_overview.png)
-
-如上图所示，如果暂时略去 OneFlow 的上层模型库、底层支撑库，集中关注 OneFlow 内部架构中与神经网络训练直接相关的部分，总体上可分为三层：
-
-* Python层：用户通过调用Python接口来配置超参，并编写 OneFlow 的作业函数来定义网络，这一切的信息，最终会在 OneFlow 中序列化为字节流，传递给下一层-- **编译时层**；
-
-* 编译时层：OneFlow 实现的编译器，将接受 Python 层传递的字节流，并将字节流中所承载的作业函数的信息，经分析、优化后，编译、链接为 OneFlow 中的 **执行计划** (Execution Plan)，最后将 `Execution Plan` 传递给下一层-- **运行时层** ；
-
-* 运行时层：OneFlow 的执行引擎接收上一层传递来的执行计划(`Plan`)，执行计划由多个更小单元的任务描述(`Task Proto`)结构组成，OneFlow 的执行引擎会解析 `Plan`，并为每个 `Task Proto` 分配一个执行单元 **actor**，众多 `actor` 一起运作，完成 OneFlow 的 **去中心化、分布式、流式计算** 。
-
-有了以上的基本层次概念后，我们将在下文中，结合具体的数据结构与代码，向大家介绍 OneFlow 的Python层、编译时、运行时的整个流程是如何运行的。
-
-
-本文讨论对象为 OneFlow 脚本编程所对应的 `lazy` 模式， OneFlow 交互式编程所对应的 `eager` 模式不在本文讨论范围。
-
-### OneFlow 任务是如何跑起来的
-如果想结合 OneFlow 的源码研究 OneFlow 的设计，建议重点关注 OneFlow 源码目录下的 [protobuf](https://developers.google.cn/protocol-buffers/) 文件，OneFlow 中控制面的数据结构、协议，都是使用 `protobuf` 定义的，结合这些数据结构，可以更快理解 OneFlow 的内部设计。
-
-以下，我们将针对通常情况下 OneFlow 脚本执行过程(如[3分钟快速上手](../quick_start/quickstart_in_3_min.md))，逐层分析 OneFlow 在Python层、编译时和运行时到底都做了哪些工作。
-
-## Python 层次
-我们在使用 OneFlow 的过程中已经知道，OneFlow 需要使用`@oneflow.global_function`装饰器来修饰一个python编写的“作业函数”。
-比如：
-```python
-@flow.global_function(get_train_config())
-def train_job():
-  # ...
-```
-
-在`oneflow/python/framework/function_util.py`中可以找到`global_function`装饰器对应的内部代码：
-```python
-    def Decorator(job_func):
-        #...
-        sess = session_ctx.GetDefaultSession()
-
-        @functools.wraps(job_func)
-        def Func(*args, **kwargs):
-            return _RunLazyJob(sess, job_func, *args, **kwargs)
-
-        sess.AddJob(_CloneFunctionDesc(function_config.function_desc, job_func))
-        #...
-        return Func
-```
-可以看到，装饰器返回的是 `Func` 函数，我们在训练过程中调用的作业函数，其实真正执行的是此处的 `Func`。
-
-装饰器的主要作用有：
-
-* 通过调用`sess.AddJob`，将训练的环境配置及作业函数的信息，添加到当前 session 上下文中，我们将看到，这些信息在编译时会被用到
-
-* 通过修饰器，使得作业函数的调用被导向`_RunLazyJob`，我们将看到在`_RunLazyJob`中包括了编译 `job_func` 的代码
-
-以下，我们来展开讨论`sess.AddJob`与`_RunLazyJob`的细节。
-
-### 作业函数的序列化
-
-在`/oneflow/python/framework/session_util.py`中可以看到`AddJob`的实现：
-```python
-class Session(object):
-    #...
-    def AddJob(self, function_desc):
-        #...
-        self.job_name2function_desc_[function_desc.job_func.__name__] = function_desc
-```
-可以看到， `session`中有一个名为 `job_name2function_desc_` 的字典，`AddJob` 将作业函数的名字作为 key，配置信息作(`function_desc`)为 value 放置进去，配置信息可以在 `oneflow/core/job/job.proto`中查看。
-
-将训练配置信息加入到 `session` 中的主要原因，是 OneFlow 在编译时需要这些信息来进行推理、优化。接下来我们来分析 OneFlow 在 Python层次是如何触发编译过程的。
-
-我们观察 `_RunLazyJob` 的内部实现，可以找到 OneFlow 进行序列化并触发 OneFlow C++ 层编译的代码位置：
-```python
-def _RunLazyJob(session, job_func, *args, **kwargs):
-    return session.TryInit().LazyRun(job_func, *args, **kwargs)
-```
-跟进 `session` 对象的 `TryInit` 方法，可以发现，`session.TryInit` 会根据当前 session 的状态，决定是否触发编译：
-```python
-class Session(object):
-    #...
-    def TryInit(self):
-        if self.status_ is SessionStatus.OPEN:
-            self.Init()
-        return self
-
-    def Init(self):
-        assert self.status_ is SessionStatus.OPEN
-        self.status_ = SessionStatus.RUNNING
-        #...
-        _TryCompleteConfigProto(self.config_proto)
-            for job_name, func_desc in self.job_name2function_desc_.items():
-                compiler.Compile(self, func_desc, self.config_proto)
-        #...
-        c_api_util.StartGlobalSession()
-        return self
-```
-从以上代码可以看到，如果当前 Session 处于 "OPEN" 状态，那么 session 会调用 `Init`， 遍历之前通过 `AddJob` 设置在 session 中的 `job_name2function_desc_` 中的各个 job ，并且调用 `compiler.Compile` 编译，`compiler.Compile`的内部实现为：
-```python
-def Compile(session, function_desc, config_proto):
-    with InterpretScope(session, function_desc, config_proto):
-        _CompileJob(function_desc)
-        c_api_util.CurJobBuildAndInferCtx_Complete()
-```
-
-其中`_CompileJob`中将对`function_desc`所描述的作业函数进行序列化并在内部调用 C++ 层代码进行构图优化。再通过 `c_api_util.CurJobBuildAndInferCtx_Complete` 告之 C++ 层序列化完成。
-
-完成`compiler.Compile`的工作后，将通过`c_api_util.StartGlobalSession()` 触发 C++ 层，创建 session，开始 C++ 层的编译 Plan 的工作。
-
-
-### 作业函数的调用
-回顾上文提到到的`_RunLazyJob`代码：
-```python
-def _RunLazyJob(session, job_func, *args, **kwargs):
-    return session.TryInit().LazyRun(job_func, *args, **kwargs)
-```
-我们已经知道在`TryInit()`中完成了作业函数的序列化，并通知 编译时完成编译构图工作。
-
-而`LazyRun`内部，就对应了用户调用作业函数时，Python层如何运行作业函数。
-
-```python
-    def LazyRun(self, job_func, *arg):
-        #...
-        remote_blobs = self.LaunchUserJob(job_func, *arg)
-        #...
-        return LazyFutureRemoteBlobs(self).SetResult(remote_blobs).Inited()
-```
-其中 `LaunchUserJob` 接受的参数 `job_func` 与 `arg` 就分别是用户调用作业函数时的作业函数以及传递的参数。
-
-`LaunchUserJob` 会遍历 `job_func` 中需要执行的计算单元，并最终通在`session.LaunchJob`(`/oneflow/python/framework/session_util.py`)中通过调用`c_api_util.LaunchJob(job_instance)`执行计算。
-
-值得一提的是，因为当用户调用作业函数时，OneFlow 已经完成了作业函数的编译构图，得到了执行计划(Execution Plan)，1个 Plan 由多个描述任务的`TaskProto`组成。以上`c_api_util.LaunchJob(job_instance)`所接受的参数`job_instance`，并不是作业函数本身，而是 Plan中的 `Task` 实例化对象，一个作业函数，将对应多个`job_instance`。
-
-## 编译期阶段
-上文提到的 Python 层的 `c_api_util.StartGlobalSession()` 会触发 C++ 代码中的 `StartGlobalSession` 并最终触发 OneFlow 编译时的入口函数 `Oneflow::Init` (`/oneflow/core/job/oneflow.cpp`)：
-```c++
-Maybe<void> Oneflow::Init(const oneflow::JobSet& job_set) {
-  // Runtime
-  JUST(CompileAndMergePlanOnMaster(job_set.job(), &plan_));
-  // ...
-}
-```
-可以看到，OneFlow 通过 `CompileAndMergePlanOnMaster` 完成编译构图，其中的`job_set.job()`是这个阶段的输入，它是包含了由 Python 接口所定义的神经网络结构及超参配置信息的序列化字节流，而 `plan_` 是输出，称为 **执行计划** (Execution Plan)。
-
-执行计划由一系列对于任务的描述(`oneflow/core/job/task.proto`)组成，每个任务自身都是一个图结构，描述了内部计算类型、内存配额、上游生产者和下游消费者等信息。这些信息包含了 OneFlow 运行时所需要的一切信息。
-
-以下是一个编译后的得到的 `Execution Plan`的图示([点击查看大图](imgs/plan_illustration.svg))：
-
-![Execution Plan](imgs/plan_illustration.svg)
-
-### 执行计划的生成过程
-进入到 `CompileAndMergePlanOnMaster` 中可以看到，首先，会调用一系列的 `MakeXXXJob(s)` 整合序列化后的作业函数信息， 加入到 `jobs` 中：
-```cpp
-Maybe<void> CompileAndMergePlanOnMaster(const PbRpf<Job>& conf_jobs, Plan* plan) {
-  std::vector<std::shared_ptr<Job>> jobs(conf_jobs.size());
-  //...	
-    if (/*...*/) {
-      MakeModelIoV2Jobs(jobs, var_op_name2parallel_blob_conf, AppendJob);
-    } else {
-      MakeModelIoJobs(jobs, var_op_name2parallel_blob_conf, AppendJob);
-    }
-  }
-  //...
-    for (const auto& pair : push_op_name2parallel_blob_conf) {
-	  //...
-      MakePushJob(std::string("System-Push-") + pair.first, 
-      //...
-    }
-    for (const auto& pair : pull_op_name2parallel_blob_conf) {
-	  //...
-      MakePullJob(std::string("System-Pull-") + pair.first, pair.first, pair.second,
-                  pull_job.get());
-    }
-  //...
-}
-```
-然后通过 `CompileCurJobOnMaster` 将 `jobs` 编译为 Plan，值得注意的是 `AddJobName2JobId` 会为每个 `job` 分配一个全局唯一的ID，用于运行时区分任务：
-```c++
-  FOR_RANGE(int64_t, i, 0, jobs.size()) {
-    AddJobName2JobId(jobs.at(i)->job_conf().job_name(), i);
-    //...
-    JUST(CompileCurJobOnMaster(jobs.at(i).get(), &sub_plans.at(i), true));
-  }
-```
-以上的编译过程，最终会调用 `Compiler::Compile`，在其内部完成 `TaskProto`的构建，并添加到 Plan 中(`oneflow/core/job/compiler.cpp`)：
-```c++
-  task_gph->ForEachNode([&](TaskNode* task_node) {
-    if (task_node->IsMeaningLess()) { return; }
-    task_node->ToProto(plan->mutable_task()->Add());
-  });
-```
-
-不过，以上步骤完成后，得到的 Plan 还不是最终完整的 Plan，OneFlow 还会增加 `main_plan`， 它对应了本节开始 Plan 图示中的 "System-Main-Tick-CriticalSection" 系列节点，具有同步与调度功能，将作为各项任务的入口：
-```c++
-    Plan main_plan;
-    //...
-    {
-      //...
-      MakeMainJob(&main_job, /*...*/);
-      //...
-      JUST(CompileMainJob(&main_job, /*...*/, &main_plan));
-    }
-```
-以上一切完成后，通过调用 `LinkMainPlan` 将各个 Plan 链接起来，得到这节开始的图片所示 Execution Plan：
-```c++
-LinkMainPlan(plan, main_plan, identity_tick_op_names);
-```
-
-执行计划是编译阶段与运行时的分界线，在得到执行计划后，OneFlow 将启动运行时，并根据执行计划中的信息执行任务。
-
-## 运行时阶段
-完成`CompileAndMergePlanOnMaster`后，OneFlow 会实例化`Runtime`，按照 `plan` 中的信息执行任务：
-
-```c++
-Maybe<void> Oneflow::Init(const oneflow::JobSet& job_set) {
-  // Runtime
-  JUST(CompileAndMergePlanOnMaster(job_set.job(), &plan_));
-  if (Global<MachineCtx>::Get()->IsThisMachineMaster()) {
-    runtime_buffers_scope_.reset(new RuntimeBuffersScope(plan_));
-  }
-  runtime_.reset(new Runtime(plan_, GetMaxVal<size_t>(), false));
-  //...
-}
-```
-在 `Runtime` (`oneflow/core/job/runtime.cpp`)的构造中，将 `Plan` 中的 task 分成了三类：
-```c++
-  std::vector<const TaskProto*> mdupdt_tasks;
-  std::vector<const TaskProto*> source_tasks;
-  std::vector<const TaskProto*> other_tasks;
-  int64_t this_machine_task_num = 0;
-  for (const TaskProto& task : plan.task()) {
-    if (task.machine_id() != Global<MachineCtx>::Get()->this_machine_id()) { continue; }
-    if (IsMdUpdtTaskType(task.task_type())) {
-      mdupdt_tasks.push_back(&task);
-    } else if (!HasNonCtrlConsumedRegstDescId(task)) {
-      source_tasks.push_back(&task);
-    } else {
-      other_tasks.push_back(&task);
-    }
-    this_machine_task_num += 1;
-  }
-```
-
-* mdupdt_tasks：【……】
-
-* source_tasks：【……】
-
-* other_tasks：【……】
-
-如前文所描述，在 task 中包含了内部计算类型、内存配额、上游生产者和下游消费者 **运行时所需要的全部信息** ，因此 OneFlow 可以通过解析 Task 启动线程执行任务。
-
-OneFlow 使用 `Actor` 执行线程，在 OneFlow 中 **数据是一等公民** ，编译阶段产生的 `Plan` 中的每个 `Task`，记录了自己数据的上游与下游，执行引擎会根据 `Task` 的记录，为每个 `Task` 实例化对应的 `Actor`， `Actor` 负责执行 `Task` 规定的数据处理或数据搬运任务。
-
-以下代码根据 `Task` 构建 `Actor`：
-
-```c++
-  RuntimeCtx* runtime_ctx = Global<RuntimeCtx>::Get();
-  runtime_ctx->NewCounter("constructing_actor_cnt", this_machine_task_num);
-  HandoutTasks(mdupdt_tasks);
-  HandoutTasks(source_tasks);
-  HandoutTasks(other_tasks);
-  runtime_ctx->WaitUntilCntEqualZero("constructing_actor_cnt");
-```
-
-OneFlow 执行引擎采用去中心化调度机制，每个 `Actor` 只需要与自己的上下游进行通信， **不需要** 所谓的 master 节点进行中转，actor之间使用消息(message)来实现生产者和消费者之间的握手协议。
-
-
-## OneFlow 各模块的技术特色
-以上，我们只是结合 OneFlow 的 Python 接口，简要介绍了 OneFlow 框架的运行流程。以下文章，分专题更深入介绍 OneFlow 框架内部的各个模块：
-
-
-### [OneFlow 的并行观](link)
-
-OneFlow 在Python接口层次提供了 `consistent_view`，在框架内部，为了提供给用户逻辑上统一的视角，将 `op` 在物理上的实现划分为多个 `kernel`，并且提出了 **SBP 并行签名机制** ，在严谨的数学基石上进行OneFlow 的工程实践。
-
-并且，OneFlow 的 `boxing` 机制，将 `SBP` 过程中的数据操作变为了透明黑盒，保证了用户使用 OneFlow 进行分布式训练时可保持逻辑单卡视角。
-
-### [自动并行](link)
-【一鹏的……】
-
-### [构图与优化](link)
-OneFlow 基于数据流模型描述计算任务，神经网络由一系列算子(Operator)构成的有向无环图(DAG)表示。并且通过注册一系列的 `PASS` 在构图与推导过程中进行优化。 
-
-### [Actor 机制](link)
-
-OneFlow 的执行引擎，采用 Actor 流水机制，实现了去中心化分布式流式计算，在统一的设计框架内解决了长期困扰深度学习的各类问题，如磁盘 IO、 copyHD、 去中心计算等。
-
-### [网络控制平面的协议设计](link)
-
-控制平面主要实现分布式系统的控制协议，包括节点初始化，集群发现，分布式锁等功能，通常这类网络通信需求只发生在系统初始化或退出阶段，编程易用性比追求苛刻性能更重要， OneFlow 基于 GRPC 实现了该模块。
-
-### [网络数据平面的网络通信实现](link)
-
-分布式深度学习系统在训练过程中 `Actor` 之间的消息及中间运算结果，具有高频、吞吐量大的特点，对网络通信要求高。
-
-OneFlow 自底层定制了网络传输模块用于数据平面的通信。并且有 RDMA 及 epoll 两套方案，可以做到在网络传输层次不依赖 `nccl`，扩大芯片选择范围。
-
-### [内存管理](link)
-
-OneFlow 的训练采用了 **纯静态内存分配方案**，在编译生成 `Plan` 的过程中，就已经确定了所有 `Task` 的内存使用情况，在整个运行时阶段，不再有内存的动态申请、释放，最大限度降低内存分配与回收带来的性能损耗。
-
-此外……【成城的内存方案】
-
-### [eager模式的实现](link)
-
-OneFlow 开发的 eager 模式，通过实现定制的虚拟机使得用户可以采用交互式的方式进行训练。
-
-
--- a/cn/docs/basics_topics/data_input.md
+++ b/cn/docs/basics_topics/data_input.md
@@ -7,7 +7,7 @@

 直接使用 NumPy 数据的方式简单方便，但仅适合小数据量的情况。因为当数据量过大时，可能在准备 NumPy 数据上遭遇效率瓶颈。因此，这种方式比较适合项目的初始阶段，快速验证和改进算法；

-OneFlow 的 DataLoader 内部采用了多线程和数据流水线等技术使得数据加载、数据预处理等效率更高。但是，需要为已经支持的格式[准备数据集](../extended_topics/how_to_make_ofdataset.md)或为 OneFlow 暂时还不支持的格式[开发自己的 DataLoader](../extended_topics/implement_data_loader.md)。因此，推荐在成熟的项目中使用。
+OneFlow 的 DataLoader 内部采用了多线程和数据流水线等技术使得数据加载、数据预处理等效率更高。但是，需要为已经支持的格式[准备数据集](../extended_topics/how_to_make_ofdataset.md)。因此，推荐在成熟的项目中使用。


 ## 使用 Numpy 数据作为输入
@@ -135,4 +135,4 @@ DataLoader 的返回值，如果是简单的基本数据类型，那么可以直
 `OFRecordImageDecoderRandomCrop` 负责图片解码并随机做了裁剪，`OFRecordRawDecoder` 负责从 ofrecord 对象中直接解码出标签， `image.Resize` 把裁剪后的图片调整成224x224的大小， `CropMirrorNormalize` 把图片进行了正则化。

 ## 支持更多格式的 DataLoader
-OneFlow 提供了一些 DataLoader 和预处理的算子，详细请参考 [oneflow.data](https://oneflow.readthedocs.io/en/master/data.html)。未来会不断丰富和优化这些算子，用户也可以参考 [这篇文章](../extended_topics/implement_data_loader.md) 自定义 DataLoader 满足特定的需求。
+OneFlow 提供了一些 DataLoader 和预处理的算子，详细请参考 [oneflow.data](https://oneflow.readthedocs.io/en/master/data.html)。
--- a/cn/docs/extended_topics/debug_by_vscode.md
+++ b/cn/docs/extended_topics/debug_by_vscode.md
-本文介绍如何配置 VS Code，搭建 OneFlow 的 GUI 开发环境。
-
-如果对于 VS Code 及其插件系统还不熟悉，可以参阅[官方文档](https://code.visualstudio.com/docs)。
-
-本文包括：
-
- 如何编译 `Debug` 版本的 OneFlow
- 远程调试所必需的 VS Code 插件的安装配置
-
-### 编译 Debug 版本的 OneFlow
-
-如果使用 `Release` 版本的 OneFlow，可能会因为编译器优化，导致在调试过程中程序实际运行位置与源码行不对应。
-
-因此我们需要编译 `Debug` 版本的 OneFlow，并且需要生成 clangd 所需要的 json 文件。
-
-在运行 cmake 的时候需要加上 `Debug` 及 `CMAKE_EXPORT_COMPILE_COMMANDS` 的 flag。
-
-```
-cmake .. \
-DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_EXPORT_COMPILE_COMMANDS=1
-```
-
- `-DCMAKE_BUILD_TYPE=Debug` 选项指定编译 Debug 版本
- `-DCMAKE_EXPORT_COMPILE_COMMANDS` 选项会在 `build` 目录下生成后文 clangd 配置时所需要的 `compile_commands.json` 文件
-
-### Remote - SSH
-本节内容仅为那些需要远程开发的人员准备，在本地主机上进行开发的人员 **可以略过此节**。
-
-通过 VS Code 的 Remote SSH 插件，可以通过 SSH 的方式连接远程服务器。
-
-![RemoteSSH](imgs/plugin-remote-ssh.png)
-
-我们的被调试对象 OneFlow 可以运行在远程主机上，然后通过 Remote SSH 将远程的情况和本地的 VS Code 用户操作连接起来， **像调试本地程序一样调试远程主机上的程序**。
-
-安装完成 Remote - SSH 后，按 F1，在弹出的搜索栏中选择 `Remote-SSH: Connect to Host...`，即可设置 SSH 的连接信息，连接远程主机。
-
-Remote - SSH 连接远程主机后，在插件一栏，会自动分类“远程”与“本地”，如果检测到需要在远程电脑上安装的插件，会显示为灰色，并带有 **Install in SSH:远程主机名** 的按钮，点击即可将对应插件安装在远程主机。
-
-![remotePlugin](imgs/plugin-remote-ssh-install.png)
-
-如上图，我们已经在远程主机安装 Python、clangd、Native Debug 插件，用于支持远程调试 OneFlow。
-
-但是远程主机并没（本地主机已经安装的）Go 和 HTML CSS Support 插件。
-
-
-### clangd
-经过简单的配置，clangd可以为我们提供代码补全、符号跳转等便利。
-
-在配置 clangd 之前，需要确认：
-
- 已经通过编译，生成了`compile_commands.json`文件
- 已经通过 Remote - SSH 在远程主机上安装了 clangd 插件
- **不要** 安装 VS Code 默认推荐的 ms-vscode.cpptools C/C++ 插件，因为 clangd 与之有冲突
-
-#### 配置 VS Code 中的 clangd 插件
-
-将 build 目录下的 `compile_commands.json` 文件软链接到 OneFlow 的源码根目录下，在 OneFlow 的源码根目录下：
-
-```
-ln -s ./build/compile_commands.json compile_commands.json
-```
-
-然后 `Ctrl+Shift+P` (macOS 下 `command+shift+p`)，找到 `Open Remote Settings` 选项，打开 `settings.json` 配置文件，在其中加入以下配置：
-
-```json
-    "clangd.path": "/path/to/bin/clangd",
-    "clangd.arguments": [
-        "-j",
-        "12",
-        "-clang-tidy"
-    ]
-```
-`clangd.arguments`的意义及更多参数选项，可查阅`clangd --help`。
-
-#### 使用 clangd
-在 VS Code 的 View->Output 面板，下拉菜单中选择 "Clang Language Server"，可以看到 clangd 的解析输出，解析完成后。选择 C/C++ 源码中的符号，可以实现跳转。
-
-按`Ctrl+P` (macOS 下 `command+P`) 后通过`@符号名`或`#符号名`可以分别实现当前文件内查找符号，或工程范围内查找符号。
-
-### native debug
-`Ctrl + Shift + D` (macOS 下 `command+shift+D`) 或者点击 activity bar 的 Run 按钮，进入到 Run 视图。
-
-![Run View](imgs/run-view.png)
-
-选择 `Create a launch.json file`，选择 gdb 模板。
-![gdb](imgs/gdb-select.png)
-
-然后设置相关参数：
-```json
-{
-    "version": "0.2.0",
-    "configurations": [
-        {
-            "name": "lenet", //自定义任务名
-            "type": "gdb",
-            "request": "launch",
-            "target": "/home/yaochi/.conda/envs/ycof/bin/python3", //python路径
-            "arguments": "lenet_train.py", //脚本
-            "cwd": "/home/yaochi/of_example", //脚本所在路径
-            "valuesFormatting": "parseText"
-        }
-    ]
-}
-```
-
-设置断点后，F5 启动调试：
-![调试截图](imgs/debug_snapshot.png)
-
-### 其它
-
-* 如果 VS Code 下载插件速度过慢，可以按照[官方文档](https://code.visualstudio.com/docs/setup/network)的步骤切换 `hostname` 或者设置代理。
-
-* 关于 clangd 安装配置的[官方介绍](https://clang.llvm.org/extra/clangd/Installation.html)
-
-* 关于 VS Code 的调试设置的[官方介绍](https://code.visualstudio.com/docs/editor/debugging)
-
-* clangd 的最新版本可能对 glibc 版本要求过高，导致报缺少库的错误。
-
-```
-./bin/clangd: /lib64/libc.so.6: version `GLIBC_2.18' not found (required by ./bin/clangd)
-```
-
-此时可以下载其它更低 clangd 的版本（本文推荐版本为 9.0.0），早期版本的 clangd 需要到 [LLVM官网](https://releases.llvm.org/download.html) 下载整个LLVM工具链，其中包含有 clangd。
--- a/cn/docs/extended_topics/how_to_make_ofdataset.md
+++ b/cn/docs/extended_topics/how_to_make_ofdataset.md
@@ -106,9 +106,7 @@ def ofrecord_reader(

 对于与业务逻辑耦合的特定操作（如解码、解压等），我们还可以为 `ofrecord_reader` 定义预处理 op，让程序拥有很高的灵活性和扩展性。

-* 关于 DataLoader 及相关算子使用可以参考[数据输入](../basics_topics/data_input.md#dataloader)
-
-* 关于自定义 Op 可以参考[用户自定义 op](user_op.md)
+- 关于 DataLoader 及相关算子使用可以参考[数据输入](../basics_topics/data_input.md#dataloader)

 ## 其它格式数据与 OFRecord 数据集的相互转化
 参考[OFrecord数据格式](ofrecord.md)中 OFRecord 文件的存储格式及本文开头介绍的 OFRecord 数据集的文件名格式约定，我们完全可以自己制作 OFRecord 数据集。

--- a/cn/docs/extended_topics/implement_data_loader.md
+++ b/cn/docs/extended_topics/implement_data_loader.md
--- a/cn/docs/extended_topics/python_kernel_op.md
+++ b/cn/docs/extended_topics/python_kernel_op.md
-# 使用 Python 扩展 Op
-**注意** ：本文涉及的 Python Kernel 仅在 `gcc 4.8.5` 编译环境下充分测试，进一步的完善计划见 [Issue 3951](https://github.com/Oneflow-Inc/oneflow/issues/3951)。
-
-## 背景介绍
-OneFlow 将各种对于数据的处理都抽象成了算子（operator），简称 op。 op 是作用在输入 tensor 上的操作，并将操作的结果写到输出 tensor 上。OneFlow 内部已经提供了比较完备的 op 算子，可以在 [ops 目录](https://github.com/Oneflow-Inc/oneflow/tree/master/oneflow/python/ops)下找到。
-
-当 OneFlow 已有的 Python 算子及其组合无法满足构建神经网络的需求，或者 Python 层次的算子无法满足性能需求时，我们可以开发自定义 op。OneFlow 提供了两类开发自定义 Op 的途径，一类是以 Python 为主的 `Python Kernel` 开发，另外一类是[使用 C++ 扩展 Op](user_op.md)一文介绍的 `C++ Kernel` 开发。
-
-`Python Kernel` 因为主要采用 Python 进行扩展，开发流程较简单，适用于快速预研、算法验证等场景。`C++ Kernel` 效率高，适用于开发已经验证稳健性并追求性能的算子。
-
-本文将介绍介绍算子开发的背景知识和基本概念，并展示如何开发 `Python Kernel`。
-
-### 基本概念
-在进行 OneFlow 算子开发前，需要了解 `op_type_name`、`Op` 以及 `Kernel` 这几个概念：
-
- op_type_name：op_type_name 是 op 类别的全局唯一 ID， OneFlow 通过 op_type_name 查询并确认 op 的种类，进而实例化 op，用于构建计算图。op 的种类与 op 的关系，类似于类与对象的关系。
- op：逻辑上的算子，包含构图推理时的输入输出形状等信息，不包含具体的处理数据的逻辑。
- kernel：对于一个逻辑上的 op，在运行时，处理的逻辑会因为物理设备以及数据类型的不同。运行时的具体处理逻辑，由 kernel 完成。简单而言，op 与 kernel 是一对多的关系，我们可以使用 Python 完成具体运算，这样的Kernel 称为 `Python Kernel`，也可以[使用 C++ 开发 Kernel](./user_op.md)。
- OneFlow 的内核由 C++ 实现，但是用户接口使用 Python，因此需要按照约定编写 `Python Wrapper`，使得 Python Op 接口能与 C++ 内核交互。
-
-### 开发步骤
-使用 Python 扩展 Op，应该准备一个以 `op_type_name` 命名的目录，在该目录下，按照约定放置必需的文件，以 [oneflow/python/test/custom_ops/user_sigmoid](https://github.com/Oneflow-Inc/oneflow/tree/master/oneflow/python/test/custom_ops/user_sigmoid) 为例：
-
-```text
-user_sigmoid
-├── user_sigmoid_cpp_def.cpp
-├── user_sigmoid_py_api.py
-└── user_sigmoid_py_kernel.py
-```
-
-其中：
-
- `op_type_name_cpp_def.cpp`(以上的 `user_sigmoid_cpp_def.cpp`) 文件中放置 Op 定义信息
- `op_type_name_py_api.py`(以上的 `user_sigmoid_py_api.py`)文件中放置 `Python Wrapper`，通过 `oneflow.user_op_builder` 将实现的 `Python Kernel` 导出给用户使用
- `op_type_name_py_kernel.py`(以上的 `user_sigmoid_py_kernel.py`)文件中放置 Python 实现的自定义算子的前向计算逻辑和后向计算逻辑
-
-下文中，我们将介绍如何用 Python 实现一个自定义的 user_relu Op，它包括：
-
- 如何编写 `op_type_name_cpp_def.cpp` 文件，定义 Op 信息
- 如何编写 `op_type_name_py_api.py` 文件，封装 Op 的 Python 接口
- 如何编写 `op_type_name_py_kernel.py` 文件，使用 Python 实现 Op 的计算 Kernel
- 在 OneFlow 中如何使用 `Python Kernel` 类型的自定义 Op
-
-
-
-## Op 的实现与注册
-首先，我们在 `user_relu_cpp_def.cpp` 中定义 op 并完成注册：
-```cpp
-#include "oneflow/core/framework/framework.h"
-
-namespace oneflow {
-namespace {
-
-REGISTER_USER_OP("user_relu_forward")
-  .Attr<std::string>("device_sub_tag", "py")
-  .Input("in")
-  .Output("out")
-  .SetTensorDescInferFn(
-      [](user_op::InferContext *ctx) -> Maybe<void> {
-        *ctx->Shape4ArgNameAndIndex("out", 0) =
-            *ctx->Shape4ArgNameAndIndex("in", 0);
-        *ctx->Dtype4ArgNameAndIndex("out", 0) =
-            *ctx->Dtype4ArgNameAndIndex("in", 0);
-        return Maybe<void>::Ok();
-      });
-}  // namespace
-}  // namespace oneflow
-```
-
-分析以上代码：
-
- `oneflow/core/framework/framework.h` 中包含了我们创建一个 op 所需要的所有接口
- `.Attr<std::string>("device_sub_tag", "py")` 是必需的，它告知 OneFlow 在使用该 Op 时默认调用Python Kernel
- 与自定义 op 有关的接口集中在 `oneflow::user_op` 中，使用名称空间 `oneflow` 可以简化类型名称
- 宏 `REGISTER_USER_OP` 用于注册 op，其接受的参数 `user_relu_forward` 是 `op_type_name`。
- 使用 `REGISTER_USER_OP` 注册后，其实会返回一个 `OpRegistry` 类（位于[user_op_registry.h](https://github.com/Oneflow-Inc/oneflow/blob/master/oneflow/core/framework/user_op_registry.h))，通过调用该类方法，完成对自定义 op 的设置：
-    1. `Input("in")` 表示其有一个名为 "in" 的输入
-    2. `Output("out")` 表示其有一个名为 "out" 的输出
-    3. `SetTensorDescInferFn` 用于设置形状及数据类型推导函数，描述该算子的输出的形状及类型与输入的关系。以上代码中，输出的形状、数据类型与输入的一致
-
-`op_type_name_cpp_def.cpp` 文件是实现 `Python Kernel` 过程中唯一会使用到的 C++ 文件，它用于设置 Op 的信息，在现阶段，还无法将使用 C++ 配置 Op 的步骤省略（因为设置分布式等高级信息时必需），不过可以看到，该文件并不涉及具体的运算，仅仅是用于描述 Op，即使不熟悉 C++，根据我们的示例，也可以很轻松地掌握。
-
-## 封装 Op 的 Python 接口
-为了用户可以在 Python 层使用刚刚设置并注册的 `user_relu` Op，我们需要创建一个 `user_relu_py_api.py` 文件，其内容如下：
-
-```python
-import oneflow as flow
-
-def user_relu_forward(x):
-    op = (
-        flow.user_op_builder("myrelu")
-        .Op("user_relu_forward")
-        .Input("in", [x])
-        .Output("out")
-        .Build()
-    )
-    return op.InferAndTryRun().SoleOutputBlob()
-```
-
-`flow.user_op_builder("op_myrelu")` 其实会返回一个名为 `op_myrelu` 的 `UserOpConfBuilder` 对象。
-
-该对象包含 `Op`、`Input` 等方法，用于封装自定义 op，具体解释如下：
-
- `Op("user_relu_forward")`：参数必须为之前在 C++ 注册时的 `op_type_name`，OneFlow 通过它找到已经注册的 op 类型，并实例化 op 对象。
- `Input("in", [input_blob])`：对应了 C++ 中 op 注册时的 `Input`，第一个参数字符串必须与 C++ 注册 op 时的 `Input` 设置的字符串一致。第二个参数为输入的张量，是一个 `list`，因为一个 op 允许有多个输入。
- `Output("out")`：对应了 C++ 中 op 注册时的 `Output`。
- `Build`：以上设置完成后，调用 `Build` 可以得到自定义 op 的 Python wrapper
-
-以下代码，将获取自定义 op 的输出：
-```python
-return op.InferAndTryRun().SoleOutputBlob()
-```
-
-其中的 `InferAndTryRun` 完成推导，返回 `UserOp`，如果返回结果只有一个输出，则使用 `SoleOutputBlob` 即可获取该唯一输出，否则，可以使用 `RemoteBlobList` 获取包含多个输出的列表。
-
-## 使用 Python 实现 Kernel
-如本文开始所描述，Op 只是逻辑上的概念，真正的计算需要 Kernel 完成，在 OneFlow 中可以既可以使用 C++ 也可以使用 Python 实现 Kernel，本文只介绍最易上手的 Python Kernel 的实现方法。使用 C++ 实现 Kernel 可以参考[使用 C++ 开发 Kernel](./user_op.md)。
-
-为了为我们上文设置的 `user_relu` Op 提供 Python Kernel，我们需要创建一个 `user_relu_py_kernel.py` 文件，其内容如下：
-
-```python
-import numpy as np
-
-def forward(args):
-    (x,) = args
-    y = (x>0)*x
-    return y
-```
-
-以上的 `forward` 方法是必需实现的，它的实现对应了我们 Op 的 Python Kernel。关于它的约定有：
-
- 方法名必需为 `forward`
- 参数只有一个，类型为 `tuple`，`tuple` 中的元素个数和顺序，与 Op 注册时的 `Input` 对应。如我们之前为 `user_relu` 注册了 `Input("in")`，那么以上代码中 `(x, ) = args` 中的 `x` 就取到 `in` 的值
- 输出与 Op 注册时的 `Output` 对应
- 参数与返回值均为 `numpy` 对象，即不能（不会）是字符串、整型数字等其它类型
-
-## 使用自定义 Op
-完成以上工作后，我们得到了一个名为 `user_relu` 的目录，包含三个文件，它们的结构如下：
-
-```text
-user_relu/
-├── user_relu_cpp_def.cpp
-├── user_relu_py_api.py
-└── user_relu_py_kernel.py
-```
-
-我们可以在 `user_relu` 文件夹所在的路径，创建一个测试文件，调用刚刚实现的自定义 Op，内容如下：
-
-```python
-import oneflow as flow
-import numpy as np
-import os
-import oneflow.typing as tp
-
-# 根据指定的路径与 op_type_name 创建 module 对象
-module_path = os.path.dirname(os.path.abspath(__file__))
-user_relu_op = flow.experimental.custom_op_module("user_relu", module_path)
-
-# 使 Op, Python API, Python Kernel 生效
-user_relu_op.py_api().cpp_def().py_kernel().build_load()
-
-@flow.global_function()
-def MyJob(x: tp.Numpy.Placeholder((5,), dtype=flow.float32)) -> tp.Numpy:
-    with flow.scope.placement("cpu", "0:0"):
-        return user_relu_op.api.user_relu_forward(x)
-
-if __name__ == "__main__":
-    input = np.array([-2, -1, 0, 1, 2], dtype=np.float32)
-    output = MyJob(input)
-    print(input)
-    print(output)
-```
-
-以上代码中，先通过 `flow.experimental.custom_op_module` 创建 module 对象，它接收两个参数，第一个参数为 `op_type_name`， 第二个参数为 `user_relu` 文件夹所在的路径。返回的 `module` 对象，代表了我们自定义的 Op。
-
-接着，通过 `user_sigmoid_op.py_api().cpp_def().py_kernel().build_load()` 可以使自定义 Op 生效，生效后的 Op 的 Python 接口，就是定义在 `user_relu_py_api.py` 文件中的方法名(`user_relu_forward`)，它被放置在 `moudle` 对象的 `api` 名称空间中。因此，我们需要通过以下方式调用:
-
-```python
-user_sigmoid_op.api.user_relu_forward(x)
-```
-
-且因为 Python Kernel 只能运行在 CPU 设备上，因此需要指定计算设备为 CPU：
-```python
-with flow.scope.placement("cpu", "0:0"):
-```
-
-## 为自定义 Op 提供反向计算
-我们通过上述工作，已经完成了 `user_relu` 算子的正向计算过程，可以用于 `type="predict"` 的作业函数。但是，如果想支持 `type="train"` 类型的训练作业函数，我们就还需要为自定义 Op 提供反向计算。
-
-为自定义 Op 提供反向计算的代码，需要写在 `op_type_name_cpp_def.cpp` 文件中，通过宏 `REGISTER_USER_OP_GRAD` 进行注册。
-
-从数学角度上看，注册过程就是我们为自定义的 op，指定后向求梯度的计算方法。从编程角度看，就是为自定义 op 设置一个后向生成函数，在该函数中，编写代码，指定这个 op 的输入梯度的计算方法。
-
-以下，我们将专门实现一个 Op，名为 `user_relu_backward`。我们将在为 `user_relu` 注册后向梯度时，用到这个“专门定制”的 Op。
-
-### 实现 `user_relu_backward` Op
-实现 `user_relu_backward` Op 的过程与实现 `user_relu` 的前向几乎是一样的。首先，在 `user_relu_cpp_def.cpp` 中设置并注册该 Op：
-
-```cpp
-REGISTER_USER_OP("user_relu_backward")
-    .Input("y")
-    .Input("dy")
-    .Output("dx")
-    .Attr<std::string>("device_sub_tag", "py")
-    .SetTensorDescInferFn([](user_op::InferContext* ctx) -> Maybe<void> {
-      const Shape* dy_shape = ctx->Shape4ArgNameAndIndex("dy", 0);
-      Shape* dx_shape = ctx->Shape4ArgNameAndIndex("dx", 0);
-      *dx_shape = *dy_shape;
-      return Maybe<void>::Ok();
-    });
-```
-
-值得注意的是，同前向类似，以上代码中 `.Attr<std::string>("device_sub_tag", "py")` 必不可少，它告知 OneFlow 在使用该 Op 时，默认调用 Python Kernel。
-
-同理，因为不需要用户直接调用这个 `user_relu_backward` Op，因此我们不需要在 `user_relu_py_api.py` 为 `user_relu_backward` 封装 Python 接口。可以直接实现它的 Python Kernel。
-
-在 `user_relu_py_kernel.py` 中，实现 `backward` 方法：
-
-```python
-def backward(args):
-    (y, dy) = args
-    dx = (y>0)*dy
-    return dx
-```
-它的参数是一个 `tuple`，数目和顺序对应了 Op 注册时的 `Input`，输出对应了 Op 注册时的 Output。
-
-### 为 Op 注册反向梯度
-我们需要在 `user_relu_cpp_def.cpp` 中，通过宏 `REGISTER_USER_OP_GRAD` 为我们的正向 Op (`user_relu_forward`) 注册反向。
-
-其代码如下：
-```c++
-REGISTER_USER_OP_GRAD("user_relu_forward")
-    .SetBackwardOpConfGenFn([](user_op::BackwardOpConfContext* ctx) {
-      const auto grad_op_name = ctx->FwOp().op_name() + "_grad";
-      const auto& grad_op_func = [&ctx](user_op::BackwardOpBuilder& builder) {
-        return builder.OpTypeName("user_relu_backward")
-            .InputBind("y", ctx->FwOp().output("y", 0))
-            .InputBind("dy", ctx->FwOp().output_grad("y", 0))
-            .Output("dx")
-            .Build();
-      };
-      ctx->DefineOp(grad_op_name, grad_op_func);
-
-      const auto& dx_get_func = [&ctx, &grad_op_name]() -> const std::string& {
-        return ctx->GetOp(grad_op_name).output("dx", 0);
-      };
-      ctx->FwOp().InputGradBind(user_op::OpArg("x", 0), dx_get_func);
-    });
-```
-
-我们对以上代码进行解释，通过 `REGISTER_USER_OP_GRAD("user_relu_forward")` 注册为前向 Op 注册后向求梯度规则，该宏接收一个参数，就是 **前向的** `op_type_name`。
-
-然后通过 `SetBackwardOpConfGenFn` 设置后向求梯度规则，同 Op 类似，在 `op_type_name_cpp_def.cpp` 中注册后向，其实不涉及真正的运算，而是设置后向计算与前向的对应关系，告诉 OneFlow 框架：
-
- 用什么 Op 求后向梯度
- 该 Op 的输入来自哪里，和前向 Op 什么关系
-
-因此，以上代码中的：
-
-```c++
-      const auto& grad_op_func = [&ctx](user_op::BackwardOpBuilder& builder) {
-        return builder.OpTypeName("user_relu_backward")
-            .InputBind("y", ctx->FwOp().output("y", 0))
-            .InputBind("dy", ctx->FwOp().output_grad("y", 0))
-            .Output("dx")
-            .Build();
-      };
-```
-
-定义了 Op 求梯度的方法：使用 `user_relu_backward` 算子，并且将前向的输出 `y` 作为 `user_relu_backward` 的输入 `y`；将前向的输出 `y` 的梯度，作为 `user_relu_backward` 的输入 `dy`；最后输出 `dx`。
-
-定完求梯度的方法后，需要调用
-```cpp
-ctx->DefineOp(grad_op_name, grad_op_func);
-```
-使之生效。
-
-之后的代码：
-```cpp
-      const auto& dx_get_func = [&ctx, &grad_op_name]() -> const std::string& {
-        return ctx->GetOp(grad_op_name).output("dx", 0);
-      };
-      ctx->FwOp().InputGradBind(user_op::OpArg("x", 0), dx_get_func);
-```
-
-是将前向的输入 `x` 和刚刚设置的求梯度的方法的输出(`dx`) 绑定到一起，这样，使用 OneFlow 训练时，就可以自动求导。
-
-## 其它
-
- 本文涉及的代码可以在 [这里](https://github.com/Oneflow-Inc/oneflow-documentation/tree/master/cn/docs/code/extended_topics/python_op) 查看
- Op 注册的更多高级设置可以参考 [这里](user_op.md#opregistry)
- 注册反向梯度时，也可以使用已有的 Op，而无需专门定制反向 Op，可以参考 [这里](./user_op.md#opgradregistry)
--- a/cn/docs/extended_topics/user_op.md
+++ b/cn/docs/extended_topics/user_op.md
--- a/cn/docs/index.md
+++ b/cn/docs/index.md
@@ -37,6 +37,5 @@ OneFlow 是开源的、采用全新架构设计，世界领先的工业级通用

 在[扩展专题](extended_topics/job_function_define_call.md)中我们介绍了具有 OneFlow 自身特点的话题，如 OneFlow 数据集格式、OneFlow 的并行观、开发者如何使用 VS Code 调试 OneFlow 框架等。

-[高级应用实例](adv_examples/resnet.md)中的文章，对应了[OneFlow Model Zoo 仓库](https://github.com/Oneflow-Inc/OneFlow-Benchmark)中的各个模型介绍，有助与读者理解模型脚本及相关细节。

 最后，期待广大开发者、机器学习爱好者参与 [OneFlow 开源计划](contribute/intro.md)，共创、共享，一起打造迈向完美的深度学习框架。
--- a/cn/mkdocs.yml
+++ b/cn/mkdocs.yml
@@ -124,17 +124,8 @@ nav:
      - 加载与准备 OFRecord 数据集: extended_topics/how_to_make_ofdataset.md
      - 将图片文件制作为 OFRecord 数据集: extended_topics/how_to_convert_image_to_ofrecord.md
      - 获取运行时数据: extended_topics/watch_watch_diff.md
-      - 使用 VS Code 调试 OneFlow: extended_topics/debug_by_vscode.md
-      - 使用 Python 扩展 Op: extended_topics/python_kernel_op.md
-      - 使用 C++ 扩展 Op: extended_topics/user_op.md
-      - 自定义 DataLoader: extended_topics/implement_data_loader.md
      - OneFlow 和 ONNX 交互: extended_topics/oneflow_convert_tools.md

-    - 高级应用实例:
-      - ResNet: adv_examples/resnet.md
-      - YoloV3: adv_examples/yolov3.md
-      - BERT: adv_examples/bert.md
-      - Wide & Deep: adv_examples/wide_deep.md
    - API:
      - API: https://oneflow.readthedocs.io/en/master/


--- a/en/docs/adv_examples/bert.md
+++ b/en/docs/adv_examples/bert.md
-
-## Summary
-BERT(Bidirectional Encoder Representations from Transformers) is a technique for NLP. In our case, we implement BERT based on the paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) using OneFlow.
-
-### Model
-| **Model** | **Hidden layers** | **Hidden unit size** | **Attention heads** | **Feedforward filter size** | **Max sequence length** | **Parameters** |
-|:---------:|:-----------------:|:--------------------:|:-------------------:|:---------------------------:|:-----------------------:|:--------------:|
-| BERTBASE  |    12 encoder     |         768          |         12          |          4 x  768           |           512           |      110M      |
-
-There are commonly two steps in BERT:
-
-* First, BERT pretrained model is obtained by pre-training;
-
-* Then, on the basis of the obtained pretrained model, an additional layer of network is added and finetuned to get the downstream application.
-
-
-## Quickstart
-### Get dataset
-We provide [OFRecord dataset and relevant other files](https://oneflow-static.oss-cn-beijing.aliyuncs.com/oneflow-tutorial-attachments/bert_squad_dataset.zip), you can get and unzip it by running commands below:
-
-```bash
-wget https://oneflow-static.oss-cn-beijing.aliyuncs.com/oneflow-tutorial-attachments/bert_squad_dataset.zip
-unzip bert_squad_dataset.zip
-```
-The list of files is as follows:
-
-* bert_config.json、vocab.txt：Files needed to generate "prediction json" file from [google bert](https://github.com/google-research/bert)
-
-* dev-v1.1/, dev-v1.1.json：SQuAD test set for evaluation
-
-* part-0：pre-trained training set contains 40 samples
-
-* train-v1.1：SQuAD training set that has been coverted to OFRecords
-
-The above files will be used in the following pretraining tasks and squad finetune.
-
-### BERT pretrained
-Firstly, clone the `OneFlow-Benchmark`:
-
-```bash
-git clone https://github.com/Oneflow-Inc/OneFlow-Benchmark.git
-cd OneFlow-Benchmark/LanguageModeling/BERT/
-```
-
-Then, with the following command, we can use our pretraining model and small sample set to start the BERT pre-training.
-```bash
-python ./run_pretraining.py\
-    --gpu_num_per_node=1 \
-    --learning_rate=3e-5 \
-    --batch_size_per_device=1 \
-    --iter_num=3 \
-    --loss_print_every_n_iter=50 \
-    --seq_length=128 \
-    --max_predictions_per_seq=20 \
-    --num_hidden_layers=12 \
-    --num_attention_heads=12 \
-    --max_position_embeddings=512 \
-    --type_vocab_size=2 \
-    --vocab_size=30522 \
-    --attention_probs_dropout_prob=0.0 \
-    --hidden_dropout_prob=0.0 \
-    --hidden_size_per_head=64 \
-    --use_boxing_v2=True \
-    --data_dir=./dataset/ \
-    --data_part_num=1 \
-    --log_dir=./bert_regresssioin_test/of \
-    --loss_print_every_n_iter=5 \
-    --model_save_dir=./bert_regresssioin_test/of \
-    --warmup_batches 831 \
-    --save_last_snapshot True
-```
-
-We will see the output similar to the following:
-```text
-==================================================================
-Running bert: num_gpu_per_node = 1, num_nodes = 1. ==================================================================
-gpu_num_per_node = 1
-node_num = 1
-node_list = None
-learning_rate = 3e-05
-weight_decay_rate = 0.01
-batch_size_per_device = 1
-iter_num = 20
-warmup_batches = 831
-log_every_n_iter = 1
-data_dir = ./dataset/
-data_part_num = 1
-use_fp16 = None
-use_boxing_v2 = True
-loss_print_every_n_iter = 5
-model_save_every_n_iter = 10000
-model_save_dir = ./bert_regresssioin_test/of
-save_last_snapshot = True
-model_load_dir = None
-log_dir = ./bert_regresssioin_test/of
-seq_length = 128
-max_predictions_per_seq = 20
-num_hidden_layers = 12
-num_attention_heads = 12
-max_position_embeddings = 512
-type_vocab_size = 2
-vocab_size = 30522
-attention_probs_dropout_prob = 0.0
-hidden_dropout_prob = 0.0
-hidden_size_per_head = 64
------------------------------------------------------------------
-Time stamp: 2020-07-06-19:09:29
-I0706 19:09:29.605840639   34801 ev_epoll_linux.c:82]        Use of signals is disabled. Epoll engine will not be used
-Init model on demand
-iter 4, total_loss: 11.032, mlm_loss: 10.281, nsp_loss: 0.751, speed: 33.086(sec/batch), 0.151(sentences/sec)
-iter 9, total_loss: 11.548, mlm_loss: 10.584, nsp_loss: 0.965, speed: 0.861(sec/batch), 5.806(sentences/sec)
-iter 14, total_loss: 10.697, mlm_loss: 10.249, nsp_loss: 0.448, speed: 0.915(sec/batch), 5.463(sentences/sec)
-iter 19, total_loss: 10.685, mlm_loss: 10.266, nsp_loss: 0.419, speed: 1.087(sec/batch), 4.602(sentences/sec)
-Saving model to ./bert_regresssioin_test/of/last_snapshot. ------------------------------------------------------------------
-average speed: 0.556(sentences/sec)
------------------------------------------------------------------
-```
-
-## Detailed description
-### Scripts
-| **Files** | **Description** | **Belongs to**|
-|:---------:|:----------:|:----------:|
-|pretrain.py、bert.py| Define the BERT model |BERT|
-|run_pretraining.py|Start BERT training. The user can configure the training environment and parameters of the BERT training through the command line parameters. The specific meanings of each option will be described in the **script options** below.| BERT|
-|squad.py|define SQuAD network|SQuAD|
-|run_squad.py|Run the SQuAD training|SQuAD|
-|run_squad_predict.py|Run the trained SQuAD model to predict.|SQuAD|
-|npy2json.py|Script required to overt OneFlow's prediction results to json.|SQuAD|
-|convert_tf_ckpt_to_of.py|Convert model from TensorFlow to OneFlow|BERT/SQuAD|
-
-
-
-### Options
-The script `run_pretraining.py` runs the pretraining and configured by command line options. You can run `run_pretraining.py --help` to see the options. The following is a detailed description of each option：
-
-* gpu_num_per_node: count of devices on each node which must be consistent on each machine
-
-* node_num: count of nodes, that is, the count of hosts in distributed system
-
-* node_list: list of nodes. When thec count of nodes is more than one, we should spcifiy list of nodes by node_list. It's a string seperated by commans like `--node_num=2 --node_list="192.168.1.12,192.168.1.14"`
-
-* learning_rate: learning rate
-
-* weight_decay_rate: decay rate of weight
-
-* batch_size_per_device: batch size on each device
-
-* iter_num ITER_NUM: count of iterations
-
-* warmup_batches: batches of warmup, default to 10000
-
-* data_dir: path to OFRecord dataset
-
-* data_part_num: number of files in the folder of OFRecord dataset
-
-* use_fp16: use float16 or not
-
-* use_boxing_v2: use boxing v2 or not
-
-* loss_print_every_n_iter: print loss every n iterations
-
-* model_save_every_n_iter: save the model every n iterations
-
-* model_save_dir: path to save the model
-
-* save_last_snapshot: whether save the model when training is finished
-
-* model_load_dir: path to load the model
-
-* log_dir LOG_DIR: specify the path of log
-
-* seq_length: length of sequence, default to 512
-
-* max_predictions_per_seq: default to 80
-
-* num_hidden_layers: number of hidden layers, defaul to 24
-
-* num_attention_heads: number of attentoion heads，default to 16
-
-### Use Wikipedia + BookCorpus dataset
-If it is necessary to carry out the pretraining of BERT from scratch, a large dataset should be used.
-
-If necessary, we can download TFRecord dataset from [google-research BERT](https://github.com/google-research/bert) and then make OFRecord dataset from it by methods in the article [Loading and preparing OFRecord dataset](../extended_topics/how_to_make_ofdataset.md).
-
-### OneFlow BERT model converted from Tensorflow Model
-If you want to directly use the pretrained model for finetune tasks (such as the SQuAD shown below), you can consider downloading directly it from [google-research BERT](https://github.com/google-research/bert) and then use the script `convert_tf_ckpt_to_of.py` we provided to convert it to OneFlow model.
-
-The conversion process is as follows:
-
-Firstly, download and unzip a BERT pretrained model of specified version, eg: `uncased_L-12_H-768_A-12`.
-```
-wget https://storage.googleapis.com/bert_models/2020_02_20/uncased_L-12_H-768_A-12.zip
-unzip uncased_L-12_H-768_A-12.zip -d uncased_L-12_H-768_A-12
-```
-
-And then, run commands below:
-```
-cd uncased_L-12_H-768_A-12/
-cat > checkpoint <<ONEFLOW
-model_checkpoint_path: "bert_model.ckpt"
-all_model_checkpoint_paths: "bert_model.ckpt"
-ONEFLOW
-```
-
-It will create a file named `checkpoint` in the directory and write content below into it:
-```
-model_checkpoint_path: "bert_model.ckpt"
-all_model_checkpoint_paths: "bert_model.ckpt"
-```
-
-Now that the TensorFlow model directory to be converted is ready, the hierarchy is:
-```
-uncased_L-12_H-768_A-12
-├── bert_config.json
-├── bert_model.ckpt.data-00000-of-00001
-├── bert_model.ckpt.index
-├── checkpoint
-└── vocab.txt
-```
-
-And then we use `convert_tf_ckpt_to_of.py` to convert model to OneFlow format:
-```bash
-python convert_tf_ckpt_to_of.py \
-  --tf_checkpoint_path ./uncased_L-12_H-768_A-12 \
-  --of_dump_path ./uncased_L-12_H-768_A-12-oneflow
-```
-The above command saves the converted OneFlow format model in `./uncased_L-12_H-768_A-12-oneflow` directory for later use(eg: SQuAD).
-
-## Finetune task: SQuAD
-### Extend to SQuAD model
-We only need to add a layer of `output` on the basis of BERT's backbone and modify the expression of loss. We can see the whole code in `squad.py`, and there are key modifications:
-```python
-def SQuADTrain():
-    #... backbone = bert_util.BertBackbone()
-
-    #add a fully-connected layer base on BERT
-    with flow.name_scope("cls-squad"):
-        final_hidden = backbone.sequence_output()
-        final_hidden_matrix = flow.reshape(final_hidden, [-1, hidden_size])
-        logits = bert_util._FullyConnected(
-                    final_hidden_matrix,
-                    hidden_size,
-                    units=2,
-                    weight_initializer=bert_util.CreateInitializer(initializer_range),
-                    name='output')
-        logits = flow.reshape(logits, [-1, seq_length, 2])
-
-        start_logits = flow.slice(logits, [None, None, 0], [None, None, 1])
-        end_logits = flow.slice(logits, [None, None, 1], [None, None, 1])
-
-    #redefine the loss of SQuAD
-        start_loss = _ComputeLoss(start_logits, start_positions_blob, seq_length)
-        end_loss = _ComputeLoss(end_logits, end_positions_blob, seq_length)
-
-        total_loss = 0.5*(start_loss + end_loss)
-
-    return total_loss
-```
-
-We run the script below to start SQuAD training to get and save a initialized model.
-
-```
-python ./run_squad.py\
-    --gpu_num_per_node=1\
-    --learning_rate=3e-5\
-    --batch_size_per_device=2\
-    --iter_num=50\
-    --loss_print_every_n_iter=50\
-    --seq_length=384\
-    --max_predictions_per_seq=20\
-    --num_hidden_layers=12\
-    --num_attention_heads=12\
-    --max_position_embeddings=512\
-    --type_vocab_size=2\
-    --vocab_size=30522\
-    --attention_probs_dropout_prob=0.0\
-    --hidden_dropout_prob=0.0\
-    --hidden_size_per_head=64\
-    --use_boxing_v2=True\
-    --data_dir=./dataset/train-v1.1\
-    --data_part_num=1\
-    --log_dir=./bert_regresssioin_test/of\
-    --model_save_dir=./bert_regresssioin_test/of\
-    --warmup_batches 831\
-    --save_last_snapshot True
-```
-There will be a initialized model in the path `./bert_regresssioin_test/of/last_snapshot`. We will merge it with pretrained BERT model and fintune it.
-
-### Merge pretrained model into SQuAD
-SQuAD is extended from pretrained model of BERT. We should merge the pretrained model into SQuAD according to the method introduced in this article[Loading and saving of model](../basics_topics/model_load_save.md).
-
-```
-cp -R ./bert_regresssioin_test/of/last_snapshot ./squadModel
-cp -R --remove-destination ./dataset/uncased_L-12_H-768_A-12_oneflow/* ./squadModel/
-```
-
-### Problem on training times
-There is a folder named `System-Train-TrainStep-xxx` in the path of pretrained model folder and the file named "out" contains the count if iterations. The `leraning rate` changes dynamically with the count of iterations.
-
-In order to prevent training of finetuning from the saved iteration affecting, the binary data in the out file should be cleared to zero.
-```
-cd System-Train-TrainStep-xxx
-xxd -r > out <<ONEFLOW
-00000000: 0000 0000 0000 0000
-ONEFLOW
-```
-
-If you are using a pretrained model transferred from TensorFlow, you can skip this step.
-
-### Start SQuAD training
-Start SQuAD training by running the script `run_suqad.py` with configuration below:
-
-* use SQuAD model `./squadModel`
-
-* use SQuAD v1.1 as training set
-
-* epoch = 3 (`iternum = 88641*3/(4*8) = 8310`)
-
-* learning rate = 3e-5
-
-```
-python ./run_squad.py\
-    --gpu_num_per_node=4\
-    --learning_rate=3e-5\
-    --batch_size_per_device=8\
-    --iter_num=8310\
-    --loss_print_every_n_iter=50\
-    --seq_length=384\
-    --max_predictions_per_seq=20\
-    --num_hidden_layers=12\
-    --num_attention_heads=12\
-    --max_position_embeddings=512\
-    --type_vocab_size=2\
-    --vocab_size=30522\
-    --attention_probs_dropout_prob=0.0\
-    --hidden_dropout_prob=0.0\
-    --hidden_size_per_head=64\
-    --use_boxing_v2=True\
-    --data_dir=./dataset/train-v1.1\
-    --data_part_num=8\
-    --log_dir=./bert_regresssioin_test/of\
-    --model_save_dir=./bert_regresssioin_test/of\
-    --warmup_batches 831\
-    --save_last_snapshot True\
-    --model_load_dir=./squadModel
-```
-
-### Prediction and evaluatoin
-In order to generate [Preidiction File](https://rajpurkar.github.io/SQuAD-explorer/), we should generate npy file fist. And then we use `write_predictions` function in [google BERT's run_squad.py](https://github.com/google-research/bert/blob/master/run_squad.py) to convert it to json format.
-
-Run the script `run_squad_predict.py` to generate `all_results.npy`:
-```bash
-python run_squad_predict.py \
-  --gpu_num_per_node=1 \
-  --batch_size_per_device=4 \
-  --iter_num=2709 \
-  --seq_length=384 \
-  --max_predictions_per_seq=20 \
-  --num_hidden_layers=12 \
-  --num_attention_heads=12 \
-  --max_position_embeddings=512 \
-  --type_vocab_size=2 \
-  --vocab_size=30522 \
-  --attention_probs_dropout_prob=0.0 \
-  --hidden_dropout_prob=0.0 \
-  --hidden_size_per_head=64 \
-  --use_boxing_v2=True \
-  --data_part_num=1 \
-  --data_dir=./dataset/dev-v1.1 \
-  --log_dir=./bert_regresssioin_test/of \
-  --model_load_dir=path/to/squadModel \
-  --warmup_batches 831
-```
-Attention: the `model_load_dir` should be the trained model of SQuAD.
-
-After we get the `all_results.npy`file, run the script `npy2json.py` in the repository of [google bert](https://github.com/google-research/bert/)(the version of TensorFlow should be v1). The `npy2json.py` we provide is modified from google bert's `run_squad.py`:
-```
-python npy2json.py\
-  --vocab_file=./dataset/vocab.txt \
-  --bert_config_file=./dataset/bert_config.json \
-  --do_train=False \
-  --do_predict=True \
-  --all_results_file=./all_results.npy \
-  --predict_file=./dataset/dev-v1.1.json \
-  --max_seq_length=384 \
-  --doc_stride=128 \
-  --output_dir=./squad_base/
-```
-
-Remember to set the `all_results_file` to the path of `all_results.npy` we obtained in the last step.
-
-We will get `predictions.json` after that which can be evaluated by[evaluate-v1.1.py](https://rajpurkar.github.io/SQuAD-explorer/).
-
-```bash
-python evaluate-v1.1.py \
-./dataset/dev-v1.1.json \
-path/to/squad_base/predictions.json
-```
-
-## Distributed training
-As described when we introduce the command line options, we can start distributed training easily by adding the options `node_num` and `node_list`:
-
-```bash
-python run_squad_predict.py \
-  --gpu_num_per_node=1 \
-  --batch_size_per_device=4 \
-  --iter_num=2709 \
-  --seq_length=384 \
-  --max_predictions_per_seq=20 \
-  --num_hidden_layers=12 \
-  --num_attention_heads=12 \
-  --max_position_embeddings=512 \
-  --type_vocab_size=2 \
-  --vocab_size=30522 \
-  --attention_probs_dropout_prob=0.0 \
-  --hidden_dropout_prob=0.0 \
-  --hidden_size_per_head=64 \
-  --use_boxing_v2=True \
-  --data_part_num=1 \
-  --data_dir=./dataset/dev-v1.1 \
-  --log_dir=./bert_regresssioin_test/of \
-  --model_load_dir=path/to/squadModel \
-  --warmup_batches 831 \
-  --node_num=2 \
-  --node_list="192.168.1.12,192.168.1.14"
-```
--- a/en/docs/adv_examples/dcgan.md
+++ b/en/docs/adv_examples/dcgan.md
-# DCGAN tutorial
-
-
-
-## 简介
-
-生成对抗网络(GANs)属于一种生成网络，它通过两个网络的相互博弈的方式来学习特定的数据分布。而DCGAN则是一种基于卷积/反卷积运算的生成对抗网络，被广泛应用于图像生成领域而DCGAN则是一种基于卷积/反卷积运算的生成对抗网络，被广泛应用于图像生成领域
-
-本例程将主要演示如何在Oneflow中运行DCGAN网络，而不重点讨论生成对抗网络的原理和细节。如果感兴趣的话，可以参考：如果感兴趣的话，可以参考：
-
- [Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks](https://arxiv.org/abs/1511.06434)
-
- [NLPS 2016 tutorial:generative adversarial networks](https://arxiv.org/abs/1511.06434)
-
-
-
-## 对齐测试
-
-本例程的核心代码在`dcgan.py`文件中，其中的模型结构和参数参考了tensorflow的[官方示例](https://www.tensorflow.org/tutorials/generative/dcgan)
-
-通过以下代码，可以运行一个简单的对齐测试，保证oneflow的模型结果与tensorflow的结果是一致的
-
-```python
-dcgan = DCGAN()
-dcgan.compare_with_tensorflow()
-```
-
-
-
-## 数据集准备
-
-例程提供了数据集下载脚本，运行`download.py`下载mnist数据集， 数据集默认保存在`./data/minst`目录中
-
-```bash
-python download.py mnist
-```
-
-
-
-## 训练
-
-在准备好数据集后，可通过DCGAN实例的`train`方法进行DCGAN的训练
-
-```python
-dcgan.train(epochs=2)
-```
-
-训练将每隔`self.eval_interval`个batch输出生成的图像
-
-![1](imgs/1.png)
-
-## 导出动图
-
-再完成训练后，可以通过DCGAN实例的`save_to_gif`方法将图像导出为动图
-
-```python
-dcgan.save_to_gif()
-```
\ No newline at end of file
--- a/en/docs/adv_examples/imgs/1.png
+++ b/en/docs/adv_examples/imgs/1.png
--- a/en/docs/adv_examples/imgs/big_vocab_table_2x1024.png
+++ b/en/docs/adv_examples/imgs/big_vocab_table_2x1024.png
--- a/en/docs/adv_examples/imgs/big_vocab_table_7x1024.png
+++ b/en/docs/adv_examples/imgs/big_vocab_table_7x1024.png
--- a/en/docs/adv_examples/imgs/detected_000004.jpg
+++ b/en/docs/adv_examples/imgs/detected_000004.jpg
--- a/en/docs/adv_examples/imgs/detected_kite.jpg
+++ b/en/docs/adv_examples/imgs/detected_kite.jpg
--- a/en/docs/adv_examples/imgs/eval_auc_loss_500iters.png
+++ b/en/docs/adv_examples/imgs/eval_auc_loss_500iters.png
--- a/en/docs/adv_examples/imgs/fish.jpg
+++ b/en/docs/adv_examples/imgs/fish.jpg
--- a/en/docs/adv_examples/imgs/fixed_batch_size_latency.png
+++ b/en/docs/adv_examples/imgs/fixed_batch_size_latency.png
--- a/en/docs/adv_examples/imgs/fixed_batch_size_memory.png
+++ b/en/docs/adv_examples/imgs/fixed_batch_size_memory.png
--- a/en/docs/adv_examples/imgs/resnet50_validation_acuracy.png
+++ b/en/docs/adv_examples/imgs/resnet50_validation_acuracy.png
--- a/en/docs/adv_examples/imgs/scaled_batch_size_latency.png
+++ b/en/docs/adv_examples/imgs/scaled_batch_size_latency.png
--- a/en/docs/adv_examples/imgs/scaled_batch_size_latency_1gpu.png
+++ b/en/docs/adv_examples/imgs/scaled_batch_size_latency_1gpu.png
--- a/en/docs/adv_examples/imgs/scaled_batch_size_memory.png
+++ b/en/docs/adv_examples/imgs/scaled_batch_size_memory.png
--- a/en/docs/adv_examples/imgs/tiger.jpg
+++ b/en/docs/adv_examples/imgs/tiger.jpg
--- a/en/docs/adv_examples/imgs/train_eval_auc_loss.png
+++ b/en/docs/adv_examples/imgs/train_eval_auc_loss.png
--- a/en/docs/adv_examples/resnet.md
+++ b/en/docs/adv_examples/resnet.md
-## Introduction
-
-### Image classification and CNN
-
-**Image classification** is an image processing method that divided different features reflected in image information into different categories of targets. It is the basis of  other tasks in computer vision, such as detection, semantic segmentation, face recognition and other high-level visual tasks.
-
-ImageNet Large-scale Visual Recognition Challenge (ILSVRC), often called ImageNet copetition, including image classification, object orientation, object detection and other tasks. It is one of the most important competition to promote the development of computer vision.
-
-In the 2012 ImageNet competition, deep convolution network Alexnet was born. With a top-5 accuracy rate more than 10% higher than the second place, it won the champion of 2012 ImageNet competition. Since then, the deep learning method represented by **CNN(Convolutional neural network)** has been applied in the field of computer vision. More and deeper CNN networks have been proposed, such as VGGNet, the champion of 2014 ImageNet competition, ResNet, the champion of 2015 ImageNet competition.
-
-
-
-### ResNet
-
-[ResNet](https://arxiv.org/abs/1512.03385) is the champion of 2015 competition. At present, compared with traditional machine learning classification algorithm, ResNet has achieved excellent results. After that, a large number of detection, segmentation, classification and other tasks are completed on the base of ResNet.
-
-In [OneFlow-Benchmark](https://github.com/Oneflow-Inc/OneFlow-Benchmark) repository, we provide OneFlow implementation of ResNet50 v1.5. After 90 epochs of training on ImageNet-2012 dataset, the accuracy of evaluation can reach 77.318% (Top 1), 93.622% (Top 5).
-
-For more detailed network parameter alignment, you can refer to [OneFlow-Benchmark's cnns](https://github.com/Oneflow-Inc/OneFlow-Benchmark/Classification/cnns) part.
-
-![resnet50_validation_acuracy](imgs/resnet50_validation_acuracy.png)
-
-
-
-**Some notes on ResNet50 v1.5**
-
-> ResNet50 v1.5 is an improved version of the original [ResNet50 v1](https://arxiv.org/abs/1512.03385), compared with the original model, the accuracy improve slightly Top1(~0.5%), you can refer to [there](https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5) for more details.
-
-Next, we take the above ResNet50 network as an example to show how to use OneFlow to train and predict step by step.
-
-The main contents include：
-
- Preparation
-  - The installation and preparation of project
-
- Quick start
-  - Predict / Inference
-  - Train / Predict
-  - Evaluation
- More details
-  - Distributed training
-  - Hybrid precision training and prediction
- Advanced
-  - Parameter alignment
-  - Preparing dataset (ImageNet 2012)
-  - Convert OneFlow model to ONNX model
-
-
-
-## Requirements
-
-
-> Don't worry, it is easy to use OneFlow. You can start OneFlow's image recognition journey with three steps as follow.
->
-> - Install OneFlow，you can refer to [OneFlow project home page](https://github.com/Oneflow-Inc/oneflow) to finish installation.
->
-> - Clone / Download [OneFlow-Benchmark](https://github.com/Oneflow-Inc/OneFlow-Benchmark) repository.
->
->   `git clone git@github.com:Oneflow-Inc/OneFlow-Benchmark.git`
->
->   `cd  OneFlow-Benchmark/Classification/cnns`
->
-> - Preparing Dataset (optional)
->
->   - Use synthetic virtual dataset directly.
->   - Download the ImageNet 2012 [mini-dataset](https://oneflow-public.oss-cn-beijing.aliyuncs.com/online_document/dataset/imagenet/mini-imagenet.zip) we created and unzip it into the data directory
->   - Or: Make a complete OFRecord format ImageNet dataset (see the advanced section below)
->
-> We provide general scripts: `train.sh` and `inference.sh`, which are applicable to the training, validation and inference of all cnn networks in this repository. You can train different models and dataset by setting parameters in scripts.
->
->  **Some notes on model**
->
-> > By default, we use ResNet50, you can also assign other model by setting the `--model` parameter. Such as: `--model="resnet50"`, `--model="vgg"` and so on.
->
-> **Description of dataset**
->
->
-> > 1)  To get reader quickly start, we provide synthetic virtual dataset, which refers to data is generated directly in memory as a random source of neural network.
-> >
-> > 2) At the same time, we provide a mini-dataset. You can download and unzip it into data directory,  you can start training quickly. After getting familiar with the process, readers can refer to the making dataset part to make a complete ImageNet 2012 dataset.
-> >
-> > 3) Using OFRecord dataset can improve the efficientcy of data loading (But this is not necessary, refer to [Data Input](../basics_topics/data_input.md), OneFlow supports loading numpy data directly).
-
-
-
-## Quick Start
-
-So, let's start OneFlow's image classification journey !
-
-First, switch to the directory:
-
-```
-cd OneFlow-Benchmark/Classification/cnns
-```
-
-### Pretrained Model
-
-#### resnet50
-
-[resnet50_v1.5_model](https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/resnet_v15_of_best_model_val_top1_77318.tgz ) (validation accuracy: 77.318% top1，93.622% top5 )
-
-### Predict / Inference
-
-After downloading pretrained model, unzip it and put it into the current directory. Then execute:
-
-```
-sh inference.sh
-```
-
-This script will call the model to classify the goldfish picture:
-
-<div align="center">
-    <img src="imgs/fish.jpg" align='center'/>
-</div>
-
-The prediction is successful if the following is output.
-
-```
-data/fish.jpg
-0.87059885 goldfish, Carassius auratus
-```
-
-As you can see, model judge this picture with 87.05% probability is goldfish.
-
-### Train & Validation
-
- Training model is also easy as we just need to execute:
-
-  ```
-  sh train.sh
-  ```
-
-  You can start training model and you will see the follow output
-
-  ```
-  Loading synthetic data.
-  Loading synthetic data.
-  Saving model to ./output/snapshots/model_save-20200723124215/snapshot_initial_model.
-  Init model on demand.
-  train: epoch 0, iter 10, loss: 7.197278, top_1: 0.000000, top_k: 0.000000, samples/s: 61.569
-  train: epoch 0, iter 20, loss: 6.177684, top_1: 0.000000, top_k: 0.000000, samples/s: 122.555
-  Saving model to ./output/snapshots/model_save-20200723124215/snapshot_epoch_0.
-  train: epoch 0, iter 30, loss: 3.988656, top_1: 0.525000, top_k: 0.812500, samples/s: 120.337
-  train: epoch 1, iter 10, loss: 1.185733, top_1: 1.000000, top_k: 1.000000, samples/s: 80.705
-  train: epoch 1, iter 20, loss: 1.042017, top_1: 1.000000, top_k: 1.000000, samples/s: 118.478
-  Saving model to ./output/snapshots/model_save-20200723124215/snapshot_epoch_1.
-  ...
-  ```
-
-  >  To facilitate running the demonstration, we use synthetic virtual dataset by default so that you can quickly see the model in action.
-
-  Also, you can use [mini-dataset](https://oneflow-public.oss-cn-beijing.aliyuncs.com/online_document/dataset/imagenet/mini-imagenet.zip), after downloading it and unzip it in data directory, and then modify the training script as follows:
-
-  ```
-  rm -rf core.*
-  rm -rf ./output/snapshots/*
-
-  DATA_ROOT=data/imagenet/ofrecord
-
-  python3 of_cnn_train_val.py \
-      --train_data_dir=$DATA_ROOT/train \
-      --num_examples=50 \
-      --train_data_part_num=1 \
-      --val_data_dir=$DATA_ROOT/validation \
-      --num_val_examples=50 \
-      --val_data_part_num=1 \
-      --num_nodes=1 \
-      --gpu_num_per_node=1 \
-      --model_update="momentum" \
-      --learning_rate=0.001 \
-      --loss_print_every_n_iter=1 \
-      --batch_size_per_device=16 \
-      --val_batch_size_per_device=10 \
-      --num_epoch=10 \
-      --model="resnet50"
-  ```
-
-  Running this script, we will train a classfication model on the mini-ImageNet dataset with only 50 goldfish images. We can use this model to classify the goldfish image.
-
-  Don't worry, if you need to train model on the complete ImageNet2012 dataset, please refer to [OneFlow-Benchmark](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns) repository.
-
-### Evaluate
-
-You can evaluate the accuracy of the Resnet50 model using either your own trained model or the [resnet50_v1.5_model](https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/resnet_v15_of_best_model_val_top1_77318.tgz ) (unzip it and put it in current directory) provided by us.
-
-Run this script:
-
-```
-sh evaluate.sh
-```
-
-The accuracy of the trained model on validation dataset with 50000 images can be obtained:
-
-```
-Time stamp: 2020-07-27-09:28:28
-Restoring model from resnet_v15_of_best_model_val_top1_77318.
-I0727 09:28:28.773988162    8411 ev_epoll_linux.c:82]        Use of signals is disabled. Epoll engine will not be used
-Loading data from /dataset/ImageNet/ofrecord/validation
-validation: epoch 0, iter 195, top_1: 0.773277, top_k: 0.936058, samples/s: 1578.325
-validation: epoch 0, iter 195, top_1: 0.773237, top_k: 0.936078, samples/s: 1692.303
-validation: epoch 0, iter 195, top_1: 0.773297, top_k: 0.936018, samples/s: 1686.896
-```
-
-> Before executing `sh evaluate.sh`, make sure you have prepared the validation dataset of ImageNet 2012. Please refer to [OneFlow-Benchmark](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns) repository to learn how to make validation dataset.
-
-From the evaluation results of the three rounds, out model has achieved 77.32+% Top1 accuracy.
-
-Finally, congratulations! You complete the training / validating, inference and evaluation of ResNet model on ImageNet dataset. Applause for yourself!
-
-
-
-## Details
-
-### Distributed training
-
-**Simple and easy-to-use distributed training is one of OneFlow's main features**
-
-OneFlow is designed to support efficient distributed training natively. Especially for distributed data parallelism, user do not have to worry about how to divide and synchronize the data when the algorithm expands from single machine to multiple machines. That is to say, in OneFlow, User only need to write algorithm from the view of single machine, and the code automatically has the ability of distributed training.
-
-
-#### How to configure and run distributed training?
-
-We still use the code shown in the "Quick Start", in `train.sh`, the distributed configuration is easily accomplished by specifying the number of nodes (machines) with `--num_nodes`, the IP address of the nodes with `--node_ips`, and the number of devices to be used on each node with `--gpu_num_per_node`.
-
-For example, we want to do distributed training on  2 machines with 8 devices, configure it like this:
-
-```
-# train.sh
-python3 of_cnn_train_val.py \
-    --num_nodes=2 \
-    --node_ips="192.168.1.1, 192.168.1.2"
-    --gpu_num_per_node=4 \
-    ...
-    --model="resnet50"
-```
-
-Then execute the following script on the two machines at the same time:
-
-```
-./train.sh
-```
-
-After the program starts, you can see through the command `watch -n 0.1 nvidia-smi` that both machines' devices start working. After a while, the output is printed on the screen of the first machine set by `--node_ips`.
-
-### Hybrid precision training and predicting
-
-Currently, OneFlow supports float16/float32 hybrid precision training. During training, the model parameters are trained using float16 while retaining float32 as the gradient update and calculation process. Since the storage of parameters is halved, the training speed will be improved.
-
-By turning on the hybrid precision training mode in OneFlow, ResNet50's training speed can theoretically reach `1.7` times of acceleration.
-
-
-#### How to turn on the hybrid precision training mode？
-
-Just add the parameter `--use_fp16=True` in the `train.sh` script.
-
-#### Hybrid precision model
-
-We provide a hybrid precision model after training 90 epochs on ImageNet2012 dataset, its Top_1 accuracy: 77.33%.
-
-You can download and use it directly: [resnet50_v15_fp16](https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/resnet_fp16_of_best_model_val_top1_77330.zip)
-
-
-
-## Advanced
-
-### Parameters alignment
-
-OneFlow's ResNet50 implementation is aligned with Nvidia's Mxnet edition. We've made careful and almost identical alignment from the learning rate, optimizer, image augmentation to finer per-layer network configuration, bias, weight initialization, and more. The detailed parameters alignment please refer to [OneFlow-Benchmark](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns) repository.
-
-
-
-###  Preparing dataset
-
-#### Introduction of image classification dataset
-
-The public dataset used for image classification are CIFAR, ImageNet, etc. These datasets provide original images in JPEG format.
-
- [CIFAR](http://www.cs.toronto.edu/~kriz/cifar.html)
-
-  Hinton's student Alex Krizhevsky and Ilya Sutskever collated a small dataset to classify pervasive objects. It includes CIFAR-10 and CIFAR-100
-
- [ImageNet](http://image-net.org/index)
-
-  ImageNet dataset are generally referred to as the dataset used in large-scale visual recognition challenge (ILSVRC) between 2010-2017. The ImageNet data has changed slightly since 2010. The commonly used ImageNet-2012 dataset includes 1000 categories, its training dataset contains 1281167 pictures, ranging from 732 to 1300 per category. The validation dataset contains 50000 pictures, with an average of 50 pictures per category.
-
-For the complete process of preparing ImageNet-2012 dataset, please refer to [README](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns/tools/README.md) in the tools directory.
-
-### Convert OneFlow model to ONNX model
-
-#### Introduction
-
- **ONNX (Open Neural Network Exchange)**  is a widely used neural network intermediate format. With the ONNX format, the OneFlow model can be used by many serving framework (like OpenVINO, ONNEX Runtime and some mobile framework: ncnn, tnn, TEgine, etc). In this section, we will introduce how to convert the trained ResNet50 v1.5 model to ONNX model and evaluate it.
-
-#### Quick Start
-
-We provide complete code: [resnet\_to\_onnx.py](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/Classification/cnns/resnet_to_onnx.py), it can help you complete the transformation and testing of the model.
-
- **Step1: ** Download the pretrain model: [resnet50_v1.5_model](https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/resnet_v15_of_best_model_val_top1_77318.tgz ), unzip it and put it into current directory
-
- **Step2: ** Execute `python3 resnet_to_onnx.py `
-
-This code will complete the transformation of the OneFlow model -> ONNX model, and then use the ONNX Runtime to load the transformed model to test the individual images. The test picture is as follows:
-
-<div align="center">
-    <img src="imgs/tiger.jpg" align='center'/>
-</div>
-
->                                              图片来源：https://en.wikipedia.org/wiki/Tiger
-
-Output：
-
-```python
-Convert to onnx success! >>  onnx/model/resnet_v15_of_best_model_val_top1_77318.onnx
-data/tiger.jpg
-Are the results equal? Yes
-Class: tiger, Panthera tigris; score: 0.8112028241157532
-```
-
-
-
-#### How to generate ONNX model
-
-We have introduced how to convert OneFlow's ResNet model to ONNX model and give an example of using the onnx runtime to make predictions in above example. Similarly, you can follow the steps to complete the transformation of your training ResNet model or other models.
-
-**Step1: Save the model's weight**
-
-First you should specify the OneFlow model path, and then specify the transformed ONNX model storage path, like the following example.
-
-首先指定待转换的OneFlow模型路径，然后指定转换后的ONNX模型存放路径，例如示例中：
-
-```python
-#set up your model path
-flow_weights_path = 'resnet_v15_of_best_model_val_top1_77318'
-onnx_model_dir = 'onnx/model'
-```
-
-**Step2: Create a new job function for inference**
-
-Then, we create a new job function for inference, which only contains the network structure, except the operator to read the OFRecord, and accepts the form of numpy array input. You can refer to the `InferenceNet` in `resnet_to_onnx.py`.
-
-**Step3: Call `flow.onnx.export` method**
-
-In the following code, we call the `oneflow_to_onnx()` method, this method includes the core model transformation method: `flow.onnx.export()`.
-
-**`flow.onnx.export`** will obtain ONNX model from OneFlow network, its first parameter is the job function used to infer. The second parameter is OneFlow model path, the third parameter is the save path of ONNX model.
-
-```python
-onnx_model = oneflow_to_onnx(InferenceNet, flow_weights_path, onnx_model_dir, external_data=False)
-```
-
-#### Evaluate the correctness of ONNX model
-
-After the ONNX model is generated, we can use ONNX model by ONNX Runtime to verify that the OneFlow model and the ONNX model give the same results with the same inputs. The corresponding code is `check_equality` in `resnet_to_onnx.py`.
--- a/en/docs/adv_examples/wide_deep.md
+++ b/en/docs/adv_examples/wide_deep.md
-[TOC]
-
-[HugeCTR](https://github.com/NVIDIA/HugeCTR)是英伟达提供的一种高效的GPU框架，专为点击率（CTR）估计训练而设计。
-
-OneFlow对标HugeCTR搭建了Wide & Deep 学习网络（WDL)。OneFlow对标HugeCTR搭建了Wide & Deep 学习网络（WDL)。OneFlow-WDL网络实现了模型并行与稀疏更新，在8卡12G TitanV的服务器上实现支持超过4亿的词表大小，而且性能没有损失与小词表性能相当。
-
-本文介绍如何使用OneFlow-WDL网络进行训练，以及一些训练结果及分析。
-
-## 环境和准备
-运行OneFlow-WDL需要有安装好OneFlow的python环境，并安装了[scikit-learn](https://scikit-learn.org/stable/install.html)。
-### 软件要求
- python 3.x（推荐）
- OneFlow 0.x
- scikit-learn
-
-### 数据准备
-我们准备了一个小的[样本数据集](https://oneflow-public.oss-cn-beijing.aliyuncs.com/datasets/wdl_ofrecord_examples.tgz)，可以下载进行简单测试。
-
-或者参考[《使用Spark创建WDL数据集》](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/ClickThroughRate/WideDeepLearning/how_to_make_ofrecord_for_wdl.md)中的步骤，从CriteoLabs官网下载原始数据集并制作成OneFlow所需要的OFRecord格式的数据集。
-
-### OneFlow-WDL脚本
-OneFlow-WDL脚本只有一个文件`wdl_train_eval.py`，请从[这里](https://github.com/Oneflow-Inc/OneFlow-Benchmark/blob/master/ClickThroughRate/WideDeepLearning/wdl_train_eval.py)下载。
-
-## 运行OneFlow-WDL脚本
-```
-EMBD_SIZE=1603616
-DATA_ROOT=/path/to/wdl/ofrecord
-python3 wdl_train_eval.py \
-  --train_data_dir $DATA_ROOT/train \
-  --train_data_part_num 256 \
-  --train_part_name_suffix_length=5 \
-  --eval_data_dir $DATA_ROOT/val \
-  --eval_data_part_num 256 \
-  --max_iter=300000 \
-  --loss_print_every_n_iter=1000 \
-  --eval_interval=1000 \
-  --batch_size=16384 \
-  --wide_vocab_size=$EMBD_SIZE \
-  --deep_vocab_size=$EMBD_SIZE \
-  --gpu_num 1
-```
-通常配置好数据集的位置`DATA_ROOT`后，上面的shell脚本就可以被执行了，如果屏幕上能够输出下面类似的结果，就表示已经正确运行。
-```
-1000 time 2020-07-08 00:28:08.066281 loss 0.503295350909233
-1000 eval_loss 0.4846755236387253 eval_auc 0.7616240146992771
-2000 time 2020-07-08 00:28:11.613961 loss 0.48661992555856703
-2000 eval_loss 0.4816856697201729 eval_auc 0.765256583562705
-3000 time 2020-07-08 00:28:15.149135 loss 0.48245503094792364
-3000 eval_loss 0.47835959643125536 eval_auc 0.7715609382514008
-4000 time 2020-07-08 00:28:18.686327 loss 0.47975033831596375
-4000 eval_loss 0.47925308644771575 eval_auc 0.7781267916810946
-```
-## 测试结果及说明
-我们在一台有8块12G显存的TitanV的服务器上对OneFlow-WDL进行了一组测试，并使用HugeCTR提供的docker容器做了同样参数的测试。
-
-### 多GPU性能测试
-主要测试目的是在batch size = 16384的情况下，测量不同GPU数量处理每个批次的平均时延（latency）。 测试配置了7个1024神经单元的隐藏层。 测试配置了7个1024神经单元的隐藏层。
-
-结果如下图：
-
-![image](imgs/fixed_batch_size_latency.png)
-
-我们同时记录了，测试时实际最大占用显存的大小，结果如下图：
-
-![image](imgs/fixed_batch_size_memory.png)
-
-综合上面结果表明，1卡到8卡，OneFlow-WDL在占用较少的显存的情况下，速度要比HugeCTR快。
-
-### batch size=16384每卡，多卡性能测试
-主要测试目的是在保证每GPU卡处理16384batch size情况下，使用1至8GPU卡进行训练每个批次的平均时延（latency）。 测试配置了7个1024神经单元的隐藏层。 测试配置了7个1024神经单元的隐藏层。
-
-结果如下图：
-
-![image](imgs/scaled_batch_size_latency.png)
-
-我们同时记录了，测试时实际最大占用显存的大小，结果如下图：
-
-![image](imgs/scaled_batch_size_memory.png)
-
-综合上面结果表明，随着卡数的增加，时延增加，OneFlow-WDL在占用较少的显存的情况下，速度要比HugeCTR快；因为每卡保证16384 batch size，OneFlow每卡占用的内存并无显著变化。
-
-### 单GPU卡不同batch size性能测试
-主要测试目的是在一个GPU卡情况下，测量不同batch size每个批次的平均时延（latency）。 测试配置了2个1024神经单元的隐藏层。 测试配置了2个1024神经单元的隐藏层。
-
-结果如下图：
-
-![image](imgs/scaled_batch_size_latency_1gpu.png)
-
-### 超大词表测试
-OneFlow-WDL中配置了两个Embedding Table：
- `wide_embedding` 大小是vocab_size x 1
- `deep_embedding` 大小是vocab_size x 16
-
-HugeCTR中词表大小（vocab_size）是1603616。我们从3200000开始测起，一直到支持4亿的词表大小，结果如下图：我们从3200000开始测起，一直到支持4亿的词表大小，结果如下图：
-
-![image](imgs/big_vocab_table_2x1024.png) ![image](imgs/big_vocab_table_7x1024.png)
-
-上面的图中，蓝色柱子是批次训练的平均时延（latency），红色曲线代表GPU显存的占用。
-
-结论：随着词表大小的增大，内存随之增大，但latency没有明显的变化。
-
-### 收敛性测试1
-我们选取了batch size=512进行了收敛性能的测试。
-
-下面这张图是，前500步的结果，每一步训练都在验证集中选取20条记录进行验证，图中的曲线分别是loss和AUC： ![image](imgs/eval_auc_loss_500iters.png)
-
-结论：AUC迅速就增长到超过了0.75。
-
-### 收敛性测试2
-和收敛性测试1同样的情况，这一次是每训练1000步打印训练loss的平均值，然后选取20条验证集数据进行验证，一共训练30万步，结果如下：
-
-![image](imgs/train_eval_auc_loss.png)
-
-结论与分析：
-1. 蓝色的train loss曲线有明显向下的台阶，因为整个训练集有36674623条数据，batch_size=512的情况下，大概71630步就过了整个数据集（一个epoch），30万步就把训练数据集用了4次多，蓝色曲线的台阶印证了这些。OneFlow在训练过程中支持数据的打乱，每当数据集被完整的用完一遍之后，数据会被重新打乱，减少过拟合。OneFlow在训练过程中支持数据的打乱，每当数据集被完整的用完一遍之后，数据会被重新打乱，减少过拟合。
-2. 橙色的曲线是验证集loss，在前两个epoch的时候基本保持下降的趋势，从第三个epoch开始，loss开始有上升的趋势，表明已经过拟合了。
-3. 灰色是验证集的AUC，AUC也是在第二个epoch的时候达到了峰值，超过了0.8，后面几个epoch就开始下降。
--- a/en/docs/adv_examples/yolov3.md
+++ b/en/docs/adv_examples/yolov3.md
-## YoloV3
-
-## 1. Introduction
-
-[YOLO](https://pjreddie.com/darknet/yolo/) series of algorithms (v1~v3), is the first single-stage object detection network, YOLO — You Only Look Once indicates its single-stage feature. Because the network is simple and the single-stage efficiency is fast, it is distinguished from the two-stage target detector represented by Faster-RCNN. Since it was released, it has become popular in the field of the target detection with its fast speed and high accuracy, and has been widely used and praised.
-
-While Yolov3 is the classic and comprehensive one(of course, the official also released Yolov4 recently). It takes Darknet-53 with residual network as the backbone, and integrates features such as multi-scale, 3-way output feature map and upsampling, which greatly improves the model accuracy and small target detection capability.
-
-![detected_kite](imgs/detected_000004.jpg)
-
-In this article, we provide an OneFlow implementation of Yolov3. The difference is that we handle NMS process in C++ and call it by customizing user op. Of course, we also support handling NMS process in Python.
-
-
-
-## 2. Quick Start
-
-Before we start, please make sure you have installed [OneFlow](https://github.com/Oneflow-Inc/oneflow) properly.
-
-1. Git clone [this repository](https://github.com/Oneflow-Inc/oneflow_yolov3)
-
-```
-git clone https://github.com/Oneflow-Inc/oneflow_yolov3.git
-```
-2. Install python dependency library
-
-```
-   pip install -r requirements.txt
-```
-3. Execute this script in project's root directory
-
-```
-bash scripts/test.sh
-```
-Execute this script to compile the operator defined in cpp code into a callable .so file. You will see in the project path.
-
- libdarknet.so
-
- liboneflow_yolov3.so
-
-
-
-### Pretrain Model
-
-We use the pretrain model—[yolov3.weight](https://pjreddie.com/media/files/yolov3.weights) provided by Yolov3 author, and generate the model in OneFlow format after transformation. Download pretrain model: [of_model_yolov3.zip](https://oneflow-public.oss-cn-beijing.aliyuncs.com/model_zoo/of_model_yolov3.zip), extract the `of_model` folder and put it in the root directory.
-
-
-
-## 3. Predict/inference
-
-Execute the following script：
-
-```
-sh yolo_predict.sh
-```
-Or：
-```
-sh yolo_predict_python_data_preprocess.sh
-```
-
-After executing the script, we will generate the images with bounding box under the `data/result`.
-
-![detected_kite](imgs/detected_kite.jpg)
-
- Parameters description
- --pretrained_model    Pretrain model path
-
- --label_path                 Coco label path
-
- --input_dir                    The path of images folder to be detected
-
- --output_dir                 The output path of the detect structure
-
- --image_paths             Single/multiple paths of image to be detected. Like：
-
-  --image_paths  'data/images/000002.jpg'  'data/images/000004.jpg'
-
-The training is also very simple. After preparing dataset, we only need to execute `sh yolo_train.sh`. The process of preparing dataset is shown in the Preparing Dataset part.
-
-
-
-## 4. Preparing Dataset
-
-Yolov3 supports arbitrary object detection dataset. In the below we use [COCO2014](http://cocodataset.org/#download) as an example to create the training/validation dataset. Other datasets [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/) or custom datasets, can be created in the same format.
-
-### Resource file
-
-Download COCO2014 training dataset and validation dataset. unzip it and put `train2014` and `val2014` under the `data/COCO/images` directory.
-
-(If you have downloaded COCO2014 dataset locally, you can create a soft link of images to the parent directory of `train2014` and `val2014`)
-
-Prepare resource file: `labels`, `5k.part`, `trainvalno5k.part`
-
-```
-wget -c https://pjreddie.com/media/files/coco/5k.part
-wget -c https://pjreddie.com/media/files/coco/trainvalno5k.part
-wget -c https://pjreddie.com/media/files/coco/labels.tgz
-```
-
-### Scripts
-
-Execute the script in `data/COCO` directory:
-
-```
-# get label file
-tar xzf labels.tgz
-
-# set up image list
-paste <(awk "{print \"$PWD\"}" <5k.part) 5k.part | tr -d '\t' > 5k.txt
-paste <(awk "{print \"$PWD\"}" <trainvalno5k.part) trainvalno5k.part | tr -d '\t' > trainvalno5k.txt
-
-# copy label txt to image dir
-find labels/train2014/ -name "*.txt"  | xargs -i cp {} images/train2014/
-find labels/val2014/   -name "*.txt"  | xargs -i cp {} images/val2014/
-```
-
-This script will automatically unzip `labels.tgz` file, and generate `5k.txt` and `trainvalno5k.txt` in current directory. Then copy all `label.txt` files in `labels/train2014` and `labels/val2014` to the corresponding training dataset and validation dataset folders (Make sure images and label are in the same directory).
-
-At this point, the preparation of the whole dataset is completed.
-
-
-
-## 5. Training
-
-Modify the parameter in `yolo_train.sh` script, let `--image_path_file="data/COCO/trainvalno5k.txt"` and execute:
-
-```
-sh yolo_train.sh
-```
-
-Then we start training, more detailed parameters are described as follows:
-
- --gpu_num_per_node    The amount of devices on each machine
- --batch_size                     The batch size
- --base_lr                           The base learning rate
- --classes                           The number of target categories (COCO 80; VOC 20)
- --model_save_dir            The model storage path
- --dataset_dir                    The path of training/validation dataset
- --num_epoch                   The total epochs
- --save_frequency            Specify the epoch interval for model saving
-
-
-## Descriptions
-
-At present, if we call `yolo_predict.sh`. The data preprocessing is dependent on `darknet`
-
-Among them:
-
-In `predict decoder`, we call `load_image_color`, `letterbox_image` function.
-
-In `train decoder`, we call `load_data_detection` function.
-It mainly involves the following operations, which will be replaced in later versions with `OneFlow Decoder Ops`
-
- image read
- nhwc -> nchw
- image / 255
- bgr2rgb
- resize_image
- fill_image
- random_distort_image
- clip image
- random flip image and box
- randomize_boxes
- correct_boxes
--- a/en/docs/basics_topics/data_input.md
+++ b/en/docs/basics_topics/data_input.md
@@ -7,7 +7,7 @@ Machine learning is driven by data. Data loading and preprocessing require both

 Working directly with Numpy data is easy and convenient but only for small amounts of data. Because when the amount of data is too large, there may be barrier in preparing the Numpy data. Therefore, this approach is more suitable for the initial stages of the project to quickly validate and improve the algorithm.

-The DataLoader of OneFlow use techniques such as multi-threading and data pipelining which make data loading, data pre-processing more efficient.However, you need to [prepare dataset](... /extended_topics/how_to_make_of_dataset.md) which already supported by Oneflow or [develop you own DataLoader](../extended_topics/implement_data_loader.md) for the datatype which not supported by Oneflow. Thus we recommend use that in mature projects.
+The DataLoader of OneFlow use techniques such as multi-threading and data pipelining which make data loading, data pre-processing more efficient.However, you need to [prepare dataset](../extended_topics/how_to_make_ofdataset.md) which already supported by Oneflow. Thus we recommend use that in mature projects.


 ## Use Numpy as Data Input
@@ -72,7 +72,7 @@ In addition, there are other data preprocessing operators that are used to proce

 The following example reads the `OFRecord` data format file and dealing with images from the ImageNet dataset. The complete code can be downloaded here: [of_data_pipeline.py](../code/basics_topics/of_data_pipeline.py).

-This script requires an OFRecord dataset and you can make your own one according to [this article] (. /extended_topics/how_to_make_of_dataset.md).
+This script requires an OFRecord dataset and you can make your own one according to [this article](../extended_topics/how_to_make_ofdataset.md).

 Or you can download the [part-00000](https://oneflow-public.oss-cn-beijing.aliyuncs.com/online_document/docs/basics_topics/part-00000) that we have prepared for you which contains 64 images. Then replace `path/to/ImageNet/ofrecord` in the script with the directory where the `part-00000` file **is located** and run the script.

@@ -139,4 +139,4 @@ For example, in the script:

 ## More Formats Support by DataLoader

-OneFlow provides a number of DataLoaders and preprocessing operators, refer to [oneflow.data](https://oneflow.readthedocs.io/en/master/data.html) for details. These operators will be enriched and optimized in the future, but users can also refer to [this article](../extended_topics/implement_data_loader.md) to customize the DataLoader to meet specific needs.
+OneFlow provides a number of DataLoaders and preprocessing operators, refer to [oneflow.data](https://oneflow.readthedocs.io/en/master/data.html) for details. These operators will be enriched and optimized in the future.
--- a/en/docs/extended_topics/debug_by_vscode.md
+++ b/en/docs/extended_topics/debug_by_vscode.md
-This article describes how to configure VS Code to build OneFlow GUI development environment.
-
-If you are not familiar with VS code please refer to [official documentation](https://code.visualstudio.com/docs).
-
-This article covers:
-
-* How to compile the `Debug` version of OneFlow.
-
-* The necessary extensions of VS code along with installing guidelines.
-
-### Compile the Debug version of OneFlow.
-
-If we use the `Release` version of OneFlow, we may have problems with debugging because of the compiling optimization, and actual running position may not correspond to the source line.
-
-Thus, we need to compile `Debug` version of OneFlow and generate the json file needed by clangd.
-
-When we run cmake, we need add flag of `Debug` and `CMAKE_EXPORT_COMPILE_COMMANDS`.
-
-```
-cmake .. \
-DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_EXPORT_COMPILE_COMMANDS=1
-```
-
-* `-DCMAKE_BUILD_TYPE=Debug` choose the version of Debug.
-
-* `-DCMAKE_EXPORT_COMPILE_COMMANDS` will generate a file named `compile_commands.json` in the `build` folder. The json file is required by clangd and we will configure it later.
-
-### Remote - SSH
-This section is intended only for those who need to develop remotely. For those developer developing on local hosts ** may skip this section**.
-By the extension "Remote SSH" of VS Code, we can connect to a remote server through SSH.
-
-![RemoteSSH](imgs/plugin-remote-ssh.png)
-
-With the help of "Remote SSH", we can attach to OneFlow running on a remote server and debug OneFlow as if we are debugging a local program.
-
-After installing the extension "Remote - SSH", press F1 and select `Remote-SSH: Connect to Host..` in the pop-up search bar. After that,we can set the SSH connection configuration and connect to the remote host.
-
-After connected to the remote host, the extensions window will be divided to "remote" and "local" automatically.
-
-If a extension that needs to be installed on the remote host is detected, it is grayed out with a button **Install in SSH: remote server name**. Click it to install the corresponding extension on the remote host.
-
-![remotePlugin](imgs/plugin-remote-ssh-install.png)
-
-As shown in the figure above, we have installed python, clangd and native debug extensions on the remote host to support remote debugging of OneFlow.
-
-But the extensions Go, HTML CSS Support are not installed remotely.
-
-
-### clangd
-After some simple configuration, clangd can provide us with code completion, symbol jump and other convenience.
-
-Followings are required before we configure clangd:
-
-* We have already compiled OneFlow and generated `compile_commands.json` file.
-* We have already installed clangd on remote host through "Remote - SSH".
-* It is **NOT** recommended that install the extension "ms-vscode.cpptools C/C++" which is recommended by VS Code. Because it conflicts with clangd.
-
-#### Configure clangd in VS code
-
-Create a soft link to the `compile_commands.json` in "build" dictionary in the source root of OneFlow. We need to change to the directory of OneFlow's source root and run the command below:
-
-```
-ln -s ./build/compile_commands.json compile_commands.json
-```
-
-Then press `Ctrl+Shift+P` (`command+shift+p` on MacOS) to find the `Open Remote Settings` option and open the `settings.json` file and add the following configuration:
-
-```json
-    "clangd.path": "/path/to/bin/clangd",
-    "clangd.arguments": [
-        "-j",
-        "12",
-        "-clang-tidy"
-    ]
-```
-The meaning of `clangd.arguments` and more options can be found by `clangd --help`.
-
-#### Using clangd
-In View->Output panel of VS code, we can choose "Clang Language Server" in dropdown list and then we will see parsing output of clangd. After that, VS Code can jumps between symbols of C/C++.
-
-Press `Ctrl+P` (`command+P` on MacOS), and then through `@symbols name` or `#symbols name` we can find the symbols in current file or in project scope respectively.
-
-### native debug
-Press `Ctrl + Shift + D` (`command+shift+D` on MacOS)  or click the Run button on activity bar can switch VS Code to the view of Run.
-
-![Run View](imgs/run-view.png)
-
-And then we choose `Create a launch.json file` first and next choose gdb template.
-
-![gdb](imgs/gdb-select.png)
-
-And then we can set the options:
-```json
-{
-    "version": "0.2.0",
-    "configurations": [
-        {
-            "name": "lenet", //defined job name
-            "type": "gdb",
-            "request": "launch",
-            "target": "/home/yaochi/.conda/envs/ycof/bin/python3", //python path
-            "arguments": "lenet_train.py", //script
-            "cwd": "/home/yaochi/of_example", //script path
-            "valuesFormatting": "parseText"
-        }
-    ]
-}
-```
-
-After we set the breakpoint, we can press F5 to start debugging.
-
-![snapshot of debugging](imgs/debug_snapshot.png)
-
-### Others:
-
-* If the download speed is too slow in VS Code, you can refer to [offcial document](https://code.visualstudio.com/docs/setup/network) for changing `hostname` or setting proxy.
-
-* The [official introduction](https://clang.llvm.org/extra/clangd/Installation.html) about install of clangd.
-
-* The [official introduction](https://code.visualstudio.com/docs/editor/debugging) about configuration of VS Code.
-
-* The latest version of clangd may have special requirements of glibc. That may lead to raise some errors on missing libraries.
-
-```
-./bin/clangd: /lib64/libc.so.6: version `GLIBC_2.18' not found (required by ./bin/clangd)
-```
-
-We can download the older version of clangd (The recommended version of this article is 9.0.0). Older version of clangd is available on [LLVM official site](https://releases.llvm.org/download.html). Download the LLVM tools package with clangd inside.
--- a/en/docs/extended_topics/how_to_make_ofdataset.md
+++ b/en/docs/extended_topics/how_to_make_ofdataset.md
@@ -104,8 +104,8 @@ def ofrecord_reader(

 The benefit of using `ofrecord_reader` is that `ofrecord_reader` acts as a normal operator which participates in OneFlow composition optimization and enjoys OneFlow pipeline acceleration.
 For flexibility and extensibility of the code, we can define a preprocessing OP for `ofrecord_reader` to deal with specific data formats which are coupled with operational logic (e.g. decoding, decompression and etc.).
+
 - For more information on DataLoader and related operator usage refer to [Data input](../basics_topics/data_input.md) .
- For more information on customized OP please refer to [User op](./user_op.md).

 ## The transition between other data format data and OFRecord dataset


--- a/en/docs/extended_topics/implement_data_loader.md
+++ b/en/docs/extended_topics/implement_data_loader.md
--- a/en/docs/extended_topics/user_op.md
+++ b/en/docs/extended_topics/user_op.md
--- a/en/docs/index.md
+++ b/en/docs/index.md
@@ -36,6 +36,4 @@ Looking forward to your [feedbacks](https://github.com/Oneflow-Inc/oneflow/issue

 - If you want to know more about the characteristics of OneFlow, such as the format of OneFlow's dataset, the parallelism view of OneFlow or how to debug OneFlow framework with vscode, please refer to [extended topic](extended_topics/job_function_define_call.md).

-In [advanced examples](adv_examples/resnet.md), we introduce models in [OneFlow Model Zoo repository](https://github.com/Oneflow-Inc/OneFlow-Benchmark). It is helpful for users to understand the models and other details.
-
 We highly expect developers and geeks to join our [contributor community](contribute/intro.md). Together we can build a perfect deep learning framework.
--- a/en/mkdocs.yml
+++ b/en/mkdocs.yml
@@ -125,19 +125,9 @@ nav:
      - Loading and Preparing OFRecord Dataset: extended_topics/how_to_make_ofdataset.md
      - Convert Image Files to OFRecord Datasets: extended_topics/how_to_convert_image_to_ofrecord.md
      - Obtain Runtime Data: extended_topics/watch_watch_diff.md
-      - Use VS Code to Debug OneFlow: extended_topics/debug_by_vscode.md
-      - User Defined OP: extended_topics/user_op.md
      - OneFlow And ONNX Convert: extended_topics/oneflow_convert_tools.md


-    #- Advanced Examples:
-      #- Image Processing:
-       # - ResNet: adv_examples/resnet.md
-        #- AlexNet: adv_examples/alexnet.md
-        #- MaskRCNN: adv_examples/mask_rcnn.md
-      #- NLP handling-BERT: adv_examples/bert.md
-      #- Recommendation System -Wide&Deep: adv_examples/wide_deep.md
-      #- Deep Faking -Generate Conflict Network: adv_examples/dcgan.md
    - API:
      - API: https://oneflow.readthedocs.io/en/master/