diff --git a/fluid/DeepASR/README_cn.md b/fluid/DeepASR/README_cn.md
new file mode 100644
index 0000000000000000000000000000000000000000..be78a048701a621bd90942bdfe30ef4d7c7f082f
--- /dev/null
+++ b/fluid/DeepASR/README_cn.md
@@ -0,0 +1,186 @@
+运行本目录下的程序示例需要使用 PaddlePaddle v0.14及以上版本。如果您的 PaddlePaddle 安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新 PaddlePaddle 安装版本。
+
+---
+
+DeepASR (Deep Automatic Speech Recognition) 是一个基于PaddlePaddle FLuid与[Kaldi](http://www.kaldi-asr.org)的语音识别系统。其利用Fluid框架完成语音识别中声学模型的配置和训练，并集成 Kaldi 的解码器。旨在方便已对 Kaldi 的较为熟悉的用户实现中声学模型的快速、大规模训练，并利用kaldi完成复杂的语音数据预处理和最终的解码过程。
+
+### 目录
+- [模型概览](#model-overview)
+- [安装](#installation)
+- [数据预处理](#data-reprocessing)
+- [模型训练](#training)
+- [训练过程中的时间分析](#perf-profiling)
+- [预测和解码](#infer-decoding)
+- [评估错误率](#scoring-error-rate)
+- [Aishell 实例](#aishell-example)
+- [欢迎贡献更多的实例](#how-to-contrib)
+
+### 模型概览
+
+DeepASR的声学模型是一个单卷积层加多层层叠LSTMP 的结构，利用卷积来进行初步的特征提取，并用多层的LSTMP来对时序关系进行建模，所用到的损失函数是交叉熵。[LSTMP](https://arxiv.org/abs/1402.1128)(LSTM with recurrent projection layer)是传统 LSTM 的拓展，在 LSTM 的基础上增加了一个映射层，将隐含层映射到较低的维度并输入下一个时间步，这种结构在大为减小 LSTM 的参数规模和计算复杂度的同时还提升了 LSTM 的性能表现。
+
+<p align="center">
+<img src="images/lstmp.png" height=240 width=480 hspace='10'/> <br />
+图1 LSTMP 的拓扑结构
+</p>
+
+### 安装
+
+
+#### kaldi的安装与设置
+
+
+DeepASR解码过程中所用的解码器依赖于[Kaldi的安装](https://github.com/kaldi-asr/kaldi)，如环境中无Kaldi, 请`git clone`其源代码，并按给定的命令安装好kaldi，最后设置环境变量`KALDI_ROOT`：
+
+```shell
+export KALDI_ROOT=<kaldi的安装路径>
+
+```
+#### 解码器的安装
+进入解码器源码所在的目录
+
+```shell
+cd models/fluid/DeepASR/decoder
+```
+运行安装脚本
+
+```shell
+sh setup.sh
+```
+ 编译过程完成即成功地安转了解码器。
+
+### 数据预处理
+
+参考[Kaldi的数据准备流程](http://kaldi-asr.org/doc/data_prep.html)完成音频数据的特征提取和标签对齐
+
+### 声学模型的训练
+
+可以选择在CPU或GPU模式下进行声学模型的训练，例如在GPU模式下的训练
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 python -u train.py \
+                   --train_feature_lst train_feature.lst \
+                   --train_label_lst train_label.lst \
+                   --val_feature_lst val_feature.lst \
+                   --val_label_lst val_label.lst \
+                   --mean_var global_mean_var \
+                   --parallel
+```
+其中`train_feature.lst`和`train_label.lst`分别是训练数据集的特征列表文件和标注列表文件，类似的，`val_feature.lst`和`val_label.lst`对应的则是验证集的列表文件。实际训练过程中要正确指定建模单元大小、学习率等重要参数。关于这些参数的说明，请运行
+
+```shell
+python train.py --help
+```
+获取更多信息。
+
+### 训练过程中的时间分析
+
+利用Fluid提供的性能分析工具profiler，可对训练过程进行性能分析，获取网络中operator级别的执行时间
+
+```shell
+CUDA_VISIBLE_DEVICES=0 python -u tools/profile.py \
+                   --train_feature_lst train_feature.lst \
+                   --train_label_lst train_label.lst \
+                   --val_feature_lst val_feature.lst \
+                   --val_label_lst val_label.lst \
+                   --mean_var global_mean_var
+```
+
+
+### 预测和解码
+
+在充分训练好声学模型之后，利用训练过程中保存下来的模型checkpoint，可对输入的音频数据进行解码输出，得到声音到文字的识别结果
+
+```
+CUDA_VISIBLE_DEVICES=0,1,2,3 python -u infer_by_ckpt.py \
+                        --batch_size 96  \
+                        --checkpoint deep_asr.pass_1.checkpoint \
+                        --infer_feature_lst test_feature.lst  \
+                        --infer_label_lst test_label.lst  \
+                        --mean_var global_mean_var \
+                        --parallel
+```
+
+### 评估错误率
+
+对语音识别系统的评价常用的指标有词错误率(Word Error Rate, WER)和字错误率(Character Error Rate, CER), 在DeepASR中也实现了相关的度量工具，其运行方式为
+
+```
+python score_error_rate.py --error_rate_type cer --ref ref.txt --hyp decoding.txt
+```
+参数`error_rate_type`表示测量错误率的类型，即 WER 或 CER；`ref.txt` 和 `decoding.txt` 分别表示参考文本和实际解码出的文本，它们有着同样的格式：
+
+```
+key1 text1
+key2 text2
+key3 text3
+...
+
+```
+
+
+### Aishell 实例
+
+本节以[Aishell数据集](http://www.aishelltech.com/kysjcp)为例，展示如何完成从数据预处理到解码输出。Aishell是由北京希尔贝克公司所开放的中文普通话语音数据集，时长178小时，包含了400名来自不同口音区域录制者的语音，原始数据可由[openslr](http://www.openslr.org/33)获取。为简化流程，这里提供了已完成预处理的数据集供下载：
+
+```
+cd examples/aishell
+sh prepare_data.sh
+```
+
+其中包括了声学模型的训练数据以及解码过程中所用到的辅助文件等。下载数据完成后，在开始训练之前可对训练过程进行分析
+
+```
+sh profile.sh
+```
+
+执行训练
+
+```
+sh train.sh
+```
+默认是用4卡GPU进行训练，在实际过程中可根据可用GPU的数目和显存大小对`batch_size`、学习率等参数进行动态调整。训练过程中典型的损失函数和精度的变化趋势如图2所示
+
+<p align="center">
+<img src="images/learning_curve.png" height=480 width=640 hspace='10'/> <br />
+图2 在Aishell数据集上训练声学模型的学习曲线
+</p>
+
+完成模型训练后，即可执行预测识别测试集语音中的文字：
+
+```
+sh infer_by_ckpt.sh
+```
+
+其中包括了声学模型的预测和解码器的解码输出两个重要的过程。以下是解码输出的样例：
+
+```
+...
+BAC009S0764W0239 十一 五 期间 我 国 累计 境外 投资 七千亿 美元
+BAC009S0765W0140 在 了解 送 方 的 资产 情况 与 需求 之后
+BAC009S0915W0291 这 对 苹果 来说 不 是 件 容易 的 事 儿
+BAC009S0769W0159 今年 土地 收入 预计 近 四万亿 元
+BAC009S0907W0451 由 浦东 商店 作为 掩护
+BAC009S0768W0128 土地 交易 可能 随着 供应 淡季 的 到来 而 降温
+...
+```
+
+每行对应一个输出，均以音频样本的关键字开头，随后是按词分隔的解码出的中文文本。解码完成后运行脚本评估字错误率(CER)
+
+```
+sh score_cer.sh
+```
+
+其输出类似于如下所示
+
+```
+Error rate[cer] = 0.101971 (10683/104765),
+total 7176 sentences in hyp, 0 not presented in ref.
+```
+
+利用经过20轮左右训练的声学模型，可以在Aishell的测试集上得到CER约10%的识别结果。
+
+
+### 欢迎贡献更多的实例
+
+DeepASR目前只开放了Aishell实例，我们欢迎用户在更多的数据集上测试完整的训练流程并贡献到这个项目中。
diff --git a/fluid/DeepASR/decoder/post_decode_faster.cc b/fluid/DeepASR/decoder/post_decode_faster.cc
deleted file mode 100644
index ce2b45bc6cecec5466f3d20841e5b8ba38151a6c..0000000000000000000000000000000000000000
--- a/fluid/DeepASR/decoder/post_decode_faster.cc
+++ /dev/null
@@ -1,145 +0,0 @@
-/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include "post_decode_faster.h"
-
-typedef kaldi::int32 int32;
-using fst::SymbolTable;
-using fst::VectorFst;
-using fst::StdArc;
-
-Decoder::Decoder(std::string word_syms_filename,
-                 std::string fst_in_filename,
-                 std::string logprior_rxfilename,
-                 kaldi::BaseFloat acoustic_scale) {
-  const char* usage =
-      "Decode, reading log-likelihoods (of transition-ids or whatever symbol "
-      "is on the graph) as matrices.";
-
-  kaldi::ParseOptions po(usage);
-  binary = true;
-  this->acoustic_scale = acoustic_scale;
-  allow_partial = true;
-  kaldi::FasterDecoderOptions decoder_opts;
-  decoder_opts.Register(&po, true);  // true == include obscure settings.
-  po.Register("binary", &binary, "Write output in binary mode");
-  po.Register("allow-partial",
-              &allow_partial,
-              "Produce output even when final state was not reached");
-  po.Register("acoustic-scale",
-              &acoustic_scale,
-              "Scaling factor for acoustic likelihoods");
-
-  word_syms = NULL;
-  if (word_syms_filename != "") {
-    word_syms = fst::SymbolTable::ReadText(word_syms_filename);
-    if (!word_syms)
-      KALDI_ERR << "Could not read symbol table from file "
-                << word_syms_filename;
-  }
-
-  std::ifstream is_logprior(logprior_rxfilename);
-  logprior.Read(is_logprior, false);
-
-  // It's important that we initialize decode_fst after loglikes_reader, as it
-  // can prevent crashes on systems installed without enough virtual memory.
-  // It has to do with what happens on UNIX systems if you call fork() on a
-  // large process: the page-table entries are duplicated, which requires a
-  // lot of virtual memory.
-  decode_fst = fst::ReadFstKaldi(fst_in_filename);
-
-  decoder = new kaldi::FasterDecoder(*decode_fst, decoder_opts);
-}
-
-
-Decoder::~Decoder() {
-  if (!word_syms) delete word_syms;
-  delete decode_fst;
-  delete decoder;
-}
-
-std::string Decoder::decode(
-    std::string key,
-    const std::vector<std::vector<kaldi::BaseFloat>>& log_probs) {
-  size_t num_frames = log_probs.size();
-  size_t dim_label = log_probs[0].size();
-
-  kaldi::Matrix<kaldi::BaseFloat> loglikes(
-      num_frames, dim_label, kaldi::kSetZero, kaldi::kStrideEqualNumCols);
-  for (size_t i = 0; i < num_frames; ++i) {
-    memcpy(loglikes.Data() + i * dim_label,
-           log_probs[i].data(),
-           sizeof(kaldi::BaseFloat) * dim_label);
-  }
-
-  return decode(key, loglikes);
-}
-
-
-std::vector<std::string> Decoder::decode(std::string posterior_rspecifier) {
-  kaldi::SequentialBaseFloatMatrixReader posterior_reader(posterior_rspecifier);
-  std::vector<std::string> decoding_results;
-
-  for (; !posterior_reader.Done(); posterior_reader.Next()) {
-    std::string key = posterior_reader.Key();
-    kaldi::Matrix<kaldi::BaseFloat> loglikes(posterior_reader.Value());
-
-    decoding_results.push_back(decode(key, loglikes));
-  }
-
-  return decoding_results;
-}
-
-
-std::string Decoder::decode(std::string key,
-                            kaldi::Matrix<kaldi::BaseFloat>& loglikes) {
-  std::string decoding_result;
-
-  if (loglikes.NumRows() == 0) {
-    KALDI_WARN << "Zero-length utterance: " << key;
-  }
-  KALDI_ASSERT(loglikes.NumCols() == logprior.Dim());
-
-  loglikes.ApplyLog();
-  loglikes.AddVecToRows(-1.0, logprior);
-
-  kaldi::DecodableMatrixScaled decodable(loglikes, acoustic_scale);
-  decoder->Decode(&decodable);
-
-  VectorFst<kaldi::LatticeArc> decoded;  // linear FST.
-
-  if ((allow_partial || decoder->ReachedFinal()) &&
-      decoder->GetBestPath(&decoded)) {
-    if (!decoder->ReachedFinal())
-      KALDI_WARN << "Decoder did not reach end-state, outputting partial "
-                    "traceback.";
-
-    std::vector<int32> alignment;
-    std::vector<int32> words;
-    kaldi::LatticeWeight weight;
-
-    GetLinearSymbolSequence(decoded, &alignment, &words, &weight);
-
-    if (word_syms != NULL) {
-      for (size_t i = 0; i < words.size(); i++) {
-        std::string s = word_syms->Find(words[i]);
-        decoding_result += s;
-        if (s == "")
-          KALDI_ERR << "Word-id " << words[i] << " not in symbol table.";
-      }
-    }
-  }
-
-  return decoding_result;
-}
diff --git a/fluid/DeepASR/decoder/post_decode_faster.h b/fluid/DeepASR/decoder/post_decode_faster.h
deleted file mode 100644
index 8bade8d6988f02ef4caab8ecf6fc50209aa3642a..0000000000000000000000000000000000000000
--- a/fluid/DeepASR/decoder/post_decode_faster.h
+++ /dev/null
@@ -1,58 +0,0 @@
-/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-    http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License. */
-
-#include <string>
-#include <vector>
-#include "base/kaldi-common.h"
-#include "base/timer.h"
-#include "decoder/decodable-matrix.h"
-#include "decoder/faster-decoder.h"
-#include "fstext/fstext-lib.h"
-#include "hmm/transition-model.h"
-#include "lat/kaldi-lattice.h"  // for {Compact}LatticeArc
-#include "tree/context-dep.h"
-#include "util/common-utils.h"
-
-
-class Decoder {
-public:
-  Decoder(std::string word_syms_filename,
-          std::string fst_in_filename,
-          std::string logprior_rxfilename,
-          kaldi::BaseFloat acoustic_scale);
-  ~Decoder();
-
-  // Interface to accept the scores read from specifier and return
-  // the batch decoding results
-  std::vector<std::string> decode(std::string posterior_rspecifier);
-
-  // Accept the scores of one utterance and return the decoding result
-  std::string decode(
-      std::string key,
-      const std::vector<std::vector<kaldi::BaseFloat>> &log_probs);
-
-private:
-  // For decoding one utterance
-  std::string decode(std::string key,
-                     kaldi::Matrix<kaldi::BaseFloat> &loglikes);
-
-  fst::SymbolTable *word_syms;
-  fst::VectorFst<fst::StdArc> *decode_fst;
-  kaldi::FasterDecoder *decoder;
-  kaldi::Vector<kaldi::BaseFloat> logprior;
-
-  bool binary;
-  kaldi::BaseFloat acoustic_scale;
-  bool allow_partial;
-};
diff --git a/fluid/DeepASR/decoder/post_latgen_faster_mapped.cc b/fluid/DeepASR/decoder/post_latgen_faster_mapped.cc
new file mode 100644
index 0000000000000000000000000000000000000000..ad8aaa84803d61bbce3d76757954e47f8585ed8b
--- /dev/null
+++ b/fluid/DeepASR/decoder/post_latgen_faster_mapped.cc
@@ -0,0 +1,305 @@
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include "post_latgen_faster_mapped.h"
+#include <limits>
+#include "ThreadPool.h"
+
+using namespace kaldi;
+typedef kaldi::int32 int32;
+using fst::SymbolTable;
+using fst::Fst;
+using fst::StdArc;
+
+Decoder::Decoder(std::string trans_model_in_filename,
+                 std::string word_syms_filename,
+                 std::string fst_in_filename,
+                 std::string logprior_in_filename,
+                 size_t beam_size,
+                 kaldi::BaseFloat acoustic_scale) {
+  const char *usage =
+      "Generate lattices using neural net model.\n"
+      "Usage: post-latgen-faster-mapped [options] <trans-model> "
+      "<fst-in|fsts-rspecifier> <logprior> <posts-rspecifier>"
+      " <lattice-wspecifier> [ <words-wspecifier> [<alignments-wspecifier>] "
+      "]\n";
+  ParseOptions po(usage);
+  allow_partial = false;
+  this->acoustic_scale = acoustic_scale;
+
+  config.Register(&po);
+  int32 beam = 11;
+  po.Register("acoustic-scale",
+              &acoustic_scale,
+              "Scaling factor for acoustic likelihoods");
+  po.Register("word-symbol-table",
+              &word_syms_filename,
+              "Symbol table for words [for debug output]");
+  po.Register("allow-partial",
+              &allow_partial,
+              "If true, produce output even if end state was not reached.");
+
+  int argc = 2;
+  char *argv[] = {(char *)"post-latgen-faster-mapped",
+                  (char *)("--beam=" + std::to_string(beam_size)).c_str()};
+
+  po.Read(argc, argv);
+
+  std::ifstream is_logprior(logprior_in_filename);
+  logprior.Read(is_logprior, false);
+
+  {
+    bool binary;
+    Input ki(trans_model_in_filename, &binary);
+    this->trans_model.Read(ki.Stream(), binary);
+  }
+
+  this->determinize = config.determinize_lattice;
+
+  this->word_syms = NULL;
+  if (word_syms_filename != "") {
+    if (!(word_syms = fst::SymbolTable::ReadText(word_syms_filename))) {
+      KALDI_ERR << "Could not read symbol table from file "
+                << word_syms_filename;
+    }
+  }
+
+  // Input FST is just one FST, not a table of FSTs.
+  this->decode_fst = fst::ReadFstKaldiGeneric(fst_in_filename);
+
+  kaldi::LatticeFasterDecoder *decoder =
+      new LatticeFasterDecoder(*decode_fst, config);
+  decoder_pool.emplace_back(decoder);
+
+  std::string lattice_wspecifier =
+      "ark:|gzip -c > mapped_decoder_data/lat.JOB.gz";
+  if (!(determinize ? compact_lattice_writer.Open(lattice_wspecifier)
+                    : lattice_writer.Open(lattice_wspecifier)))
+    KALDI_ERR << "Could not open table for writing lattices: "
+              << lattice_wspecifier;
+
+  words_writer = new Int32VectorWriter("");
+  alignment_writer = new Int32VectorWriter("");
+}
+
+Decoder::~Decoder() {
+  if (!this->word_syms) delete this->word_syms;
+  delete this->decode_fst;
+  for (size_t i = 0; i < decoder_pool.size(); ++i) {
+    delete decoder_pool[i];
+  }
+  delete words_writer;
+  delete alignment_writer;
+}
+
+
+void Decoder::decode_from_file(std::string posterior_rspecifier,
+                               size_t num_processes) {
+  try {
+    double tot_like = 0.0;
+    kaldi::int64 frame_count = 0;
+    // int num_success = 0, num_fail = 0;
+
+    KALDI_ASSERT(ClassifyRspecifier(fst_in_filename, NULL, NULL) ==
+                 kNoRspecifier);
+    SequentialBaseFloatMatrixReader posterior_reader("ark:" +
+                                                     posterior_rspecifier);
+
+    Timer timer;
+    timer.Reset();
+    double elapsed = 0.0;
+
+    for (size_t n = decoder_pool.size(); n < num_processes; ++n) {
+      kaldi::LatticeFasterDecoder *decoder =
+          new LatticeFasterDecoder(*decode_fst, config);
+      decoder_pool.emplace_back(decoder);
+    }
+    elapsed = timer.Elapsed();
+    ThreadPool thread_pool(num_processes);
+
+    while (!posterior_reader.Done()) {
+      timer.Reset();
+      std::vector<std::future<std::string>> que;
+      for (size_t i = 0; i < num_processes && !posterior_reader.Done(); ++i) {
+        std::string utt = posterior_reader.Key();
+        Matrix<BaseFloat> &loglikes(posterior_reader.Value());
+        que.emplace_back(thread_pool.enqueue(std::bind(
+            &Decoder::decode_internal, this, decoder_pool[i], utt, loglikes)));
+        posterior_reader.Next();
+      }
+      timer.Reset();
+      for (size_t i = 0; i < que.size(); ++i) {
+        std::cout << que[i].get() << std::endl;
+      }
+    }
+
+  } catch (const std::exception &e) {
+    std::cerr << e.what();
+  }
+}
+
+inline kaldi::Matrix<kaldi::BaseFloat> vector2kaldi_mat(
+    const std::vector<std::vector<kaldi::BaseFloat>> &log_probs) {
+  size_t num_frames = log_probs.size();
+  size_t dim_label = log_probs[0].size();
+  kaldi::Matrix<kaldi::BaseFloat> loglikes(
+      num_frames, dim_label, kaldi::kSetZero, kaldi::kStrideEqualNumCols);
+  for (size_t i = 0; i < num_frames; ++i) {
+    memcpy(loglikes.Data() + i * dim_label,
+           log_probs[i].data(),
+           sizeof(kaldi::BaseFloat) * dim_label);
+  }
+  return loglikes;
+}
+
+std::vector<std::string> Decoder::decode_batch(
+    std::vector<std::string> keys,
+    const std::vector<std::vector<std::vector<kaldi::BaseFloat>>>
+        &log_probs_batch,
+    size_t num_processes) {
+  ThreadPool thread_pool(num_processes);
+  std::vector<std::string> decoding_results;  //(keys.size(), "");
+
+  for (size_t n = decoder_pool.size(); n < num_processes; ++n) {
+    kaldi::LatticeFasterDecoder *decoder =
+        new LatticeFasterDecoder(*decode_fst, config);
+    decoder_pool.emplace_back(decoder);
+  }
+
+  size_t index = 0;
+  while (index < keys.size()) {
+    std::vector<std::future<std::string>> res_in_que;
+    for (size_t t = 0; t < num_processes && index < keys.size(); ++t) {
+      kaldi::Matrix<kaldi::BaseFloat> loglikes =
+          vector2kaldi_mat(log_probs_batch[index]);
+      res_in_que.emplace_back(
+          thread_pool.enqueue(std::bind(&Decoder::decode_internal,
+                                        this,
+                                        decoder_pool[t],
+                                        keys[index],
+                                        loglikes)));
+      index++;
+    }
+    for (size_t i = 0; i < res_in_que.size(); ++i) {
+      decoding_results.emplace_back(res_in_que[i].get());
+    }
+  }
+  return decoding_results;
+}
+
+std::string Decoder::decode(
+    std::string key,
+    const std::vector<std::vector<kaldi::BaseFloat>> &log_probs) {
+  kaldi::Matrix<kaldi::BaseFloat> loglikes = vector2kaldi_mat(log_probs);
+  return decode_internal(decoder_pool[0], key, loglikes);
+}
+
+
+std::string Decoder::decode_internal(
+    LatticeFasterDecoder *decoder,
+    std::string key,
+    kaldi::Matrix<kaldi::BaseFloat> &loglikes) {
+  if (loglikes.NumRows() == 0) {
+    KALDI_WARN << "Zero-length utterance: " << key;
+    // num_fail++;
+  }
+  KALDI_ASSERT(loglikes.NumCols() == logprior.Dim());
+
+  loglikes.ApplyLog();
+  loglikes.AddVecToRows(-1.0, logprior);
+
+  DecodableMatrixScaledMapped matrix_decodable(
+      trans_model, loglikes, acoustic_scale);
+  double like;
+  return this->DecodeUtteranceLatticeFaster(
+      decoder, matrix_decodable, key, &like);
+}
+
+
+std::string Decoder::DecodeUtteranceLatticeFaster(
+    LatticeFasterDecoder *decoder,
+    DecodableInterface &decodable,  // not const but is really an input.
+    std::string utt,
+    double *like_ptr) {  // puts utterance's like in like_ptr on success.
+  using fst::VectorFst;
+  std::string ret = utt + ' ';
+
+  if (!decoder->Decode(&decodable)) {
+    KALDI_WARN << "Failed to decode file " << utt;
+    return ret;
+  }
+  if (!decoder->ReachedFinal()) {
+    if (allow_partial) {
+      KALDI_WARN << "Outputting partial output for utterance " << utt
+                 << " since no final-state reached\n";
+    } else {
+      KALDI_WARN << "Not producing output for utterance " << utt
+                 << " since no final-state reached and "
+                 << "--allow-partial=false.\n";
+      return ret;
+    }
+  }
+
+  double likelihood;
+  LatticeWeight weight;
+  int32 num_frames;
+  {  // First do some stuff with word-level traceback...
+    VectorFst<LatticeArc> decoded;
+    if (!decoder->GetBestPath(&decoded))
+      // Shouldn't really reach this point as already checked success.
+      KALDI_ERR << "Failed to get traceback for utterance " << utt;
+
+    std::vector<int32> alignment;
+    std::vector<int32> words;
+    GetLinearSymbolSequence(decoded, &alignment, &words, &weight);
+    num_frames = alignment.size();
+    // if (alignment_writer->IsOpen()) alignment_writer->Write(utt, alignment);
+    if (word_syms != NULL) {
+      for (size_t i = 0; i < words.size(); i++) {
+        std::string s = word_syms->Find(words[i]);
+        ret += s + ' ';
+      }
+    }
+    likelihood = -(weight.Value1() + weight.Value2());
+  }
+
+  // Get lattice, and do determinization if requested.
+  Lattice lat;
+  decoder->GetRawLattice(&lat);
+  if (lat.NumStates() == 0)
+    KALDI_ERR << "Unexpected problem getting lattice for utterance " << utt;
+  fst::Connect(&lat);
+  if (determinize) {
+    CompactLattice clat;
+    if (!DeterminizeLatticePhonePrunedWrapper(
+            trans_model,
+            &lat,
+            decoder->GetOptions().lattice_beam,
+            &clat,
+            decoder->GetOptions().det_opts))
+      KALDI_WARN << "Determinization finished earlier than the beam for "
+                 << "utterance " << utt;
+    // We'll write the lattice without acoustic scaling.
+    if (acoustic_scale != 0.0)
+      fst::ScaleLattice(fst::AcousticLatticeScale(1.0 / acoustic_scale), &clat);
+    // disable output lattice temporarily
+    // compact_lattice_writer.Write(utt, clat);
+  } else {
+    // We'll write the lattice without acoustic scaling.
+    if (acoustic_scale != 0.0)
+      fst::ScaleLattice(fst::AcousticLatticeScale(1.0 / acoustic_scale), &lat);
+    // lattice_writer.Write(utt, lat);
+  }
+  return ret;
+}
diff --git a/fluid/DeepASR/decoder/post_latgen_faster_mapped.h b/fluid/DeepASR/decoder/post_latgen_faster_mapped.h
new file mode 100644
index 0000000000000000000000000000000000000000..9c234b8681690b9f1e3d30b61ac3b97b7055887f
--- /dev/null
+++ b/fluid/DeepASR/decoder/post_latgen_faster_mapped.h
@@ -0,0 +1,80 @@
+/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License. */
+
+#include <string>
+#include <vector>
+#include "base/kaldi-common.h"
+#include "base/timer.h"
+#include "decoder/decodable-matrix.h"
+#include "decoder/decoder-wrappers.h"
+#include "fstext/kaldi-fst-io.h"
+#include "hmm/transition-model.h"
+#include "tree/context-dep.h"
+#include "util/common-utils.h"
+
+class Decoder {
+public:
+  Decoder(std::string trans_model_in_filename,
+          std::string word_syms_filename,
+          std::string fst_in_filename,
+          std::string logprior_in_filename,
+          size_t beam_size,
+          kaldi::BaseFloat acoustic_scale);
+  ~Decoder();
+
+  // Interface to accept the scores read from specifier and print
+  // the decoding results directly
+  void decode_from_file(std::string posterior_rspecifier,
+                        size_t num_processes = 1);
+
+  // Accept the scores of one utterance and return the decoding result
+  std::string decode(
+      std::string key,
+      const std::vector<std::vector<kaldi::BaseFloat>> &log_probs);
+
+  // Accept the scores of utterances in batch and return the decoding results
+  std::vector<std::string> decode_batch(
+      std::vector<std::string> key,
+      const std::vector<std::vector<std::vector<kaldi::BaseFloat>>>
+          &log_probs_batch,
+      size_t num_processes = 1);
+
+private:
+  // For decoding one utterance
+  std::string decode_internal(kaldi::LatticeFasterDecoder *decoder,
+                              std::string key,
+                              kaldi::Matrix<kaldi::BaseFloat> &loglikes);
+
+  std::string DecodeUtteranceLatticeFaster(kaldi::LatticeFasterDecoder *decoder,
+                                           kaldi::DecodableInterface &decodable,
+                                           std::string utt,
+                                           double *like_ptr);
+
+  fst::SymbolTable *word_syms;
+  fst::Fst<fst::StdArc> *decode_fst;
+  std::vector<kaldi::LatticeFasterDecoder *> decoder_pool;
+  kaldi::Vector<kaldi::BaseFloat> logprior;
+  kaldi::TransitionModel trans_model;
+  kaldi::LatticeFasterDecoderConfig config;
+
+  kaldi::CompactLatticeWriter compact_lattice_writer;
+  kaldi::LatticeWriter lattice_writer;
+  kaldi::Int32VectorWriter *words_writer;
+  kaldi::Int32VectorWriter *alignment_writer;
+
+  bool binary;
+  bool determinize;
+  kaldi::BaseFloat acoustic_scale;
+  bool allow_partial;
+};
diff --git a/fluid/DeepASR/decoder/pybind.cc b/fluid/DeepASR/decoder/pybind.cc
index 90ea38ffb535677dc66d74fc64ff3fe4a27bf824..4a9b27d4cf862e5c1492875512fdeba3e95ecb15 100644
--- a/fluid/DeepASR/decoder/pybind.cc
+++ b/fluid/DeepASR/decoder/pybind.cc
@@ -15,25 +15,37 @@ limitations under the License. */
 #include <pybind11/pybind11.h>
 #include <pybind11/stl.h>
 
-#include "post_decode_faster.h"
+#include "post_latgen_faster_mapped.h"
 
 namespace py = pybind11;
 
-PYBIND11_MODULE(post_decode_faster, m) {
+PYBIND11_MODULE(post_latgen_faster_mapped, m) {
   m.doc() = "Decoder for Deep ASR model";
 
   py::class_<Decoder>(m, "Decoder")
-      .def(py::init<std::string, std::string, std::string, kaldi::BaseFloat>())
-      .def("decode",
-           (std::vector<std::string> (Decoder::*)(std::string)) &
-               Decoder::decode,
+      .def(py::init<std::string,
+                    std::string,
+                    std::string,
+                    std::string,
+                    size_t,
+                    kaldi::BaseFloat>())
+      .def("decode_from_file",
+           (void (Decoder::*)(std::string, size_t)) & Decoder::decode_from_file,
            "Decode for the probability matrices in specifier "
-           "and return the transcriptions.")
+           "and print the transcriptions.")
       .def(
           "decode",
           (std::string (Decoder::*)(
               std::string, const std::vector<std::vector<kaldi::BaseFloat>>&)) &
               Decoder::decode,
           "Decode one input probability matrix "
-          "and return the transcription.");
+          "and return the transcription.")
+      .def("decode_batch",
+           (std::vector<std::string> (Decoder::*)(
+               std::vector<std::string>,
+               const std::vector<std::vector<std::vector<kaldi::BaseFloat>>>&,
+               size_t num_processes)) &
+               Decoder::decode_batch,
+           "Decode one batch of probability matrices "
+           "and return the transcriptions.");
 }
diff --git a/fluid/DeepASR/decoder/setup.py b/fluid/DeepASR/decoder/setup.py
index a98c0b4cc17717a6769b8322e4f5afe3de6ab2de..81fc857cce5b57af5bce7b34a1f4243fb853c0b6 100644
--- a/fluid/DeepASR/decoder/setup.py
+++ b/fluid/DeepASR/decoder/setup.py
@@ -24,7 +24,7 @@ except:
                      "install kaldi and export KALDI_ROOT=<kaldi's root dir> .")
 
 args = [
-    '-std=c++11', '-Wno-sign-compare', '-Wno-unused-variable',
+    '-std=c++11', '-fopenmp', '-Wno-sign-compare', '-Wno-unused-variable',
     '-Wno-unused-local-typedefs', '-Wno-unused-but-set-variable',
     '-Wno-deprecated-declarations', '-Wno-unused-function'
 ]
@@ -49,11 +49,11 @@ LIB_DIRS = [os.path.abspath(path) for path in LIB_DIRS]
 
 ext_modules = [
     Extension(
-        'post_decode_faster',
-        ['pybind.cc', 'post_decode_faster.cc'],
+        'post_latgen_faster_mapped',
+        ['pybind.cc', 'post_latgen_faster_mapped.cc'],
         include_dirs=[
             'pybind11/include', '.', os.path.join(kaldi_root, 'src'),
-            os.path.join(kaldi_root, 'tools/openfst/src/include')
+            os.path.join(kaldi_root, 'tools/openfst/src/include'), 'ThreadPool'
         ],
         language='c++',
         libraries=LIBS,
@@ -63,8 +63,8 @@ ext_modules = [
 ]
 
 setup(
-    name='post_decode_faster',
-    version='0.0.1',
+    name='post_latgen_faster_mapped',
+    version='0.1.0',
     author='Paddle',
     author_email='',
     description='Decoder for Deep ASR model',
diff --git a/fluid/DeepASR/decoder/setup.sh b/fluid/DeepASR/decoder/setup.sh
index 1471f85f414ae8dd5230f04cf08da282adc3b0b7..238cc64986900bae6fa0bb403d8134981212b8ea 100644
--- a/fluid/DeepASR/decoder/setup.sh
+++ b/fluid/DeepASR/decoder/setup.sh
@@ -4,4 +4,9 @@ if [ ! -d pybind11 ]; then
     git clone https://github.com/pybind/pybind11.git
 fi 
 
+if [ ! -d ThreadPool ]; then
+    git clone https://github.com/progschj/ThreadPool.git
+    echo -e "\n"
+fi
+
 python setup.py build_ext -i 
diff --git a/fluid/DeepASR/examples/aishell/download_pretrained_model.sh b/fluid/DeepASR/examples/aishell/download_pretrained_model.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a8813e241c4f6e40392dff6f173160d2bbd77175
--- /dev/null
+++ b/fluid/DeepASR/examples/aishell/download_pretrained_model.sh
@@ -0,0 +1,15 @@
+url=http://deep-asr-data.gz.bcebos.com/aishell_pretrained_model.tar.gz
+md5=7b51bde64e884f43901b7a3461ccbfa3
+
+wget -c $url
+
+echo "Checking md5 sum ..."
+md5sum_tmp=`md5sum aishell_pretrained_model.tar.gz | cut -d ' ' -f1`
+
+if [ $md5sum_tmp !=  $md5 ]; then
+    echo "Md5sum check failed, please remove and redownload "
+          "aishell_pretrained_model.tar.gz."
+    exit 1
+fi
+
+tar xvf aishell_pretrained_model.tar.gz 
diff --git a/fluid/DeepASR/examples/aishell/infer_by_ckpt.sh b/fluid/DeepASR/examples/aishell/infer_by_ckpt.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2d31757451849afc1412421376484d2ad41962bc
--- /dev/null
+++ b/fluid/DeepASR/examples/aishell/infer_by_ckpt.sh
@@ -0,0 +1,18 @@
+decode_to_path=./decoding_result.txt
+
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python -u ../../infer_by_ckpt.py --batch_size 96  \
+                        --checkpoint checkpoints/deep_asr.latest.checkpoint \
+                        --infer_feature_lst data/test_feature.lst  \
+                        --mean_var data/global_mean_var \
+                        --frame_dim 80  \
+                        --class_num 3040 \
+                        --num_threads 24  \
+                        --beam_size 11 \
+                        --decode_to_path $decode_to_path \
+                        --trans_model aux/final.mdl \
+                        --log_prior aux/logprior \
+                        --vocabulary aux/graph/words.txt \
+                        --graphs aux/graph/HCLG.fst \
+                        --acoustic_scale 0.059 \
+                        --parallel
diff --git a/fluid/DeepASR/examples/aishell/prepare_data.sh b/fluid/DeepASR/examples/aishell/prepare_data.sh
index 3b4a8753a7558c8fe5dc9b1045862ec3d29b2734..8bb7ac5cccb2ba72fd6351fc1e6755f5135740d8 100644
--- a/fluid/DeepASR/examples/aishell/prepare_data.sh
+++ b/fluid/DeepASR/examples/aishell/prepare_data.sh
@@ -1,7 +1,9 @@
 data_dir=~/.cache/paddle/dataset/speech/deep_asr_data/aishell
 data_url='http://deep-asr-data.gz.bcebos.com/aishell_data.tar.gz'
 lst_url='http://deep-asr-data.gz.bcebos.com/aishell_lst.tar.gz'
+aux_url='http://deep-asr-data.gz.bcebos.com/aux.tar.gz'
 md5=17669b8d63331c9326f4a9393d289bfb
+aux_md5=50e3125eba1e3a2768a6f2e499cc1749
 
 if [ ! -e $data_dir ]; then
     mkdir -p $data_dir
@@ -35,3 +37,7 @@ wget -c -P data $lst_url
 tar xvf data/aishell_lst.tar.gz -C data
 
 ln -s $data_dir data/aishell
+
+echo "Download and untar aux files ..."
+wget -c $aux_url
+tar xvf aux.tar.gz 
diff --git a/fluid/DeepASR/examples/aishell/score_cer.sh b/fluid/DeepASR/examples/aishell/score_cer.sh
new file mode 100644
index 0000000000000000000000000000000000000000..70dfcbad4a8427adcc1149fbab02ec674dacde0c
--- /dev/null
+++ b/fluid/DeepASR/examples/aishell/score_cer.sh
@@ -0,0 +1,4 @@
+ref_txt=aux/test.ref.txt
+hyp_txt=decoding_result.txt
+
+python ../../score_error_rate.py --error_rate_type cer --ref $ref_txt --hyp $hyp_txt
diff --git a/fluid/DeepASR/images/learning_curve.png b/fluid/DeepASR/images/learning_curve.png
new file mode 100644
index 0000000000000000000000000000000000000000..f09e8514e16fa09c8c32f3b455a5515f270df27a
Binary files /dev/null and b/fluid/DeepASR/images/learning_curve.png differ
diff --git a/fluid/DeepASR/images/lstmp.png b/fluid/DeepASR/images/lstmp.png
new file mode 100644
index 0000000000000000000000000000000000000000..72c2fc28998b09218f5dfd9d4c4d09a773b4f503
Binary files /dev/null and b/fluid/DeepASR/images/lstmp.png differ
diff --git a/fluid/DeepASR/infer_by_ckpt.py b/fluid/DeepASR/infer_by_ckpt.py
index d216335c71294c0e4b54e891ab7a67e471b4f1a4..1e0fb15c6d6f05aa1e054b37333b0fa0cb5cd8d9 100644
--- a/fluid/DeepASR/infer_by_ckpt.py
+++ b/fluid/DeepASR/infer_by_ckpt.py
@@ -14,10 +14,9 @@ import data_utils.augmentor.trans_add_delta as trans_add_delta
 import data_utils.augmentor.trans_splice as trans_splice
 import data_utils.augmentor.trans_delay as trans_delay
 import data_utils.async_data_reader as reader
-from decoder.post_decode_faster import Decoder
-from data_utils.util import lodtensor_to_ndarray
+from data_utils.util import lodtensor_to_ndarray, split_infer_result
 from model_utils.model import stacked_lstmp_model
-from data_utils.util import split_infer_result
+from decoder.post_latgen_faster_mapped import Decoder
 from tools.error_rate import char_errors
 
 
@@ -28,6 +27,11 @@ def parse_args():
         type=int,
         default=32,
         help='The sequence number of a batch data. (default: %(default)d)')
+    parser.add_argument(
+        '--beam_size',
+        type=int,
+        default=11,
+        help='The beam size for decoding. (default: %(default)d)')
     parser.add_argument(
         '--minimum_batch_size',
         type=int,
@@ -60,10 +64,10 @@ def parse_args():
         default=1749,
         help='Number of classes in label. (default: %(default)d)')
     parser.add_argument(
-        '--learning_rate',
-        type=float,
-        default=0.00016,
-        help='Learning rate used to train. (default: %(default)f)')
+        '--num_threads',
+        type=int,
+        default=10,
+        help='The number of threads for decoding. (default: %(default)d)')
     parser.add_argument(
         '--device',
         type=str,
@@ -75,7 +79,7 @@ def parse_args():
     parser.add_argument(
         '--mean_var',
         type=str,
-        default='data/global_mean_var_search26kHr',
+        default='data/global_mean_var',
         help="The path for feature's global mean and variance. "
         "(default: %(default)s)")
     parser.add_argument(
@@ -83,35 +87,30 @@ def parse_args():
         type=str,
         default='data/infer_feature.lst',
         help='The feature list path for inference. (default: %(default)s)')
-    parser.add_argument(
-        '--infer_label_lst',
-        type=str,
-        default='data/infer_label.lst',
-        help='The label list path for inference. (default: %(default)s)')
-    parser.add_argument(
-        '--ref_txt',
-        type=str,
-        default='data/text.test',
-        help='The reference text for decoding. (default: %(default)s)')
     parser.add_argument(
         '--checkpoint',
         type=str,
         default='./checkpoint',
         help="The checkpoint path to init model. (default: %(default)s)")
+    parser.add_argument(
+        '--trans_model',
+        type=str,
+        default='./graph/trans_model',
+        help="The path to vocabulary. (default: %(default)s)")
     parser.add_argument(
         '--vocabulary',
         type=str,
-        default='./decoder/graph/words.txt',
+        default='./graph/words.txt',
         help="The path to vocabulary. (default: %(default)s)")
     parser.add_argument(
         '--graphs',
         type=str,
-        default='./decoder/graph/TLG.fst',
+        default='./graph/TLG.fst',
         help="The path to TLG graphs for decoding. (default: %(default)s)")
     parser.add_argument(
         '--log_prior',
         type=str,
-        default="./decoder/logprior",
+        default="./logprior",
         help="The log prior probs for training data. (default: %(default)s)")
     parser.add_argument(
         '--acoustic_scale',
@@ -119,10 +118,16 @@ def parse_args():
         default=0.2,
         help="Scaling factor for acoustic likelihoods. (default: %(default)f)")
     parser.add_argument(
-        '--target_trans',
+        '--post_matrix_path',
+        type=str,
+        default=None,
+        help="The path to output post prob matrix. (default: %(default)s)")
+    parser.add_argument(
+        '--decode_to_path',
         type=str,
-        default="./decoder/target_trans.txt",
-        help="The path to target transcription. (default: %(default)s)")
+        default='./decoding_result.txt',
+        required=True,
+        help="The path to output the decoding result. (default: %(default)s)")
     args = parser.parse_args()
     return args
 
@@ -134,16 +139,47 @@ def print_arguments(args):
     print('------------------------------------------------')
 
 
-def get_trg_trans(args):
-    trans_dict = {}
-    with open(args.target_trans) as trg_trans:
-        line = trg_trans.readline()
-        while line:
-            items = line.strip().split()
-            key = items[0]
-            trans_dict[key] = ''.join(items[1:])
-            line = trg_trans.readline()
-    return trans_dict
+class PostMatrixWriter:
+    """ The writer for outputing the post probability matrix
+    """
+
+    def __init__(self, to_path):
+        self._to_path = to_path
+        with open(self._to_path, "w") as post_matrix:
+            post_matrix.seek(0)
+            post_matrix.truncate()
+
+    def write(self, keys, probs):
+        with open(self._to_path, "a") as post_matrix:
+            if isinstance(keys, str):
+                keys, probs = [keys], [probs]
+
+            for key, prob in zip(keys, probs):
+                post_matrix.write(key + " [\n")
+                for i in range(prob.shape[0]):
+                    for j in range(prob.shape[1]):
+                        post_matrix.write(str(prob[i][j]) + " ")
+                    post_matrix.write("\n")
+                post_matrix.write("]\n")
+
+
+class DecodingResultWriter:
+    """ The writer for writing out decoding results
+    """
+
+    def __init__(self, to_path):
+        self._to_path = to_path
+        with open(self._to_path, "w") as decoding_result:
+            decoding_result.seek(0)
+            decoding_result.truncate()
+
+    def write(self, results):
+        with open(self._to_path, "a") as decoding_result:
+            if isinstance(results, str):
+                decoding_result.write(results.encode("utf8") + "\n")
+            else:
+                for result in results:
+                    decoding_result.write(result.encode("utf8") + "\n")
 
 
 def infer_from_ckpt(args):
@@ -162,9 +198,10 @@ def infer_from_ckpt(args):
 
     infer_program = fluid.default_main_program().clone()
 
+    # optimizer, placeholder
     optimizer = fluid.optimizer.Adam(
         learning_rate=fluid.layers.exponential_decay(
-            learning_rate=args.learning_rate,
+            learning_rate=0.0001,
             decay_steps=1879,
             decay_rate=1 / 1.2,
             staircase=True))
@@ -174,34 +211,38 @@ def infer_from_ckpt(args):
     exe = fluid.Executor(place)
     exe.run(fluid.default_startup_program())
 
-    trg_trans = get_trg_trans(args)
     # load checkpoint.
     fluid.io.load_persistables(exe, args.checkpoint)
 
     # init decoder
-    decoder = Decoder(args.vocabulary, args.graphs, args.log_prior,
-                      args.acoustic_scale)
+    decoder = Decoder(args.trans_model, args.vocabulary, args.graphs,
+                      args.log_prior, args.beam_size, args.acoustic_scale)
 
     ltrans = [
         trans_add_delta.TransAddDelta(2, 2),
         trans_mean_variance_norm.TransMeanVarianceNorm(args.mean_var),
-        trans_splice.TransSplice(), trans_delay.TransDelay(5)
+        trans_splice.TransSplice(5, 5), trans_delay.TransDelay(5)
     ]
 
     feature_t = fluid.LoDTensor()
     label_t = fluid.LoDTensor()
 
     # infer data reader
-    infer_data_reader = reader.AsyncDataReader(args.infer_feature_lst,
-                                               args.infer_label_lst)
+    infer_data_reader = reader.AsyncDataReader(
+        args.infer_feature_lst, drop_frame_len=-1, split_sentence_threshold=-1)
     infer_data_reader.set_transformers(ltrans)
-    infer_costs, infer_accs = [], []
-    total_edit_dist, total_ref_len = 0.0, 0
+
+    decoding_result_writer = DecodingResultWriter(args.decode_to_path)
+    post_matrix_writer = None if args.post_matrix_path is None \
+                         else PostMatrixWriter(args.post_matrix_path)
+
     for batch_id, batch_data in enumerate(
             infer_data_reader.batch_iterator(args.batch_size,
                                              args.minimum_batch_size)):
         # load_data
         (features, labels, lod, name_lst) = batch_data
+        features = np.reshape(features, (-1, 11, 3, args.frame_dim))
+        features = np.transpose(features, (0, 2, 1, 3))
         feature_t.set(features, place)
         feature_t.set_lod([lod])
         label_t.set(labels, place)
@@ -212,24 +253,17 @@ def infer_from_ckpt(args):
                                 "label": label_t},
                           fetch_list=[prediction, avg_cost, accuracy],
                           return_numpy=False)
-        infer_costs.append(lodtensor_to_ndarray(results[1])[0])
-        infer_accs.append(lodtensor_to_ndarray(results[2])[0])
 
         probs, lod = lodtensor_to_ndarray(results[0])
         infer_batch = split_infer_result(probs, lod)
 
-        for index, sample in enumerate(infer_batch):
-            key = name_lst[index]
-            ref = trg_trans[key]
-            hyp = decoder.decode(key, sample)
-            edit_dist, ref_len = char_errors(ref.decode("utf8"), hyp)
-            total_edit_dist += edit_dist
-            total_ref_len += ref_len
-            print(key + "|Ref:", ref)
-            print(key + "|Hyp:", hyp.encode("utf8"))
-            print("Instance CER: ", edit_dist / ref_len)
-
-    print("Total CER = %f" % (total_edit_dist / total_ref_len))
+        print("Decoding batch %d ..." % batch_id)
+        decoded = decoder.decode_batch(name_lst, infer_batch, args.num_threads)
+
+        decoding_result_writer.write(decoded)
+
+        if args.post_matrix_path is not None:
+            post_matrix_writer.write(name_lst, infer_batch)
 
 
 if __name__ == '__main__':
diff --git a/fluid/DeepASR/score_error_rate.py b/fluid/DeepASR/score_error_rate.py
new file mode 100644
index 0000000000000000000000000000000000000000..dde5a2448afffcae61c4d033159a5b081e6c79e8
--- /dev/null
+++ b/fluid/DeepASR/score_error_rate.py
@@ -0,0 +1,80 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+from tools.error_rate import char_errors, word_errors
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        "Score word/character error rate (WER/CER) "
+        "for decoding result.")
+    parser.add_argument(
+        '--error_rate_type',
+        type=str,
+        default='cer',
+        choices=['cer', 'wer'],
+        help="Error rate type. (default: %(default)s)")
+    parser.add_argument(
+        '--special_tokens',
+        type=str,
+        default='<SPOKEN_NOISE>',
+        help="Special tokens in scoring CER, seperated by space. "
+        "They shouldn't be splitted and should be treated as one special "
+        "character. Example: '<SPOKEN_NOISE> <bos> <eos>' "
+        "(default: %(default)s)")
+    parser.add_argument(
+        '--ref', type=str, required=True, help="The ground truth text.")
+    parser.add_argument(
+        '--hyp', type=str, required=True, help="The decoding result text.")
+    args = parser.parse_args()
+    return args
+
+
+if __name__ == '__main__':
+
+    args = parse_args()
+    ref_dict = {}
+    sum_errors, sum_ref_len = 0.0, 0
+    sent_cnt, not_in_ref_cnt = 0, 0
+
+    special_tokens = args.special_tokens.split(" ")
+
+    with open(args.ref, "r") as ref_txt:
+        line = ref_txt.readline()
+        while line:
+            del_pos = line.find(" ")
+            key, sent = line[0:del_pos], line[del_pos + 1:-1].strip()
+            ref_dict[key] = sent
+            line = ref_txt.readline()
+
+    with open(args.hyp, "r") as hyp_txt:
+        line = hyp_txt.readline()
+        while line:
+            del_pos = line.find(" ")
+            key, sent = line[0:del_pos], line[del_pos + 1:-1].strip()
+            sent_cnt += 1
+            line = hyp_txt.readline()
+            if key not in ref_dict:
+                not_in_ref_cnt += 1
+                continue
+
+            if args.error_rate_type == 'cer':
+                for sp_tok in special_tokens:
+                    sent = sent.replace(sp_tok, '\0')
+                errors, ref_len = char_errors(
+                    ref_dict[key].decode("utf8"),
+                    sent.decode("utf8"),
+                    remove_space=True)
+            else:
+                errors, ref_len = word_errors(ref_dict[key].decode("utf8"),
+                                              sent.decode("utf8"))
+            sum_errors += errors
+            sum_ref_len += ref_len
+
+    print("Error rate[%s] = %f (%d/%d)," %
+          (args.error_rate_type, sum_errors / sum_ref_len, int(sum_errors),
+           sum_ref_len))
+    print("total %d sentences in hyp, %d not presented in ref." %
+          (sent_cnt, not_in_ref_cnt))
diff --git a/fluid/DeepQNetwork/README.md b/fluid/DeepQNetwork/README.md
index 6df88ecbf50e5d0375070c772e8b5b2340791b78..e72920bcad29ce7ffd78bfb90a1406654298248d 100644
--- a/fluid/DeepQNetwork/README.md
+++ b/fluid/DeepQNetwork/README.md
@@ -1,44 +1,67 @@
-# Reproduce DQN, DoubleDQN, DuelingDQN model with fluid version of PaddlePaddle
+[中文版](README_cn.md)
 
-+ DQN in:
+## Reproduce DQN, DoubleDQN, DuelingDQN model with Fluid version of PaddlePaddle
+Based on PaddlePaddle's next-generation API Fluid, the DQN model of deep reinforcement learning is reproduced, and the same level of indicators of the paper is reproduced in the classic Atari game. The model receives the image of the game as input, and uses the end-to-end model to directly predict the next step. The repository contains the following three types of models:
++ DQN in
 [Human-level Control Through Deep Reinforcement Learning](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html)
 + DoubleDQN in:
 [Deep Reinforcement Learning with Double Q-Learning](https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewPaper/12389)
 + DuelingDQN in:
 [Dueling Network Architectures for Deep Reinforcement Learning](http://proceedings.mlr.press/v48/wangf16.html)
 
-# Atari benchmark & performance
-## [Atari games introduction](https://gym.openai.com/envs/#atari)
+## Atari benchmark & performance
 
-+ Pong game result
-![DQN result](assets/dqn.png)
+### Atari games introduction
 
-# How to use
-+ Dependencies:
-    + python2.7
-    + gym
-    + tqdm
-    + paddlepaddle-gpu==0.12.0
+Please see [here](https://gym.openai.com/envs/#atari) to know more about Atari game.
 
-+ Start Training:
-    ```
-    # To train a model for Pong game with gpu (use DQN model as default)
-    python train.py --rom ./rom_files/pong.bin --use_cuda
+### Pong game result
 
-    # To train a model for Pong with DoubleDQN
-    python train.py --rom ./rom_files/pong.bin --use_cuda --alg DoubleDQN
+The average game rewards that can be obtained for the three models as the number of training steps changes during the training are as follows(about 3 hours/1 Million steps):
 
-    # To train a model for Pong with DuelingDQN
-    python train.py --rom ./rom_files/pong.bin --use_cuda --alg DuelingDQN
-    ```
+<div align="center">
+<img src="assets/dqn.png" width="600" height="300" alt="DQN result"></img>
+</div>
 
-To train more games, can install more rom files from [here](https://github.com/openai/atari-py/tree/master/atari_py/atari_roms)
+## How to use
+### Dependencies:
++ python2.7
++ gym
++ tqdm
++ opencv-python
++ paddlepaddle-gpu>=0.12.0
++ ale_python_interface
 
-+ Start Testing:
+### Install Dependencies:
++ Install PaddlePaddle:
+    recommended to compile and install PaddlePaddle from source code
++ Install other dependencies:
     ```
-    # Play the game with saved model and calculate the average rewards
-    python play.py --rom ./rom_files/pong.bin --use_cuda --model_path ./saved_model/DQN-pong/stepXXXXX
-
-    # Play the game with visualization
-    python play.py --rom ./rom_files/pong.bin --use_cuda --model_path ./saved_model/DQN-pong/stepXXXXX --viz 0.01
+    pip install -r requirement.txt
+    pip install gym[atari]
     ```
+    Install ale_python_interface, please see [here](https://github.com/mgbellemare/Arcade-Learning-Environment).
+
+### Start Training:
+```
+# To train a model for Pong game with gpu (use DQN model as default)
+python train.py --rom ./rom_files/pong.bin --use_cuda
+
+# To train a model for Pong with DoubleDQN
+python train.py --rom ./rom_files/pong.bin --use_cuda --alg DoubleDQN
+
+# To train a model for Pong with DuelingDQN
+python train.py --rom ./rom_files/pong.bin --use_cuda --alg DuelingDQN
+```
+
+To train more games, you can install more rom files from [here](https://github.com/openai/atari-py/tree/master/atari_py/atari_roms).
+
+### Start Testing:
+```
+# Play the game with saved best model and calculate the average rewards
+python play.py --rom ./rom_files/pong.bin --use_cuda --model_path ./saved_model/DQN-pong
+
+# Play the game with visualization
+python play.py --rom ./rom_files/pong.bin --use_cuda --model_path ./saved_model/DQN-pong --viz 0.01
+```
+[Here](https://pan.baidu.com/s/1gIsbNw5V7tMeb74ojx-TMA) is saved models for Pong and Breakout games. You can use it to play the game directly.
diff --git a/fluid/DeepQNetwork/README_cn.md b/fluid/DeepQNetwork/README_cn.md
new file mode 100644
index 0000000000000000000000000000000000000000..68a65bffe8fab79ce563fefc894dd035c1572065
--- /dev/null
+++ b/fluid/DeepQNetwork/README_cn.md
@@ -0,0 +1,71 @@
+## 基于PaddlePaddle的Fluid版本复现DQN, DoubleDQN, DuelingDQN三个模型
+
+基于PaddlePaddle下一代API Fluid复现了深度强化学习领域的DQN模型，在经典的Atari 游戏上复现了论文同等水平的指标，模型接收游戏的图像作为输入，采用端到端的模型直接预测下一步要执行的控制信号，本仓库一共包含以下3类模型：
++ DQN模型：
+[Human-level Control Through Deep Reinforcement Learning](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html)
++ DoubleDQN模型：
+[Deep Reinforcement Learning with Double Q-Learning](https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewPaper/12389)
++ DuelingDQN模型：
+[Dueling Network Architectures for Deep Reinforcement Learning](http://proceedings.mlr.press/v48/wangf16.html)
+
+## 模型效果：Atari游戏表现
+
+### Atari游戏介绍
+
+请点击[这里](https://gym.openai.com/envs/#atari)了解Atari游戏。
+
+### Pong游戏训练结果
+三个模型在训练过程中随着训练步数的变化，能得到的平均游戏奖励如下图所示（大概3小时每1百万步）：
+
+<div align="center">
+<img src="assets/dqn.png" width="600" height="300" alt="DQN result"></img>
+</div>
+
+## 使用教程
+
+### 依赖:
++ python2.7
++ gym
++ tqdm
++ opencv-python
++ paddlepaddle-gpu>=0.12.0
++ ale_python_interface
+
+### 下载依赖：
+
++ 安装PaddlePaddle：
+    建议通过PaddlePaddle源码进行编译安装  
++ 下载其它依赖：
+    ```
+    pip install -r requirement.txt
+    pip install gym[atari]
+    ```
+    安装ale_python_interface可以参考[这里](https://github.com/mgbellemare/Arcade-Learning-Environment)
+
+### 训练模型：
+
+```
+# 使用GPU训练Pong游戏（默认使用DQN模型）
+python train.py --rom ./rom_files/pong.bin --use_cuda
+
+# 训练DoubleDQN模型
+python train.py --rom ./rom_files/pong.bin --use_cuda --alg DoubleDQN
+
+# 训练DuelingDQN模型
+python train.py --rom ./rom_files/pong.bin --use_cuda --alg DuelingDQN
+```
+
+训练更多游戏，可以从[这里](https://github.com/openai/atari-py/tree/master/atari_py/atari_roms)下载游戏rom
+
+### 测试模型：
+
+```
+# Play the game with saved model and calculate the average rewards
+# 使用训练过程中保存的最好模型玩游戏，以及计算平均奖励（rewards）
+python play.py --rom ./rom_files/pong.bin --use_cuda --model_path ./saved_model/DQN-pong
+
+# 以可视化的形式来玩游戏
+python play.py --rom ./rom_files/pong.bin --use_cuda --model_path ./saved_model/DQN-pong --viz 0.01
+```
+
+[这里](https://pan.baidu.com/s/1gIsbNw5V7tMeb74ojx-TMA)是Pong和Breakout游戏训练好的模型，可以直接用来测试。
diff --git a/fluid/DeepQNetwork/play.py b/fluid/DeepQNetwork/play.py
index 2920391f105aeca1e99c347174464688edb47dae..b956343f3e78543ad702461175e859d3bef2af88 100644
--- a/fluid/DeepQNetwork/play.py
+++ b/fluid/DeepQNetwork/play.py
@@ -11,7 +11,7 @@ from tqdm import tqdm
 
 def predict_action(exe, state, predict_program, feed_names, fetch_targets,
                    action_dim):
-    if np.random.randint(100) == 0:
+    if np.random.random() < 0.01:
         act = np.random.randint(action_dim)
     else:
         state = np.expand_dims(state, axis=0)
diff --git a/fluid/DeepQNetwork/requirement.txt b/fluid/DeepQNetwork/requirement.txt
new file mode 100644
index 0000000000000000000000000000000000000000..be84b259f066e9a26dd207fb5e4e6f66ea9fba03
--- /dev/null
+++ b/fluid/DeepQNetwork/requirement.txt
@@ -0,0 +1,5 @@
+numpy
+gym
+tqdm
+opencv-python
+paddlepaddle-gpu==0.12.0
diff --git a/fluid/DeepQNetwork/train.py b/fluid/DeepQNetwork/train.py
index 6e75fe77bc53df24cab2f5bebad9f59ee88a8a3e..63439be7c8da481c946b0cb0bd571637bd875105 100644
--- a/fluid/DeepQNetwork/train.py
+++ b/fluid/DeepQNetwork/train.py
@@ -120,6 +120,9 @@ def train_agent():
     pbar = tqdm(total=1e8)
     recent_100_reward = []
     total_step = 0
+    max_reward = None
+    save_path = os.path.join(args.model_dirname, '{}-{}'.format(
+        args.alg, os.path.basename(args.rom).split('.')[0]))
     while True:
         # start epoch
         total_reward, step = run_train_episode(agent, env, exp)
@@ -134,14 +137,11 @@ def train_agent():
             print("eval_agent done, (steps, eval_reward): ({}, {})".format(
                 total_step, eval_reward))
 
-        if total_step // args.save_every_steps == save_flag:
-            save_flag += 1
-            save_path = os.path.join(args.model_dirname, '{}-{}'.format(
-                args.alg, os.path.basename(args.rom).split('.')[0]),
-                                     'step{}'.format(total_step))
-            fluid.io.save_inference_model(save_path, ['state'],
-                                          agent.pred_value, agent.exe,
-                                          agent.predict_program)
+            if max_reward is None or eval_reward > max_reward:
+                max_reward = eval_reward
+                fluid.io.save_inference_model(save_path, ['state'],
+                                              agent.pred_value, agent.exe,
+                                              agent.predict_program)
     pbar.close()
 
 
@@ -173,11 +173,6 @@ if __name__ == '__main__':
         type=str,
         default='saved_model',
         help='dirname to save model')
-    parser.add_argument(
-        '--save_every_steps',
-        type=int,
-        default=100000,
-        help='every steps number to save model')
     parser.add_argument(
         '--test_every_steps',
         type=int,
diff --git a/fluid/README.cn.md b/fluid/README.cn.md
new file mode 100644
index 0000000000000000000000000000000000000000..6c454aa3fa03159267a41c636b2fa0524eecadd7
--- /dev/null
+++ b/fluid/README.cn.md
@@ -0,0 +1,74 @@
+# models 简介
+
+## 图像分类
+
+图像分类是根据图像的语义信息对不同类别图像进行区分，是计算机视觉中重要的基础问题，是物体检测、图像分割、物体跟踪、行为分析、人脸识别等其他高层视觉任务的基础，在许多领域都有着广泛的应用。如：安防领域的人脸识别和智能视频分析等，交通领域的交通场景识别，互联网领域基于内容的图像检索和相册自动归类，医学领域的图像识别等。
+
+在深度学习时代，图像分类的准确率大幅度提升，在图像分类任务中，我们向大家介绍了如何在经典的数据集ImageNet上，训练常用的模型，包括AlexNet、VGG、GoogLeNet、ResNet、Inception-v4、MobileNet、DPN(Dual Path Network)、SE-ResNeXt模型，也开源了[训练的模型](https://github.com/PaddlePaddle/models/blob/develop/fluid/image_classification/README_cn.md#已有模型及其性能)方便用户下载使用。同时提供了能够将Caffe模型转换为PaddlePaddle Fluid模型配置和参数文件的工具。
+
+- [AlexNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/models)
+- [VGG](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/models)
+- [GoogleNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/models)
+- [Residual Network](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/models)
+- [Inception-v4](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/models)
+- [MobileNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/models)
+- [Dual Path Network](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/models)
+- [SE-ResNeXt](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/models)
+- [Caffe模型转换为Paddle Fluid配置和模型文件工具](https://github.com/PaddlePaddle/models/tree/develop/fluid/image_classification/caffe2fluid)
+
+## 目标检测
+
+目标检测任务的目标是给定一张图像或是一个视频帧，让计算机找出其中所有目标的位置，并给出每个目标的具体类别。对于人类来说，目标检测是一个非常简单的任务。然而，计算机能够“看到”的是图像被编码之后的数字，很难解图像或是视频帧中出现了人或是物体这样的高层语义概念，也就更加难以定位目标出现在图像中哪个区域。与此同时，由于目标会出现在图像或是视频帧中的任何位置，目标的形态千变万化，图像或是视频帧的背景千差万别，诸多因素都使得目标检测对计算机来说是一个具有挑战性的问题。
+
+在目标检测任务中，我们介绍了如何基于[PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/)、[MS COCO](http://cocodataset.org/#home)数据训练通用物体检测模型，当前介绍了SSD算法，SSD全称Single Shot MultiBox Detector，是目标检测领域较新且效果较好的检测算法之一，具有检测速度快且检测精度高的特点。
+
+开放环境中的检测人脸，尤其是小的、模糊的和部分遮挡的人脸也是一个具有挑战的任务。我们也介绍了如何基于[WIDER FACE](http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/)数据训练百度自研的人脸检测PyramidBox模型，该算法于2018年3月份在WIDER FACE的多项评测中均获得[第一名](http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/WiderFace_Results.html)。
+
+- [Single Shot MultiBox Detector](https://github.com/PaddlePaddle/models/blob/develop/fluid/object_detection/README_cn.md)
+- [Face Detector: PyramidBox](https://github.com/PaddlePaddle/models/tree/develop/fluid/face_detection/README_cn.md)
+
+
+## 图像语义分割
+
+图像语意分割顾名思义是将图像像素按照表达的语义含义的不同进行分组/分割，图像语义是指对图像内容的理解，例如，能够描绘出什么物体在哪里做了什么事情等，分割是指对图片中的每个像素点进行标注，标注属于哪一类别。近年来用在无人车驾驶技术中分割街景来避让行人和车辆、医疗影像分析中辅助诊断等。
+
+在图像语义分割任务中，我们介绍如何基于图像级联网络(Image Cascade Network,ICNet)进行语义分割，相比其他分割算法，ICNet兼顾了准确率和速度。
+
+
+- [ICNet](https://github.com/PaddlePaddle/models/tree/develop/fluid/icnet)
+
+
+## 场景文字识别
+
+许多场景图像中包含着丰富的文本信息，对理解图像信息有着重要作用，能够极大地帮助人们认知和理解场景图像的内容。场景文字识别是在图像背景复杂、分辨率低下、字体多样、分布随意等情况下，将图像信息转化为文字序列的过程，可认为是一种特别的翻译过程：将图像输入翻译为自然语言输出。场景图像文字识别技术的发展也促进了一些新型应用的产生，如通过自动识别路牌中的文字帮助街景应用获取更加准确的地址信息等。
+
+在场景文字识别任务中，我们介绍如何将基于CNN的图像特征提取和基于RNN的序列翻译技术结合，免除人工定义特征，避免字符分割，使用自动学习到的图像特征，完成端到端地无约束字符定位和识别。当前，介绍了CRNN-CTC模型，后续会引入基于注意力机制的序列到序列模型。
+
+- [CRNN-CTC模型](https://github.com/PaddlePaddle/models/tree/develop/fluid/ocr_recognition)
+
+
+## 语音识别
+
+
+自动语音识别（Automatic Speech Recognition, ASR）是将人类声音中的词汇内容转录成计算机可输入的文字的技术。语音识别的相关研究经历了漫长的探索过程，在HMM/GMM模型之后其发展一直较为缓慢，随着深度学习的兴起，其迎来了春天。在多种语言识别任务中，将深度神经网络(DNN)作为声学模型，取得了比GMM更好的性能，使得 ASR 成为深度学习应用最为成功的领域之一。而由于识别准确率的不断提高，有越来越多的语言技术产品得以落地，例如语言输入法、以智能音箱为代表的智能家居设备等 —— 基于语言的交互方式正在深刻的改变人类的生活。
+
+与 [DeepSpeech](https://github.com/PaddlePaddle/DeepSpeech) 中深度学习模型端到端直接预测字词的分布不同，本实例更接近传统的语言识别流程，以音素为建模单元，关注语言识别中声学模型的训练，利用[kaldi](http://www.kaldi-asr.org)进行音频数据的特征提取和标签对齐，并集成 kaldi 的解码器完成解码。
+
+
+- [DeepASR](https://github.com/PaddlePaddle/models/tree/develop/fluid/DeepASR)
+
+## 机器翻译
+
+机器翻译（Machine Translation）将一种自然语言(源语言)转换成一种自然语言（目标语音），是自然语言处理中非常基础和重要的研究方向。在全球化的浪潮中，机器翻译在促进跨语言文明的交流中所起的重要作用是不言而喻的。其发展经历了统计机器翻译和基于神经网络的神经机器翻译(Nueural Machine Translation, NMT)等阶段。在 NMT 成熟后，机器翻译才真正得以大规模应用。而早阶段的 NMT 主要是基于循环神经网络 RNN 的，其训练过程中当前时间步依赖于前一个时间步的计算，时间步之间难以并行化以提高训练速度。因此，非 RNN 结构的 NMT 得以应运而生，例如基于卷积神经网络 CNN 的结构和基于自注意力机制（Self-Attention）的结构。
+
+本实例所实现的 Transformer 就是一个基于自注意力机制的机器翻译模型，其中不再有RNN或CNN结构，而是完全利用 Attention 学习语言中的上下文依赖。相较于RNN/CNN, 这种结构在单层内计算复杂度更低、易于并行化、对长程依赖更易建模，最终在多种语言之间取得了最好的翻译效果。
+
+- [Transformer](https://github.com/PaddlePaddle/models/tree/develop/fluid/neural_machine_translation/transformer)
+
+## 强化学习
+
+强化学习是近年来一个愈发重要的机器学习方向，特别是与深度学习相结合而形成的深度强化学习(Deep Reinforcement Learning, DRL)，取得了很多令人惊异的成就。人们所熟知的战胜人类顶级围棋职业选手的 AlphaGo 就是 DRL 应用的一个典型例子，除游戏领域外，其它的应用还包括机器人、自然语言处理等。
+
+深度强化学习的开山之作是在Atari视频游戏中的成功应用， 其可直接接受视频帧这种高维输入并根据图像内容端到端地预测下一步的动作，所用到的模型被称为深度Q网络(Deep Q-Network, DQN)。本实例就是利用PaddlePaddle Fluid这个灵活的框架，实现了 DQN 及其变体，并测试了它们在 Atari 游戏中的表现。
+
+- [DeepQNetwork](https://github.com/PaddlePaddle/models/tree/develop/fluid/DeepQNetwork)
diff --git a/fluid/face_detection/.gitignore b/fluid/face_detection/.gitignore
index eeee7d7057bcb73e738e6df94c702a9e8c5dced6..ea3e7b052591ddb7d19525a685c13971bededf6f 100644
--- a/fluid/face_detection/.gitignore
+++ b/fluid/face_detection/.gitignore
@@ -1,9 +1,11 @@
 model/
-pretrained/
-data/
-label/
+data/WIDER_train
+data/WIDER_val
+data/wider_face_split
+vgg_ilsvrc_16_fc_reduced*
 *.swp
 *.log
 log*
 output*
-infer_results*
+pred
+eval_tools
diff --git a/fluid/face_detection/README.md b/fluid/face_detection/README.md
new file mode 120000
index 0000000000000000000000000000000000000000..4015683cfa5969297febc12e7ca1264afabbc0b5
--- /dev/null
+++ b/fluid/face_detection/README.md
@@ -0,0 +1 @@
+README_cn.md
\ No newline at end of file
diff --git a/fluid/face_detection/README_cn.md b/fluid/face_detection/README_cn.md
new file mode 100644
index 0000000000000000000000000000000000000000..1213a59dba4dc7b4c001deef7e2029f45c232ff0
--- /dev/null
+++ b/fluid/face_detection/README_cn.md
@@ -0,0 +1,173 @@
+运行本目录下的程序示例需要使用 PaddlePaddle 最新的 develop branch 版本。如果您的 PaddlePaddle 安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新 PaddlePaddle 安装版本。
+
+---
+
+
+## Pyramidbox 人脸检测
+
+## Table of Contents
+- [简介](#简介)
+- [数据准备](#数据准备)
+- [模型训练](#模型训练)
+- [模型评估](#模型评估)
+- [模型发布](#模型发布)
+
+### 简介
+
+人脸检测是经典的计算机视觉任务，非受控场景中的小脸、模糊和遮挡的人脸检测是这个方向上最有挑战的问题。[PyramidBox](https://arxiv.org/pdf/1803.07737.pdf) 是一种基于SSD的单阶段人脸检测器，它利用上下文信息解决困难人脸的检测问题。如下图所示，PyramidBox在六个尺度的特征图上进行不同层级的预测。该工作主要包括以下模块：LFPN、Pyramid Anchors、CPM、Data-anchor-sampling。具体可以参考该方法对应的论文 https://arxiv.org/pdf/1803.07737.pdf ，下面进行简要的介绍。
+
+<p align="center">
+<img src="images/architecture_of_pyramidbox.jpg" height=316 width=415 hspace='10'/> <br />
+Pyramidbox 人脸检测模型
+</p>
+
+**LFPN**: LFPN全称Low-level Feature Pyramid Networks, 在检测任务中，LFPN可以充分结合高层次的包含更多上下文的特征和低层次的包含更多纹理的特征。高层级特征被用于检测尺寸较大的人脸，而低层级特征被用于检测尺寸较小的人脸。为了将高层级特征整合到高分辨率的低层级特征上，我们从中间层开始做自上而下的融合，构建Low-level FPN。
+
+**Pyramid Anchors**: 该算法使用半监督解决方案来生成与人脸检测相关的具有语义的近似标签，提出基于anchor的语境辅助方法，它引入有监督的信息来学习较小的、模糊的和部分遮挡的人脸的语境特征。使用者可以根据标注的人脸标签，按照一定的比例扩充，得到头部的标签（上下左右各扩充1/2）和人体的标签（可自定义扩充比例）。
+
+**CPM**: CPM全称Context-sensitive Predict Module, 本方法设计了一种上下文敏感结构(CPM)来提高预测网络的表达能力。
+
+**Data-anchor-sampling**: 设计了一种新的采样方法，称作Data-anchor-sampling，该方法可以增加训练样本在不同尺度上的多样性。该方法改变训练样本的分布，重点关注较小的人脸。
+
+Pyramidbox模型可以在以下示例图片上展示鲁棒的检测性能，该图有一千张人脸，该模型检测出其中的880张人脸。
+<p align="center">
+<img src="images/demo_img.jpg" height=255 width=455 hspace='10'/> <br />
+Pyramidbox 人脸检测性能展示
+</p>
+
+
+
+### 数据准备
+
+本教程使用 [WIDER FACE 数据集](http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/) 来进行模型的训练测试工作，官网给出了详尽的数据介绍。
+
+WIDER FACE数据集包含32,203张图片，其中包含393,703个人脸，数据集的人脸在尺度、姿态、遮挡方面有较大的差异性。另外WIDER FACE数据集是基于61个场景归类的，然后针对每个场景，随机的挑选40%作为训练集，10%作为验证集，50%作为测试集。
+
+首先，从官网训练集和验证集，放在`data`目录，官网提供了谷歌云和百度云下载地址，请依据情况自行下载。并下载训练集和验证集的标注信息:
+
+```bash
+./data/download.sh
+```
+
+准备好数据之后，`data`目录如下：
+
+```
+data
+|-- download.sh
+|-- wider_face_split
+|   |-- readme.txt
+|   |-- wider_face_train_bbx_gt.txt
+|   |-- wider_face_val_bbx_gt.txt
+|   `-- ...
+|-- WIDER_train
+|   `-- images
+|       |-- 0--Parade
+|       ...
+|       `-- 9--Press_Conference
+`-- WIDER_val
+    `-- images
+        |-- 0--Parade
+        ...
+        `-- 9--Press_Conference
+```
+
+
+### 模型训练
+
+#### 下载预训练模型
+
+我们提供了预训练模型，模型是基于VGGNet的主干网络，使用如下命令下载：
+
+
+```bash
+wget http://paddlemodels.bj.bcebos.com/vgg_ilsvrc_16_fc_reduced.tar.gz
+tar -xf vgg_ilsvrc_16_fc_reduced.tar.gz && rm -f vgg_ilsvrc_16_fc_reduced.tar.gz
+```
+
+声明：该预训练模型转换自[Caffe](http://cs.unc.edu/~wliu/projects/ParseNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel)。不久，我们会发布自己预训练的模型。
+
+
+#### 开始训练
+
+
+`train.py` 是训练模块的主要执行程序，调用示例如下：
+
+```bash
+python -u train.py --batch_size=16 --pretrained_model=vgg_ilsvrc_16_fc_reduced
+```
+  - 可以通过设置 `export CUDA_VISIBLE_DEVICES=0,1,2,3` 指定想要使用的GPU数量。
+  - 更多的可选参数见:
+    ```bash
+    python train.py --help
+    ```
+
+模型训练所采用的数据增强：
+
+**数据增强**：数据的读取行为定义在 `reader.py` 中，所有的图片都会被缩放到640x640。在训练时还会对图片进行数据增强，包括随机扰动、翻转、裁剪等，和[物体检测SSD算法](https://github.com/PaddlePaddle/models/blob/develop/fluid/object_detection/README_cn.md#%E8%AE%AD%E7%BB%83-pascal-voc-%E6%95%B0%E6%8D%AE%E9%9B%86)中数据增强类似，除此之外，增加了上面提到的Data-anchor-sampling:
+
+  **尺度变换(Data-anchor-sampling)**：随机将图片尺度变换到一定范围的尺度，大大增强人脸的尺度变化。具体操作为根据随机选择的人脸高(height)和宽(width)，得到$v=\\sqrt{width * height}$，判断$v$的值位于缩放区间$[16，32，64，128，256，512]$中的的哪一个。假设$v=45$，则选定$32<v<64$，以均匀分布的概率选取$[16，32，64]$中的任意一个值。若选中$64$，则该人脸的缩放区间在 $[64 / 2，min(v * 2, 64 * 2)]$中选定。
+
+
+
+**注意**：
+  - 本次开源模型中CPM模块与论文中有些许不同，相比论文中CPM模块训练和测试速度更快。
+  - Pyramid Anchors模块的body部分可以针对不同情况，进行相应的长宽设置来调参。同时face、head、body部分的loss对应的系数也可以通过调参优化。
+
+
+### 模型评估
+
+验证集的评估需要两个步骤：先预测出验证集的检测框和置信度，再利用WIDER FACE官方提供的评估脚本得到评估结果。
+
+- 预测检测结果
+
+  ```bash
+  python -u widerface_eval.py --model_dir=output/159 --pred_dir=pred
+  ```
+  更多的可选参数:
+
+  ```bash
+  python -u widerface_eval.py --help
+  ```
+  **注意**： `widerface_eval.py`中`multi_scale_test_pyramid`可用可不用，由于Data-anchor-sampling的作用，更加密集的anchors对性能有更大的提升。
+
+- 评估AP指标
+
+  下载官方评估脚本，评估average precision(AP)指标：
+
+  ```bash
+  wget http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/support/eval_script/eval_tools.zip
+  unzip eval_tools.zip && rm -f eval_tools.zip
+  ```
+
+  修改`eval_tools/wider_eval.m`中检测结果保存的路径和将要画出的曲线名称：
+
+  ```txt
+  # 此处修改存放结果的文件夹名字
+  pred_dir = './pred';  
+  # 此处修改将要画出的曲线名称
+  legend_name = 'Fluid-PyramidBox';
+  ```
+
+  `wider_eval.m`是评估模块的主要执行程序，命令行式的运行命令如下：
+
+  ```bash
+  matlab -nodesktop -nosplash -nojvm -r "run wider_eval.m;quit;"
+  ```
+
+### 模型发布
+
+
+
+| 模型                    | 预训练模型  | 训练数据    | 测试数据    | mAP |
+|:------------------------:|:------------------:|:----------------:|:------------:|:----:|
+|[Pyramidbox-v1-SSD 640x640]() | [VGGNet](http://paddlemodels.bj.bcebos.com/vgg_ilsvrc_16_fc_reduced.tar.gz) | WIDER FACE train | WIDER FACE Val   | 95.6%/ 94.7%/ 89.3%  |
+
+#### 性能曲线
+<p align="center">
+    <img src="images/wider_pr_cruve_int_easy_val.jpg" width="280" />
+    <img src="images/wider_pr_cruve_int_medium_val.jpg" width="280" />
+    <img src="images/wider_pr_cruve_int_hard_val.jpg" width="280" /></br>
+WIDER FACE Easy/Medium/Hard set
+</p>
+
+> 目前，基于PaddlePaddle的实现过程中模型参数仍在调优，比上图更优的结果会在后续发布
diff --git a/fluid/face_detection/data/download.sh b/fluid/face_detection/data/download.sh
new file mode 100755
index 0000000000000000000000000000000000000000..aa32b53dd44f4286b4a6e24fba75f098d797487f
--- /dev/null
+++ b/fluid/face_detection/data/download.sh
@@ -0,0 +1,8 @@
+DIR="$( cd "$(dirname "$0")" ; pwd -P )"
+cd "$DIR"
+
+echo "Downloading..."
+wget http://mmlab.ie.cuhk.edu.hk/projects/WIDERFace/support/bbx_annotation/wider_face_split.zip
+
+echo "Extracting..."
+unzip wider_face_split.zip && rm -f wider_face_split.zip
diff --git a/fluid/face_detection/data_util.py b/fluid/face_detection/data_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..ac022593119e0008c3f7f3858303cbf5bc717650
--- /dev/null
+++ b/fluid/face_detection/data_util.py
@@ -0,0 +1,151 @@
+"""
+This code is based on https://github.com/fchollet/keras/blob/master/keras/utils/data_utils.py
+"""
+
+import time
+import numpy as np
+import threading
+import multiprocessing
+try:
+    import queue
+except ImportError:
+    import Queue as queue
+
+
+class GeneratorEnqueuer(object):
+    """
+    Builds a queue out of a data generator.
+
+    Args:
+        generator: a generator function which endlessly yields data
+        use_multiprocessing (bool): use multiprocessing if True,
+            otherwise use threading.
+        wait_time (float): time to sleep in-between calls to `put()`.
+        random_seed (int): Initial seed for workers,
+            will be incremented by one for each workers.
+    """
+
+    def __init__(self,
+                 generator,
+                 use_multiprocessing=False,
+                 wait_time=0.05,
+                 random_seed=None):
+        self.wait_time = wait_time
+        self._generator = generator
+        self._use_multiprocessing = use_multiprocessing
+        self._threads = []
+        self._stop_event = None
+        self.queue = None
+        self._manager = None
+        self.seed = random_seed
+
+    def start(self, workers=1, max_queue_size=10):
+        """
+        Start worker threads which add data from the generator into the queue.
+
+        Args:
+            workers (int): number of worker threads
+            max_queue_size (int): queue size
+                (when full, threads could block on `put()`)
+        """
+
+        def data_generator_task():
+            """
+            Data generator task.
+            """
+
+            def task():
+                if (self.queue is not None and
+                        self.queue.qsize() < max_queue_size):
+                    generator_output = next(self._generator)
+                    self.queue.put((generator_output))
+                else:
+                    time.sleep(self.wait_time)
+
+            if not self._use_multiprocessing:
+                while not self._stop_event.is_set():
+                    with self.genlock:
+                        try:
+                            task()
+                        except Exception:
+                            self._stop_event.set()
+                            break
+            else:
+                while not self._stop_event.is_set():
+                    try:
+                        task()
+                    except Exception:
+                        self._stop_event.set()
+                        break
+
+        try:
+            if self._use_multiprocessing:
+                self._manager = multiprocessing.Manager()
+                self.queue = self._manager.Queue(maxsize=max_queue_size)
+                self._stop_event = multiprocessing.Event()
+            else:
+                self.genlock = threading.Lock()
+                self.queue = queue.Queue()
+                self._stop_event = threading.Event()
+            for _ in range(workers):
+                if self._use_multiprocessing:
+                    # Reset random seed else all children processes
+                    # share the same seed
+                    np.random.seed(self.seed)
+                    thread = multiprocessing.Process(target=data_generator_task)
+                    thread.daemon = True
+                    if self.seed is not None:
+                        self.seed += 1
+                else:
+                    thread = threading.Thread(target=data_generator_task)
+                self._threads.append(thread)
+                thread.start()
+        except:
+            self.stop()
+            raise
+
+    def is_running(self):
+        """
+        Returns:
+            bool: Whether the worker theads are running.
+        """
+        return self._stop_event is not None and not self._stop_event.is_set()
+
+    def stop(self, timeout=None):
+        """
+        Stops running threads and wait for them to exit, if necessary.
+        Should be called by the same thread which called `start()`.
+
+        Args:
+            timeout(int|None): maximum time to wait on `thread.join()`.
+        """
+        if self.is_running():
+            self._stop_event.set()
+        for thread in self._threads:
+            if self._use_multiprocessing:
+                if thread.is_alive():
+                    thread.terminate()
+            else:
+                thread.join(timeout)
+        if self._manager:
+            self._manager.shutdown()
+
+        self._threads = []
+        self._stop_event = None
+        self.queue = None
+
+    def get(self):
+        """
+        Creates a generator to extract data from the queue.
+        Skip the data if it is `None`.
+
+        # Yields
+            tuple of data in the queue.
+        """
+        while self.is_running():
+            if not self.queue.empty():
+                inputs = self.queue.get()
+                if inputs is not None:
+                    yield inputs
+            else:
+                time.sleep(self.wait_time)
diff --git a/fluid/face_detection/image_util.py b/fluid/face_detection/image_util.py
index f39538285637c1a284c4058130be40d89435dcef..8f3728a90402f07665c2678a2eae3e86bb128068 100644
--- a/fluid/face_detection/image_util.py
+++ b/fluid/face_detection/image_util.py
@@ -131,12 +131,13 @@ def data_anchor_sampling(sampler, bbox_labels, image_width, image_height,
             rand_idx_size = range_size + 1
         else:
             # np.random.randint range: [low, high)
-            rng_rand_size = np.random.randint(0, range_size)
-            rand_idx_size = rng_rand_size % range_size
-
-        scale_choose = random.uniform(scale_array[rand_idx_size] / 2.0,
-                                      2.0 * scale_array[rand_idx_size])
+            rng_rand_size = np.random.randint(0, range_size + 1)
+            rand_idx_size = rng_rand_size % (range_size + 1)
 
+        min_resize_val = scale_array[rand_idx_size] / 2.0
+        max_resize_val = min(2.0 * scale_array[rand_idx_size],
+                             2 * math.sqrt(wid * hei))
+        scale_choose = random.uniform(min_resize_val, max_resize_val)
         sample_bbox_size = wid * resize_width / scale_choose
 
         w_off_orig = 0.0
@@ -389,9 +390,19 @@ def crop_image_sampling(img, bbox_labels, sample_bbox, image_width,
     roi_width = cross_width
     roi_height = cross_height
 
+    roi_y1 = int(roi_ymin)
+    roi_y2 = int(roi_ymin + roi_height)
+    roi_x1 = int(roi_xmin)
+    roi_x2 = int(roi_xmin + roi_width)
+
+    cross_y1 = int(cross_ymin)
+    cross_y2 = int(cross_ymin + cross_height)
+    cross_x1 = int(cross_xmin)
+    cross_x2 = int(cross_xmin + cross_width)
+
     sample_img = np.zeros((height, width, 3))
-    sample_img[int(roi_ymin) : int(roi_ymin + roi_height), int(roi_xmin) : int(roi_xmin + roi_width)] = \
-        img[int(cross_ymin) : int(cross_ymin + cross_height), int(cross_xmin) : int(cross_xmin + cross_width)]
+    sample_img[roi_y1 : roi_y2, roi_x1 : roi_x2] = \
+        img[cross_y1 : cross_y2, cross_x1 : cross_x2]
 
     sample_img = cv2.resize(
         sample_img, (resize_width, resize_height), interpolation=cv2.INTER_AREA)
diff --git a/fluid/face_detection/images/architecture_of_pyramidbox.jpg b/fluid/face_detection/images/architecture_of_pyramidbox.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..d453ce3d80478c1606d784056cea5e9d599f5120
Binary files /dev/null and b/fluid/face_detection/images/architecture_of_pyramidbox.jpg differ
diff --git a/fluid/face_detection/images/demo_img.jpg b/fluid/face_detection/images/demo_img.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..4d950e723a01aa32e2b848333ef903bcb8779d8f
Binary files /dev/null and b/fluid/face_detection/images/demo_img.jpg differ
diff --git a/fluid/face_detection/images/wider_pr_cruve_int_easy_val.jpg b/fluid/face_detection/images/wider_pr_cruve_int_easy_val.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..29f902491ea35a527fb3d5822e5bc3a7c4d976cb
Binary files /dev/null and b/fluid/face_detection/images/wider_pr_cruve_int_easy_val.jpg differ
diff --git a/fluid/face_detection/images/wider_pr_cruve_int_hard_val.jpg b/fluid/face_detection/images/wider_pr_cruve_int_hard_val.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..58f941be640c130bb7cfdf013f6e61d1ca948dba
Binary files /dev/null and b/fluid/face_detection/images/wider_pr_cruve_int_hard_val.jpg differ
diff --git a/fluid/face_detection/images/wider_pr_cruve_int_medium_val.jpg b/fluid/face_detection/images/wider_pr_cruve_int_medium_val.jpg
new file mode 100644
index 0000000000000000000000000000000000000000..3c21b78e059e8185b8ad458ebf1c0e88aa3d993e
Binary files /dev/null and b/fluid/face_detection/images/wider_pr_cruve_int_medium_val.jpg differ
diff --git a/fluid/face_detection/profile.py b/fluid/face_detection/profile.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd686ad0784abd730d41263e3982560345ca6908
--- /dev/null
+++ b/fluid/face_detection/profile.py
@@ -0,0 +1,190 @@
+import os
+import shutil
+import numpy as np
+import time
+import argparse
+import functools
+
+import reader
+import paddle
+import paddle.fluid as fluid
+import paddle.fluid.profiler as profiler
+from pyramidbox import PyramidBox
+from utility import add_arguments, print_arguments
+
+parser = argparse.ArgumentParser(description=__doc__)
+add_arg = functools.partial(add_arguments, argparser=parser)
+
+# yapf: disable
+add_arg('parallel',         bool,  True,            "parallel")
+add_arg('learning_rate',    float, 0.001,           "Learning rate.")
+add_arg('batch_size',       int,   20,              "Minibatch size.")
+add_arg('num_iteration',    int,   10,              "Epoch number.")
+add_arg('skip_reader',      bool,  False,            "Whether to skip data reader.")
+add_arg('use_gpu',          bool,  True,            "Whether use GPU.")
+add_arg('use_pyramidbox',   bool,  True,            "Whether use PyramidBox model.")
+add_arg('model_save_dir',   str,   'output',        "The path to save model.")
+add_arg('pretrained_model', str,   './pretrained/', "The init model path.")
+add_arg('resize_h',         int,   640,             "The resized image height.")
+add_arg('resize_w',         int,   640,             "The resized image height.")
+#yapf: enable
+
+
+def train(args, config, train_file_list, optimizer_method):
+    learning_rate = args.learning_rate
+    batch_size = args.batch_size
+    height = args.resize_h
+    width = args.resize_w
+    use_gpu = args.use_gpu
+    use_pyramidbox = args.use_pyramidbox
+    model_save_dir = args.model_save_dir
+    pretrained_model = args.pretrained_model
+    skip_reader = args.skip_reader
+    num_iterations = args.num_iteration
+    parallel = args.parallel
+
+    num_classes = 2
+    image_shape = [3, height, width]
+
+    devices = os.getenv("CUDA_VISIBLE_DEVICES") or ""
+    devices_num = len(devices.split(","))
+
+    fetches = []
+    network = PyramidBox(image_shape, num_classes,
+                         sub_network=use_pyramidbox)
+    if use_pyramidbox:
+        face_loss, head_loss, loss = network.train()
+        fetches = [face_loss, head_loss]
+    else:
+        loss = network.vgg_ssd_loss()
+        fetches = [loss]
+
+    epocs = 12880 / batch_size
+    boundaries = [epocs * 40, epocs * 60, epocs * 80, epocs * 100]
+    values = [
+        learning_rate, learning_rate * 0.5, learning_rate * 0.25,
+        learning_rate * 0.1, learning_rate * 0.01
+    ]
+
+    if optimizer_method == "momentum":
+        optimizer = fluid.optimizer.Momentum(
+            learning_rate=fluid.layers.piecewise_decay(
+                boundaries=boundaries, values=values),
+            momentum=0.9,
+            regularization=fluid.regularizer.L2Decay(0.0005),
+        )
+    else:
+        optimizer = fluid.optimizer.RMSProp(
+            learning_rate=fluid.layers.piecewise_decay(boundaries, values),
+            regularization=fluid.regularizer.L2Decay(0.0005),
+        )
+
+    optimizer.minimize(loss)
+    fluid.memory_optimize(fluid.default_main_program())
+
+    place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    exe.run(fluid.default_startup_program())
+
+    start_pass = 0
+    if pretrained_model:
+        if pretrained_model.isdigit():
+            start_pass = int(pretrained_model) + 1
+            pretrained_model = os.path.join(model_save_dir, pretrained_model)
+            print("Resume from %s " %(pretrained_model))
+
+        if not os.path.exists(pretrained_model):
+            raise ValueError("The pre-trained model path [%s] does not exist." %
+                             (pretrained_model))
+        def if_exist(var):
+            return os.path.exists(os.path.join(pretrained_model, var.name))
+        fluid.io.load_vars(exe, pretrained_model, predicate=if_exist)
+
+    if parallel:
+        train_exe = fluid.ParallelExecutor(
+            use_cuda=use_gpu, loss_name=loss.name)
+
+    train_reader = reader.train_batch_reader(config, train_file_list, batch_size=batch_size)
+
+    def tensor(data, place, lod=None):
+        t = fluid.core.LoDTensor()
+        t.set(data, place)
+        if lod:
+            t.set_lod(lod)
+        return t
+
+    im, face_box, head_box, labels, lod = next(train_reader)
+    im_t = tensor(im, place)
+    box1 = tensor(face_box, place, [lod])
+    box2 = tensor(head_box, place, [lod])
+    lbl_t = tensor(labels, place, [lod])
+    feed_data = {'image': im_t, 'face_box': box1,
+                 'head_box': box2, 'gt_label': lbl_t}
+
+    def run(iterations, feed_data):
+        # global feed_data
+        reader_time = []
+        run_time = []
+        for batch_id in range(iterations):
+            start_time = time.time()
+            if not skip_reader:
+                im, face_box, head_box, labels, lod = next(train_reader)
+                im_t = tensor(im, place)
+                box1 = tensor(face_box, place, [lod])
+                box2 = tensor(head_box, place, [lod])
+                lbl_t = tensor(labels, place, [lod])
+                feed_data = {'image': im_t, 'face_box': box1,
+                             'head_box': box2, 'gt_label': lbl_t}
+            end_time = time.time()
+            reader_time.append(end_time - start_time)
+
+            start_time = time.time()
+            if parallel:
+                fetch_vars = train_exe.run(fetch_list=[v.name for v in fetches],
+                                           feed=feed_data)
+            else:
+                fetch_vars = exe.run(fluid.default_main_program(),
+                                     feed=feed_data,
+                                     fetch_list=fetches)
+            end_time = time.time()
+            run_time.append(end_time - start_time)
+            fetch_vars = [np.mean(np.array(v)) for v in fetch_vars]
+            if not args.use_pyramidbox:
+                print("Batch {0}, loss {1}".format(batch_id, fetch_vars[0]))
+            else:
+                print("Batch {0}, face loss {1}, head loss {2}".format(
+                       batch_id, fetch_vars[0], fetch_vars[1]))
+
+        return reader_time, run_time
+
+    # start-up
+    run(2, feed_data)
+
+    # profiling
+    start = time.time()
+    if not parallel:
+        with profiler.profiler('All', 'total', '/tmp/profile_file'):
+            reader_time, run_time = run(num_iterations, feed_data)
+    else:
+        reader_time, run_time = run(num_iterations, feed_data)
+    end = time.time()
+    total_time = end - start
+    print("Total time: {0}, reader time: {1} s, run time: {2} s".format(
+        total_time, np.sum(reader_time), np.sum(run_time)))
+
+
+if __name__ == '__main__':
+    args = parser.parse_args()
+    print_arguments(args)
+
+    data_dir = 'data/WIDERFACE/WIDER_train/images/'
+    train_file_list = 'label/train_gt_widerface.res'
+
+    config = reader.Settings(
+        data_dir=data_dir,
+        resize_h=args.resize_h,
+        resize_w=args.resize_w,
+        apply_expand=False,
+        mean_value=[104., 117., 123.],
+        ap_version='11point')
+    train(args, config, train_file_list, optimizer_method="momentum")
diff --git a/fluid/face_detection/pyramidbox.py b/fluid/face_detection/pyramidbox.py
index 74641f62eff18772337849f521269fecf9cef912..ba1a99356003f3482fcaf87874bb0cabd5733762 100644
--- a/fluid/face_detection/pyramidbox.py
+++ b/fluid/face_detection/pyramidbox.py
@@ -52,7 +52,7 @@ def conv_block(input, groups, filters, ksizes, strides=None, with_pool=True):
 class PyramidBox(object):
     def __init__(self,
                  data_shape,
-                 num_classes,
+                 num_classes=None,
                  use_transposed_conv2d=True,
                  is_infer=False,
                  sub_network=False):
@@ -81,10 +81,7 @@ class PyramidBox(object):
         if self.is_infer:
             return [self.image]
         else:
-            return [
-                self.image, self.face_box, self.head_box, self.gt_label,
-                self.difficult
-            ]
+            return [self.image, self.face_box, self.head_box, self.gt_label]
 
     def _input(self):
         self.image = fluid.layers.data(
@@ -96,8 +93,6 @@ class PyramidBox(object):
                 name='head_box', shape=[4], dtype='float32', lod_level=1)
             self.gt_label = fluid.layers.data(
                 name='gt_label', shape=[1], dtype='int32', lod_level=1)
-            self.difficult = fluid.layers.data(
-                name='gt_difficult', shape=[1], dtype='int32', lod_level=1)
 
     def _vgg(self):
         self.conv1, self.pool1 = conv_block(self.image, 2, [64] * 2, [3] * 2)
@@ -144,7 +139,8 @@ class PyramidBox(object):
                     stride=2,
                     groups=ch,
                     param_attr=w_attr,
-                    bias_attr=False)
+                    bias_attr=False,
+                    use_cudnn=True)
             else:
                 upsampling = fluid.layers.resize_bilinear(
                     conv1, out_shape=up_to.shape[2:])
@@ -418,5 +414,5 @@ class PyramidBox(object):
                 nms_threshold=0.3,
                 nms_top_k=5000,
                 keep_top_k=750,
-                score_threshold=0.05)
+                score_threshold=0.01)
         return test_program, face_nmsed_out
diff --git a/fluid/face_detection/reader.py b/fluid/face_detection/reader.py
index 5db54a010a266823c7f00ca1be654f70b9980244..5ac6e506f4cf2d45e3b5ee688492787a99f9264c 100644
--- a/fluid/face_detection/reader.py
+++ b/fluid/face_detection/reader.py
@@ -24,6 +24,7 @@ import time
 import copy
 import random
 import cv2
+from data_util import GeneratorEnqueuer
 
 
 class Settings(object):
@@ -58,30 +59,25 @@ class Settings(object):
         self.saturation_delta = 0.5
         self.brightness_prob = 0.5
         # _brightness_delta is the normalized value by 256
-        # self._brightness_delta = 32
         self.brightness_delta = 0.125
         self.scale = 0.007843  # 1 / 127.5
         self.data_anchor_sampling_prob = 0.5
         self.min_face_size = 8.0
 
 
-def draw_image(faces_pred, img, resize_val):
-    for i in range(len(faces_pred)):
-        draw_rotate_rectange(img, faces_pred[i], resize_val, (0, 255, 0), 3)
-
-
-def draw_rotate_rectange(img, face, resize_val, color, thickness):
-    cv2.line(img, (int(face[1] * resize_val), int(face[2] * resize_val)), (int(
-        face[3] * resize_val), int(face[2] * resize_val)), color, thickness)
-
-    cv2.line(img, (int(face[3] * resize_val), int(face[2] * resize_val)), (int(
-        face[3] * resize_val), int(face[4] * resize_val)), color, thickness)
-
-    cv2.line(img, (int(face[1] * resize_val), int(face[2] * resize_val)), (int(
-        face[1] * resize_val), int(face[4] * resize_val)), color, thickness)
-
-    cv2.line(img, (int(face[3] * resize_val), int(face[4] * resize_val)), (int(
-        face[1] * resize_val), int(face[4] * resize_val)), color, thickness)
+def to_chw_bgr(image):
+    """
+    Transpose image from HWC to CHW and from RBG to BGR.
+    Args:
+        image (np.array): an image with HWC and RBG layout.
+    """
+    # HWC to CHW
+    if len(image.shape) == 3:
+        image = np.swapaxes(image, 1, 2)
+        image = np.swapaxes(image, 1, 0)
+    # RBG to BGR
+    image = image[[2, 1, 0], :, :]
+    return image
 
 
 def preprocess(img, bbox_labels, mode, settings, image_path):
@@ -107,9 +103,6 @@ def preprocess(img, bbox_labels, mode, settings, image_path):
                 batch_sampler, bbox_labels, img_width, img_height, scale_array,
                 settings.resize_width, settings.resize_height)
             img = np.array(img)
-            # Debug
-            # img_save = Image.fromarray(img)
-            # img_save.save('img_orig.jpg')
             if len(sampled_bbox) > 0:
                 idx = int(random.uniform(0, len(sampled_bbox)))
                 img, sampled_labels = image_util.crop_image_sampling(
@@ -118,17 +111,7 @@ def preprocess(img, bbox_labels, mode, settings, image_path):
                     settings.min_face_size)
 
             img = img.astype('uint8')
-            # Debug: visualize the gt bbox
-            visualize_bbox = 0
-            if visualize_bbox:
-                img_show = img
-                draw_image(sampled_labels, img_show, settings.resize_height)
-                img_show = Image.fromarray(img_show)
-                img_show.save('final_img_show.jpg')
-
             img = Image.fromarray(img)
-            # Debug
-            # img.save('final_img.jpg')
 
         else:
             # hard-code here
@@ -172,46 +155,41 @@ def preprocess(img, bbox_labels, mode, settings, image_path):
                 tmp = sampled_labels[i][1]
                 sampled_labels[i][1] = 1 - sampled_labels[i][3]
                 sampled_labels[i][3] = 1 - tmp
-    # HWC to CHW
-    if len(img.shape) == 3:
-        img = np.swapaxes(img, 1, 2)
-        img = np.swapaxes(img, 1, 0)
-    # RBG to BGR
-    img = img[[2, 1, 0], :, :]
+
+    img = to_chw_bgr(img)
     img = img.astype('float32')
     img -= settings.img_mean
     img = img * settings.scale
     return img, sampled_labels
 
 
-def put_txt_in_dict(input_txt):
+def load_file_list(input_txt):
     with open(input_txt, 'r') as f_dir:
         lines_input_txt = f_dir.readlines()
 
-    dict_input_txt = {}
+    file_dict = {}
     num_class = 0
     for i in range(len(lines_input_txt)):
-        tmp_line_txt = lines_input_txt[i].strip('\n\t\r')
-        if '--' in tmp_line_txt:
+        line_txt = lines_input_txt[i].strip('\n\t\r')
+        if '--' in line_txt:
             if i != 0:
                 num_class += 1
-            dict_input_txt[num_class] = []
-            dict_name = tmp_line_txt
-            dict_input_txt[num_class].append(tmp_line_txt)
-        if '--' not in tmp_line_txt:
-            if len(tmp_line_txt) > 6:
-                split_str = tmp_line_txt.split(' ')
+            file_dict[num_class] = []
+            file_dict[num_class].append(line_txt)
+        if '--' not in line_txt:
+            if len(line_txt) > 6:
+                split_str = line_txt.split(' ')
                 x1_min = float(split_str[0])
                 y1_min = float(split_str[1])
                 x2_max = float(split_str[2])
                 y2_max = float(split_str[3])
-                tmp_line_txt = str(x1_min) + ' ' + str(y1_min) + ' ' + str(
+                line_txt = str(x1_min) + ' ' + str(y1_min) + ' ' + str(
                     x2_max) + ' ' + str(y2_max)
-                dict_input_txt[num_class].append(tmp_line_txt)
+                file_dict[num_class].append(line_txt)
             else:
-                dict_input_txt[num_class].append(tmp_line_txt)
+                file_dict[num_class].append(line_txt)
 
-    return dict_input_txt
+    return file_dict
 
 
 def expand_bboxes(bboxes,
@@ -238,68 +216,106 @@ def expand_bboxes(bboxes,
     return expand_boxes
 
 
-def pyramidbox(settings, file_list, mode, shuffle):
-
-    dict_input_txt = {}
-    dict_input_txt = put_txt_in_dict(file_list)
+def train_generator(settings, file_list, batch_size, shuffle=True):
+    file_dict = load_file_list(file_list)
+    while True:
+        if shuffle:
+            random.shuffle(file_dict)
+        images, face_boxes, head_boxes, label_ids = [], [], [], []
+        label_offs = [0]
 
-    def reader():
-        if mode == 'train' and shuffle:
-            random.shuffle(dict_input_txt)
-        for index_image in range(len(dict_input_txt)):
-
-            image_name = dict_input_txt[index_image][0] + '.jpg'
+        for index_image in file_dict.keys():
+            image_name = file_dict[index_image][0]
             image_path = os.path.join(settings.data_dir, image_name)
-
             im = Image.open(image_path)
             if im.mode == 'L':
                 im = im.convert('RGB')
             im_width, im_height = im.size
 
             # layout: label | xmin | ymin | xmax | ymax
-            if mode == 'train':
-                bbox_labels = []
-                for index_box in range(len(dict_input_txt[index_image])):
-                    if index_box >= 2:
-                        bbox_sample = []
-                        temp_info_box = dict_input_txt[index_image][
-                            index_box].split(' ')
-                        xmin = float(temp_info_box[0])
-                        ymin = float(temp_info_box[1])
-                        w = float(temp_info_box[2])
-                        h = float(temp_info_box[3])
-                        xmax = xmin + w
-                        ymax = ymin + h
-
-                        bbox_sample.append(1)
-                        bbox_sample.append(float(xmin) / im_width)
-                        bbox_sample.append(float(ymin) / im_height)
-                        bbox_sample.append(float(xmax) / im_width)
-                        bbox_sample.append(float(ymax) / im_height)
-                        bbox_labels.append(bbox_sample)
-
-                im, sample_labels = preprocess(im, bbox_labels, mode, settings,
-                                               image_path)
-                sample_labels = np.array(sample_labels)
-                if len(sample_labels) == 0: continue
-                im = im.astype('float32')
-                boxes = sample_labels[:, 1:5]
-                lbls = [1] * len(boxes)
-                difficults = [1] * len(boxes)
-                yield im, boxes, expand_bboxes(boxes), lbls, difficults
-
-            if mode == 'test':
-                yield im, image_path
+            bbox_labels = []
+            for index_box in range(len(file_dict[index_image])):
+                if index_box >= 2:
+                    bbox_sample = []
+                    temp_info_box = file_dict[index_image][index_box].split(' ')
+                    xmin = float(temp_info_box[0])
+                    ymin = float(temp_info_box[1])
+                    w = float(temp_info_box[2])
+                    h = float(temp_info_box[3])
+                    xmax = xmin + w
+                    ymax = ymin + h
+
+                    bbox_sample.append(1)
+                    bbox_sample.append(float(xmin) / im_width)
+                    bbox_sample.append(float(ymin) / im_height)
+                    bbox_sample.append(float(xmax) / im_width)
+                    bbox_sample.append(float(ymax) / im_height)
+                    bbox_labels.append(bbox_sample)
+
+            im, sample_labels = preprocess(im, bbox_labels, "train", settings,
+                                           image_path)
+            sample_labels = np.array(sample_labels)
+            if len(sample_labels) == 0: continue
+
+            im = im.astype('float32')
+            face_box = sample_labels[:, 1:5]
+            head_box = expand_bboxes(face_box)
+            label = [1] * len(face_box)
+
+            images.append(im)
+            face_boxes.extend(face_box)
+            head_boxes.extend(head_box)
+            label_ids.extend(label)
+            label_offs.append(label_offs[-1] + len(face_box))
+
+            if len(images) == batch_size:
+                images = np.array(images).astype('float32')
+                face_boxes = np.array(face_boxes).astype('float32')
+                head_boxes = np.array(head_boxes).astype('float32')
+                label_ids = np.array(label_ids).astype('int32')
+                yield images, face_boxes, head_boxes, label_ids, label_offs
+                images, face_boxes, head_boxes = [], [], []
+                label_ids, label_offs = [], [0]
+
+
+def train_batch_reader(settings,
+                       file_list,
+                       batch_size,
+                       shuffle=True,
+                       num_workers=8):
+    try:
+        enqueuer = GeneratorEnqueuer(
+            train_generator(settings, file_list, batch_size, shuffle),
+            use_multiprocessing=False)
+        enqueuer.start(max_queue_size=24, workers=num_workers)
+        generator_output = None
+        while True:
+            while enqueuer.is_running():
+                if not enqueuer.queue.empty():
+                    generator_output = enqueuer.queue.get()
+                    break
+                else:
+                    time.sleep(0.01)
+            yield generator_output
+            generator_output = None
+    finally:
+        if enqueuer is not None:
+            enqueuer.stop()
 
-    return reader
 
+def test(settings, file_list):
+    file_dict = load_file_list(file_list)
 
-def train(settings, file_list, shuffle=True):
-    return pyramidbox(settings, file_list, 'train', shuffle)
-
+    def reader():
+        for index_image in file_dict.keys():
+            image_name = file_dict[index_image][0]
+            image_path = os.path.join(settings.data_dir, image_name)
+            im = Image.open(image_path)
+            if im.mode == 'L':
+                im = im.convert('RGB')
+            yield im, image_path
 
-def test(settings, file_list):
-    return pyramidbox(settings, file_list, 'test', False)
+    return reader
 
 
 def infer(settings, image_path):
@@ -312,12 +328,7 @@ def infer(settings, image_path):
             img = img.resize((settings.resize_width, settings.resize_height),
                              Image.ANTIALIAS)
         img = np.array(img)
-        # HWC to CHW
-        if len(img.shape) == 3:
-            img = np.swapaxes(img, 1, 2)
-            img = np.swapaxes(img, 1, 0)
-        # RBG to BGR
-        img = img[[2, 1, 0], :, :]
+        img = to_chw_bgr(img)
         img = img.astype('float32')
         img -= settings.img_mean
         img = img * settings.scale
diff --git a/fluid/face_detection/train.py b/fluid/face_detection/train.py
index acff16ecc354ac625699596e75b2db2c8f164a95..b62ac26d0d7236421e80ed4396c6ed3d0f72c310 100644
--- a/fluid/face_detection/train.py
+++ b/fluid/face_detection/train.py
@@ -5,27 +5,26 @@ import time
 import argparse
 import functools
 
-import reader
-import paddle
 import paddle.fluid as fluid
 from pyramidbox import PyramidBox
+import reader
 from utility import add_arguments, print_arguments
 
 parser = argparse.ArgumentParser(description=__doc__)
 add_arg = functools.partial(add_arguments, argparser=parser)
 
 # yapf: disable
-add_arg('parallel',         bool,  True,            "parallel")
-add_arg('learning_rate',    float, 0.001,           "Learning rate.")
-add_arg('batch_size',       int,   12,              "Minibatch size.")
+add_arg('parallel',         bool,  True,            "Whether use multi-GPU/threads or not.")
+add_arg('learning_rate',    float, 0.001,           "The start learning rate.")
+add_arg('batch_size',       int,   16,              "Minibatch size.")
 add_arg('num_passes',       int,   160,             "Epoch number.")
 add_arg('use_gpu',          bool,  True,            "Whether use GPU.")
 add_arg('use_pyramidbox',   bool,  True,            "Whether use PyramidBox model.")
 add_arg('model_save_dir',   str,   'output',        "The path to save model.")
-add_arg('pretrained_model', str,   './pretrained/', "The init model path.")
 add_arg('resize_h',         int,   640,             "The resized image height.")
-add_arg('resize_w',         int,   640,             "The resized image height.")
-add_arg('with_mem_opt',     bool,  False,           "Whether to use memory optimization or not.")
+add_arg('resize_w',         int,   640,             "The resized image width.")
+add_arg('with_mem_opt',     bool,  True,            "Whether to use memory optimization or not.")
+add_arg('pretrained_model', str,   './vgg_ilsvrc_16_fc_reduced/', "The init model path.")
 #yapf: enable
 
 
@@ -58,8 +57,9 @@ def train(args, config, train_file_list, optimizer_method):
         loss = network.vgg_ssd_loss()
         fetches = [loss]
 
-    epocs = 12880 / batch_size
-    boundaries = [epocs * 50, epocs * 80, epocs * 120, epocs * 140]
+    steps_per_pass = 12880 / batch_size
+    boundaries = [steps_per_pass * 50, steps_per_pass * 80,
+                  steps_per_pass * 120, steps_per_pass * 140]
     values = [
         learning_rate, learning_rate * 0.5, learning_rate * 0.25,
         learning_rate * 0.1, learning_rate * 0.01
@@ -104,9 +104,7 @@ def train(args, config, train_file_list, optimizer_method):
         train_exe = fluid.ParallelExecutor(
             use_cuda=use_gpu, loss_name=loss.name)
 
-    train_reader = paddle.batch(
-        reader.train(config, train_file_list), batch_size=batch_size)
-    feeder = fluid.DataFeeder(place=place, feed_list=network.feeds())
+    train_reader = reader.train_batch_reader(config, train_file_list, batch_size=batch_size)
 
     def save_model(postfix):
         model_path = os.path.join(model_save_dir, postfix)
@@ -115,24 +113,38 @@ def train(args, config, train_file_list, optimizer_method):
         print 'save models to %s' % (model_path)
         fluid.io.save_persistables(exe, model_path)
 
+    def tensor(data, place, lod=None):
+        t = fluid.core.LoDTensor()
+        t.set(data, place)
+        if lod:
+            t.set_lod(lod)
+        return t
+
     for pass_id in range(start_pass, num_passes):
         start_time = time.time()
         prev_start_time = start_time
         end_time = 0
-        for batch_id, data in enumerate(train_reader()):
+        for batch_id in range(steps_per_pass):
+            im, face_box, head_box, labels, lod = next(train_reader)
+            im_t = tensor(im, place)
+            box1 = tensor(face_box, place, [lod])
+            box2 = tensor(head_box, place, [lod])
+            lbl_t = tensor(labels, place, [lod])
+            feeding = {'image': im_t, 'face_box': box1,
+                       'head_box': box2, 'gt_label': lbl_t}
+
             prev_start_time = start_time
             start_time = time.time()
-            if len(data) < 2 * devices_num: continue
             if args.parallel:
                 fetch_vars = train_exe.run(fetch_list=[v.name for v in fetches],
-                                           feed=feeder.feed(data))
+                                           feed=feeding)
             else:
                 fetch_vars = exe.run(fluid.default_main_program(),
-                                     feed=feeder.feed(data),
+                                     feed=feeding,
                                      fetch_list=fetches)
             end_time = time.time()
             fetch_vars = [np.mean(np.array(v)) for v in fetch_vars]
-            if batch_id % 1 == 0:
+            if batch_id % 10 == 0:
                 if not args.use_pyramidbox:
                     print("Pass {0}, batch {1}, loss {2}, time {3}".format(
                         pass_id, batch_id, fetch_vars[0],
@@ -151,8 +163,8 @@ if __name__ == '__main__':
     args = parser.parse_args()
     print_arguments(args)
 
-    data_dir = 'data/WIDERFACE/WIDER_train/images/'
-    train_file_list = 'label/train_gt_widerface.res'
+    data_dir = 'data/WIDER_train/images/'
+    train_file_list = 'data/wider_face_split/wider_face_train_bbx_gt.txt'
 
     config = reader.Settings(
         data_dir=data_dir,
diff --git a/fluid/face_detection/visualize.py b/fluid/face_detection/visualize.py
new file mode 100644
index 0000000000000000000000000000000000000000..418ef533cf9f89dfe3526583f76f2228583e378a
--- /dev/null
+++ b/fluid/face_detection/visualize.py
@@ -0,0 +1,54 @@
+import os
+from PIL import Image
+from PIL import ImageDraw
+
+
+def draw_bbox(image, bbox):
+    """
+    Draw one bounding box on image.
+    Args:
+        image (PIL.Image): a PIL Image object.
+        bbox (np.array|list|tuple): (xmin, ymin, xmax, ymax).
+    """
+    draw = ImageDraw.Draw(image)
+    xmin, ymin, xmax, ymax = box
+    (left, right, top, bottom) = (xmin, xmax, ymin, ymax)
+    draw.line(
+        [(left, top), (left, bottom), (right, bottom), (right, top),
+         (left, top)],
+        width=4,
+        fill='red')
+
+
+def draw_bboxes(image_file, bboxes, labels=None, output_dir=None):
+    """
+    Draw bounding boxes on image.
+    
+    Args:
+        image_file (string): input image path.
+        bboxes (np.array): bounding boxes.
+        labels (list of string): the label names of bboxes.
+        output_dir (string): output directory.
+    """
+    if labels:
+        assert len(bboxes) == len(labels)
+
+    image = Image.open(image_file)
+    draw = ImageDraw.Draw(image)
+    for i in range(len(bboxes)):
+        xmin, ymin, xmax, ymax = bboxes[i]
+        (left, right, top, bottom) = (xmin, xmax, ymin, ymax)
+        draw.line(
+            [(left, top), (left, bottom), (right, bottom), (right, top),
+             (left, top)],
+            width=4,
+            fill='red')
+        if labels and image.mode == 'RGB':
+            draw.text((left, top), labels[i], (255, 255, 0))
+
+    output_file = image_file.split('/')[-1]
+    if output_dir:
+        output_file = os.path.join(output_dir, output_file)
+
+    print("The image with bbox is saved as {}".format(output_file))
+    image.save(output_file)
diff --git a/fluid/face_detection/infer.py b/fluid/face_detection/widerface_eval.py
similarity index 53%
rename from fluid/face_detection/infer.py
rename to fluid/face_detection/widerface_eval.py
index a9468c33c110e04c82c9845414e1d83fee0bb7a7..72be5fa64d3ae96ca5f4933bca6036c05c2c6e5b 100644
--- a/fluid/face_detection/infer.py
+++ b/fluid/face_detection/widerface_eval.py
@@ -4,68 +4,130 @@ import numpy as np
 import argparse
 import functools
 from PIL import Image
-from PIL import ImageDraw
 
-import paddle
 import paddle.fluid as fluid
 import reader
 from pyramidbox import PyramidBox
 from utility import add_arguments, print_arguments
 parser = argparse.ArgumentParser(description=__doc__)
 add_arg = functools.partial(add_arguments, argparser=parser)
+
 # yapf: disable
-add_arg('use_gpu',          bool,  True,      "Whether use GPU.")
-add_arg('use_pyramidbox',   bool,  True, "Whether use PyramidBox model.")
-add_arg('confs_threshold',  float, 0.25,    "Confidence threshold to draw bbox.")
-add_arg('image_path',       str,   '',        "The data root path.")
-add_arg('model_dir',        str,   '',     "The model path.")
+add_arg('use_gpu',        bool, True,                              "Whether use GPU or not.")
+add_arg('use_pyramidbox', bool, True,                              "Whether use PyramidBox model.")
+add_arg('data_dir',       str,  'data/WIDER_val/images/',          "The validation dataset path.")
+add_arg('model_dir',      str,  '',                                "The model path.")
+add_arg('pred_dir',       str,  'pred',                            "The path to save the evaluation results.")
+add_arg('file_list',      str,  'data/wider_face_split/wider_face_val_bbx_gt.txt', "The validation dataset path.")
 # yapf: enable
 
 
-def draw_bounding_box_on_image(image_path, nms_out, confs_threshold):
-    image = Image.open(image_path)
-    draw = ImageDraw.Draw(image)
-    for dt in nms_out:
-        xmin, ymin, xmax, ymax, score = dt
-        if score < confs_threshold:
-            continue
-        (left, right, top, bottom) = (xmin, xmax, ymin, ymax)
-        draw.line(
-            [(left, top), (left, bottom), (right, bottom), (right, top),
-             (left, top)],
-            width=4,
-            fill='red')
-    image_name = image_path.split('/')[-1]
-    image_class = image_path.split('/')[-2]
-    print("image with bbox drawed saved as {}".format(image_name))
-    image.save('./infer_results/' + image_class.encode('utf-8') + '/' +
-               image_name.encode('utf-8'))
+def infer(args, config):
+    batch_size = 1
+    model_dir = args.model_dir
+    data_dir = args.data_dir
+    file_list = args.file_list
+    pred_dir = args.pred_dir
+
+    if not os.path.exists(model_dir):
+        raise ValueError("The model path [%s] does not exist." % (model_dir))
+
+    test_reader = reader.test(config, file_list)
+
+    for image, image_path in test_reader():
+        shrink, max_shrink = get_shrink(image.size[1], image.size[0])
+
+        det0 = detect_face(image, shrink)
+        det1 = flip_test(image, shrink)
+        [det2, det3] = multi_scale_test(image, max_shrink)
+        det4 = multi_scale_test_pyramid(image, max_shrink)
+        det = np.row_stack((det0, det1, det2, det3, det4))
+        dets = bbox_vote(det)
+
+        save_widerface_bboxes(image_path, dets, pred_dir)
 
+    print("Finish evaluation.")
 
-def write_to_txt(image_path, f, nms_out):
+
+def save_widerface_bboxes(image_path, bboxes_scores, output_dir):
+    """
+    Save predicted results, including bbox and score into text file.
+    Args:
+        image_path (string): file name.
+        bboxes_scores (np.array|list): the predicted bboxed and scores, layout
+            is (xmin, ymin, xmax, ymax, score)
+        output_dir (string): output directory.
+    """
     image_name = image_path.split('/')[-1]
     image_class = image_path.split('/')[-2]
-    f.write('{:s}\n'.format(
-        image_class.encode('utf-8') + '/' + image_name.encode('utf-8')))
-    f.write('{:d}\n'.format(nms_out.shape[0]))
-    for dt in nms_out:
-        xmin, ymin, xmax, ymax, score = dt
+
+    image_name = image_name.encode('utf-8')
+    image_class = image_class.encode('utf-8')
+
+    odir = os.path.join(output_dir, image_class)
+    if not os.path.exists(odir):
+        os.makedirs(odir)
+
+    ofname = os.path.join(odir, '%s.txt' % (image_name[:-4]))
+    f = open(ofname, 'w')
+    f.write('{:s}\n'.format(image_class + '/' + image_name))
+    f.write('{:d}\n'.format(bboxes_scores.shape[0]))
+    for box_score in bboxes_scores:
+        xmin, ymin, xmax, ymax, score = box_score
         f.write('{:.1f} {:.1f} {:.1f} {:.1f} {:.3f}\n'.format(xmin, ymin, (
             xmax - xmin + 1), (ymax - ymin + 1), score))
-    print("image infer result saved {}".format(image_name[:-4]))
+    f.close()
+    print("The predicted result is saved as {}".format(ofname))
+
+
+def detect_face(image, shrink):
+    image_shape = [3, image.size[1], image.size[0]]
+    if shrink != 1:
+        h, w = int(image_shape[1] * shrink), int(image_shape[2] * shrink)
+        image = image.resize((w, h), Image.ANTIALIAS)
+        image_shape = [3, h, w]
 
+    img = np.array(image)
+    img = reader.to_chw_bgr(img)
+    mean = [104., 117., 123.]
+    scale = 0.007843
+    img = img.astype('float32')
+    img -= np.array(mean)[:, np.newaxis, np.newaxis].astype('float32')
+    img = img * scale
+    img = [img]
+    img = np.array(img)
+
+    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
+    exe = fluid.Executor(place)
+    main_program = fluid.Program()
+    startup_program = fluid.Program()
+
+    with fluid.unique_name.guard():
+        with fluid.program_guard(main_program, startup_program):
+            network = PyramidBox(
+                image_shape, sub_network=args.use_pyramidbox, is_infer=True)
+            infer_program, nmsed_out = network.infer(main_program)
+            fetches = [nmsed_out]
+            fluid.io.load_persistables(
+                exe, args.model_dir, main_program=main_program)
+
+            detection, = exe.run(infer_program,
+                                 feed={'image': img},
+                                 fetch_list=fetches,
+                                 return_numpy=False)
+            detection = np.array(detection)
+    # layout: xmin, ymin, xmax. ymax, score
+    if detection.shape == (1, ):
+        print("No face detected")
+        return np.array([[0, 0, 0, 0, 0]])
+    det_conf = detection[:, 1]
+    det_xmin = image_shape[2] * detection[:, 2] / shrink
+    det_ymin = image_shape[1] * detection[:, 3] / shrink
+    det_xmax = image_shape[2] * detection[:, 4] / shrink
+    det_ymax = image_shape[1] * detection[:, 5] / shrink
 
-def get_round(x, loc):
-    str_x = str(x)
-    if '.' in str_x:
-        len_after = len(str_x.split('.')[1])
-        str_before = str_x.split('.')[0]
-        str_after = str_x.split('.')[1]
-        if len_after >= 3:
-            str_final = str_before + '.' + str_after[0:loc]
-            return float(str_final)
-        else:
-            return x
+    det = np.column_stack((det_xmin, det_ymin, det_xmax, det_ymax, det_conf))
+    return det
 
 
 def bbox_vote(det):
@@ -86,7 +148,7 @@ def bbox_vote(det):
         inter = w * h
         o = inter / (area[0] + area[:] - inter)
 
-        # get needed merge det and delete these det
+        # nms
         merge_index = np.where(o >= 0.3)[0]
         det_accu = det[merge_index, :]
         det = np.delete(det, merge_index, 0)
@@ -111,78 +173,6 @@ def bbox_vote(det):
     return dets
 
 
-def image_preprocess(image):
-    img = np.array(image)
-    # HWC to CHW
-    if len(img.shape) == 3:
-        img = np.swapaxes(img, 1, 2)
-        img = np.swapaxes(img, 1, 0)
-    # RBG to BGR
-    img = img[[2, 1, 0], :, :]
-    img = img.astype('float32')
-    img -= np.array(
-        [104., 117., 123.])[:, np.newaxis, np.newaxis].astype('float32')
-    img = img * 0.007843
-    img = [img]
-    img = np.array(img)
-    return img
-
-
-def detect_face(image, shrink):
-    image_shape = [3, image.size[1], image.size[0]]
-    num_classes = 2
-    place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
-    exe = fluid.Executor(place)
-
-    if shrink != 1:
-        image = image.resize((int(image_shape[2] * shrink),
-                              int(image_shape[1] * shrink)), Image.ANTIALIAS)
-        image_shape = [
-            image_shape[0], int(image_shape[1] * shrink),
-            int(image_shape[2] * shrink)
-        ]
-    print "image_shape:", image_shape
-    img = image_preprocess(image)
-
-    scope = fluid.core.Scope()
-    main_program = fluid.Program()
-    startup_program = fluid.Program()
-
-    with fluid.scope_guard(scope):
-        with fluid.unique_name.guard():
-            with fluid.program_guard(main_program, startup_program):
-                fetches = []
-                network = PyramidBox(
-                    image_shape,
-                    num_classes,
-                    sub_network=args.use_pyramidbox,
-                    is_infer=True)
-                infer_program, nmsed_out = network.infer(main_program)
-                fetches = [nmsed_out]
-                fluid.io.load_persistables(
-                    exe, args.model_dir, main_program=main_program)
-
-                detection, = exe.run(infer_program,
-                                     feed={'image': img},
-                                     fetch_list=fetches,
-                                     return_numpy=False)
-                detection = np.array(detection)
-    # layout: xmin, ymin, xmax. ymax, score
-    if detection.shape == (1, ):
-        print("No face detected")
-        return np.array([[0, 0, 0, 0, 0]])
-    det_conf = detection[:, 1]
-    det_xmin = image_shape[2] * detection[:, 2] / shrink
-    det_ymin = image_shape[1] * detection[:, 3] / shrink
-    det_xmax = image_shape[2] * detection[:, 4] / shrink
-    det_ymax = image_shape[1] * detection[:, 5] / shrink
-
-    det = np.column_stack((det_xmin, det_ymin, det_xmax, det_ymax, det_conf))
-    keep_index = np.where(det[:, 4] >= 0)[0]
-    det = det[keep_index, :]
-    return det
-
-
 def flip_test(image, shrink):
     img = image.transpose(Image.FLIP_LEFT_RIGHT)
     det_f = detect_face(img, shrink)
@@ -197,18 +187,18 @@ def flip_test(image, shrink):
 
 
 def multi_scale_test(image, max_shrink):
-    # shrink detecting and shrink only detect big face
+    # Shrink detecting is only used to detect big faces
     st = 0.5 if max_shrink >= 0.75 else 0.5 * max_shrink
     det_s = detect_face(image, st)
     index = np.where(
         np.maximum(det_s[:, 2] - det_s[:, 0] + 1, det_s[:, 3] - det_s[:, 1] + 1)
         > 30)[0]
     det_s = det_s[index, :]
-    # enlarge one times
+    # Enlarge one times
     bt = min(2, max_shrink) if max_shrink > 1 else (st + max_shrink) / 2
     det_b = detect_face(image, bt)
 
-    # enlarge small image x times for small face
+    # Enlarge small image x times for small faces
     if max_shrink > 2:
         bt *= 2
         while bt < max_shrink:
@@ -216,12 +206,13 @@ def multi_scale_test(image, max_shrink):
             bt *= 2
         det_b = np.row_stack((det_b, detect_face(image, max_shrink)))
 
-    # enlarge only detect small face
+    # Enlarged images are only used to detect small faces.
     if bt > 1:
         index = np.where(
             np.minimum(det_b[:, 2] - det_b[:, 0] + 1,
                        det_b[:, 3] - det_b[:, 1] + 1) < 100)[0]
         det_b = det_b[index, :]
+    # Shrinked images are only used to detect big faces.
     else:
         index = np.where(
             np.maximum(det_b[:, 2] - det_b[:, 0] + 1,
@@ -231,23 +222,24 @@ def multi_scale_test(image, max_shrink):
 
 
 def multi_scale_test_pyramid(image, max_shrink):
-    # shrink detecting and shrink only detect big face
+    # Use image pyramids to detect faces
     det_b = detect_face(image, 0.25)
     index = np.where(
         np.maximum(det_b[:, 2] - det_b[:, 0] + 1, det_b[:, 3] - det_b[:, 1] + 1)
         > 30)[0]
     det_b = det_b[index, :]
 
-    st = [0.5, 0.75, 1.25, 1.5, 1.75, 2.25]
+    st = [0.75, 1.25, 1.5, 1.75]
     for i in range(len(st)):
         if (st[i] <= max_shrink):
             det_temp = detect_face(image, st[i])
-            # enlarge only detect small face
+            # Enlarged images are only used to detect small faces.
             if st[i] > 1:
                 index = np.where(
                     np.minimum(det_temp[:, 2] - det_temp[:, 0] + 1,
                                det_temp[:, 3] - det_temp[:, 1] + 1) < 100)[0]
                 det_temp = det_temp[index, :]
+            # Shrinked images are only used to detect big faces.
             else:
                 index = np.where(
                     np.maximum(det_temp[:, 2] - det_temp[:, 0] + 1,
@@ -257,13 +249,28 @@ def multi_scale_test_pyramid(image, max_shrink):
     return det_b
 
 
-def get_im_shrink(image_shape):
-    max_shrink_v1 = (0x7fffffff / 577.0 /
-                     (image_shape[1] * image_shape[2]))**0.5
-    max_shrink_v2 = (
-        (678 * 1024 * 2.0 * 2.0) / (image_shape[1] * image_shape[2]))**0.5
-    max_shrink = get_round(min(max_shrink_v1, max_shrink_v2), 2) - 0.3
+def get_shrink(height, width):
+    """
+    Args:
+        height (int): image height.
+        width (int): image width.
+    """
+    # avoid out of memory
+    max_shrink_v1 = (0x7fffffff / 577.0 / (height * width))**0.5
+    max_shrink_v2 = ((678 * 1024 * 2.0 * 2.0) / (height * width))**0.5
+
+    def get_round(x, loc):
+        str_x = str(x)
+        if '.' in str_x:
+            str_before, str_after = str_x.split('.')
+            len_after = len(str_after)
+            if len_after >= 3:
+                str_final = str_before + '.' + str_after[0:loc]
+                return float(str_final)
+            else:
+                return x
 
+    max_shrink = get_round(min(max_shrink_v1, max_shrink_v2), 2) - 0.3
     if max_shrink >= 1.5 and max_shrink < 2:
         max_shrink = max_shrink - 0.1
     elif max_shrink >= 2 and max_shrink < 3:
@@ -275,60 +282,12 @@ def get_im_shrink(image_shape):
     elif max_shrink >= 5:
         max_shrink = max_shrink - 0.5
 
-    print 'max_shrink = ', max_shrink
     shrink = max_shrink if max_shrink < 1 else 1
-    print "shrink = ", shrink
-
     return shrink, max_shrink
 
 
-def infer(args, batch_size, data_args):
-    if not os.path.exists(args.model_dir):
-        raise ValueError("The model path [%s] does not exist." %
-                         (args.model_dir))
-
-    infer_reader = paddle.batch(
-        reader.test(data_args, file_list), batch_size=batch_size)
-
-    for batch_id, img in enumerate(infer_reader()):
-        image = img[0][0]
-        image_path = img[0][1]
-
-        # image.size: [width, height]
-        image_shape = [3, image.size[1], image.size[0]]
-
-        shrink, max_shrink = get_im_shrink(image_shape)
-
-        det0 = detect_face(image, shrink)
-        det1 = flip_test(image, shrink)
-        [det2, det3] = multi_scale_test(image, max_shrink)
-        det4 = multi_scale_test_pyramid(image, max_shrink)
-        det = np.row_stack((det0, det1, det2, det3, det4))
-        dets = bbox_vote(det)
-
-        image_name = image_path.split('/')[-1]
-        image_class = image_path.split('/')[-2]
-        if not os.path.exists('./infer_results/' + image_class.encode('utf-8')):
-            os.makedirs('./infer_results/' + image_class.encode('utf-8'))
-
-        f = open('./infer_results/' + image_class.encode('utf-8') + '/' +
-                 image_name.encode('utf-8')[:-4] + '.txt', 'w')
-        write_to_txt(image_path, f, dets)
-        # draw_bounding_box_on_image(image_path, dets, args.confs_threshold)
-    print "Done"
-
-
 if __name__ == '__main__':
     args = parser.parse_args()
     print_arguments(args)
-
-    data_dir = 'data/WIDERFACE/WIDER_val/images/'
-    file_list = 'label/val_gt_widerface.res'
-
-    data_args = reader.Settings(
-        data_dir=data_dir,
-        mean_value=[104., 117., 123],
-        apply_distort=False,
-        apply_expand=False,
-        ap_version='11point')
-    infer(args, batch_size=1, data_args=data_args)
+    config = reader.Settings(data_dir=args.data_dir)
+    infer(args, config)
diff --git a/fluid/image_classification/data/ILSVRC2012/download_imagenet2012.sh b/fluid/image_classification/data/ILSVRC2012/download_imagenet2012.sh
index 947b8900bd944759437a55c20fb32bca4a1b9380..3e6e0ce6d6df0b8c5a5e7814e510eb64006ce34d 100644
--- a/fluid/image_classification/data/ILSVRC2012/download_imagenet2012.sh
+++ b/fluid/image_classification/data/ILSVRC2012/download_imagenet2012.sh
@@ -34,7 +34,7 @@ tar xf ${valid_tar} -C ${valid_folder}
 
 echo "Download imagenet label file: val_list.txt & train_list.txt"
 label_file=ImageNet_label.tgz
-label_url=http://imagenet-data.bj.bcebos.com/${label_file}
+label_url=http://paddle-imagenet-models.bj.bcebos.com/${label_file}
 wget -nd -c ${label_url}
 tar zxf ${label_file}
 
diff --git a/fluid/neural_machine_translation/transformer/README_cn.md b/fluid/neural_machine_translation/transformer/README_cn.md
new file mode 100644
index 0000000000000000000000000000000000000000..547b525b40abbfc3009e3948273db52ff394e535
--- /dev/null
+++ b/fluid/neural_machine_translation/transformer/README_cn.md
@@ -0,0 +1,163 @@
+运行本目录下的程序示例需要使用 PaddlePaddle 最新的 develop branch 版本。如果您的 PaddlePaddle 安装版本低于此要求，请按照[安装文档](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html)中的说明更新 PaddlePaddle 安装版本。
+
+---
+
+## Transformer
+
+以下是本例的简要目录结构及说明：
+
+```text
+.
+├── images               # README 文档中的图片
+├── optim.py             # learning rate scheduling 计算程序
+├── infer.py             # 预测脚本
+├── model.py             # 模型定义
+├── reader.py            # 数据读取接口
+├── README.md            # 文档
+├── train.py             # 训练脚本
+└── config.py            # 训练、预测以及模型参数配置
+```
+
+### 简介
+
+Transformer 是论文 [Attention Is All You Need](https://arxiv.org/abs/1706.03762) 中提出的用以完成机器翻译（machine translation, MT）等序列到序列（sequence to sequence, Seq2Seq）学习任务的一种全新网络结构，其完全使用注意力（Attention）机制来实现序列到序列的建模[1]。
+
+相较于此前 Seq2Seq 模型中广泛使用的循环神经网络（Recurrent Neural Network, RNN），使用（Self）Attention 进行输入序列到输出序列的变换主要具有以下优势：
+
+- 计算复杂度小
+  - 特征维度为 d 、长度为 n 的序列，在 RNN 中计算复杂度为 `O(n * d * d)` （n 个时间步，每个时间步计算 d 维的矩阵向量乘法），在 Self-Attention 中计算复杂度为 `O(n * n * d)` （n 个时间步两两计算 d 维的向量点积或其他相关度函数），n 通常要小于 d 。
+- 计算并行度高
+  - RNN 中当前时间步的计算要依赖前一个时间步的计算结果；Self-Attention 中各时间步的计算只依赖输入不依赖之前时间步输出，各时间步可以完全并行。
+- 容易学习长程依赖（long-range dependencies）
+  - RNN 中相距为 n 的两个位置间的关联需要 n 步才能建立；Self-Attention 中任何两个位置都直接相连；路径越短信号传播越容易。
+
+这些也在机器翻译任务中得到了印证，Transformer 模型在训练时间大幅减少的同时取得了 WMT'14 英德翻译任务 BLEU 值的新高。此外，Transformer 在应用于成分句法分析（Constituency Parsing）任务时也有着不俗的表现，这也说明其具有较高的通用性，容易迁移到其他应用场景中。这些都表明 Transformer 有着广阔的前景。
+
+### 模型概览
+
+Transformer 同样使用了 Seq2Seq 模型中典型的编码器-解码器（Encoder-Decoder）的框架结构，整体网络结构如图1所示。
+
+<p align="center">
+<img src="images/transformer_network.png" height=400 hspace='10'/> <br />
+图 1. Transformer 网络结构图
+</p>
+
+Encoder 由若干相同的 layer 堆叠组成，每个 layer 主要由多头注意力（Multi-Head Attention）和全连接的前馈（Feed-Forward）网络这两个 sub-layer 构成。
+- Multi-Head Attention 在这里用于实现 Self-Attention，相比于简单的 Attention 机制，其将输入进行多路线性变换后分别计算 Attention 的结果，并将所有结果拼接后再次进行线性变换作为输出。参见图2，其中 Attention 使用的是点积（Dot-Product），并在点积后进行了 scale 的处理以避免因点积结果过大进入 softmax 的饱和区域。
+- Feed-Forward 网络会对序列中的每个位置进行相同的计算（Position-wise），其采用的是两次线性变换中间加以 ReLU 激活的结构。
+
+此外，每个 sub-layer 后还施以 Residual Connection [2]和 Layer Normalization [3]来促进梯度传播和模型收敛。
+
+<p align="center">
+<img src="images/multi_head_attention.png" height=300 hspace='10'/> <br />
+图 2. Multi-Head Attention
+</p>
+
+Decoder 具有和 Encoder 类似的结构，只是相比于组成 Encoder 的 layer ，在组成 Decoder 的 layer 中还多了一个 Multi-Head Attention 的 sub-layer 来实现对 Encoder 输出的 Attention，这个 Encoder-Decoder Attention 在其他 Seq2Seq 模型中也是存在的。
+
+
+### 数据准备
+
+我们以 [WMT'16 EN-DE 数据集](http://www.statmt.org/wmt16/translation-task.html)作为示例，同时参照论文中的设置使用 BPE（byte-pair encoding）[4]编码的数据，使用这种方式表示的数据能够更好的解决未登录词（out-of-vocabulary，OOV）的问题。用到的 BPE 数据可以参照[这里](https://github.com/google/seq2seq/blob/master/docs/data.md)进行下载，下载后解压，其中 `train.tok.clean.bpe.32000.en` 和 `train.tok.clean.bpe.32000.de` 为使用 BPE 的训练数据（平行语料，分别对应了英语和德语，经过了 tokenize 和 BPE 的处理），`newstest2013.tok.bpe.32000.en` 和 `newstest2013.tok.bpe.32000.de` 等为测试数据（`newstest2013.tok.en` 和 `newstest2013.tok.de` 等则为对应的未使用 BPE 的测试数据），`vocab.bpe.32000` 为相应的词典文件（源语言和目标语言共享该词典文件）。
+
+由于本示例中的数据读取脚本 `reader.py` 使用的样本数据的格式为 `\t` 分隔的的源语言和目标语言句子对（句子中的词之间使用空格分隔）， 因此需要将源语言到目标语言的平行语料库文件合并为一个文件，可以执行以下命令进行合并：
+```sh
+paste -d '\t' train.tok.clean.bpe.32000.en train.tok.clean.bpe.32000.de > train.tok.clean.bpe.32000.en-de
+```
+此外，下载的词典文件 `vocab.bpe.32000` 中未包含表示序列开始、序列结束和未登录词的特殊符号，可以使用如下命令在词典中加入 `<s>` 、`<e>` 和 `<unk>` 作为这三个特殊符号。
+```sh
+sed -i '1i\<s>\n<e>\n<unk>' vocab.bpe.32000
+```
+
+对于其他自定义数据，遵循或转换为上述的数据格式即可。如果希望在自定义数据中使用 BPE 编码，可以参照[这里](https://github.com/rsennrich/subword-nmt)进行预处理。
+
+### 模型训练
+
+`train.py` 是模型训练脚本，可以执行以下命令进行模型训练：
+```sh
+python -u train.py \
+  --src_vocab_fpath data/vocab.bpe.32000 \
+  --trg_vocab_fpath data/vocab.bpe.32000 \
+  --special_token '<s>' '<e>' '<unk>' \
+  --train_file_pattern data/train.tok.clean.bpe.32000.en-de \
+  --use_token_batch True \
+  --batch_size 3200 \
+  --sort_type pool \
+  --pool_size 200000 \
+```
+上述命令中设置了源语言词典文件路径（`src_vocab_fpath`）、目标语言词典文件路径（`trg_vocab_fpath`）、训练数据文件（`train_file_pattern`，支持通配符）等数据相关的参数和构造 batch 方式（`use_token_batch` 指出数据按照 token 数目或者 sequence 数目组成 batch）等 reader 相关的参数。有关这些参数更详细的信息可以通过执行以下命令查看：
+```sh
+python train.py --help
+```
+
+更多模型训练相关的参数则在 `config.py` 中的 `ModelHyperParams` 和 `TrainTaskConfig` 内定义；`ModelHyperParams` 定义了 embedding 维度等模型超参数，`TrainTaskConfig` 定义了 warmup 步数等训练需要的参数。这些参数默认使用了 Transformer 论文中 base model 的配置，如需调整可以在该脚本中进行修改。另外这些参数同样可在执行训练脚本的命令行中设置，传入的配置会合并并覆盖 `config.py` 中的配置，如可以通过以下命令来训练 Transformer 论文中的 big model ：
+
+```sh
+python -u train.py \
+  --src_vocab_fpath data/vocab.bpe.32000 \
+  --trg_vocab_fpath data/vocab.bpe.32000 \
+  --special_token '<s>' '<e>' '<unk>' \
+  --train_file_pattern data/train.tok.clean.bpe.32000.en-de \
+  --use_token_batch True \
+  --batch_size 3200 \
+  --sort_type pool \
+  --pool_size 200000 \
+  n_layer 8 \
+  n_head 16 \
+  d_model 1024 \
+  d_inner_hid 4096 \
+  dropout 0.3
+```
+有关这些参数更详细信息的还请参考 `config.py` 中的注释说明。
+
+训练时默认使用所有 GPU，可以通过 `CUDA_VISIBLE_DEVICES` 环境变量来设置使用的 GPU 数目。在训练过程中，每个 epoch 结束后将保存模型到参数 `model_dir` 指定的目录，每个 iteration 将打印如下的日志到标准输出：
+```txt
+epoch: 0, batch: 0, sum loss: 258793.343750, avg loss: 11.069005, ppl: 64151.644531
+epoch: 0, batch: 1, sum loss: 256140.718750, avg loss: 11.059616, ppl: 63552.148438
+epoch: 0, batch: 2, sum loss: 258931.093750, avg loss: 11.064013, ppl: 63832.167969
+epoch: 0, batch: 3, sum loss: 256837.875000, avg loss: 11.058206, ppl: 63462.574219
+epoch: 0, batch: 4, sum loss: 256461.000000, avg loss: 11.053401, ppl: 63158.390625
+epoch: 0, batch: 5, sum loss: 257064.562500, avg loss: 11.019099, ppl: 61028.683594
+epoch: 0, batch: 6, sum loss: 256180.125000, avg loss: 11.008556, ppl: 60388.644531
+epoch: 0, batch: 7, sum loss: 256619.671875, avg loss: 11.007106, ppl: 60301.113281
+epoch: 0, batch: 8, sum loss: 255716.734375, avg loss: 10.966025, ppl: 57874.105469
+epoch: 0, batch: 9, sum loss: 245157.500000, avg loss: 10.966562, ppl: 57905.187500
+```
+
+### 模型预测
+
+`infer.py` 是模型预测脚本，模型训练完成后可以执行以下命令对指定文件中的文本进行翻译：
+```sh
+python -u infer.py \
+  --src_vocab_fpath data/vocab.bpe.32000 \
+  --trg_vocab_fpath data/vocab.bpe.32000 \
+  --special_token '<s>' '<e>' '<unk>' \
+  --test_file_pattern data/newstest2013.tok.bpe.32000.en-de \
+  --batch_size 4 \
+  model_path trained_models/pass_20.infer.model \
+  beam_size 5
+  max_out_len 256
+```
+和模型训练时类似，预测时也需要设置数据和 reader 相关的参数，并可以执行 `python infer.py --help` 查看这些参数的说明（部分参数意义和训练时略有不同）；同样可以在预测命令中设置模型超参数，但应与模型训练时的设置一致；此外相比于模型训练，预测时还有一些额外的参数，如需要设置 `model_path` 来给出模型所在目录，可以设置 `beam_size` 和 `max_out_len` 来指定 Beam Search 算法的搜索宽度和最大深度（翻译长度），这些参数也可以在 `config.py` 中的 `InferTaskConfig` 内查阅注释说明并进行更改设置。
+
+执行以上预测命令会打印翻译结果到标准输出，每行输出是对应行输入的得分最高的翻译。需要注意，对于使用 BPE 的数据，预测出的翻译结果也将是 BPE 表示的数据，要恢复成原始的数据（这里指 tokenize 后的数据）才能进行正确的评估，可以使用以下命令来恢复 `predict.txt` 内的翻译结果到 `predict.tok.txt` 中。
+
+```sh
+sed 's/@@ //g' predict.txt > predict.tok.txt
+```
+
+接下来就可以使用参考翻译（这里使用的是 `newstest2013.tok.de`）对翻译结果进行 BLEU 指标的评估了。计算 BLEU 值的一个较为广泛使用的脚本可以从[这里](https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl)获取，获取后执行如下命令：
+```sh
+perl multi-bleu.perl data/newstest2013.tok.de < predict.tok.txt
+```
+可以看到类似如下的结果。
+```
+BLEU = 25.08, 58.3/31.5/19.6/12.6 (BP=0.966, ratio=0.967, hyp_len=61321, ref_len=63412)
+```
+
+### 参考文献
+
+1. Vaswani A, Shazeer N, Parmar N, et al. [Attention is all you need](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf)[C]//Advances in Neural Information Processing Systems. 2017: 6000-6010.
+2. He K, Zhang X, Ren S, et al. [Deep residual learning for image recognition](http://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
+3. Ba J L, Kiros J R, Hinton G E. [Layer normalization](https://arxiv.org/pdf/1607.06450.pdf)[J]. arXiv preprint arXiv:1607.06450, 2016.
+4. Sennrich R, Haddow B, Birch A. [Neural machine translation of rare words with subword units](https://arxiv.org/pdf/1508.07909)[J]. arXiv preprint arXiv:1508.07909, 2015.
diff --git a/fluid/neural_machine_translation/transformer/config.py b/fluid/neural_machine_translation/transformer/config.py
index a4e588c620f21c4f38eb1906f55d68ddf93214b6..e68ab17e69eff890cb8e6b028ead5e6163213761 100644
--- a/fluid/neural_machine_translation/transformer/config.py
+++ b/fluid/neural_machine_translation/transformer/config.py
@@ -62,10 +62,8 @@ class ModelHyperParams(object):
     eos_idx = 1
     # index for <unk> token
     unk_idx = 2
-    # max length of sequences.
-    # The size of position encoding table should at least plus 1, since the
-    # sinusoid position encoding starts from 1 and 0 can be used as the padding
-    # token for position encoding.
+    # max length of sequences deciding the size of position encoding table.
+    # Start from 1 and count start and end tokens in.
     max_length = 256
     # the dimension for word embeddings, which is also the last dimension of
     # the input and output of multi-head attention, position-wise feed-forward
diff --git a/fluid/neural_machine_translation/transformer/images/multi_head_attention.png b/fluid/neural_machine_translation/transformer/images/multi_head_attention.png
new file mode 100644
index 0000000000000000000000000000000000000000..427fb6b32aaeb7013066a167aab4fb97c024c2d6
Binary files /dev/null and b/fluid/neural_machine_translation/transformer/images/multi_head_attention.png differ
diff --git a/fluid/neural_machine_translation/transformer/images/transformer_network.png b/fluid/neural_machine_translation/transformer/images/transformer_network.png
new file mode 100644
index 0000000000000000000000000000000000000000..34be0e5c7e2b08f858683d86353db5e81049c7ca
Binary files /dev/null and b/fluid/neural_machine_translation/transformer/images/transformer_network.png differ
diff --git a/fluid/neural_machine_translation/transformer/infer.py b/fluid/neural_machine_translation/transformer/infer.py
index 874028081cca218ae16559af9ea9b05d3494c977..505bf0b0062bda27a0299ed7d844e2f05abd95b8 100644
--- a/fluid/neural_machine_translation/transformer/infer.py
+++ b/fluid/neural_machine_translation/transformer/infer.py
@@ -543,7 +543,8 @@ def infer(args, inferencer=fast_infer):
         start_mark=args.special_token[0],
         end_mark=args.special_token[1],
         unk_mark=args.special_token[2],
-        max_length=ModelHyperParams.max_length,
+        # count start and end tokens out
+        max_length=ModelHyperParams.max_length - 2,
         clip_last_batch=False)
     trg_idx2word = test_data.load_dict(
         dict_path=args.trg_vocab_fpath, reverse=True)
diff --git a/fluid/neural_machine_translation/transformer/train.py b/fluid/neural_machine_translation/transformer/train.py
index 3f0c216d6b2d4846d07525695eceb01252baeb96..0456e1a740d2c5d33aeedf3e7c4c5133259762ec 100644
--- a/fluid/neural_machine_translation/transformer/train.py
+++ b/fluid/neural_machine_translation/transformer/train.py
@@ -340,9 +340,13 @@ def train(args):
             start_mark=args.special_token[0],
             end_mark=args.special_token[1],
             unk_mark=args.special_token[2],
+            # count start and end tokens out
+            max_length=ModelHyperParams.max_length - 2,
             clip_last_batch=False)
+        train_data = read_multiple(
+            reader=train_data.batch_generator,
+            count=dev_count if args.use_token_batch else 1)
 
-        train_data = read_multiple(reader=train_data.batch_generator)
         build_strategy = fluid.BuildStrategy()
         # Since the token number differs among devices, customize gradient scale to
         # use token average cost among multi-devices. and the gradient scale is
@@ -372,6 +376,8 @@ def train(args):
                 start_mark=args.special_token[0],
                 end_mark=args.special_token[1],
                 unk_mark=args.special_token[2],
+                # count start and end tokens out
+                max_length=ModelHyperParams.max_length - 2,
                 clip_last_batch=False,
                 shuffle=False,
                 shuffle_batch=False)
diff --git a/fluid/object_detection/README_cn.md b/fluid/object_detection/README_cn.md
index a5769eccd4a9ae6b3cab1b5788cff193e6302130..6595c05460128223296f8fdd1cddbc482812616f 100644
--- a/fluid/object_detection/README_cn.md
+++ b/fluid/object_detection/README_cn.md
@@ -60,7 +60,7 @@ cd data/coco
     ./pretrained/download_imagenet.sh
     ```
 
-#### 训练 PASCAL VOC 数据集
+#### 训练
 
 `train.py` 是训练模块的主要执行程序，调用示例如下：
   ```bash
diff --git a/fluid/object_detection/data/coco/download.sh b/fluid/object_detection/data/coco/download.sh
index cd6e18c8e5be1690d12b31600450034a222e17b8..50bc8a6894463549a2b18197704450621e969c9d 100644
--- a/fluid/object_detection/data/coco/download.sh
+++ b/fluid/object_detection/data/coco/download.sh
@@ -10,7 +10,7 @@ wget http://images.cocodataset.org/zips/val2017.zip
 wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip
 wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
 # Extract the data.
-echo "Extractint..."
+echo "Extracting..."
 unzip train2014.tar
 unzip val2014.tar
 unzip train2017.tar
diff --git a/fluid/object_detection/data/pascalvoc/download.sh b/fluid/object_detection/data/pascalvoc/download.sh
index 55bbb0e5a43f937ee478c9502444b22c493890ae..e16073915c98815c1a23e8aded67ab2db4cfba10 100755
--- a/fluid/object_detection/data/pascalvoc/download.sh
+++ b/fluid/object_detection/data/pascalvoc/download.sh
@@ -7,7 +7,7 @@ wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
 wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
 wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
 # Extract the data.
-echo "Extractint..."
+echo "Extracting..."
 tar -xf VOCtrainval_11-May-2012.tar
 tar -xf VOCtrainval_06-Nov-2007.tar
 tar -xf VOCtest_06-Nov-2007.tar