Merge branch 'develop' of https://github.com/PaddlePaddle/models into ctc_decoder_deploy

202a06a2 · Yibing Liu · b5c4d832 · a2499939 · 202a06a2 · 202a06a2
109 changed file
--- a/.gitignore
+++ b/.gitignore
 .DS_Store
+*.pyc
--- a/.travis.yml
+++ b/.travis.yml
-group: deprecated-2017Q2
 language: cpp
 cache: ccache
 sudo: required

--- a/.travis/unittest.sh
+++ b/.travis/unittest.sh
@@ -10,6 +10,7 @@ unittest(){
    cd $1 > /dev/null
    if [ -f "setup.sh" ]; then
        sh setup.sh
+        export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
    fi
    if [ $? != 0 ]; then
        exit 1

--- a/README.md
+++ b/README.md
@@ -47,29 +47,37 @@ PaddlePaddle提供了丰富的运算单元，帮助大家以模块化的方式
 - 5.1 [基于 Pairwise 和 Listwise 的排序学习](https://github.com/PaddlePaddle/models/tree/develop/ltr)
-## 6. 序列标注
+## 6. 深度结构化语义模型
+ 深度结构化语义模型使用DNN模型在一个连续的语义空间中学习文本低纬的向量表示，最终建模两个句子间的语义相似度。
+本例中我们演示如何使用 PaddlePaddle实现一个通用的深度结构化语义模型来建模两个字符串间的语义相似度。
+模型支持CNN(卷积网络)、FC(全连接网络)、RNN(递归神经网络)等不同的网络结构，以及分类、回归、排序等不同损失函数，采用了比较通用的数据格式，用户替换数据便可以在真实场景中使用。
+- 6.1 [深度结构化语义模型](https://github.com/PaddlePaddle/models/tree/develop/dssm)
+## 7. 序列标注
 给定输入序列，序列标注模型为序列中每一个元素贴上一个类别标签，是自然语言处理领域最基础的任务之一。随着深度学习的不断探索和发展，利用循环神经网络学习输入序列的特征表示，条件随机场（Conditional Random Field, CRF）在特征基础上完成序列标注任务，逐渐成为解决序列标注问题的标配解决方案。
 在序列标注的例子中，我们以命名实体识别（Named Entity Recognition，NER）任务为例，介绍如何训练一个端到端的序列标注模型。
- 6.1 [命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/sequence_tagging_for_ner)
+- 7.1 [命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/sequence_tagging_for_ner)
-## 7. 序列到序列学习
+## 8. 序列到序列学习
 序列到序列学习实现两个甚至是多个不定长模型之间的映射，有着广泛的应用，包括：机器翻译、智能对话与问答、广告创意语料生成、自动编码（如金融画像编码）、判断多个文本串之间的语义相关性等。
 在序列到序列学习的例子中，我们以机器翻译任务为例，提供了多种改进模型，供大家学习和使用。包括：不带注意力机制的序列到序列映射模型，这一模型是所有序列到序列学习模型的基础；使用 scheduled sampling 改善 RNN 模型在生成任务中的错误累积问题；带外部记忆机制的神经机器翻译，通过增强神经网络的记忆能力，来完成复杂的序列到序列学习任务。
- 7.1 [无注意力机制的编码器解码器模型](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention)
+- 8.1 [无注意力机制的编码器解码器模型](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention)
-## 8. 图像分类
+## 9. 图像分类
 图像相比文字能够提供更加生动、容易理解及更具艺术感的信息，是人们转递与交换信息的重要来源。在图像分类的例子中，我们向大家介绍如何在PaddlePaddle中训练AlexNet、VGG、GoogLeNet和ResNet模型。同时还提供了一个模型转换工具，能够将Caffe训练好的模型文件，转换为PaddlePaddle的模型文件。
- 8.1 [将Caffe模型文件转换为PaddlePaddle模型文件](https://github.com/PaddlePaddle/models/tree/develop/image_classification/caffe2paddle)
+- 9.1 [将Caffe模型文件转换为PaddlePaddle模型文件](https://github.com/PaddlePaddle/models/tree/develop/image_classification/caffe2paddle)
- 8.2 [AlexNet](https://github.com/PaddlePaddle/models/tree/develop/image_classification)
+- 9.2 [AlexNet](https://github.com/PaddlePaddle/models/tree/develop/image_classification)
- 8.3 [VGG](https://github.com/PaddlePaddle/models/tree/develop/image_classification)
+- 9.3 [VGG](https://github.com/PaddlePaddle/models/tree/develop/image_classification)
- 8.4 [Residual Network](https://github.com/PaddlePaddle/models/tree/develop/image_classification)
+- 9.4 [Residual Network](https://github.com/PaddlePaddle/models/tree/develop/image_classification)
 ## Copyright and License

--- a/ctr/README.md
+++ b/ctr/README.md
 # 点击率预估
+以下是本例目录包含的文件以及对应说明:
+```
+├── README.md               # 本教程markdown 文档
+├── dataset.md              # 数据集处理教程
+├── images                  # 本教程图片目录
+│   ├── lr_vs_dnn.jpg
+│   └── wide_deep.png
+├── infer.py                # 预测脚本
+├── network_conf.py         # 模型网络配置
+├── reader.py               # data reader
+├── train.py                # 训练脚本
+└── utils.py                # helper functions
+└── avazu_data_processer.py # 示例数据预处理脚本
+```
 ## 背景介绍
-CTR(Click-Through Rate，点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率，
+CTR(Click-Through Rate，点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
-通常被用来衡量一个在线广告系统的有效性。
+是对用户点击一个特定链接的概率做出预测，是广告投放过程中的一个重要环节。精准的点击率预估对在线广告系统收益最大化具有重要意义。
-当有多个广告位时，CTR 预估一般会作为排序的基准。
+当有多个广告位时，CTR 预估一般会作为排序的基准，比如在搜索引擎的广告系统里，当用户输入一个带商业价值的搜索词（query）时，系统大体上会执行下列步骤来展示广告：
-比如在搜索引擎的广告系统里，当用户输入一个带商业价值的搜索词（query）时，系统大体上会执行下列步骤来展示广告：
-1.  召回满足 query 的广告集合
+1.  获取与用户搜索词相关的广告集合
 2.  业务规则和相关性过滤
 3.  根据拍卖机制和 CTR 排序
 4.  展出广告
@@ -36,13 +51,11 @@ Figure 1. LR 和 DNN 模型结构对比
 </p>
 LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构，可以看到 LR 和 DNN 有一些共通之处（比如权重累加），
-但前者的模型复杂度在相同输入维度下比后者可能低很多（从某方面讲，模型越复杂，越有潜力学习到更复杂的信息）。
+但前者的模型复杂度在相同输入维度下比后者可能低很多（从某方面讲，模型越复杂，越有潜力学习到更复杂的信息）；
 如果 LR 要达到匹敌 DNN 的学习能力，必须增加输入的维度，也就是增加特征的数量，
 这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
-LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力，包括内存和计算量等方面，工业界都有非常成熟的优化方法。
+LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力，包括内存和计算量等方面，工业界都有非常成熟的优化方法；
 而 DNN 模型具有自己学习新特征的能力，一定程度上能够提升特征使用的效率，
 这使得 DNN 模型在同样规模特征的情况下，更有可能达到更好的学习效果。
@@ -59,10 +72,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力，包括
 我们直接使用第一种方法做分类任务。
-我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。
+我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示本例中的模型。
+具体的特征处理方法参看 [data process](./dataset.md)。
+本教程中演示模型的输入格式如下：
+```
+# <dnn input ids> \t <lr input sparse values> \t click
+1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
+23 231 \t 1230:0.12 13421:0.9 \t 1
+```
+详细的格式描述如下：
+- `dnn input ids` 采用 one-hot 表示，只需要填写值为1的ID（注意这里不是变长输入）
+- `lr input sparse values` 使用了 `ID:VALUE` 的表示，值部分最好规约到值域 `[-1, 1]`。
+此外，模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度，文件的格式如下：
-具体的特征处理方法参看 [data process](./dataset.md)
+```
+dnn_input_dim: <int>
+lr_input_dim: <int>
+```
+其中， `<int>` 表示一个整型数值。
+本目录下的 `avazu_data_processor.py` 可以对下载的演示数据集\[[2](#参考文档)\] 进行处理，具体使用方法参考如下说明：
+```
+usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
+                               OUTPUT_DIR
+                               [--num_lines_to_detect NUM_LINES_TO_DETECT]
+                               [--test_set_size TEST_SET_SIZE]
+                               [--train_size TRAIN_SIZE]
+PaddlePaddle CTR example
+optional arguments:
+  -h, --help            show this help message and exit
+  --data_path DATA_PATH
+                        path of the Avazu dataset
+  --output_dir OUTPUT_DIR
+                        directory to output
+  --num_lines_to_detect NUM_LINES_TO_DETECT
+                        number of records to detect dataset's meta info
+  --test_set_size TEST_SET_SIZE
+                        size of the validation dataset(default: 10000)
+  --train_size TRAIN_SIZE
+                        size of the trainset (default: 100000)
+```
+- `data_path` 是待处理的数据路径
+- `output_dir` 生成数据的输出路径
+- `num_lines_to_detect` 预先扫描数据生成ID的个数，这里是扫描的文件行数
+- `test_set_size` 生成测试集的行数
+- `train_size` 生成训练姐的行数
 ## Wide & Deep Learning Model
@@ -201,18 +266,20 @@ trainer.train(
 ## 运行训练和测试
 训练模型需要如下步骤：
-1. 下载训练数据，可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\]
+1. 准备训练数据
    1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
    2. 解压 train.gz 得到 train.txt
-2. 执行 `python train.py --train_data_path train.txt` ，开始训练
+    3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
+2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
 上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程，具体的命令行参数及用法如下
 ```
 usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
-                [--batch_size BATCH_SIZE] [--test_set_size TEST_SET_SIZE]
+                [--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
                [--num_passes NUM_PASSES]
-                [--num_lines_to_detact NUM_LINES_TO_DETACT]
+                [--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
+                DATA_META_FILE --model_type MODEL_TYPE
 PaddlePaddle CTR example
@@ -220,16 +287,78 @@ optional arguments:
  -h, --help            show this help message and exit
  --train_data_path TRAIN_DATA_PATH
                        path of training dataset
+  --test_data_path TEST_DATA_PATH
+                        path of testing dataset
  --batch_size BATCH_SIZE
                        size of mini-batch (default:10000)
-  --test_set_size TEST_SET_SIZE
-                        size of the validation dataset(default: 10000)
  --num_passes NUM_PASSES
                        number of passes to train
-  --num_lines_to_detact NUM_LINES_TO_DETACT
+  --model_output_prefix MODEL_OUTPUT_PREFIX
-                        number of records to detect dataset's meta info
+                        prefix of path for model to store (default:
+                        ./ctr_models)
+  --data_meta_file DATA_META_FILE
+                        path of data meta info file
+  --model_type MODEL_TYPE
+                        model type, classification: 0, regression 1 (default
+                        classification)
+```
+- `train_data_path` ： 训练集的路径
+- `test_data_path` : 测试集的路径
+- `num_passes`: 模型训练多少轮
+- `data_meta_file`: 参考[数据和任务抽象](### 数据和任务抽象)的描述。
+- `model_type`: 模型分类或回归
+## 用训好的模型做预测
+训好的模型可以用来预测新的数据， 预测数据的格式为
+```
+# <dnn input ids> \t <lr input sparse values>
+1 23 190 \t 230:0.12 3421:0.9 23451:0.12
+23 231 \t 1230:0.12 13421:0.9
 ```
+这里与训练数据的格式唯一不同的地方，就是没有标签，也就是训练数据中第3列 `click` 对应的数值。
+`infer.py` 的使用方法如下
+```
+usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
+                --prediction_output_path PREDICTION_OUTPUT_PATH
+                [--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
+PaddlePaddle CTR example
+optional arguments:
+  -h, --help            show this help message and exit
+  --model_gz_path MODEL_GZ_PATH
+                        path of model parameters gz file
+  --data_path DATA_PATH
+                        path of the dataset to infer
+  --prediction_output_path PREDICTION_OUTPUT_PATH
+                        path to output the prediction
+  --data_meta_path DATA_META_PATH
+                        path of trainset's meta info, default is ./data.meta
+  --model_type MODEL_TYPE
+                        model type, classification: 0, regression 1 (default
+                        classification)
+```
+- `model_gz_path_model`：用 `gz` 压缩过的模型路径
+- `data_path` ： 需要预测的数据路径
+- `prediction_output_paht`：预测输出的路径
+- `data_meta_file` ：参考[数据和任务抽象](### 数据和任务抽象)的描述。
+- `model_type` ：分类或回归
+示例数据可以用如下命令预测
+```
+python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
+```
+最终的预测结果位于 `predictions.txt`。
 ## 参考文献
 1. <https://en.wikipedia.org/wiki/Click-through_rate>
 2. <https://www.kaggle.com/c/avazu-ctr-prediction/data>

--- a/ctr/data_provider.py
+++ b/ctr/data_provider.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-import os
 import sys
 import csv
+import cPickle
+import argparse
 import numpy as np
+from utils import logger, TaskMode
+parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
+parser.add_argument(
+    '--data_path', type=str, required=True, help="path of the Avazu dataset")
+parser.add_argument(
+    '--output_dir', type=str, required=True, help="directory to output")
+parser.add_argument(
+    '--num_lines_to_detect',
+    type=int,
+    default=500000,
+    help="number of records to detect dataset's meta info")
+parser.add_argument(
+    '--test_set_size',
+    type=int,
+    default=10000,
+    help="size of the validation dataset(default: 10000)")
+parser.add_argument(
+    '--train_size',
+    type=int,
+    default=100000,
+    help="size of the trainset (default: 100000)")
+args = parser.parse_args()
 '''
 The fields of the dataset are:
@@ -22,7 +50,7 @@ The fields of the dataset are:
    15. device_conn_type
    16. C14-C21 -- anonymized categorical variables
-We will treat following fields as categorical features:
+We will treat the following fields as categorical features:
    - C1
    - banner_pos
@@ -40,6 +68,14 @@ and some other features as id features:
 The `hour` field will be treated as a continuous feature and will be transformed
 to one-hot representation which has 24 bits.
+This script will output 3 files:
+1. train.txt
+2. test.txt
+3. infer.txt
+all the files are for demo.
 '''
 feature_dims = {}
@@ -161,6 +197,7 @@ def detect_dataset(path, topn, id_fea_space=10000):
    NOTE the records should be randomly shuffled first.
    '''
    # create categorical statis objects.
+    logger.warning('detecting dataset')
    with open(path, 'rb') as csvfile:
        reader = csv.DictReader(csvfile)
@@ -174,9 +211,6 @@ def detect_dataset(path, topn, id_fea_space=10000):
    for key, item in fields.items():
        feature_dims[key] = item.size()
-    #for key in id_features:
-    #feature_dims[key] = id_fea_space
    feature_dims['hour'] = 24
    feature_dims['click'] = 1
@@ -184,10 +218,17 @@ def detect_dataset(path, topn, id_fea_space=10000):
        feature_dims[key] for key in categorial_features + ['hour']) + 1
    feature_dims['lr_input'] = np.sum(feature_dims[key]
                                      for key in id_features) + 1
    return feature_dims
+def load_data_meta(meta_path):
+    '''
+    Load dataset's meta infomation.
+    '''
+    feature_dims, fields = cPickle.load(open(meta_path, 'rb'))
+    return feature_dims, fields
 def concat_sparse_vectors(inputs, dims):
    '''
    Concaterate more than one sparse vectors into one.
@@ -211,67 +252,162 @@ class AvazuDataset(object):
    '''
    Load AVAZU dataset as train set.
    '''
-    TRAIN_MODE = 0
-    TEST_MODE = 1
-    def __init__(self, train_path, n_records_as_test=-1):
+    def __init__(self,
+                 train_path,
+                 n_records_as_test=-1,
+                 fields=None,
+                 feature_dims=None):
        self.train_path = train_path
        self.n_records_as_test = n_records_as_test
-        # task model: 0 train, 1 test
+        self.fields = fields
-        self.mode = 0
+        # default is train mode.
+        self.mode = TaskMode.create_train()
-    def train(self):
+        self.categorial_dims = [
-        self.mode = self.TRAIN_MODE
+            feature_dims[key] for key in categorial_features + ['hour']
-        return self._parse(self.train_path, skip_n_lines=self.n_records_as_test)
+        ]
+        self.id_dims = [feature_dims[key] for key in id_features]
-    def test(self):
+    def train(self):
-        self.mode = self.TEST_MODE
+        '''
-        return self._parse(self.train_path, top_n_lines=self.n_records_as_test)
+        Load trainset.
+        '''
-    def _parse(self, path, skip_n_lines=-1, top_n_lines=-1):
+        logger.info("load trainset from %s" % self.train_path)
-        with open(path, 'rb') as csvfile:
+        self.mode = TaskMode.create_train()
-            reader = csv.DictReader(csvfile)
+        with open(self.train_path) as f:
+            reader = csv.DictReader(f)
-            categorial_dims = [
-                feature_dims[key] for key in categorial_features + ['hour']
-            ]
-            id_dims = [feature_dims[key] for key in id_features]
            for row_id, row in enumerate(reader):
-                if skip_n_lines > 0 and row_id < skip_n_lines:
+                # skip top n lines
+                if self.n_records_as_test > 0 and row_id < self.n_records_as_test:
                    continue
-                if top_n_lines > 0 and row_id > top_n_lines:
-                    break
-                record = []
-                for key in categorial_features:
-                    record.append(fields[key].gen(row[key]))
-                record.append([int(row['hour'][-2:])])
-                dense_input = concat_sparse_vectors(record, categorial_dims)
-                record = []
+                rcd = self._parse_record(row)
-                for key in id_features:
+                if rcd:
-                    if 'cross' not in key:
+                    yield rcd
-                        record.append(fields[key].gen(row[key]))
-                    else:
-                        fea0 = fields[key].cross_fea0
-                        fea1 = fields[key].cross_fea1
-                        record.append(
-                            fields[key].gen_cross_fea(row[fea0], row[fea1]))
-                sparse_input = concat_sparse_vectors(record, id_dims)
+    def test(self):
+        '''
+        Load testset.
+        '''
+        logger.info("load testset from %s" % self.train_path)
+        self.mode = TaskMode.create_test()
+        with open(self.train_path) as f:
+            reader = csv.DictReader(f)
-                record = [dense_input, sparse_input]
+            for row_id, row in enumerate(reader):
+                # skip top n lines
+                if self.n_records_as_test > 0 and row_id > self.n_records_as_test:
+                    break
-                record.append(list((int(row['click']), )))
+                rcd = self._parse_record(row)
-                yield record
+                if rcd:
+                    yield rcd
+    def infer(self):
+        '''
+        Load inferset.
+        '''
+        logger.info("load inferset from %s" % self.train_path)
+        self.mode = TaskMode.create_infer()
+        with open(self.train_path) as f:
+            reader = csv.DictReader(f)
-if __name__ == '__main__':
+            for row_id, row in enumerate(reader):
-    path = 'train.txt'
+                rcd = self._parse_record(row)
-    print detect_dataset(path, 400000)
+                if rcd:
+                    yield rcd
-    filereader = AvazuDataset(path)
+    def _parse_record(self, row):
-    for no, rcd in enumerate(filereader.train()):
+        '''
-        print no, rcd
+        Parse a CSV row and get a record.
-        if no > 1000: break
+        '''
+        record = []
+        for key in categorial_features:
+            record.append(self.fields[key].gen(row[key]))
+        record.append([int(row['hour'][-2:])])
+        dense_input = concat_sparse_vectors(record, self.categorial_dims)
+        record = []
+        for key in id_features:
+            if 'cross' not in key:
+                record.append(self.fields[key].gen(row[key]))
+            else:
+                fea0 = self.fields[key].cross_fea0
+                fea1 = self.fields[key].cross_fea1
+                record.append(
+                    self.fields[key].gen_cross_fea(row[fea0], row[fea1]))
+        sparse_input = concat_sparse_vectors(record, self.id_dims)
+        record = [dense_input, sparse_input]
+        if not self.mode.is_infer():
+            record.append(list((int(row['click']), )))
+        return record
+def ids2dense(vec, dim):
+    return vec
+def ids2sparse(vec):
+    return ["%d:1" % x for x in vec]
+detect_dataset(args.data_path, args.num_lines_to_detect)
+dataset = AvazuDataset(
+    args.data_path,
+    args.test_set_size,
+    fields=fields,
+    feature_dims=feature_dims)
+output_trainset_path = os.path.join(args.output_dir, 'train.txt')
+output_testset_path = os.path.join(args.output_dir, 'test.txt')
+output_infer_path = os.path.join(args.output_dir, 'infer.txt')
+output_meta_path = os.path.join(args.output_dir, 'data.meta.txt')
+with open(output_trainset_path, 'w') as f:
+    for id, record in enumerate(dataset.train()):
+        if id and id % 10000 == 0:
+            logger.info("load %d records" % id)
+        if id > args.train_size:
+            break
+        dnn_input, lr_input, click = record
+        dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
+        lr_input = ids2sparse(lr_input)
+        line = "%s\t%s\t%d\n" % (' '.join(map(str, dnn_input)),
+                                 ' '.join(map(str, lr_input)), click[0])
+        f.write(line)
+    logger.info('write to %s' % output_trainset_path)
+with open(output_testset_path, 'w') as f:
+    for id, record in enumerate(dataset.test()):
+        dnn_input, lr_input, click = record
+        dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
+        lr_input = ids2sparse(lr_input)
+        line = "%s\t%s\t%d\n" % (' '.join(map(str, dnn_input)),
+                                 ' '.join(map(str, lr_input)), click[0])
+        f.write(line)
+    logger.info('write to %s' % output_testset_path)
+with open(output_infer_path, 'w') as f:
+    for id, record in enumerate(dataset.infer()):
+        dnn_input, lr_input = record
+        dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
+        lr_input = ids2sparse(lr_input)
+        line = "%s\t%s\n" % (' '.join(map(str, dnn_input)),
+                             ' '.join(map(str, lr_input)), )
+        f.write(line)
+        if id > args.test_set_size:
+            break
+    logger.info('write to %s' % output_infer_path)
+with open(output_meta_path, 'w') as f:
+    lines = [
+        "dnn_input_dim: %d" % feature_dims['dnn_input'],
+        "lr_input_dim: %d" % feature_dims['lr_input']
+    ]
+    f.write('\n'.join(lines))
+    logger.info('write data meta into %s' % output_meta_path)
--- a/ctr/dataset.md
+++ b/ctr/dataset.md
 # 数据及处理
 ## 数据集介绍
+本教程演示使用Kaggle上CTR任务的数据集\[[3](#参考文献)\]的预处理方法，最终产生本模型需要的格式，详细的数据格式参考[README.md](./README.md)。
+Wide && Deep Model\[[2](#参考文献)\]的优势是融合稠密特征和大规模稀疏特征，
+因此特征处理方面也针对稠密和稀疏两种特征作处理，
+其中Deep部分的稠密值全部转化为ID类特征，
+通过embedding 来转化为稠密的向量输入；Wide部分主要通过ID的叉乘提升维度。
 数据集使用 `csv` 格式存储，其中各个字段内容如下：
 -   `id` : ad identifier

--- a/ctr/index.html
+++ b/ctr/index.html
@@ -42,15 +42,30 @@
 <div id="markdown" style='display:none'>
 # 点击率预估
+以下是本例目录包含的文件以及对应说明:
+```
+├── README.md               # 本教程markdown 文档
+├── dataset.md              # 数据集处理教程
+├── images                  # 本教程图片目录
+│   ├── lr_vs_dnn.jpg
+│   └── wide_deep.png
+├── infer.py                # 预测脚本
+├── network_conf.py         # 模型网络配置
+├── reader.py               # data reader
+├── train.py                # 训练脚本
+└── utils.py                # helper functions
+└── avazu_data_processer.py # 示例数据预处理脚本
+```
 ## 背景介绍
-CTR(Click-Through Rate，点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率，
+CTR(Click-Through Rate，点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
-通常被用来衡量一个在线广告系统的有效性。
+是对用户点击一个特定链接的概率做出预测，是广告投放过程中的一个重要环节。精准的点击率预估对在线广告系统收益最大化具有重要意义。
-当有多个广告位时，CTR 预估一般会作为排序的基准。
+当有多个广告位时，CTR 预估一般会作为排序的基准，比如在搜索引擎的广告系统里，当用户输入一个带商业价值的搜索词（query）时，系统大体上会执行下列步骤来展示广告：
-比如在搜索引擎的广告系统里，当用户输入一个带商业价值的搜索词（query）时，系统大体上会执行下列步骤来展示广告：
-1.  召回满足 query 的广告集合
+1.  获取与用户搜索词相关的广告集合
 2.  业务规则和相关性过滤
 3.  根据拍卖机制和 CTR 排序
 4.  展出广告
@@ -78,13 +93,11 @@ Figure 1. LR 和 DNN 模型结构对比
 </p>
 LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构，可以看到 LR 和 DNN 有一些共通之处（比如权重累加），
-但前者的模型复杂度在相同输入维度下比后者可能低很多（从某方面讲，模型越复杂，越有潜力学习到更复杂的信息）。
+但前者的模型复杂度在相同输入维度下比后者可能低很多（从某方面讲，模型越复杂，越有潜力学习到更复杂的信息）；
 如果 LR 要达到匹敌 DNN 的学习能力，必须增加输入的维度，也就是增加特征的数量，
 这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
-LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力，包括内存和计算量等方面，工业界都有非常成熟的优化方法。
+LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力，包括内存和计算量等方面，工业界都有非常成熟的优化方法；
 而 DNN 模型具有自己学习新特征的能力，一定程度上能够提升特征使用的效率，
 这使得 DNN 模型在同样规模特征的情况下，更有可能达到更好的学习效果。
@@ -101,10 +114,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力，包括
 我们直接使用第一种方法做分类任务。
-我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。
+我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示本例中的模型。
+具体的特征处理方法参看 [data process](./dataset.md)。
+本教程中演示模型的输入格式如下：
+```
+# <dnn input ids> \t <lr input sparse values> \t click
+1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
+23 231 \t 1230:0.12 13421:0.9 \t 1
+```
+详细的格式描述如下：
+- `dnn input ids` 采用 one-hot 表示，只需要填写值为1的ID（注意这里不是变长输入）
+- `lr input sparse values` 使用了 `ID:VALUE` 的表示，值部分最好规约到值域 `[-1, 1]`。
+此外，模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度，文件的格式如下：
-具体的特征处理方法参看 [data process](./dataset.md)
+```
+dnn_input_dim: <int>
+lr_input_dim: <int>
+```
+其中， `<int>` 表示一个整型数值。
+本目录下的 `avazu_data_processor.py` 可以对下载的演示数据集\[[2](#参考文档)\] 进行处理，具体使用方法参考如下说明：
+```
+usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
+                               OUTPUT_DIR
+                               [--num_lines_to_detect NUM_LINES_TO_DETECT]
+                               [--test_set_size TEST_SET_SIZE]
+                               [--train_size TRAIN_SIZE]
+PaddlePaddle CTR example
+optional arguments:
+  -h, --help            show this help message and exit
+  --data_path DATA_PATH
+                        path of the Avazu dataset
+  --output_dir OUTPUT_DIR
+                        directory to output
+  --num_lines_to_detect NUM_LINES_TO_DETECT
+                        number of records to detect dataset's meta info
+  --test_set_size TEST_SET_SIZE
+                        size of the validation dataset(default: 10000)
+  --train_size TRAIN_SIZE
+                        size of the trainset (default: 100000)
+```
+- `data_path` 是待处理的数据路径
+- `output_dir` 生成数据的输出路径
+- `num_lines_to_detect` 预先扫描数据生成ID的个数，这里是扫描的文件行数
+- `test_set_size` 生成测试集的行数
+- `train_size` 生成训练姐的行数
 ## Wide & Deep Learning Model
@@ -243,18 +308,20 @@ trainer.train(
 ## 运行训练和测试
 训练模型需要如下步骤：
-1. 下载训练数据，可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\]
+1. 准备训练数据
    1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
    2. 解压 train.gz 得到 train.txt
-2. 执行 `python train.py --train_data_path train.txt` ，开始训练
+    3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
+2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
 上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程，具体的命令行参数及用法如下
 ```
 usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
-                [--batch_size BATCH_SIZE] [--test_set_size TEST_SET_SIZE]
+                [--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
                [--num_passes NUM_PASSES]
-                [--num_lines_to_detact NUM_LINES_TO_DETACT]
+                [--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
+                DATA_META_FILE --model_type MODEL_TYPE
 PaddlePaddle CTR example
@@ -262,16 +329,78 @@ optional arguments:
  -h, --help            show this help message and exit
  --train_data_path TRAIN_DATA_PATH
                        path of training dataset
+  --test_data_path TEST_DATA_PATH
+                        path of testing dataset
  --batch_size BATCH_SIZE
                        size of mini-batch (default:10000)
-  --test_set_size TEST_SET_SIZE
-                        size of the validation dataset(default: 10000)
  --num_passes NUM_PASSES
                        number of passes to train
-  --num_lines_to_detact NUM_LINES_TO_DETACT
+  --model_output_prefix MODEL_OUTPUT_PREFIX
-                        number of records to detect dataset's meta info
+                        prefix of path for model to store (default:
+                        ./ctr_models)
+  --data_meta_file DATA_META_FILE
+                        path of data meta info file
+  --model_type MODEL_TYPE
+                        model type, classification: 0, regression 1 (default
+                        classification)
+```
+- `train_data_path` ： 训练集的路径
+- `test_data_path` : 测试集的路径
+- `num_passes`: 模型训练多少轮
+- `data_meta_file`: 参考[数据和任务抽象](### 数据和任务抽象)的描述。
+- `model_type`: 模型分类或回归
+## 用训好的模型做预测
+训好的模型可以用来预测新的数据， 预测数据的格式为
+```
+# <dnn input ids> \t <lr input sparse values>
+1 23 190 \t 230:0.12 3421:0.9 23451:0.12
+23 231 \t 1230:0.12 13421:0.9
 ```
+这里与训练数据的格式唯一不同的地方，就是没有标签，也就是训练数据中第3列 `click` 对应的数值。
+`infer.py` 的使用方法如下
+```
+usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
+                --prediction_output_path PREDICTION_OUTPUT_PATH
+                [--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
+PaddlePaddle CTR example
+optional arguments:
+  -h, --help            show this help message and exit
+  --model_gz_path MODEL_GZ_PATH
+                        path of model parameters gz file
+  --data_path DATA_PATH
+                        path of the dataset to infer
+  --prediction_output_path PREDICTION_OUTPUT_PATH
+                        path to output the prediction
+  --data_meta_path DATA_META_PATH
+                        path of trainset's meta info, default is ./data.meta
+  --model_type MODEL_TYPE
+                        model type, classification: 0, regression 1 (default
+                        classification)
+```
+- `model_gz_path_model`：用 `gz` 压缩过的模型路径
+- `data_path` ： 需要预测的数据路径
+- `prediction_output_paht`：预测输出的路径
+- `data_meta_file` ：参考[数据和任务抽象](### 数据和任务抽象)的描述。
+- `model_type` ：分类或回归
+示例数据可以用如下命令预测
+```
+python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
+```
+最终的预测结果位于 `predictions.txt`。
 ## 参考文献
 1. <https://en.wikipedia.org/wiki/Click-through_rate>
 2. <https://www.kaggle.com/c/avazu-ctr-prediction/data>

--- a/ctr/infer.py
+++ b/ctr/infer.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+import gzip
+import argparse
+import itertools
+import paddle.v2 as paddle
+import network_conf
+from train import dnn_layer_dims
+import reader
+from utils import logger, ModelType
+parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
+parser.add_argument(
+    '--model_gz_path',
+    type=str,
+    required=True,
+    help="path of model parameters gz file")
+parser.add_argument(
+    '--data_path', type=str, required=True, help="path of the dataset to infer")
+parser.add_argument(
+    '--prediction_output_path',
+    type=str,
+    required=True,
+    help="path to output the prediction")
+parser.add_argument(
+    '--data_meta_path',
+    type=str,
+    default="./data.meta",
+    help="path of trainset's meta info, default is ./data.meta")
+parser.add_argument(
+    '--model_type',
+    type=int,
+    required=True,
+    default=ModelType.CLASSIFICATION,
+    help='model type, classification: %d, regression %d (default classification)'
+    % (ModelType.CLASSIFICATION, ModelType.REGRESSION))
+args = parser.parse_args()
+paddle.init(use_gpu=False, trainer_count=1)
+class CTRInferer(object):
+    def __init__(self, param_path):
+        logger.info("create CTR model")
+        dnn_input_dim, lr_input_dim = reader.load_data_meta(args.data_meta_path)
+        # create the mdoel
+        self.ctr_model = network_conf.CTRmodel(
+            dnn_layer_dims,
+            dnn_input_dim,
+            lr_input_dim,
+            model_type=ModelType(args.model_type),
+            is_infer=True)
+        # load parameter
+        logger.info("load model parameters from %s" % param_path)
+        self.parameters = paddle.parameters.Parameters.from_tar(
+            gzip.open(param_path, 'r'))
+        self.inferer = paddle.inference.Inference(
+            output_layer=self.ctr_model.model,
+            parameters=self.parameters, )
+    def infer(self, data_path):
+        logger.info("infer data...")
+        dataset = reader.Dataset()
+        infer_reader = paddle.batch(
+            dataset.infer(args.data_path), batch_size=1000)
+        logger.warning('write predictions to %s' % args.prediction_output_path)
+        output_f = open(args.prediction_output_path, 'w')
+        for id, batch in enumerate(infer_reader()):
+            res = self.inferer.infer(input=batch)
+            predictions = [x for x in itertools.chain.from_iterable(res)]
+            assert len(batch) == len(
+                predictions), "predict error, %d inputs, but %d predictions" % (
+                    len(batch), len(predictions))
+            output_f.write('\n'.join(map(str, predictions)) + '\n')
+if __name__ == '__main__':
+    ctr_inferer = CTRInferer(args.model_gz_path)
+    ctr_inferer.infer(args.data_path)
--- a/ctr/network_conf.py
+++ b/ctr/network_conf.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+import paddle.v2 as paddle
+from paddle.v2 import layer
+from paddle.v2 import data_type as dtype
+from utils import logger, ModelType
+class CTRmodel(object):
+    '''
+    A CTR model which implements wide && deep learning model.
+    '''
+    def __init__(self,
+                 dnn_layer_dims,
+                 dnn_input_dim,
+                 lr_input_dim,
+                 model_type=ModelType.create_classification(),
+                 is_infer=False):
+        '''
+        @dnn_layer_dims: list of integer
+            dims of each layer in dnn
+        @dnn_input_dim: int
+            size of dnn's input layer
+        @lr_input_dim: int
+            size of lr's input layer
+        @is_infer: bool
+            whether to build a infer model
+        '''
+        self.dnn_layer_dims = dnn_layer_dims
+        self.dnn_input_dim = dnn_input_dim
+        self.lr_input_dim = lr_input_dim
+        self.model_type = model_type
+        self.is_infer = is_infer
+        self._declare_input_layers()
+        self.dnn = self._build_dnn_submodel_(self.dnn_layer_dims)
+        self.lr = self._build_lr_submodel_()
+        # model's prediction
+        # TODO(superjom) rename it to prediction
+        if self.model_type.is_classification():
+            self.model = self._build_classification_model(self.dnn, self.lr)
+        if self.model_type.is_regression():
+            self.model = self._build_regression_model(self.dnn, self.lr)
+    def _declare_input_layers(self):
+        self.dnn_merged_input = layer.data(
+            name='dnn_input',
+            type=paddle.data_type.sparse_binary_vector(self.dnn_input_dim))
+        self.lr_merged_input = layer.data(
+            name='lr_input',
+            type=paddle.data_type.sparse_vector(self.lr_input_dim))
+        if not self.is_infer:
+            self.click = paddle.layer.data(
+                name='click', type=dtype.dense_vector(1))
+    def _build_dnn_submodel_(self, dnn_layer_dims):
+        '''
+        build DNN submodel.
+        '''
+        dnn_embedding = layer.fc(
+            input=self.dnn_merged_input, size=dnn_layer_dims[0])
+        _input_layer = dnn_embedding
+        for i, dim in enumerate(dnn_layer_dims[1:]):
+            fc = layer.fc(
+                input=_input_layer,
+                size=dim,
+                act=paddle.activation.Relu(),
+                name='dnn-fc-%d' % i)
+            _input_layer = fc
+        return _input_layer
+    def _build_lr_submodel_(self):
+        '''
+        config LR submodel
+        '''
+        fc = layer.fc(
+            input=self.lr_merged_input, size=1, act=paddle.activation.Relu())
+        return fc
+    def _build_classification_model(self, dnn, lr):
+        merge_layer = layer.concat(input=[dnn, lr])
+        self.output = layer.fc(
+            input=merge_layer,
+            size=1,
+            # use sigmoid function to approximate ctr rate, a float value between 0 and 1.
+            act=paddle.activation.Sigmoid())
+        if not self.is_infer:
+            self.train_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
+                input=self.output, label=self.click)
+        return self.output
+    def _build_regression_model(self, dnn, lr):
+        merge_layer = layer.concat(input=[dnn, lr])
+        self.output = layer.fc(
+            input=merge_layer, size=1, act=paddle.activation.Sigmoid())
+        if not self.is_infer:
+            self.train_cost = paddle.layer.mse_cost(
+                input=self.output, label=self.click)
+        return self.output
--- a/ctr/reader.py
+++ b/ctr/reader.py
+from utils import logger, TaskMode, load_dnn_input_record, load_lr_input_record
+feeding_index = {'dnn_input': 0, 'lr_input': 1, 'click': 2}
+class Dataset(object):
+    def __init__(self):
+        self.mode = TaskMode.create_train()
+    def train(self, path):
+        '''
+        Load trainset.
+        '''
+        logger.info("load trainset from %s" % path)
+        self.mode = TaskMode.create_train()
+        self.path = path
+        return self._parse
+    def test(self, path):
+        '''
+        Load testset.
+        '''
+        logger.info("load testset from %s" % path)
+        self.path = path
+        self.mode = TaskMode.create_test()
+        return self._parse
+    def infer(self, path):
+        '''
+        Load infer set.
+        '''
+        logger.info("load inferset from %s" % path)
+        self.path = path
+        self.mode = TaskMode.create_infer()
+        return self._parse
+    def _parse(self):
+        '''
+        Parse dataset.
+        '''
+        with open(self.path) as f:
+            for line_id, line in enumerate(f):
+                fs = line.strip().split('\t')
+                dnn_input = load_dnn_input_record(fs[0])
+                lr_input = load_lr_input_record(fs[1])
+                if not self.mode.is_infer():
+                    click = [int(fs[2])]
+                    yield dnn_input, lr_input, click
+                else:
+                    yield dnn_input, lr_input
+def load_data_meta(path):
+    '''
+    load data meta info from path, return (dnn_input_dim, lr_input_dim)
+    '''
+    with open(path) as f:
+        lines = f.read().split('\n')
+        err_info = "wrong meta format"
+        assert len(lines) == 2, err_info
+        assert 'dnn_input_dim:' in lines[0] and 'lr_input_dim:' in lines[
+            1], err_info
+        res = map(int, [_.split(':')[1] for _ in lines])
+        logger.info('dnn input dim: %d' % res[0])
+        logger.info('lr input dim: %d' % res[1])
+        return res
--- a/ctr/train.py
+++ b/ctr/train.py
 #!/usr/bin/env python
-# -*- coding: utf-8 -*-
+# -*- coding: utf-8 -*-import os
 import argparse
-import logging
+import gzip
-import paddle.v2 as paddle
-from paddle.v2 import layer
-from paddle.v2 import data_type as dtype
-from data_provider import field_index, detect_dataset, AvazuDataset
-parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
-parser.add_argument(
-    '--train_data_path',
-    type=str,
-    required=True,
-    help="path of training dataset")
-parser.add_argument(
-    '--batch_size',
-    type=int,
-    default=10000,
-    help="size of mini-batch (default:10000)")
-parser.add_argument(
-    '--test_set_size',
-    type=int,
-    default=10000,
-    help="size of the validation dataset(default: 10000)")
-parser.add_argument(
-    '--num_passes', type=int, default=10, help="number of passes to train")
-parser.add_argument(
-    '--num_lines_to_detact',
-    type=int,
-    default=500000,
-    help="number of records to detect dataset's meta info")
-args = parser.parse_args()
-dnn_layer_dims = [128, 64, 32, 1]
-data_meta_info = detect_dataset(args.train_data_path, args.num_lines_to_detact)
-logging.warning('detect categorical fields in dataset %s' %
-                args.train_data_path)
-for key, item in data_meta_info.items():
-    logging.warning('    - {}\t{}'.format(key, item))
-paddle.init(use_gpu=False, trainer_count=1)
-# ==============================================================================
+import reader
-#                    input layers
+import paddle.v2 as paddle
-# ==============================================================================
+from utils import logger, ModelType
-dnn_merged_input = layer.data(
+from network_conf import CTRmodel
-    name='dnn_input',
-    type=paddle.data_type.sparse_binary_vector(data_meta_info['dnn_input']))
+def parse_args():
-lr_merged_input = layer.data(
+    parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
-    name='lr_input',
+    parser.add_argument(
-    type=paddle.data_type.sparse_binary_vector(data_meta_info['lr_input']))
+        '--train_data_path',
+        type=str,
-click = paddle.layer.data(name='click', type=dtype.dense_vector(1))
+        required=True,
+        help="path of training dataset")
+    parser.add_argument(
+        '--test_data_path', type=str, help='path of testing dataset')
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=10000,
+        help="size of mini-batch (default:10000)")
+    parser.add_argument(
+        '--num_passes', type=int, default=10, help="number of passes to train")
+    parser.add_argument(
+        '--model_output_prefix',
+        type=str,
+        default='./ctr_models',
+        help='prefix of path for model to store (default: ./ctr_models)')
+    parser.add_argument(
+        '--data_meta_file',
+        type=str,
+        required=True,
+        help='path of data meta info file', )
+    parser.add_argument(
+        '--model_type',
+        type=int,
+        required=True,
+        default=ModelType.CLASSIFICATION,
+        help='model type, classification: %d, regression %d (default classification)'
+        % (ModelType.CLASSIFICATION, ModelType.REGRESSION))
+    return parser.parse_args()
-# ==============================================================================
+dnn_layer_dims = [128, 64, 32, 1]
-#                    network structure
-# ==============================================================================
-def build_dnn_submodel(dnn_layer_dims):
-    dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0])
-    _input_layer = dnn_embedding
-    for i, dim in enumerate(dnn_layer_dims[1:]):
-        fc = layer.fc(
-            input=_input_layer,
-            size=dim,
-            act=paddle.activation.Relu(),
-            name='dnn-fc-%d' % i)
-        _input_layer = fc
-    return _input_layer
-# config LR submodel
-def build_lr_submodel():
-    fc = layer.fc(
-        input=lr_merged_input, size=1, name='lr', act=paddle.activation.Relu())
-    return fc
-# conbine DNN and LR submodels
-def combine_submodels(dnn, lr):
-    merge_layer = layer.concat(input=[dnn, lr])
-    fc = layer.fc(
-        input=merge_layer,
-        size=1,
-        name='output',
-        # use sigmoid function to approximate ctr rate, a float value between 0 and 1.
-        act=paddle.activation.Sigmoid())
-    return fc
-dnn = build_dnn_submodel(dnn_layer_dims)
-lr = build_lr_submodel()
-output = combine_submodels(dnn, lr)
 # ==============================================================================
 #                   cost and train period
 # ==============================================================================
-classification_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
-    input=output, label=click)
-params = paddle.parameters.create(classification_cost)
-optimizer = paddle.optimizer.Momentum(momentum=0.01)
-trainer = paddle.trainer.SGD(
-    cost=classification_cost, parameters=params, update_equation=optimizer)
-dataset = AvazuDataset(
-    args.train_data_path, n_records_as_test=args.test_set_size)
-def event_handler(event):
-    if isinstance(event, paddle.event.EndIteration):
-        num_samples = event.batch_id * args.batch_size
-        if event.batch_id % 100 == 0:
-            logging.warning("Pass %d, Samples %d, Cost %f" %
-                            (event.pass_id, num_samples, event.cost))
-        if event.batch_id % 1000 == 0:
-            result = trainer.test(
-                reader=paddle.batch(dataset.test, batch_size=args.batch_size),
-                feeding=field_index)
-            logging.warning("Test %d-%d, Cost %f" %
-                            (event.pass_id, event.batch_id, result.cost))
-trainer.train(
+def train():
-    reader=paddle.batch(
+    args = parse_args()
-        paddle.reader.shuffle(dataset.train, buf_size=500),
+    args.model_type = ModelType(args.model_type)
-        batch_size=args.batch_size),
+    paddle.init(use_gpu=False, trainer_count=1)
-    feeding=field_index,
+    dnn_input_dim, lr_input_dim = reader.load_data_meta(args.data_meta_file)
-    event_handler=event_handler,
-    num_passes=args.num_passes)
+    # create ctr model.
+    model = CTRmodel(
+        dnn_layer_dims,
+        dnn_input_dim,
+        lr_input_dim,
+        model_type=args.model_type,
+        is_infer=False)
+    params = paddle.parameters.create(model.train_cost)
+    optimizer = paddle.optimizer.AdaGrad()
+    trainer = paddle.trainer.SGD(
+        cost=model.train_cost, parameters=params, update_equation=optimizer)
+    dataset = reader.Dataset()
+    def __event_handler__(event):
+        if isinstance(event, paddle.event.EndIteration):
+            num_samples = event.batch_id * args.batch_size
+            if event.batch_id % 100 == 0:
+                logger.warning("Pass %d, Samples %d, Cost %f, %s" % (
+                    event.pass_id, num_samples, event.cost, event.metrics))
+            if event.batch_id % 1000 == 0:
+                if args.test_data_path:
+                    result = trainer.test(
+                        reader=paddle.batch(
+                            dataset.test(args.test_data_path),
+                            batch_size=args.batch_size),
+                        feeding=reader.feeding_index)
+                    logger.warning("Test %d-%d, Cost %f, %s" %
+                                   (event.pass_id, event.batch_id, result.cost,
+                                    result.metrics))
+                path = "{}-pass-{}-batch-{}-test-{}.tar.gz".format(
+                    args.model_output_prefix, event.pass_id, event.batch_id,
+                    result.cost)
+                with gzip.open(path, 'w') as f:
+                    params.to_tar(f)
+    trainer.train(
+        reader=paddle.batch(
+            paddle.reader.shuffle(
+                dataset.train(args.train_data_path), buf_size=500),
+            batch_size=args.batch_size),
+        feeding=reader.feeding_index,
+        event_handler=__event_handler__,
+        num_passes=args.num_passes)
+if __name__ == '__main__':
+    train()
--- a/ctr/utils.py
+++ b/ctr/utils.py
+import logging
+logging.basicConfig()
+logger = logging.getLogger("paddle")
+logger.setLevel(logging.INFO)
+class TaskMode:
+    TRAIN_MODE = 0
+    TEST_MODE = 1
+    INFER_MODE = 2
+    def __init__(self, mode):
+        self.mode = mode
+    def is_train(self):
+        return self.mode == self.TRAIN_MODE
+    def is_test(self):
+        return self.mode == self.TEST_MODE
+    def is_infer(self):
+        return self.mode == self.INFER_MODE
+    @staticmethod
+    def create_train():
+        return TaskMode(TaskMode.TRAIN_MODE)
+    @staticmethod
+    def create_test():
+        return TaskMode(TaskMode.TEST_MODE)
+    @staticmethod
+    def create_infer():
+        return TaskMode(TaskMode.INFER_MODE)
+class ModelType:
+    CLASSIFICATION = 0
+    REGRESSION = 1
+    def __init__(self, mode):
+        self.mode = mode
+    def is_classification(self):
+        return self.mode == self.CLASSIFICATION
+    def is_regression(self):
+        return self.mode == self.REGRESSION
+    @staticmethod
+    def create_classification():
+        return ModelType(ModelType.CLASSIFICATION)
+    @staticmethod
+    def create_regression():
+        return ModelType(ModelType.REGRESSION)
+def load_dnn_input_record(sent):
+    return map(int, sent.split())
+def load_lr_input_record(sent):
+    res = []
+    for _ in [x.split(':') for x in sent.split()]:
+        res.append((int(_[0]), float(_[1]), ))
+    return res
--- a/deep_speech_2/.gitignore
+++ b/deep_speech_2/.gitignore
+manifest*
+mean_std.npz
+thirdparty/
--- a/deep_speech_2/README.md
+++ b/deep_speech_2/README.md
-# Deep Speech 2 on PaddlePaddle
+# DeepSpeech2 on PaddlePaddle
 ## Installation
-Please replace `$PADDLE_INSTALL_DIR` with your own paddle installation directory.
 ```
 sh setup.sh
-export LD_LIBRARY_PATH=$PADDLE_INSTALL_DIR/Paddle/third_party/install/warpctc/lib:$LD_LIBRARY_PATH
 ```
-For some machines, we also need to install libsndfile1. Details to be added.
+Please replace `$PADDLE_INSTALL_DIR` with your own paddle installation directory.
 ## Usage
@@ -35,15 +32,21 @@ python datasets/librispeech/librispeech.py --help
 ### Preparing for Training
 ```
-python compute_mean_std.py
+python tools/compute_mean_std.py
+```
+It will compute mean and stdandard deviation for audio features, and save them to a file with a default name `./mean_std.npz`. This file will be used in both training and inferencing. The default feature of audio data is power spectrum, and the mfcc feature is also supported. To train and infer based on mfcc feature, please generate this file by
+```
+python tools/compute_mean_std.py --specgram_type mfcc
 ```
-`python compute_mean_std.py` computes mean and stdandard deviation for audio features, and save them to a file with a default name `./mean_std.npz`. This file will be used in both training and inferencing.
+and specify ```--specgram_type mfcc``` when running train.py, infer.py, evaluator.py or tune.py.
 More help for arguments:
 ```
-python compute_mean_std.py --help
+python tools/compute_mean_std.py --help
 ```
 ### Training
@@ -66,12 +69,31 @@ More help for arguments:
 python train.py --help
 ```
-### Inferencing
+### Preparing language model
+The following steps, inference, parameters tuning and evaluating, will require a language model during decoding.
+A compressed language model is provided and can be accessed by
+```
+cd ./lm
+sh run.sh
+cd ..
+```
+### Inference
+For GPU inference
 ```
 CUDA_VISIBLE_DEVICES=0 python infer.py
 ```
+For CPU inference
+```
+python infer.py --use_gpu=False
+```
 More help for arguments:
 ```
@@ -92,14 +114,55 @@ python evaluate.py --help
 ### Parameters tuning
-Parameters tuning for the CTC beam search decoder
+Usually, the parameters $\alpha$ and $\beta$ for the CTC [prefix beam search](https://arxiv.org/abs/1408.2873) decoder need to be tuned after retraining the acoustic model.
+For GPU tuning
 ```
 CUDA_VISIBLE_DEVICES=0 python tune.py
 ```
+For CPU tuning
+```
+python tune.py --use_gpu=False
+```
 More help for arguments:
 ```
 python tune.py --help
 ```
+Then reset parameters with the tuning result before inference or evaluating.
+### Playing with the ASR Demo
+A real-time ASR demo is built for users to try out the ASR model with their own voice. Please do the following installation on the machine you'd like to run the demo's client (no need for the machine running the demo's server).
+For example, on MAC OS X:
+```
+brew install portaudio
+pip install pyaudio
+pip install pynput
+```
+After a model and language model is prepared, we can first start the demo's server:
+```
+CUDA_VISIBLE_DEVICES=0 python demo_server.py
+```
+And then in another console, start the demo's client:
+```
+python demo_client.py
+```
+On the client console, press and hold the "white-space" key on the keyboard to start talking, until you finish your speech and then release the "white-space" key. The decoding results (infered transcription) will be displayed.
+It could be possible to start the server and the client in two seperate machines, e.g. `demo_client.py` is usually started in a machine with a microphone hardware, while `demo_server.py` is usually started in a remote server with powerful GPUs. Please first make sure that these two machines have network access to each other, and then use `--host_ip` and `--host_port` to indicate the server machine's actual IP address (instead of the `localhost` as default) and TCP port, in both `demo_server.py` and `demo_client.py`.
+## PaddleCloud Training
+If you wish to train DeepSpeech2 on PaddleCloud, please refer to
+[Train DeepSpeech2 on PaddleCloud](https://github.com/PaddlePaddle/models/tree/develop/deep_speech_2/cloud).
--- a/deep_speech_2/cloud/README.md
+++ b/deep_speech_2/cloud/README.md
+# Train DeepSpeech2 on PaddleCloud
+>Note:
+>Please make sure [PaddleCloud Client](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#%E4%B8%8B%E8%BD%BD%E5%B9%B6%E9%85%8D%E7%BD%AEpaddlecloud) has be installed and current directory is `deep_speech_2/cloud/`
+## Step 1:  Upload Data
+Provided with several input manifests, `pcloud_upload_data.sh` will pack and upload all the containing audio files to PaddleCloud filesystem, and also generate some corresponding manifest files with updated cloud paths.
+Please modify the following arguments in `pcloud_upload_data.sh`:
+- `IN_MANIFESTS`： Paths (in local filesystem) of manifest files containing the audio files to be uploaded. Multiple paths can be concatenated with a whitespace delimeter.
+- `OUT_MANIFESTS`: Paths (in local filesystem) to write the updated output manifest files to. Multiple paths can be concatenated with a whitespace delimeter. The values of `audio_filepath` in the output manifests are updated with cloud filesystem paths.
+- `CLOUD_DATA_DIR`:  Directory (in PaddleCloud filesystem) to upload the data to. Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it.
+- `NUM_SHARDS`: Number of data shards / parts (in tar files) to be generated when packing and uploading data. Smaller `num_shards` requires larger temoporal local disk space for packing data.
+By running:
+```
+sh pcloud_upload_data.sh
+```
+all the audio files will be uploaded to PaddleCloud filesystem, and you will get modified manifests files in `OUT_MANIFESTS`.
+You have to take this step only once, in the very first time you do the cloud training. Later on, the data is persisitent on the cloud filesystem and reusable for further job submissions.
+## Step 2:  Configure Training
+Configure cloud training arguments in `pcloud_submit.sh`, with the following arguments:
+- `TRAIN_MANIFEST`: Manifest filepath (in local filesystem) for training. Notice that the`audio_filepath` should be in cloud filesystem, like those generated by `pcloud_upload_data.sh`.
+- `DEV_MANIFEST`: Manifest filepath (in local filesystem) for validation.
+- `CLOUD_MODEL_DIR`: Directory (in PaddleCloud filesystem) to save the model parameters (checkpoints). Don't forget to replace `USERNAME` in the default directory and make sure that you have the permission to write it.
+- `BATCH_SIZE`: Training batch size for a single node.
+- `NUM_GPU`: Number of GPUs allocated for a single node.
+- `NUM_NODE`: Number of nodes (machines) allocated for this job.
+- `IS_LOCAL`: Set to False to enable parameter server, if using multiple nodes.
+Configure other training hyper-parameters in `pcloud_train.sh` as you wish, just as what you can do in local training.
+By running:
+```
+sh pcloud_submit.sh
+```
+you submit a training job to PaddleCloud. And you will see the job name when the submission is done.
+## Step 3  Get Job Logs
+Run this to list all the jobs you have submitted, as well as their running status:
+```
+paddlecloud get jobs
+```
+Run this, the corresponding job's logs will be printed.
+```
+paddlecloud logs -n 10000 $REPLACED_WITH_YOUR_ACTUAL_JOB_NAME
+```
+## More Help
+For more information about the usage of PaddleCloud, please refer to [PaddleCloud Usage](https://github.com/PaddlePaddle/cloud/blob/develop/doc/usage_cn.md#提交任务).
--- a/deep_speech_2/cloud/_init_paths.py
+++ b/deep_speech_2/cloud/_init_paths.py
+"""Set up paths for DS2"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os.path
+import sys
+def add_path(path):
+    if path not in sys.path:
+        sys.path.insert(0, path)
+this_dir = os.path.dirname(__file__)
+proj_path = os.path.join(this_dir, '..')
+add_path(proj_path)
--- a/deep_speech_2/cloud/pcloud_submit.sh
+++ b/deep_speech_2/cloud/pcloud_submit.sh
+TRAIN_MANIFEST="cloud/cloud.manifest.train"
+DEV_MANIFEST="cloud/cloud.manifest.dev"
+CLOUD_MODEL_DIR="/pfs/dlnel/home/USERNAME/deepspeech2/model"
+BATCH_SIZE=256
+NUM_GPU=8
+NUM_NODE=1
+IS_LOCAL="True"
+JOB_NAME=deepspeech-`date +%Y%m%d%H%M%S`
+DS2_PATH=${PWD%/*}
+cp -f  pcloud_train.sh ${DS2_PATH}
+paddlecloud submit \
+-image bootstrapper:5000/wanghaoshuang/pcloud_ds2:latest \
+-jobname ${JOB_NAME} \
+-cpu ${NUM_GPU} \
+-gpu ${NUM_GPU} \
+-memory 64Gi \
+-parallelism ${NUM_NODE} \
+-pscpu 1 \
+-pservers 1 \
+-psmemory 64Gi \
+-passes 1 \
+-entry "sh pcloud_train.sh ${TRAIN_MANIFEST} ${DEV_MANIFEST} ${CLOUD_MODEL_DIR} ${NUM_GPU} ${BATCH_SIZE} ${IS_LOCAL}" \
+${DS2_PATH}
+rm ${DS2_PATH}/pcloud_train.sh
--- a/deep_speech_2/cloud/pcloud_train.sh
+++ b/deep_speech_2/cloud/pcloud_train.sh
+TRAIN_MANIFEST=$1
+DEV_MANIFEST=$2
+MODEL_PATH=$3
+NUM_GPU=$4
+BATCH_SIZE=$5
+IS_LOCAL=$6
+python ./cloud/split_data.py \
+--in_manifest_path=${TRAIN_MANIFEST} \
+--out_manifest_path='/local.manifest.train'
+python ./cloud/split_data.py \
+--in_manifest_path=${DEV_MANIFEST} \
+--out_manifest_path='/local.manifest.dev'
+python train.py \
+--batch_size=$BATCH_SIZE \
+--use_gpu=1 \
+--trainer_count=${NUM_GPU} \
+--num_threads_data=${NUM_GPU} \
+--is_local=${IS_LOCAL} \
+--train_manifest_path='/local.manifest.train' \
+--dev_manifest_path='/local.manifest.dev' \
+--output_model_dir=${MODEL_PATH} \
--- a/deep_speech_2/cloud/pcloud_upload_data.sh
+++ b/deep_speech_2/cloud/pcloud_upload_data.sh
+IN_MANIFESTS="../datasets/manifest.train ../datasets/manifest.dev ../datasets/manifest.test"
+OUT_MANIFESTS="./cloud.manifest.train ./cloud.manifest.dev ./cloud.manifest.test"
+CLOUD_DATA_DIR="/pfs/dlnel/home/USERNAME/deepspeech2/data/librispeech"
+NUM_SHARDS=50
+python upload_data.py \
+--in_manifest_paths ${IN_MANIFESTS} \
+--out_manifest_paths ${OUT_MANIFESTS} \
+--cloud_data_dir ${CLOUD_DATA_DIR} \
+--num_shards ${NUM_SHARDS}
+if [ $? -ne 0 ]
+then
+    echo "Upload Data Failed!"
+    exit 1
+fi
+echo "All Done."
--- a/deep_speech_2/cloud/split_data.py
+++ b/deep_speech_2/cloud/split_data.py
+"""This tool is used for splitting data into each node of
+paddlecloud. This script should be called in paddlecloud.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import json
+import argparse
+parser = argparse.ArgumentParser(description=__doc__)
+parser.add_argument(
+    "--in_manifest_path",
+    type=str,
+    required=True,
+    help="Input manifest path for all nodes.")
+parser.add_argument(
+    "--out_manifest_path",
+    type=str,
+    required=True,
+    help="Output manifest file path for current node.")
+args = parser.parse_args()
+def split_data(in_manifest_path, out_manifest_path):
+    with open("/trainer_id", "r") as f:
+        trainer_id = int(f.readline()[:-1])
+    with open("/trainer_count", "r") as f:
+        trainer_count = int(f.readline()[:-1])
+    out_manifest = []
+    for index, json_line in enumerate(open(in_manifest_path, 'r')):
+        if (index % trainer_count) == trainer_id:
+            out_manifest.append("%s\n" % json_line.strip())
+    with open(out_manifest_path, 'w') as f:
+        f.writelines(out_manifest)
+if __name__ == '__main__':
+    split_data(args.in_manifest_path, args.out_manifest_path)
--- a/deep_speech_2/cloud/upload_data.py
+++ b/deep_speech_2/cloud/upload_data.py
+"""This script is for uploading data for DeepSpeech2 training on paddlecloud.
+Steps:
+1. Read original manifests and extract local sound files.
+2. Tar all local sound files into multiple tar files and upload them.
+3. Modify original manifests with updated paths in cloud filesystem.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import json
+import os
+import tarfile
+import sys
+import argparse
+import shutil
+from subprocess import call
+import _init_paths
+from data_utils.utils import read_manifest
+parser = argparse.ArgumentParser(description=__doc__)
+parser.add_argument(
+    "--in_manifest_paths",
+    default=[
+        "../datasets/manifest.train", "../datasets/manifest.dev",
+        "../datasets/manifest.test"
+    ],
+    type=str,
+    nargs='+',
+    help="Local filepaths of input manifests to load, pack and upload."
+    "(default: %(default)s)")
+parser.add_argument(
+    "--out_manifest_paths",
+    default=[
+        "./cloud.manifest.train", "./cloud.manifest.dev",
+        "./cloud.manifest.test"
+    ],
+    type=str,
+    nargs='+',
+    help="Local filepaths of modified manifests to write to. "
+    "(default: %(default)s)")
+parser.add_argument(
+    "--cloud_data_dir",
+    required=True,
+    type=str,
+    help="Destination directory on paddlecloud to upload data to.")
+parser.add_argument(
+    "--num_shards",
+    default=10,
+    type=int,
+    help="Number of parts to split data to. (default: %(default)s)")
+parser.add_argument(
+    "--local_tmp_dir",
+    default="./tmp/",
+    type=str,
+    help="Local directory for storing temporary data. (default: %(default)s)")
+args = parser.parse_args()
+def upload_data(in_manifest_path_list, out_manifest_path_list, local_tmp_dir,
+                upload_tar_dir, num_shards):
+    """Extract and pack sound files listed in the manifest files into multple
+    tar files and upload them to padldecloud. Besides, generate new manifest
+    files with updated paths in paddlecloud.
+    """
+    # compute total audio number
+    total_line = 0
+    for manifest_path in in_manifest_path_list:
+        with open(manifest_path, 'r') as f:
+            total_line += len(f.readlines())
+    line_per_tar = (total_line // num_shards) + 1
+    # pack and upload shard by shard
+    line_count, tar_file = 0, None
+    for manifest_path, out_manifest_path in zip(in_manifest_path_list,
+                                                out_manifest_path_list):
+        manifest = read_manifest(manifest_path)
+        out_manifest = []
+        for json_data in manifest:
+            sound_filepath = json_data['audio_filepath']
+            sound_filename = os.path.basename(sound_filepath)
+            if line_count % line_per_tar == 0:
+                if tar_file != None:
+                    tar_file.close()
+                    pcloud_cp(tar_path, upload_tar_dir)
+                    os.remove(tar_path)
+                tar_name = 'part-%s-of-%s.tar' % (
+                    str(line_count // line_per_tar).zfill(5),
+                    str(num_shards).zfill(5))
+                tar_path = os.path.join(local_tmp_dir, tar_name)
+                tar_file = tarfile.open(tar_path, 'w')
+            tar_file.add(sound_filepath, arcname=sound_filename)
+            line_count += 1
+            json_data['audio_filepath'] = "tar:%s#%s" % (
+                os.path.join(upload_tar_dir, tar_name), sound_filename)
+            out_manifest.append("%s\n" % json.dumps(json_data))
+        with open(out_manifest_path, 'w') as f:
+            f.writelines(out_manifest)
+        pcloud_cp(out_manifest_path, upload_tar_dir)
+    tar_file.close()
+    pcloud_cp(tar_path, upload_tar_dir)
+    os.remove(tar_path)
+def pcloud_mkdir(dir):
+    """Make directory in PaddleCloud filesystem.
+    """
+    if call(['paddlecloud', 'mkdir', dir]) != 0:
+        raise IOError("PaddleCloud mkdir failed: %s." % dir)
+def pcloud_cp(src, dst):
+    """Copy src from local filesytem to dst in PaddleCloud filesystem,
+    or downlowd src from PaddleCloud filesystem to dst in local filesystem.
+    """
+    if call(['paddlecloud', 'cp', src, dst]) != 0:
+        raise IOError("PaddleCloud cp failed: from [%s] to [%s]." % (src, dst))
+if __name__ == '__main__':
+    if not os.path.exists(args.local_tmp_dir):
+        os.makedirs(args.local_tmp_dir)
+    pcloud_mkdir(args.cloud_data_dir)
+    upload_data(args.in_manifest_paths, args.out_manifest_paths,
+                args.local_tmp_dir, args.cloud_data_dir, args.num_shards)
+    shutil.rmtree(args.local_tmp_dir)
--- a/deep_speech_2/conf/augmentation.config
+++ b/deep_speech_2/conf/augmentation.config
+[
+    {
+        "type": "shift",
+        "params": {"min_shift_ms": -5,
+                   "max_shift_ms": 5},
+        "prob": 1.0
+    }
+]
--- a/deep_speech_2/conf/augmentation.config.example
+++ b/deep_speech_2/conf/augmentation.config.example
+[
+    {
+        "type": "noise",
+        "params": {"min_snr_dB": 40,
+                   "max_snr_dB": 50,
+                   "noise_manifest_path": "datasets/manifest.noise"},
+        "prob": 0.6
+    },
+    {
+        "type": "impulse",
+        "params": {"impulse_manifest_path": "datasets/manifest.impulse"},
+        "prob": 0.5
+    },
+    {
+        "type": "speed",
+        "params": {"min_speed_rate": 0.95,
+                   "max_speed_rate": 1.05},
+        "prob": 0.5
+    },
+    {
+        "type": "shift",
+        "params": {"min_shift_ms": -5,
+                   "max_shift_ms": 5},
+        "prob": 1.0
+    },
+    {
+        "type": "volume",
+        "params": {"min_gain_dBFS": -10,
+                   "max_gain_dBFS": 10},
+        "prob": 0.0
+    },
+    {
+        "type": "bayesian_normal",
+        "params": {"target_db": -20,
+                   "prior_db": -20,
+                   "prior_samples": 100},
+        "prob": 0.0
+    }
+]
--- a/deep_speech_2/data_utils/audio.py
+++ b/deep_speech_2/data_utils/audio.py
@@ -204,7 +204,7 @@ class AudioSegment(object):
        :raise ValueError: If the sample rates of the two segments are not
                           equal, or if the lengths of segments don't match.
        """
-        if type(self) != type(other):
+        if isinstance(other, type(self)):
            raise TypeError("Cannot add segments of different types: %s "
                            "and %s." % (type(self), type(other)))
        if self._sample_rate != other._sample_rate:
@@ -231,7 +231,7 @@ class AudioSegment(object):
        Note that this is an in-place transformation.
        :param gain: Gain in decibels to apply to samples. 
-        :type gain: float
+        :type gain: float|1darray
        """
        self._samples *= 10.**(gain / 20.)
@@ -457,9 +457,9 @@ class AudioSegment(object):
                            audio segments when resample is not allowed.
        """
        if allow_resample and self.sample_rate != impulse_segment.sample_rate:
-            impulse_segment = impulse_segment.resample(self.sample_rate)
+            impulse_segment.resample(self.sample_rate)
        if self.sample_rate != impulse_segment.sample_rate:
-            raise ValueError("Impulse segment's sample rate (%d Hz) is not"
+            raise ValueError("Impulse segment's sample rate (%d Hz) is not "
                             "equal to base signal sample rate (%d Hz)." %
                             (impulse_segment.sample_rate, self.sample_rate))
        samples = signal.fftconvolve(self.samples, impulse_segment.samples,

--- a/deep_speech_2/data_utils/augmentor/augmentation.py
+++ b/deep_speech_2/data_utils/augmentor/augmentation.py
@@ -8,6 +8,8 @@ import random
 from data_utils.augmentor.volume_perturb import VolumePerturbAugmentor
 from data_utils.augmentor.shift_perturb import ShiftPerturbAugmentor
 from data_utils.augmentor.speed_perturb import SpeedPerturbAugmentor
+from data_utils.augmentor.noise_perturb import NoisePerturbAugmentor
+from data_utils.augmentor.impulse_response import ImpulseResponseAugmentor
 from data_utils.augmentor.resample import ResampleAugmentor
 from data_utils.augmentor.online_bayesian_normalization import \
     OnlineBayesianNormalizationAugmentor
@@ -23,21 +25,46 @@ class AugmentationPipeline(object):
    string, e.g.
    .. code-block::
-        '[{"type": "volume",
-           "params": {"min_gain_dBFS": -15,
-                      "max_gain_dBFS": 15},
-           "prob": 0.5},
-          {"type": "speed",
-           "params": {"min_speed_rate": 0.8,
-                      "max_speed_rate": 1.2},
-           "prob": 0.5}
-         ]' 
+        [ {
+                "type": "noise",
+                "params": {"min_snr_dB": 10,
+                           "max_snr_dB": 20,
+                           "noise_manifest_path": "datasets/manifest.noise"},
+                "prob": 0.0
+            },
+            {
+                "type": "speed",
+                "params": {"min_speed_rate": 0.9,
+                           "max_speed_rate": 1.1},
+                "prob": 1.0
+            },
+            {
+                "type": "shift",
+                "params": {"min_shift_ms": -5,
+                           "max_shift_ms": 5},
+                "prob": 1.0
+            },
+            {
+                "type": "volume",
+                "params": {"min_gain_dBFS": -10,
+                           "max_gain_dBFS": 10},
+                "prob": 0.0
+            },
+            {
+                "type": "bayesian_normal",
+                "params": {"target_db": -20,
+                           "prior_db": -20,
+                           "prior_samples": 100},
+                "prob": 0.0
+            }
+        ]
    This augmentation configuration inserts two augmentation models
    into the pipeline, with one is VolumePerturbAugmentor and the other
    SpeedPerturbAugmentor. "prob" indicates the probability of the current
-    augmentor to take effect.
+    augmentor to take effect. If "prob" is zero, the augmentor does not take
+    effect.
    :param augmentation_config: Augmentation configuration in json string.
    :type augmentation_config: str
@@ -60,7 +87,7 @@ class AugmentationPipeline(object):
        :type audio_segment: AudioSegmenet|SpeechSegment
        """
        for augmentor, rate in zip(self._augmentors, self._rates):
-            if self._rng.uniform(0., 1.) <= rate:
+            if self._rng.uniform(0., 1.) < rate:
                augmentor.transform_audio(audio_segment)
    def _parse_pipeline_from(self, config_json):
@@ -89,5 +116,9 @@ class AugmentationPipeline(object):
            return ResampleAugmentor(self._rng, **params)
        elif augmentor_type == "bayesian_normal":
            return OnlineBayesianNormalizationAugmentor(self._rng, **params)
+        elif augmentor_type == "noise":
+            return NoisePerturbAugmentor(self._rng, **params)
+        elif augmentor_type == "impulse":
+            return ImpulseResponseAugmentor(self._rng, **params)
        else:
            raise ValueError("Unknown augmentor type [%s]." % augmentor_type)
--- a/deep_speech_2/data_utils/augmentor/impulse_response.py
+++ b/deep_speech_2/data_utils/augmentor/impulse_response.py
+"""Contains the impulse response augmentation model."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from data_utils.augmentor.base import AugmentorBase
+from data_utils import utils
+from data_utils.audio import AudioSegment
+class ImpulseResponseAugmentor(AugmentorBase):
+    """Augmentation model for adding impulse response effect.
+    :param rng: Random generator object.
+    :type rng: random.Random
+    :param impulse_manifest_path: Manifest path for impulse audio data.
+    :type impulse_manifest_path: basestring 
+    """
+    def __init__(self, rng, impulse_manifest_path):
+        self._rng = rng
+        self._impulse_manifest = utils.read_manifest(
+            manifest_path=impulse_manifest_path)
+    def transform_audio(self, audio_segment):
+        """Add impulse response effect.
+        Note that this is an in-place transformation.
+        :param audio_segment: Audio segment to add effects to.
+        :type audio_segment: AudioSegmenet|SpeechSegment
+        """
+        impulse_json = self._rng.sample(self._impulse_manifest, 1)[0]
+        impulse_segment = AudioSegment.from_file(impulse_json['audio_filepath'])
+        audio_segment.convolve(impulse_segment, allow_resample=True)
--- a/deep_speech_2/data_utils/augmentor/noise_perturb.py
+++ b/deep_speech_2/data_utils/augmentor/noise_perturb.py
+"""Contains the noise perturb augmentation model."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+from data_utils.augmentor.base import AugmentorBase
+from data_utils import utils
+from data_utils.audio import AudioSegment
+class NoisePerturbAugmentor(AugmentorBase):
+    """Augmentation model for adding background noise.
+    :param rng: Random generator object.
+    :type rng: random.Random
+    :param min_snr_dB: Minimal signal noise ratio, in decibels.
+    :type min_snr_dB: float
+    :param max_snr_dB: Maximal signal noise ratio, in decibels.
+    :type max_snr_dB: float
+    :param noise_manifest_path: Manifest path for noise audio data.
+    :type noise_manifest_path: basestring 
+    """
+    def __init__(self, rng, min_snr_dB, max_snr_dB, noise_manifest_path):
+        self._min_snr_dB = min_snr_dB
+        self._max_snr_dB = max_snr_dB
+        self._rng = rng
+        self._noise_manifest = utils.read_manifest(
+            manifest_path=noise_manifest_path)
+    def transform_audio(self, audio_segment):
+        """Add background noise audio.
+        Note that this is an in-place transformation.
+        :param audio_segment: Audio segment to add effects to.
+        :type audio_segment: AudioSegmenet|SpeechSegment
+        """
+        noise_json = self._rng.sample(self._noise_manifest, 1)[0]
+        if noise_json['duration'] < audio_segment.duration:
+            raise RuntimeError("The duration of sampled noise audio is smaller "
+                               "than the audio segment to add effects to.")
+        diff_duration = noise_json['duration'] - audio_segment.duration
+        start = self._rng.uniform(0, diff_duration)
+        end = start + audio_segment.duration
+        noise_segment = AudioSegment.slice_from_file(
+            noise_json['audio_filepath'], start=start, end=end)
+        snr_dB = self._rng.uniform(self._min_snr_dB, self._max_snr_dB)
+        audio_segment.add_noise(
+            noise_segment, snr_dB, allow_downsampling=True, rng=self._rng)
--- a/deep_speech_2/data_utils/augmentor/online_bayesian_normalization.py
+++ b/deep_speech_2/data_utils/augmentor/online_bayesian_normalization.py
--- a/deep_speech_2/data_utils/augmentor/resample.py
+++ b/deep_speech_2/data_utils/augmentor/resample.py
--- a/deep_speech_2/data_utils/data.py
+++ b/deep_speech_2/data_utils/data.py
@@ -6,9 +6,11 @@ from __future__ import division
 from __future__ import print_function
 import random
-import numpy as np
+import tarfile
 import multiprocessing
+import numpy as np
 import paddle.v2 as paddle
+from threading import local
 from data_utils import utils
 from data_utils.augmentor.augmentation import AugmentationPipeline
 from data_utils.featurizer.speech_featurizer import SpeechFeaturizer
@@ -46,7 +48,7 @@ class DataGenerator(object):
    :param specgram_type: Specgram feature type. Options: 'linear'.
    :type specgram_type: str
    :param use_dB_normalization: Whether to normalize the audio to -20 dB
-                                 before extracting the features.
+                                before extracting the features.
    :type use_dB_normalization: bool
    :param num_threads: Number of CPU threads for processing data.
    :type num_threads: int
@@ -65,7 +67,7 @@ class DataGenerator(object):
                 max_freq=None,
                 specgram_type='linear',
                 use_dB_normalization=True,
-                 num_threads=multiprocessing.cpu_count(),
+                 num_threads=multiprocessing.cpu_count() // 2,
                 random_seed=0):
        self._max_duration = max_duration
        self._min_duration = min_duration
@@ -82,6 +84,27 @@ class DataGenerator(object):
        self._num_threads = num_threads
        self._rng = random.Random(random_seed)
        self._epoch = 0
+        # for caching tar files info
+        self.local_data = local()
+        self.local_data.tar2info = {}
+        self.local_data.tar2object = {}
+    def process_utterance(self, filename, transcript):
+        """Load, augment, featurize and normalize for speech data.
+        :param filename: Audio filepath
+        :type filename: basestring | file
+        :param transcript: Transcription text.
+        :type transcript: basestring
+        :return: Tuple of audio feature tensor and list of token ids for
+                 transcription.
+        :rtype: tuple of (2darray, list)
+        """
+        speech_segment = SpeechSegment.from_file(filename, transcript)
+        self._augmentation_pipeline.transform_audio(speech_segment)
+        specgram, text_ids = self._speech_featurizer.featurize(speech_segment)
+        specgram = self._normalizer.apply(specgram)
+        return specgram, text_ids
    def batch_reader_creator(self,
                             manifest_path,
@@ -94,7 +117,7 @@ class DataGenerator(object):
        """
        Batch data reader creator for audio data. Return a callable generator
        function to produce batches of data.
        Audio features within one batch will be padded with zeros to have the
        same shape, or a user-defined shape.
@@ -152,7 +175,7 @@ class DataGenerator(object):
                        manifest, batch_size, clipped=True)
                elif shuffle_method == "instance_shuffle":
                    self._rng.shuffle(manifest)
-                elif not shuffle_method:
+                elif shuffle_method == None:
                    pass
                else:
                    raise ValueError("Unknown shuffle method %s." %
@@ -174,9 +197,9 @@ class DataGenerator(object):
    @property
    def feeding(self):
        """Returns data reader's feeding dict.
        :return: Data feeding dict.
-        :rtype: dict 
+        :rtype: dict
        """
        return {"audio_spectrogram": 0, "transcript_text": 1}
@@ -198,13 +221,37 @@ class DataGenerator(object):
        """
        return self._speech_featurizer.vocab_list
-    def _process_utterance(self, filename, transcript):
+    def _parse_tar(self, file):
-        """Load, augment, featurize and normalize for speech data."""
+        """Parse a tar file to get a tarfile object
-        speech_segment = SpeechSegment.from_file(filename, transcript)
+        and a map containing tarinfoes
-        self._augmentation_pipeline.transform_audio(speech_segment)
+        """
-        specgram, text_ids = self._speech_featurizer.featurize(speech_segment)
+        result = {}
-        specgram = self._normalizer.apply(specgram)
+        f = tarfile.open(file)
-        return specgram, text_ids
+        for tarinfo in f.getmembers():
+            result[tarinfo.name] = tarinfo
+        return f, result
+    def _get_file_object(self, file):
+        """Get file object by file path.
+        If file startwith tar, it will return a tar file object
+        and cached tar file info for next reading request.
+        It will return file directly, if the type of file is not str.
+        """
+        if file.startswith('tar:'):
+            tarpath, filename = file.split(':', 1)[1].split('#', 1)
+            if 'tar2info' not in self.local_data.__dict__:
+                self.local_data.tar2info = {}
+            if 'tar2object' not in self.local_data.__dict__:
+                self.local_data.tar2object = {}
+            if tarpath not in self.local_data.tar2info:
+                object, infoes = self._parse_tar(tarpath)
+                self.local_data.tar2info[tarpath] = infoes
+                self.local_data.tar2object[tarpath] = object
+            return self.local_data.tar2object[tarpath].extractfile(
+                self.local_data.tar2info[tarpath][filename])
+        else:
+            return open(file, 'r')
    def _instance_reader_creator(self, manifest):
        """
@@ -220,8 +267,9 @@ class DataGenerator(object):
                yield instance
        def mapper(instance):
-            return self._process_utterance(instance["audio_filepath"],
+            return self.process_utterance(
-                                           instance["text"])
+                self._get_file_object(instance["audio_filepath"]),
+                instance["text"])
        return paddle.reader.xmap_readers(
            mapper, reader, self._num_threads, 1024, order=True)

--- a/deep_speech_2/data_utils/featurizer/audio_featurizer.py
+++ b/deep_speech_2/data_utils/featurizer/audio_featurizer.py
@@ -6,13 +6,15 @@ from __future__ import print_function
 import numpy as np
 from data_utils import utils
 from data_utils.audio import AudioSegment
+from python_speech_features import mfcc
+from python_speech_features import delta
 class AudioFeaturizer(object):
    """Audio featurizer, for extracting features from audio contents of
    AudioSegment or SpeechSegment.
-    Currently, it only supports feature type of linear spectrogram.
+    Currently, it supports feature types of linear spectrogram and mfcc.
    :param specgram_type: Specgram feature type. Options: 'linear'.
    :type specgram_type: str
@@ -20,9 +22,10 @@ class AudioFeaturizer(object):
    :type stride_ms: float
    :param window_ms: Window size (in milliseconds) for generating frames.
    :type window_ms: float
-    :param max_freq: Used when specgram_type is 'linear', only FFT bins
+    :param max_freq: When specgram_type is 'linear', only FFT bins
                     corresponding to frequencies between [0, max_freq] are
-                     returned.
+                     returned; when specgram_type is 'mfcc', max_feq is the
+                     highest band edge of mel filters.
    :types max_freq: None|float
    :param target_sample_rate: Audio are resampled (if upsampling or
                               downsampling is allowed) to this before
@@ -54,7 +57,7 @@ class AudioFeaturizer(object):
    def featurize(self,
                  audio_segment,
                  allow_downsampling=True,
-                  allow_upsamplling=True):
+                  allow_upsampling=True):
        """Extract audio features from AudioSegment or SpeechSegment.
        :param audio_segment: Audio/speech segment to extract features from.
@@ -91,6 +94,9 @@ class AudioFeaturizer(object):
            return self._compute_linear_specgram(
                samples, sample_rate, self._stride_ms, self._window_ms,
                self._max_freq)
+        elif self._specgram_type == 'mfcc':
+            return self._compute_mfcc(samples, sample_rate, self._stride_ms,
+                                      self._window_ms, self._max_freq)
        else:
            raise ValueError("Unknown specgram_type %s. "
                             "Supported values: linear." % self._specgram_type)
@@ -142,3 +148,39 @@ class AudioFeaturizer(object):
        # prepare fft frequency list
        freqs = float(sample_rate) / window_size * np.arange(fft.shape[0])
        return fft, freqs
+    def _compute_mfcc(self,
+                      samples,
+                      sample_rate,
+                      stride_ms=10.0,
+                      window_ms=20.0,
+                      max_freq=None):
+        """Compute mfcc from samples."""
+        if max_freq is None:
+            max_freq = sample_rate / 2
+        if max_freq > sample_rate / 2:
+            raise ValueError("max_freq must not be greater than half of "
+                             "sample rate.")
+        if stride_ms > window_ms:
+            raise ValueError("Stride size must not be greater than "
+                             "window size.")
+        # compute the 13 cepstral coefficients, and the first one is replaced
+        # by log(frame energy)
+        mfcc_feat = mfcc(
+            signal=samples,
+            samplerate=sample_rate,
+            winlen=0.001 * window_ms,
+            winstep=0.001 * stride_ms,
+            highfreq=max_freq)
+        # Deltas
+        d_mfcc_feat = delta(mfcc_feat, 2)
+        # Deltas-Deltas
+        dd_mfcc_feat = delta(d_mfcc_feat, 2)
+        # transpose
+        mfcc_feat = np.transpose(mfcc_feat)
+        d_mfcc_feat = np.transpose(d_mfcc_feat)
+        dd_mfcc_feat = np.transpose(dd_mfcc_feat)
+        # concat above three features
+        concat_mfcc_feat = np.concatenate(
+            (mfcc_feat, d_mfcc_feat, dd_mfcc_feat))
+        return concat_mfcc_feat
--- a/deep_speech_2/data_utils/featurizer/speech_featurizer.py
+++ b/deep_speech_2/data_utils/featurizer/speech_featurizer.py
@@ -11,23 +11,24 @@ class SpeechFeaturizer(object):
    """Speech featurizer, for extracting features from both audio and transcript
    contents of SpeechSegment.
-    Currently, for audio parts, it only supports feature type of linear
+    Currently, for audio parts, it supports feature types of linear
-    spectrogram; for transcript parts, it only supports char-level tokenizing
+    spectrogram and mfcc; for transcript parts, it only supports char-level
-    and conversion into a list of token indices. Note that the token indexing
+    tokenizing and conversion into a list of token indices. Note that the
-    order follows the given vocabulary file.
+    token indexing order follows the given vocabulary file.
    :param vocab_filepath: Filepath to load vocabulary for token indices
                           conversion.
    :type specgram_type: basestring
-    :param specgram_type: Specgram feature type. Options: 'linear'.
+    :param specgram_type: Specgram feature type. Options: 'linear', 'mfcc'.
    :type specgram_type: str
    :param stride_ms: Striding size (in milliseconds) for generating frames.
    :type stride_ms: float
    :param window_ms: Window size (in milliseconds) for generating frames.
    :type window_ms: float
-    :param max_freq: Used when specgram_type is 'linear', only FFT bins
+    :param max_freq: When specgram_type is 'linear', only FFT bins
                     corresponding to frequencies between [0, max_freq] are
-                     returned.
+                     returned; when specgram_type is 'mfcc', max_freq is the
+                     highest band edge of mel filters.
    :types max_freq: None|float
    :param target_sample_rate: Speech are resampled (if upsampling or
                               downsampling is allowed) to this before

--- a/deep_speech_2/data_utils/featurizer/text_featurizer.py
+++ b/deep_speech_2/data_utils/featurizer/text_featurizer.py
@@ -4,6 +4,7 @@ from __future__ import division
 from __future__ import print_function
 import os
+import codecs
 class TextFeaturizer(object):
@@ -59,7 +60,7 @@ class TextFeaturizer(object):
    def _load_vocabulary_from_file(self, vocab_filepath):
        """Load vocabulary from file."""
        vocab_lines = []
-        with open(vocab_filepath, 'r') as file:
+        with codecs.open(vocab_filepath, 'r', 'utf-8') as file:
            vocab_lines.extend(file.readlines())
        vocab_list = [line[:-1] for line in vocab_lines]
        vocab_dict = dict(

--- a/deep_speech_2/data_utils/normalizer.py
+++ b/deep_speech_2/data_utils/normalizer.py
@@ -16,7 +16,7 @@ class FeatureNormalizer(object):
    if mean_std_filepath is provided (not None), the normalizer will directly
    initilize from the file. Otherwise, both manifest_path and featurize_func
    should be given for on-the-fly mean and stddev computing.
    :param mean_std_filepath: File containing the pre-computed mean and stddev.
    :type mean_std_filepath: None|basestring
    :param manifest_path: Manifest of instances for computing mean and stddev.

--- a/deep_speech_2/data_utils/speech.py
+++ b/deep_speech_2/data_utils/speech.py
@@ -115,7 +115,7 @@ class SpeechSegment(AudioSegment):
                 speech file.
        :rtype: SpeechSegment
        """
-        audio = Audiosegment.slice_from_file(filepath, start, end)
+        audio = AudioSegment.slice_from_file(filepath, start, end)
        return cls(audio.samples, audio.sample_rate, transcript)
    @classmethod

--- a/deep_speech_2/data_utils/utils.py
+++ b/deep_speech_2/data_utils/utils.py
@@ -4,15 +4,16 @@ from __future__ import division
 from __future__ import print_function
 import json
+import codecs
 def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
    """Load and parse manifest file.
    Instances with durations outside [min_duration, max_duration] will be
    filtered out.
-    :param manifest_path: Manifest file to load and parse. 
+    :param manifest_path: Manifest file to load and parse.
    :type manifest_path: basestring
    :param max_duration: Maximal duration in seconds for instance filter.
    :type max_duration: float
@@ -23,7 +24,7 @@ def read_manifest(manifest_path, max_duration=float('inf'), min_duration=0.0):
    :raises IOError: If failed to parse the manifest.
    """
    manifest = []
-    for json_line in open(manifest_path):
+    for json_line in codecs.open(manifest_path, 'r', 'utf-8'):
        try:
            json_data = json.loads(json_line)
        except Exception as e:

--- a/deep_speech_2/datasets/librispeech/librispeech.py
+++ b/deep_speech_2/datasets/librispeech/librispeech.py
@@ -11,11 +11,12 @@ from __future__ import print_function
 import distutils.util
 import os
-import wget
+import sys
 import tarfile
 import argparse
 import soundfile
 import json
+import codecs
 from paddle.v2.dataset.common import md5file
 DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
@@ -66,7 +67,7 @@ def download(url, md5sum, target_dir):
    filepath = os.path.join(target_dir, url.split("/")[-1])
    if not (os.path.exists(filepath) and md5file(filepath) == md5sum):
        print("Downloading %s ..." % url)
-        wget.download(url, target_dir)
+        os.system("wget -c " + url + " -P " + target_dir)
        print("\nMD5 Chesksum %s ..." % filepath)
        if not md5file(filepath) == md5sum:
            raise RuntimeError("MD5 checksum failed.")
@@ -112,7 +113,7 @@ def create_manifest(data_dir, manifest_path):
                        'duration': duration,
                        'text': text
                    }))
-    with open(manifest_path, 'w') as out_file:
+    with codecs.open(manifest_path, 'w', 'utf-8') as out_file:
        for line in json_lines:
            out_file.write(line + '\n')

--- a/deep_speech_2/datasets/noise/chime3_background.py
+++ b/deep_speech_2/datasets/noise/chime3_background.py
+"""Prepare CHiME3 background data.
+Download, unpack and create manifest files.
+Manifest file is a json-format file with each line containing the
+meta data (i.e. audio filepath, transcript and audio duration)
+of each audio file in the data set.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import distutils.util
+import os
+import wget
+import zipfile
+import argparse
+import soundfile
+import json
+from paddle.v2.dataset.common import md5file
+DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
+URL = "https://d4s.myairbridge.com/packagev2/AG0Y3DNBE5IWRRTV/?dlid=W19XG7T0NNHB027139H0EQ"
+MD5 = "c3ff512618d7a67d4f85566ea1bc39ec"
+parser = argparse.ArgumentParser(description=__doc__)
+parser.add_argument(
+    "--target_dir",
+    default=DATA_HOME + "/chime3_background",
+    type=str,
+    help="Directory to save the dataset. (default: %(default)s)")
+parser.add_argument(
+    "--manifest_filepath",
+    default="manifest.chime3.background",
+    type=str,
+    help="Filepath for output manifests. (default: %(default)s)")
+args = parser.parse_args()
+def download(url, md5sum, target_dir, filename=None):
+    """Download file from url to target_dir, and check md5sum."""
+    if filename == None:
+        filename = url.split("/")[-1]
+    if not os.path.exists(target_dir): os.makedirs(target_dir)
+    filepath = os.path.join(target_dir, filename)
+    if not (os.path.exists(filepath) and md5file(filepath) == md5sum):
+        print("Downloading %s ..." % url)
+        wget.download(url, target_dir)
+        print("\nMD5 Chesksum %s ..." % filepath)
+        if not md5file(filepath) == md5sum:
+            raise RuntimeError("MD5 checksum failed.")
+    else:
+        print("File exists, skip downloading. (%s)" % filepath)
+    return filepath
+def unpack(filepath, target_dir):
+    """Unpack the file to the target_dir."""
+    print("Unpacking %s ..." % filepath)
+    if filepath.endswith('.zip'):
+        zip = zipfile.ZipFile(filepath, 'r')
+        zip.extractall(target_dir)
+        zip.close()
+    elif filepath.endswith('.tar') or filepath.endswith('.tar.gz'):
+        tar = zipfile.open(filepath)
+        tar.extractall(target_dir)
+        tar.close()
+    else:
+        raise ValueError("File format is not supported for unpacking.")
+def create_manifest(data_dir, manifest_path):
+    """Create a manifest json file summarizing the data set, with each line
+    containing the meta data (i.e. audio filepath, transcription text, audio
+    duration) of each audio file within the data set.
+    """
+    print("Creating manifest %s ..." % manifest_path)
+    json_lines = []
+    for subfolder, _, filelist in sorted(os.walk(data_dir)):
+        for filename in filelist:
+            if filename.endswith('.wav'):
+                filepath = os.path.join(data_dir, subfolder, filename)
+                audio_data, samplerate = soundfile.read(filepath)
+                duration = float(len(audio_data)) / samplerate
+                json_lines.append(
+                    json.dumps({
+                        'audio_filepath': filepath,
+                        'duration': duration,
+                        'text': ''
+                    }))
+    with open(manifest_path, 'w') as out_file:
+        for line in json_lines:
+            out_file.write(line + '\n')
+def prepare_chime3(url, md5sum, target_dir, manifest_path):
+    """Download, unpack and create summmary manifest file."""
+    if not os.path.exists(os.path.join(target_dir, "CHiME3")):
+        # download
+        filepath = download(url, md5sum, target_dir,
+                            "myairbridge-AG0Y3DNBE5IWRRTV.zip")
+        # unpack
+        unpack(filepath, target_dir)
+        unpack(
+            os.path.join(target_dir, 'CHiME3_background_bus.zip'), target_dir)
+        unpack(
+            os.path.join(target_dir, 'CHiME3_background_caf.zip'), target_dir)
+        unpack(
+            os.path.join(target_dir, 'CHiME3_background_ped.zip'), target_dir)
+        unpack(
+            os.path.join(target_dir, 'CHiME3_background_str.zip'), target_dir)
+    else:
+        print("Skip downloading and unpacking. Data already exists in %s." %
+              target_dir)
+    # create manifest json file
+    create_manifest(target_dir, manifest_path)
+def main():
+    prepare_chime3(
+        url=URL,
+        md5sum=MD5,
+        target_dir=args.target_dir,
+        manifest_path=args.manifest_filepath)
+if __name__ == '__main__':
+    main()
--- a/deep_speech_2/datasets/run_noise.sh
+++ b/deep_speech_2/datasets/run_noise.sh
+cd noise 
+python chime3_background.py
+if [ $? -ne 0 ]; then
+    echo "Prepare CHiME3 background noise failed. Terminated."
+    exit 1
+fi
+cd -
+cat noise/manifest.* > manifest.noise
+echo "All done."
--- a/deep_speech_2/decoder.py
+++ b/deep_speech_2/decoder.py
@@ -205,9 +205,9 @@ def ctc_beam_search_decoder_batch(probs_split,
    :type num_processes: int
    :param cutoff_prob: Cutoff probability in pruning,
                        default 1.0, no pruning.
+    :type cutoff_prob: float
    :param num_processes: Number of parallel processes.
    :type num_processes: int
-    :type cutoff_prob: float
    :param ext_scoring_func: External scoring function for
                            partially decoded sentence, e.g. word count
                            or language model.

--- a/deep_speech_2/demo_client.py
+++ b/deep_speech_2/demo_client.py
+"""Client-end for the ASR demo."""
+from pynput import keyboard
+import struct
+import socket
+import sys
+import argparse
+import pyaudio
+parser = argparse.ArgumentParser(description=__doc__)
+parser.add_argument(
+    "--host_ip",
+    default="localhost",
+    type=str,
+    help="Server IP address. (default: %(default)s)")
+parser.add_argument(
+    "--host_port",
+    default=8086,
+    type=int,
+    help="Server Port. (default: %(default)s)")
+args = parser.parse_args()
+is_recording = False
+enable_trigger_record = True
+def on_press(key):
+    """On-press keyboard callback function."""
+    global is_recording, enable_trigger_record
+    if key == keyboard.Key.space:
+        if (not is_recording) and enable_trigger_record:
+            sys.stdout.write("Start Recording ... ")
+            sys.stdout.flush()
+            is_recording = True
+def on_release(key):
+    """On-release keyboard callback function."""
+    global is_recording, enable_trigger_record
+    if key == keyboard.Key.esc:
+        return False
+    elif key == keyboard.Key.space:
+        if is_recording == True:
+            is_recording = False
+data_list = []
+def callback(in_data, frame_count, time_info, status):
+    """Audio recorder's stream callback function."""
+    global data_list, is_recording, enable_trigger_record
+    if is_recording:
+        data_list.append(in_data)
+        enable_trigger_record = False
+    elif len(data_list) > 0:
+        # Connect to server and send data
+        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+        sock.connect((args.host_ip, args.host_port))
+        sent = ''.join(data_list)
+        sock.sendall(struct.pack('>i', len(sent)) + sent)
+        print('Speech[length=%d] Sent.' % len(sent))
+        # Receive data from the server and shut down
+        received = sock.recv(1024)
+        print "Recognition Results: {}".format(received)
+        sock.close()
+        data_list = []
+    enable_trigger_record = True
+    return (in_data, pyaudio.paContinue)
+def main():
+    # prepare audio recorder
+    p = pyaudio.PyAudio()
+    stream = p.open(
+        format=pyaudio.paInt32,
+        channels=1,
+        rate=16000,
+        input=True,
+        stream_callback=callback)
+    stream.start_stream()
+    # prepare keyboard listener
+    with keyboard.Listener(
+            on_press=on_press, on_release=on_release) as listener:
+        listener.join()
+    # close up
+    stream.stop_stream()
+    stream.close()
+    p.terminate()
+if __name__ == "__main__":
+    main()
--- a/deep_speech_2/demo_server.py
+++ b/deep_speech_2/demo_server.py
+"""Server-end for the ASR demo."""
+import os
+import time
+import random
+import argparse
+import distutils.util
+from time import gmtime, strftime
+import SocketServer
+import struct
+import wave
+import paddle.v2 as paddle
+from utils import print_arguments
+from data_utils.data import DataGenerator
+from model import DeepSpeech2Model
+from data_utils.utils import read_manifest
+parser = argparse.ArgumentParser(description=__doc__)
+parser.add_argument(
+    "--host_ip",
+    default="localhost",
+    type=str,
+    help="Server IP address. (default: %(default)s)")
+parser.add_argument(
+    "--host_port",
+    default=8086,
+    type=int,
+    help="Server Port. (default: %(default)s)")
+parser.add_argument(
+    "--speech_save_dir",
+    default="demo_cache",
+    type=str,
+    help="Directory for saving demo speech. (default: %(default)s)")
+parser.add_argument(
+    "--vocab_filepath",
+    default='datasets/vocab/eng_vocab.txt',
+    type=str,
+    help="Vocabulary filepath. (default: %(default)s)")
+parser.add_argument(
+    "--mean_std_filepath",
+    default='mean_std.npz',
+    type=str,
+    help="Manifest path for normalizer. (default: %(default)s)")
+parser.add_argument(
+    "--warmup_manifest_path",
+    default='datasets/manifest.test',
+    type=str,
+    help="Manifest path for warmup test. (default: %(default)s)")
+parser.add_argument(
+    "--specgram_type",
+    default='linear',
+    type=str,
+    help="Feature type of audio data: 'linear' (power spectrum)"
+    " or 'mfcc'. (default: %(default)s)")
+parser.add_argument(
+    "--num_conv_layers",
+    default=2,
+    type=int,
+    help="Convolution layer number. (default: %(default)s)")
+parser.add_argument(
+    "--num_rnn_layers",
+    default=3,
+    type=int,
+    help="RNN layer number. (default: %(default)s)")
+parser.add_argument(
+    "--rnn_layer_size",
+    default=512,
+    type=int,
+    help="RNN layer cell number. (default: %(default)s)")
+parser.add_argument(
+    "--use_gpu",
+    default=True,
+    type=distutils.util.strtobool,
+    help="Use gpu or not. (default: %(default)s)")
+parser.add_argument(
+    "--model_filepath",
+    default='checkpoints/params.latest.tar.gz',
+    type=str,
+    help="Model filepath. (default: %(default)s)")
+parser.add_argument(
+    "--decode_method",
+    default='beam_search',
+    type=str,
+    help="Method for ctc decoding: best_path or beam_search. "
+    "(default: %(default)s)")
+parser.add_argument(
+    "--beam_size",
+    default=100,
+    type=int,
+    help="Width for beam search decoding. (default: %(default)d)")
+parser.add_argument(
+    "--language_model_path",
+    default="lm/data/common_crawl_00.prune01111.trie.klm",
+    type=str,
+    help="Path for language model. (default: %(default)s)")
+parser.add_argument(
+    "--alpha",
+    default=0.36,
+    type=float,
+    help="Parameter associated with language model. (default: %(default)f)")
+parser.add_argument(
+    "--beta",
+    default=0.25,
+    type=float,
+    help="Parameter associated with word count. (default: %(default)f)")
+parser.add_argument(
+    "--cutoff_prob",
+    default=0.99,
+    type=float,
+    help="The cutoff probability of pruning"
+    "in beam search. (default: %(default)f)")
+args = parser.parse_args()
+class AsrTCPServer(SocketServer.TCPServer):
+    """The ASR TCP Server."""
+    def __init__(self,
+                 server_address,
+                 RequestHandlerClass,
+                 speech_save_dir,
+                 audio_process_handler,
+                 bind_and_activate=True):
+        self.speech_save_dir = speech_save_dir
+        self.audio_process_handler = audio_process_handler
+        SocketServer.TCPServer.__init__(
+            self, server_address, RequestHandlerClass, bind_and_activate=True)
+class AsrRequestHandler(SocketServer.BaseRequestHandler):
+    """The ASR request handler."""
+    def handle(self):
+        # receive data through TCP socket
+        chunk = self.request.recv(1024)
+        target_len = struct.unpack('>i', chunk[:4])[0]
+        data = chunk[4:]
+        while len(data) < target_len:
+            chunk = self.request.recv(1024)
+            data += chunk
+        # write to file
+        filename = self._write_to_file(data)
+        print("Received utterance[length=%d] from %s, saved to %s." %
+              (len(data), self.client_address[0], filename))
+        start_time = time.time()
+        transcript = self.server.audio_process_handler(filename)
+        finish_time = time.time()
+        print("Response Time: %f, Transcript: %s" %
+              (finish_time - start_time, transcript))
+        self.request.sendall(transcript)
+    def _write_to_file(self, data):
+        # prepare save dir and filename
+        if not os.path.exists(self.server.speech_save_dir):
+            os.mkdir(self.server.speech_save_dir)
+        timestamp = strftime("%Y%m%d%H%M%S", gmtime())
+        out_filename = os.path.join(
+            self.server.speech_save_dir,
+            timestamp + "_" + self.client_address[0] + ".wav")
+        # write to wav file
+        file = wave.open(out_filename, 'wb')
+        file.setnchannels(1)
+        file.setsampwidth(4)
+        file.setframerate(16000)
+        file.writeframes(data)
+        file.close()
+        return out_filename
+def warm_up_test(audio_process_handler,
+                 manifest_path,
+                 num_test_cases,
+                 random_seed=0):
+    """Warming-up test."""
+    manifest = read_manifest(manifest_path)
+    rng = random.Random(random_seed)
+    samples = rng.sample(manifest, num_test_cases)
+    for idx, sample in enumerate(samples):
+        print("Warm-up Test Case %d: %s", idx, sample['audio_filepath'])
+        start_time = time.time()
+        transcript = audio_process_handler(sample['audio_filepath'])
+        finish_time = time.time()
+        print("Response Time: %f, Transcript: %s" %
+              (finish_time - start_time, transcript))
+def start_server():
+    """Start the ASR server"""
+    # prepare data generator
+    data_generator = DataGenerator(
+        vocab_filepath=args.vocab_filepath,
+        mean_std_filepath=args.mean_std_filepath,
+        augmentation_config='{}',
+        specgram_type=args.specgram_type,
+        num_threads=1)
+    # prepare ASR model
+    ds2_model = DeepSpeech2Model(
+        vocab_size=data_generator.vocab_size,
+        num_conv_layers=args.num_conv_layers,
+        num_rnn_layers=args.num_rnn_layers,
+        rnn_layer_size=args.rnn_layer_size,
+        pretrained_model_path=args.model_filepath)
+    # prepare ASR inference handler
+    def file_to_transcript(filename):
+        feature = data_generator.process_utterance(filename, "")
+        result_transcript = ds2_model.infer_batch(
+            infer_data=[feature],
+            decode_method=args.decode_method,
+            beam_alpha=args.alpha,
+            beam_beta=args.beta,
+            beam_size=args.beam_size,
+            cutoff_prob=args.cutoff_prob,
+            vocab_list=data_generator.vocab_list,
+            language_model_path=args.language_model_path,
+            num_processes=1)
+        return result_transcript[0]
+    # warming up with utterrances sampled from Librispeech
+    print('-----------------------------------------------------------')
+    print('Warming up ...')
+    warm_up_test(
+        audio_process_handler=file_to_transcript,
+        manifest_path=args.warmup_manifest_path,
+        num_test_cases=3)
+    print('-----------------------------------------------------------')
+    # start the server
+    server = AsrTCPServer(
+        server_address=(args.host_ip, args.host_port),
+        RequestHandlerClass=AsrRequestHandler,
+        speech_save_dir=args.speech_save_dir,
+        audio_process_handler=file_to_transcript)
+    print("ASR Server Started.")
+    server.serve_forever()
+def main():
+    print_arguments(args)
+    paddle.init(use_gpu=args.use_gpu, trainer_count=1)
+    start_server()
+if __name__ == "__main__":
+    main()
--- a/deep_speech_2/error_rate.py
+++ b/deep_speech_2/error_rate.py
@@ -10,47 +10,54 @@ import numpy as np
 def _levenshtein_distance(ref, hyp):
-    """Levenshtein distance is a string metric for measuring the difference between
+    """Levenshtein distance is a string metric for measuring the difference
-    two sequences. Informally, the levenshtein disctance is defined as the minimum
+    between two sequences. Informally, the levenshtein disctance is defined as
-    number of single-character edits (substitutions, insertions or deletions) 
+    the minimum number of single-character edits (substitutions, insertions or
-    required to change one word into the other. We can naturally extend the edits to 
+    deletions) required to change one word into the other. We can naturally
-    word level when calculate levenshtein disctance for two sentences.
+    extend the edits to word level when calculate levenshtein disctance for
+    two sentences.
    """
-    ref_len = len(ref)
+    m = len(ref)
-    hyp_len = len(hyp)
+    n = len(hyp)
    # special case
    if ref == hyp:
        return 0
-    if ref_len == 0:
+    if m == 0:
-        return hyp_len
+        return n
-    if hyp_len == 0:
+    if n == 0:
-        return ref_len
+        return m
-    distance = np.zeros((ref_len + 1, hyp_len + 1), dtype=np.int32)
+    if m < n:
+        ref, hyp = hyp, ref
+        m, n = n, m
+    # use O(min(m, n)) space
+    distance = np.zeros((2, n + 1), dtype=np.int32)
    # initialize distance matrix
-    for j in xrange(hyp_len + 1):
+    for j in xrange(n + 1):
        distance[0][j] = j
-    for i in xrange(ref_len + 1):
-        distance[i][0] = i
    # calculate levenshtein distance
-    for i in xrange(1, ref_len + 1):
+    for i in xrange(1, m + 1):
-        for j in xrange(1, hyp_len + 1):
+        prev_row_idx = (i - 1) % 2
+        cur_row_idx = i % 2
+        distance[cur_row_idx][0] = i
+        for j in xrange(1, n + 1):
            if ref[i - 1] == hyp[j - 1]:
-                distance[i][j] = distance[i - 1][j - 1]
+                distance[cur_row_idx][j] = distance[prev_row_idx][j - 1]
            else:
-                s_num = distance[i - 1][j - 1] + 1
+                s_num = distance[prev_row_idx][j - 1] + 1
-                i_num = distance[i][j - 1] + 1
+                i_num = distance[cur_row_idx][j - 1] + 1
-                d_num = distance[i - 1][j] + 1
+                d_num = distance[prev_row_idx][j] + 1
-                distance[i][j] = min(s_num, i_num, d_num)
+                distance[cur_row_idx][j] = min(s_num, i_num, d_num)
-    return distance[ref_len][hyp_len]
+    return distance[m % 2][n]
 def wer(reference, hypothesis, ignore_case=False, delimiter=' '):
-    """Calculate word error rate (WER). WER compares reference text and 
+    """Calculate word error rate (WER). WER compares reference text and
    hypothesis text in word-level. WER is defined as:
    .. math::
@@ -65,8 +72,8 @@ def wer(reference, hypothesis, ignore_case=False, delimiter=' '):
        Iw is the number of words inserted,
        Nw is the number of words in the reference
-    We can use levenshtein distance to calculate WER. Please draw an attention that 
+    We can use levenshtein distance to calculate WER. Please draw an attention
-    empty items will be removed when splitting sentences by delimiter.
+    that empty items will be removed when splitting sentences by delimiter.
    :param reference: The reference sentence.
    :type reference: basestring
@@ -95,7 +102,7 @@ def wer(reference, hypothesis, ignore_case=False, delimiter=' '):
    return wer
-def cer(reference, hypothesis, ignore_case=False):
+def cer(reference, hypothesis, ignore_case=False, remove_space=False):
    """Calculate charactor error rate (CER). CER compares reference text and
    hypothesis text in char-level. CER is defined as:
@@ -111,10 +118,10 @@ def cer(reference, hypothesis, ignore_case=False):
        Ic is the number of characters inserted
        Nc is the number of characters in the reference
-    We can use levenshtein distance to calculate CER. Chinese input should be 
+    We can use levenshtein distance to calculate CER. Chinese input should be
-    encoded to unicode. Please draw an attention that the leading and tailing 
+    encoded to unicode. Please draw an attention that the leading and tailing
-    white space characters will be truncated and multiple consecutive white 
+    space characters will be truncated and multiple consecutive space
-    space characters in a sentence will be replaced by one white space character.
+    characters in a sentence will be replaced by one space character.
    :param reference: The reference sentence.
    :type reference: basestring
@@ -122,6 +129,8 @@ def cer(reference, hypothesis, ignore_case=False):
    :type hypothesis: basestring
    :param ignore_case: Whether case-sensitive or not.
    :type ignore_case: bool
+    :param remove_space: Whether remove internal space characters
+    :type remove_space: bool
    :return: Character error rate.
    :rtype: float
    :raises ValueError: If the reference length is zero.
@@ -130,8 +139,12 @@ def cer(reference, hypothesis, ignore_case=False):
        reference = reference.lower()
        hypothesis = hypothesis.lower()
-    reference = ' '.join(filter(None, reference.split(' ')))
+    join_char = ' '
-    hypothesis = ' '.join(filter(None, hypothesis.split(' ')))
+    if remove_space == True:
+        join_char = ''
+    reference = join_char.join(filter(None, reference.split(' ')))
+    hypothesis = join_char.join(filter(None, hypothesis.split(' ')))
    if len(reference) == 0:
        raise ValueError("Length of reference should be greater than 0.")

--- a/deep_speech_2/evaluate.py
+++ b/deep_speech_2/evaluate.py
@@ -5,20 +5,24 @@ from __future__ import print_function
 import distutils.util
 import argparse
-import gzip
+import multiprocessing
 import paddle.v2 as paddle
 from data_utils.data import DataGenerator
-from model import deep_speech2
+from model import DeepSpeech2Model
-from decoder import *
+from error_rate import wer, cer
-from lm.lm_scorer import LmScorer
+import utils
-from error_rate import wer
 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
    "--batch_size",
-    default=100,
+    default=128,
    type=int,
    help="Minibatch size for evaluation. (default: %(default)s)")
+parser.add_argument(
+    "--trainer_count",
+    default=8,
+    type=int,
+    help="Trainer number. (default: %(default)s)")
 parser.add_argument(
    "--num_conv_layers",
    default=2,
@@ -41,12 +45,12 @@ parser.add_argument(
    help="Use gpu or not. (default: %(default)s)")
 parser.add_argument(
    "--num_threads_data",
-    default=multiprocessing.cpu_count(),
+    default=multiprocessing.cpu_count() // 2,
    type=int,
    help="Number of cpu threads for preprocessing data. (default: %(default)s)")
 parser.add_argument(
    "--num_processes_beam_search",
-    default=multiprocessing.cpu_count(),
+    default=multiprocessing.cpu_count() // 2,
    type=int,
    help="Number of cpu processes for beam search. (default: %(default)s)")
 parser.add_argument(
@@ -58,21 +62,21 @@ parser.add_argument(
    "--decode_method",
    default='beam_search',
    type=str,
-    help="Method for ctc decoding, best_path or beam_search. (default: %(default)s)"
+    help="Method for ctc decoding, best_path or beam_search. "
-)
+    "(default: %(default)s)")
 parser.add_argument(
    "--language_model_path",
-    default="lm/data/1Billion.klm",
+    default="lm/data/common_crawl_00.prune01111.trie.klm",
    type=str,
    help="Path for language model. (default: %(default)s)")
 parser.add_argument(
    "--alpha",
-    default=0.26,
+    default=0.36,
    type=float,
    help="Parameter associated with language model. (default: %(default)f)")
 parser.add_argument(
    "--beta",
-    default=0.1,
+    default=0.25,
    type=float,
    help="Parameter associated with word count. (default: %(default)f)")
 parser.add_argument(
@@ -86,6 +90,12 @@ parser.add_argument(
    default=500,
    type=int,
    help="Width for beam search decoding. (default: %(default)d)")
+parser.add_argument(
+    "--specgram_type",
+    default='linear',
+    type=str,
+    help="Feature type of audio data: 'linear' (power spectrum)"
+    " or 'mfcc'. (default: %(default)s)")
 parser.add_argument(
    "--decode_manifest_path",
    default='datasets/manifest.test',
@@ -101,102 +111,68 @@ parser.add_argument(
    default='datasets/vocab/eng_vocab.txt',
    type=str,
    help="Vocabulary filepath. (default: %(default)s)")
+parser.add_argument(
+    "--error_rate_type",
+    default='wer',
+    choices=['wer', 'cer'],
+    type=str,
+    help="Error rate type for evaluation. 'wer' for word error rate and 'cer' "
+    "for character error rate. "
+    "(default: %(default)s)")
 args = parser.parse_args()
 def evaluate():
    """Evaluate on whole test data for DeepSpeech2."""
-    # initialize data generator
    data_generator = DataGenerator(
        vocab_filepath=args.vocab_filepath,
        mean_std_filepath=args.mean_std_filepath,
        augmentation_config='{}',
+        specgram_type=args.specgram_type,
        num_threads=args.num_threads_data)
-    # create network config
-    # paddle.data_type.dense_array is used for variable batch input.
-    # The size 161 * 161 is only an placeholder value and the real shape
-    # of input batch data will be induced during training.
-    audio_data = paddle.layer.data(
-        name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
-    text_data = paddle.layer.data(
-        name="transcript_text",
-        type=paddle.data_type.integer_value_sequence(data_generator.vocab_size))
-    output_probs = deep_speech2(
-        audio_data=audio_data,
-        text_data=text_data,
-        dict_size=data_generator.vocab_size,
-        num_conv_layers=args.num_conv_layers,
-        num_rnn_layers=args.num_rnn_layers,
-        rnn_size=args.rnn_layer_size,
-        is_inference=True)
-    # load parameters
-    parameters = paddle.parameters.Parameters.from_tar(
-        gzip.open(args.model_filepath))
-    # prepare infer data
    batch_reader = data_generator.batch_reader_creator(
        manifest_path=args.decode_manifest_path,
        batch_size=args.batch_size,
+        min_batch_size=1,
        sortagrad=False,
        shuffle_method=None)
-    # define inferer
+    ds2_model = DeepSpeech2Model(
-    inferer = paddle.inference.Inference(
+        vocab_size=data_generator.vocab_size,
-        output_layer=output_probs, parameters=parameters)
+        num_conv_layers=args.num_conv_layers,
+        num_rnn_layers=args.num_rnn_layers,
-    # initialize external scorer for beam search decoding
+        rnn_layer_size=args.rnn_layer_size,
-    if args.decode_method == 'beam_search':
+        pretrained_model_path=args.model_filepath)
-        ext_scorer = LmScorer(args.alpha, args.beta, args.language_model_path)
-    wer_counter, wer_sum = 0, 0.0
+    error_rate_func = cer if args.error_rate_type == 'cer' else wer
+    error_sum, num_ins = 0.0, 0
    for infer_data in batch_reader():
-        # run inference
+        result_transcripts = ds2_model.infer_batch(
-        infer_results = inferer.infer(input=infer_data)
+            infer_data=infer_data,
-        num_steps = len(infer_results) // len(infer_data)
+            decode_method=args.decode_method,
-        probs_split = [
+            beam_alpha=args.alpha,
-            infer_results[i * num_steps:(i + 1) * num_steps]
+            beam_beta=args.beta,
-            for i in xrange(0, len(infer_data))
+            beam_size=args.beam_size,
+            cutoff_prob=args.cutoff_prob,
+            vocab_list=data_generator.vocab_list,
+            language_model_path=args.language_model_path,
+            num_processes=args.num_processes_beam_search)
+        target_transcripts = [
+            ''.join([data_generator.vocab_list[token] for token in transcript])
+            for _, transcript in infer_data
        ]
-        # target transcription
+        for target, result in zip(target_transcripts, result_transcripts):
-        target_transcription = [
+            error_sum += error_rate_func(target, result)
-            ''.join([
+            num_ins += 1
-                data_generator.vocab_list[index] for index in infer_data[i][1]
+        print("Error rate [%s] (%d/?) = %f" %
-            ]) for i, probs in enumerate(probs_split)
+              (args.error_rate_type, num_ins, error_sum / num_ins))
-        ]
+    print("Final error rate [%s] (%d/%d) = %f" %
-        # decode and print
+          (args.error_rate_type, num_ins, num_ins, error_sum / num_ins))
-        # best path decode
-        if args.decode_method == "best_path":
-            for i, probs in enumerate(probs_split):
-                output_transcription = ctc_best_path_decoder(
-                    probs_seq=probs, vocabulary=data_generator.vocab_list)
-                wer_sum += wer(target_transcription[i], output_transcription)
-                wer_counter += 1
-        # beam search decode
-        elif args.decode_method == "beam_search":
-            # beam search using multiple processes
-            beam_search_results = ctc_beam_search_decoder_batch(
-                probs_split=probs_split,
-                vocabulary=data_generator.vocab_list,
-                beam_size=args.beam_size,
-                blank_id=len(data_generator.vocab_list),
-                num_processes=args.num_processes_beam_search,
-                ext_scoring_func=ext_scorer,
-                cutoff_prob=args.cutoff_prob, )
-            for i, beam_search_result in enumerate(beam_search_results):
-                wer_sum += wer(target_transcription[i],
-                               beam_search_result[0][1])
-                wer_counter += 1
-        else:
-            raise ValueError("Decoding method [%s] is not supported." %
-                             decode_method)
-    print("Final WER = %f" % (wer_sum / wer_counter))
 def main():
-    paddle.init(use_gpu=args.use_gpu, trainer_count=1)
+    utils.print_arguments(args)
+    paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
    evaluate()

--- a/deep_speech_2/infer.py
+++ b/deep_speech_2/infer.py
@@ -4,15 +4,12 @@ from __future__ import division
 from __future__ import print_function
 import argparse
-import gzip
 import distutils.util
 import multiprocessing
 import paddle.v2 as paddle
 from data_utils.data import DataGenerator
-from model import deep_speech2
+from model import DeepSpeech2Model
-from decoder import *
+from error_rate import wer, cer
-from lm.lm_scorer import LmScorer
-from error_rate import wer
 import utils
 parser = argparse.ArgumentParser(description=__doc__)
@@ -43,14 +40,25 @@ parser.add_argument(
    help="Use gpu or not. (default: %(default)s)")
 parser.add_argument(
    "--num_threads_data",
-    default=multiprocessing.cpu_count(),
+    default=1,
    type=int,
    help="Number of cpu threads for preprocessing data. (default: %(default)s)")
 parser.add_argument(
    "--num_processes_beam_search",
-    default=multiprocessing.cpu_count(),
+    default=multiprocessing.cpu_count() // 2,
    type=int,
    help="Number of cpu processes for beam search. (default: %(default)s)")
+parser.add_argument(
+    "--specgram_type",
+    default='linear',
+    type=str,
+    help="Feature type of audio data: 'linear' (power spectrum)"
+    " or 'mfcc'. (default: %(default)s)")
+parser.add_argument(
+    "--trainer_count",
+    default=8,
+    type=int,
+    help="Trainer number. (default: %(default)s)")
 parser.add_argument(
    "--mean_std_filepath",
    default='mean_std.npz',
@@ -75,31 +83,26 @@ parser.add_argument(
    "--decode_method",
    default='beam_search',
    type=str,
-    help="Method for ctc decoding: best_path or beam_search. (default: %(default)s)"
+    help="Method for ctc decoding: best_path or beam_search. "
-)
+    "(default: %(default)s)")
 parser.add_argument(
    "--beam_size",
    default=500,
    type=int,
    help="Width for beam search decoding. (default: %(default)d)")
-parser.add_argument(
-    "--num_results_per_sample",
-    default=1,
-    type=int,
-    help="Number of output per sample in beam search. (default: %(default)d)")
 parser.add_argument(
    "--language_model_path",
-    default="lm/data/1Billion.klm",
+    default="lm/data/common_crawl_00.prune01111.trie.klm",
    type=str,
    help="Path for language model. (default: %(default)s)")
 parser.add_argument(
    "--alpha",
-    default=0.26,
+    default=0.36,
    type=float,
    help="Parameter associated with language model. (default: %(default)f)")
 parser.add_argument(
    "--beta",
-    default=0.1,
+    default=0.25,
    type=float,
    help="Parameter associated with word count. (default: %(default)f)")
 parser.add_argument(
@@ -108,41 +111,25 @@ parser.add_argument(
    type=float,
    help="The cutoff probability of pruning"
    "in beam search. (default: %(default)f)")
+parser.add_argument(
+    "--error_rate_type",
+    default='wer',
+    choices=['wer', 'cer'],
+    type=str,
+    help="Error rate type for evaluation. 'wer' for word error rate and 'cer' "
+    "for character error rate. "
+    "(default: %(default)s)")
 args = parser.parse_args()
 def infer():
    """Inference for DeepSpeech2."""
-    # initialize data generator
    data_generator = DataGenerator(
        vocab_filepath=args.vocab_filepath,
        mean_std_filepath=args.mean_std_filepath,
        augmentation_config='{}',
+        specgram_type=args.specgram_type,
        num_threads=args.num_threads_data)
-    # create network config
-    # paddle.data_type.dense_array is used for variable batch input.
-    # The size 161 * 161 is only an placeholder value and the real shape
-    # of input batch data will be induced during training.
-    audio_data = paddle.layer.data(
-        name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
-    text_data = paddle.layer.data(
-        name="transcript_text",
-        type=paddle.data_type.integer_value_sequence(data_generator.vocab_size))
-    output_probs = deep_speech2(
-        audio_data=audio_data,
-        text_data=text_data,
-        dict_size=data_generator.vocab_size,
-        num_conv_layers=args.num_conv_layers,
-        num_rnn_layers=args.num_rnn_layers,
-        rnn_size=args.rnn_layer_size,
-        is_inference=True)
-    # load parameters
-    parameters = paddle.parameters.Parameters.from_tar(
-        gzip.open(args.model_filepath))
-    # prepare infer data
    batch_reader = data_generator.batch_reader_creator(
        manifest_path=args.decode_manifest_path,
        batch_size=args.num_samples,
@@ -151,66 +138,38 @@ def infer():
        shuffle_method=None)
    infer_data = batch_reader().next()
-    # run inference
+    ds2_model = DeepSpeech2Model(
-    infer_results = paddle.infer(
+        vocab_size=data_generator.vocab_size,
-        output_layer=output_probs, parameters=parameters, input=infer_data)
+        num_conv_layers=args.num_conv_layers,
-    num_steps = len(infer_results) // len(infer_data)
+        num_rnn_layers=args.num_rnn_layers,
-    probs_split = [
+        rnn_layer_size=args.rnn_layer_size,
-        infer_results[i * num_steps:(i + 1) * num_steps]
+        pretrained_model_path=args.model_filepath)
-        for i in xrange(len(infer_data))
+    result_transcripts = ds2_model.infer_batch(
-    ]
+        infer_data=infer_data,
+        decode_method=args.decode_method,
+        beam_alpha=args.alpha,
+        beam_beta=args.beta,
+        beam_size=args.beam_size,
+        cutoff_prob=args.cutoff_prob,
+        vocab_list=data_generator.vocab_list,
+        language_model_path=args.language_model_path,
+        num_processes=args.num_processes_beam_search)
-    # targe transcription
+    error_rate_func = cer if args.error_rate_type == 'cer' else wer
-    target_transcription = [
+    target_transcripts = [
-        ''.join(
+        ''.join([data_generator.vocab_list[token] for token in transcript])
-            [data_generator.vocab_list[index] for index in infer_data[i][1]])
+        for _, transcript in infer_data
-        for i, probs in enumerate(probs_split)
    ]
+    for target, result in zip(target_transcripts, result_transcripts):
-    ## decode and print
+        print("\nTarget Transcription: %s\nOutput Transcription: %s" %
-    # best path decode
+              (target, result))
-    wer_sum, wer_counter = 0, 0
+        print("Current error rate [%s] = %f" %
-    if args.decode_method == "best_path":
+              (args.error_rate_type, error_rate_func(target, result)))
-        for i, probs in enumerate(probs_split):
-            best_path_transcription = ctc_best_path_decoder(
-                probs_seq=probs, vocabulary=data_generator.vocab_list)
-            print("\nTarget Transcription: %s\nOutput Transcription: %s" %
-                  (target_transcription[i], best_path_transcription))
-            wer_cur = wer(target_transcription[i], best_path_transcription)
-            wer_sum += wer_cur
-            wer_counter += 1
-            print("cur wer = %f, average wer = %f" %
-                  (wer_cur, wer_sum / wer_counter))
-    # beam search decode
-    elif args.decode_method == "beam_search":
-        ext_scorer = LmScorer(args.alpha, args.beta, args.language_model_path)
-        beam_search_batch_results = ctc_beam_search_decoder_batch(
-            probs_split=probs_split,
-            vocabulary=data_generator.vocab_list,
-            beam_size=args.beam_size,
-            blank_id=len(data_generator.vocab_list),
-            num_processes=args.num_processes_beam_search,
-            cutoff_prob=args.cutoff_prob,
-            ext_scoring_func=ext_scorer, )
-        for i, beam_search_result in enumerate(beam_search_batch_results):
-            print("\nTarget Transcription:\t%s" % target_transcription[i])
-            for index in xrange(args.num_results_per_sample):
-                result = beam_search_result[index]
-                #output: index, log prob, beam result
-                print("Beam %d: %f \t%s" % (index, result[0], result[1]))
-            wer_cur = wer(target_transcription[i], beam_search_result[0][1])
-            wer_sum += wer_cur
-            wer_counter += 1
-            print("cur wer = %f , average wer = %f" %
-                  (wer_cur, wer_sum / wer_counter))
-    else:
-        raise ValueError("Decoding method [%s] is not supported." %
-                         decode_method)
 def main():
    utils.print_arguments(args)
-    paddle.init(use_gpu=args.use_gpu, trainer_count=1)
+    paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
    infer()

--- a/deep_speech_2/layer.py
+++ b/deep_speech_2/layer.py
+"""Contains DeepSpeech2 layers."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import paddle.v2 as paddle
+def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride,
+                  padding, act):
+    """Convolution layer with batch normalization.
+    :param input: Input layer.
+    :type input: LayerOutput
+    :param filter_size: The x dimension of a filter kernel. Or input a tuple for
+                        two image dimension.
+    :type filter_size: int|tuple|list
+    :param num_channels_in: Number of input channels.
+    :type num_channels_in: int
+    :type num_channels_out: Number of output channels.
+    :type num_channels_in: out
+    :param padding: The x dimension of the padding. Or input a tuple for two
+                    image dimension.
+    :type padding: int|tuple|list
+    :param act: Activation type.
+    :type act: BaseActivation
+    :return: Batch norm layer after convolution layer.
+    :rtype: LayerOutput
+    """
+    conv_layer = paddle.layer.img_conv(
+        input=input,
+        filter_size=filter_size,
+        num_channels=num_channels_in,
+        num_filters=num_channels_out,
+        stride=stride,
+        padding=padding,
+        act=paddle.activation.Linear(),
+        bias_attr=False)
+    return paddle.layer.batch_norm(input=conv_layer, act=act)
+def bidirectional_simple_rnn_bn_layer(name, input, size, act):
+    """Bidirectonal simple rnn layer with sequence-wise batch normalization.
+    The batch normalization is only performed on input-state weights.
+    :param name: Name of the layer.
+    :type name: string
+    :param input: Input layer.
+    :type input: LayerOutput
+    :param size: Number of RNN cells.
+    :type size: int
+    :param act: Activation type.
+    :type act: BaseActivation
+    :return: Bidirectional simple rnn layer.
+    :rtype: LayerOutput
+    """
+    # input-hidden weights shared across bi-direcitonal rnn.
+    input_proj = paddle.layer.fc(
+        input=input, size=size, act=paddle.activation.Linear(), bias_attr=False)
+    # batch norm is only performed on input-state projection 
+    input_proj_bn = paddle.layer.batch_norm(
+        input=input_proj, act=paddle.activation.Linear())
+    # forward and backward in time
+    forward_simple_rnn = paddle.layer.recurrent(
+        input=input_proj_bn, act=act, reverse=False)
+    backward_simple_rnn = paddle.layer.recurrent(
+        input=input_proj_bn, act=act, reverse=True)
+    return paddle.layer.concat(input=[forward_simple_rnn, backward_simple_rnn])
+def conv_group(input, num_stacks):
+    """Convolution group with stacked convolution layers.
+    :param input: Input layer.
+    :type input: LayerOutput
+    :param num_stacks: Number of stacked convolution layers.
+    :type num_stacks: int
+    :return: Output layer of the convolution group.
+    :rtype: LayerOutput
+    """
+    conv = conv_bn_layer(
+        input=input,
+        filter_size=(11, 41),
+        num_channels_in=1,
+        num_channels_out=32,
+        stride=(3, 2),
+        padding=(5, 20),
+        act=paddle.activation.BRelu())
+    for i in xrange(num_stacks - 1):
+        conv = conv_bn_layer(
+            input=conv,
+            filter_size=(11, 21),
+            num_channels_in=32,
+            num_channels_out=32,
+            stride=(1, 2),
+            padding=(5, 10),
+            act=paddle.activation.BRelu())
+    output_num_channels = 32
+    output_height = 160 // pow(2, num_stacks) + 1
+    return conv, output_num_channels, output_height
+def rnn_group(input, size, num_stacks):
+    """RNN group with stacked bidirectional simple RNN layers.
+    :param input: Input layer.
+    :type input: LayerOutput
+    :param size: Number of RNN cells in each layer.
+    :type size: int
+    :param num_stacks: Number of stacked rnn layers.
+    :type num_stacks: int
+    :return: Output layer of the RNN group.
+    :rtype: LayerOutput
+    """
+    output = input
+    for i in xrange(num_stacks):
+        output = bidirectional_simple_rnn_bn_layer(
+            name=str(i), input=output, size=size, act=paddle.activation.BRelu())
+    return output
+def deep_speech2(audio_data,
+                 text_data,
+                 dict_size,
+                 num_conv_layers=2,
+                 num_rnn_layers=3,
+                 rnn_size=256):
+    """
+    The whole DeepSpeech2 model structure (a simplified version).
+    :param audio_data: Audio spectrogram data layer.
+    :type audio_data: LayerOutput
+    :param text_data: Transcription text data layer.
+    :type text_data: LayerOutput
+    :param dict_size: Dictionary size for tokenized transcription.
+    :type dict_size: int
+    :param num_conv_layers: Number of stacking convolution layers.
+    :type num_conv_layers: int
+    :param num_rnn_layers: Number of stacking RNN layers.
+    :type num_rnn_layers: int
+    :param rnn_size: RNN layer size (number of RNN cells).
+    :type rnn_size: int
+    :return: A tuple of an output unnormalized log probability layer (
+             before softmax) and a ctc cost layer.
+    :rtype: tuple of LayerOutput
+    """
+    # convolution group
+    conv_group_output, conv_group_num_channels, conv_group_height = conv_group(
+        input=audio_data, num_stacks=num_conv_layers)
+    # convert data form convolution feature map to sequence of vectors
+    conv2seq = paddle.layer.block_expand(
+        input=conv_group_output,
+        num_channels=conv_group_num_channels,
+        stride_x=1,
+        stride_y=1,
+        block_x=1,
+        block_y=conv_group_height)
+    # rnn group
+    rnn_group_output = rnn_group(
+        input=conv2seq, size=rnn_size, num_stacks=num_rnn_layers)
+    fc = paddle.layer.fc(
+        input=rnn_group_output,
+        size=dict_size + 1,
+        act=paddle.activation.Linear(),
+        bias_attr=True)
+    # probability distribution with softmax
+    log_probs = paddle.layer.mixed(
+        input=paddle.layer.identity_projection(input=fc),
+        act=paddle.activation.Softmax())
+    # ctc cost
+    ctc_loss = paddle.layer.warp_ctc(
+        input=fc,
+        label=text_data,
+        size=dict_size + 1,
+        blank=dict_size,
+        norm_by_times=True)
+    return log_probs, ctc_loss
--- a/deep_speech_2/lm/run.sh
+++ b/deep_speech_2/lm/run.sh
-echo "Downloading language model."
+echo "Downloading language model ..."
+mkdir data
+LM=common_crawl_00.prune01111.trie.klm
+MD5="099a601759d467cd0a8523ff939819c5"
+wget -c http://paddlepaddle.bj.bcebos.com/model_zoo/speech/$LM -P ./data
+echo "Checking md5sum ..."
+md5_tmp=`md5sum ./data/$LM | awk -F[' '] '{print $1}'`
+if [ $MD5 != $md5_tmp ]; then
+    echo "Fail to download the language model!"
+    exit 1
+fi
-wget -c ftp://xxx/xxx/en.00.UNKNOWN.klm -P ./data
--- a/deep_speech_2/model.py
+++ b/deep_speech_2/model.py
@@ -3,141 +3,244 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+import sys
+import os
+import time
+import gzip
+from decoder import *
+from lm.lm_scorer import LmScorer
 import paddle.v2 as paddle
+from layer import *
-def conv_bn_layer(input, filter_size, num_channels_in, num_channels_out, stride,
+class DeepSpeech2Model(object):
-                  padding, act):
+    """DeepSpeech2Model class.
-    """
-    Convolution layer with batch normalization.
+    :param vocab_size: Decoding vocabulary size.
-    """
+    :type vocab_size: int
-    conv_layer = paddle.layer.img_conv(
-        input=input,
-        filter_size=filter_size,
-        num_channels=num_channels_in,
-        num_filters=num_channels_out,
-        stride=stride,
-        padding=padding,
-        act=paddle.activation.Linear(),
-        bias_attr=False)
-    return paddle.layer.batch_norm(input=conv_layer, act=act)
-def bidirectional_simple_rnn_bn_layer(name, input, size, act):
-    """
-    Bidirectonal simple rnn layer with sequence-wise batch normalization.
-    The batch normalization is only performed on input-state weights.
-    """
-    # input-hidden weights shared across bi-direcitonal rnn.
-    input_proj = paddle.layer.fc(
-        input=input, size=size, act=paddle.activation.Linear(), bias_attr=False)
-    # batch norm is only performed on input-state projection 
-    input_proj_bn = paddle.layer.batch_norm(
-        input=input_proj, act=paddle.activation.Linear())
-    # forward and backward in time
-    forward_simple_rnn = paddle.layer.recurrent(
-        input=input_proj_bn, act=act, reverse=False)
-    backward_simple_rnn = paddle.layer.recurrent(
-        input=input_proj_bn, act=act, reverse=True)
-    return paddle.layer.concat(input=[forward_simple_rnn, backward_simple_rnn])
-def conv_group(input, num_stacks):
-    """
-    Convolution group with several stacking convolution layers.
-    """
-    conv = conv_bn_layer(
-        input=input,
-        filter_size=(11, 41),
-        num_channels_in=1,
-        num_channels_out=32,
-        stride=(3, 2),
-        padding=(5, 20),
-        act=paddle.activation.BRelu())
-    for i in xrange(num_stacks - 1):
-        conv = conv_bn_layer(
-            input=conv,
-            filter_size=(11, 21),
-            num_channels_in=32,
-            num_channels_out=32,
-            stride=(1, 2),
-            padding=(5, 10),
-            act=paddle.activation.BRelu())
-    output_num_channels = 32
-    output_height = 160 // pow(2, num_stacks) + 1
-    return conv, output_num_channels, output_height
-def rnn_group(input, size, num_stacks):
-    """
-    RNN group with several stacking RNN layers.
-    """
-    output = input
-    for i in xrange(num_stacks):
-        output = bidirectional_simple_rnn_bn_layer(
-            name=str(i), input=output, size=size, act=paddle.activation.BRelu())
-    return output
-def deep_speech2(audio_data,
-                 text_data,
-                 dict_size,
-                 num_conv_layers=2,
-                 num_rnn_layers=3,
-                 rnn_size=256,
-                 is_inference=False):
-    """
-    The whole DeepSpeech2 model structure (a simplified version).
-    :param audio_data: Audio spectrogram data layer.
-    :type audio_data: LayerOutput
-    :param text_data: Transcription text data layer.
-    :type text_data: LayerOutput
-    :param dict_size: Dictionary size for tokenized transcription.
-    :type dict_size: int
    :param num_conv_layers: Number of stacking convolution layers.
    :type num_conv_layers: int
    :param num_rnn_layers: Number of stacking RNN layers.
    :type num_rnn_layers: int
-    :param rnn_size: RNN layer size (number of RNN cells).
+    :param rnn_layer_size: RNN layer size (number of RNN cells).
-    :type rnn_size: int
+    :type rnn_layer_size: int
-    :param is_inference: False in the training mode, and True in the
+    :param pretrained_model_path: Pretrained model path. If None, will train
-                         inferene mode.
+                                  from stratch.
-    :type is_inference: bool
+    :type pretrained_model_path: basestring|None
-    :return: If is_inference set False, return a ctc cost layer;
-             if is_inference set True, return a sequence layer of output
-             probability distribution.
-    :rtype: tuple of LayerOutput
    """
-    # convolution group
-    conv_group_output, conv_group_num_channels, conv_group_height = conv_group(
+    def __init__(self, vocab_size, num_conv_layers, num_rnn_layers,
-        input=audio_data, num_stacks=num_conv_layers)
+                 rnn_layer_size, pretrained_model_path):
-    # convert data form convolution feature map to sequence of vectors
+        self._create_network(vocab_size, num_conv_layers, num_rnn_layers,
-    conv2seq = paddle.layer.block_expand(
+                             rnn_layer_size)
-        input=conv_group_output,
+        self._create_parameters(pretrained_model_path)
-        num_channels=conv_group_num_channels,
+        self._inferer = None
-        stride_x=1,
+        self._loss_inferer = None
-        stride_y=1,
+        self._ext_scorer = None
-        block_x=1,
-        block_y=conv_group_height)
+    def train(self,
-    # rnn group
+              train_batch_reader,
-    rnn_group_output = rnn_group(
+              dev_batch_reader,
-        input=conv2seq, size=rnn_size, num_stacks=num_rnn_layers)
+              feeding_dict,
-    fc = paddle.layer.fc(
+              learning_rate,
-        input=rnn_group_output,
+              gradient_clipping,
-        size=dict_size + 1,
+              num_passes,
-        act=paddle.activation.Linear(),
+              output_model_dir,
-        bias_attr=True)
+              is_local=True,
-    if is_inference:
+              num_iterations_print=100):
-        # probability distribution with softmax
+        """Train the model.
-        return paddle.layer.mixed(
-            input=paddle.layer.identity_projection(input=fc),
+        :param train_batch_reader: Train data reader.
-            act=paddle.activation.Softmax())
+        :type train_batch_reader: callable
-    else:
+        :param dev_batch_reader: Validation data reader.
-        # ctc cost
+        :type dev_batch_reader: callable
-        return paddle.layer.warp_ctc(
+        :param feeding_dict: Feeding is a map of field name and tuple index
-            input=fc,
+                             of the data that reader returns.
-            label=text_data,
+        :type feeding_dict: dict|list
-            size=dict_size + 1,
+        :param learning_rate: Learning rate for ADAM optimizer.
-            blank=dict_size,
+        :type learning_rate: float
-            norm_by_times=True)
+        :param gradient_clipping: Gradient clipping threshold.
+        :type gradient_clipping: float
+        :param num_passes: Number of training epochs.
+        :type num_passes: int
+        :param num_iterations_print: Number of training iterations for printing
+                                     a training loss.
+        :type rnn_iteratons_print: int
+        :param is_local: Set to False if running with pserver with multi-nodes.
+        :type is_local: bool
+        :param output_model_dir: Directory for saving the model (every pass).
+        :type output_model_dir: basestring
+        """
+        # prepare model output directory
+        if not os.path.exists(output_model_dir):
+            os.mkdir(output_model_dir)
+        # prepare optimizer and trainer
+        optimizer = paddle.optimizer.Adam(
+            learning_rate=learning_rate,
+            gradient_clipping_threshold=gradient_clipping)
+        trainer = paddle.trainer.SGD(
+            cost=self._loss,
+            parameters=self._parameters,
+            update_equation=optimizer,
+            is_local=is_local)
+        # create event handler
+        def event_handler(event):
+            global start_time, cost_sum, cost_counter
+            if isinstance(event, paddle.event.EndIteration):
+                cost_sum += event.cost
+                cost_counter += 1
+                if (event.batch_id + 1) % num_iterations_print == 0:
+                    output_model_path = os.path.join(output_model_dir,
+                                                     "params.latest.tar.gz")
+                    with gzip.open(output_model_path, 'w') as f:
+                        self._parameters.to_tar(f)
+                    print("\nPass: %d, Batch: %d, TrainCost: %f" %
+                          (event.pass_id, event.batch_id + 1,
+                           cost_sum / cost_counter))
+                    cost_sum, cost_counter = 0.0, 0
+                else:
+                    sys.stdout.write('.')
+                    sys.stdout.flush()
+            if isinstance(event, paddle.event.BeginPass):
+                start_time = time.time()
+                cost_sum, cost_counter = 0.0, 0
+            if isinstance(event, paddle.event.EndPass):
+                result = trainer.test(
+                    reader=dev_batch_reader, feeding=feeding_dict)
+                output_model_path = os.path.join(
+                    output_model_dir, "params.pass-%d.tar.gz" % event.pass_id)
+                with gzip.open(output_model_path, 'w') as f:
+                    self._parameters.to_tar(f)
+                print("\n------- Time: %d sec,  Pass: %d, ValidationCost: %s" %
+                      (time.time() - start_time, event.pass_id, result.cost))
+        # run train
+        trainer.train(
+            reader=train_batch_reader,
+            event_handler=event_handler,
+            num_passes=num_passes,
+            feeding=feeding_dict)
+    def infer_loss_batch(self, infer_data):
+        """Model inference. Infer the ctc loss for a batch of speech
+        utterances.
+        :param infer_data: List of utterances to infer, with each utterance a
+                           tuple of audio features and transcription text (empty
+                           string).
+        :type infer_data: list
+        :return: List of ctc loss.
+        :rtype: List of float
+        """
+        # define inferer
+        if self._loss_inferer == None:
+            self._loss_inferer = paddle.inference.Inference(
+                output_layer=self._loss, parameters=self._parameters)
+        # run inference
+        return self._loss_inferer.infer(input=infer_data)
+    def infer_batch(self, infer_data, decode_method, beam_alpha, beam_beta,
+                    beam_size, cutoff_prob, vocab_list, language_model_path,
+                    num_processes):
+        """Model inference. Infer the transcription for a batch of speech
+        utterances.
+        :param infer_data: List of utterances to infer, with each utterance
+                           consisting of a tuple of audio features and
+                           transcription text (empty string).
+        :type infer_data: list
+        :param decode_method: Decoding method name, 'best_path' or
+                              'beam search'.
+        :param decode_method: string
+        :param beam_alpha: Parameter associated with language model.
+        :type beam_alpha: float
+        :param beam_beta: Parameter associated with word count.
+        :type beam_beta: float
+        :param beam_size: Width for Beam search.
+        :type beam_size: int
+        :param cutoff_prob: Cutoff probability in pruning,
+                            default 1.0, no pruning.
+        :type cutoff_prob: float
+        :param vocab_list: List of tokens in the vocabulary, for decoding.
+        :type vocab_list: list
+        :param language_model_path: Filepath for language model.
+        :type language_model_path: basestring|None
+        :param num_processes: Number of processes (CPU) for decoder.
+        :type num_processes: int
+        :return: List of transcription texts.
+        :rtype: List of basestring
+        """
+        # define inferer
+        if self._inferer == None:
+            self._inferer = paddle.inference.Inference(
+                output_layer=self._log_probs, parameters=self._parameters)
+        # run inference
+        infer_results = self._inferer.infer(input=infer_data)
+        num_steps = len(infer_results) // len(infer_data)
+        probs_split = [
+            infer_results[i * num_steps:(i + 1) * num_steps]
+            for i in xrange(0, len(infer_data))
+        ]
+        # run decoder
+        results = []
+        if decode_method == "best_path":
+            # best path decode
+            for i, probs in enumerate(probs_split):
+                output_transcription = ctc_best_path_decoder(
+                    probs_seq=probs, vocabulary=vocab_list)
+                results.append(output_transcription)
+        elif decode_method == "beam_search":
+            # initialize external scorer
+            if self._ext_scorer == None:
+                self._ext_scorer = LmScorer(beam_alpha, beam_beta,
+                                            language_model_path)
+                self._loaded_lm_path = language_model_path
+            else:
+                self._ext_scorer.reset_params(beam_alpha, beam_beta)
+                assert self._loaded_lm_path == language_model_path
+            # beam search decode
+            beam_search_results = ctc_beam_search_decoder_batch(
+                probs_split=probs_split,
+                vocabulary=vocab_list,
+                beam_size=beam_size,
+                blank_id=len(vocab_list),
+                num_processes=num_processes,
+                ext_scoring_func=self._ext_scorer,
+                cutoff_prob=cutoff_prob)
+            results = [result[0][1] for result in beam_search_results]
+        else:
+            raise ValueError("Decoding method [%s] is not supported." %
+                             decode_method)
+        return results
+    def _create_parameters(self, model_path=None):
+        """Load or create model parameters."""
+        if model_path is None:
+            self._parameters = paddle.parameters.create(self._loss)
+        else:
+            self._parameters = paddle.parameters.Parameters.from_tar(
+                gzip.open(model_path))
+    def _create_network(self, vocab_size, num_conv_layers, num_rnn_layers,
+                        rnn_layer_size):
+        """Create data layers and model network."""
+        # paddle.data_type.dense_array is used for variable batch input.
+        # The size 161 * 161 is only an placeholder value and the real shape
+        # of input batch data will be induced during training.
+        audio_data = paddle.layer.data(
+            name="audio_spectrogram",
+            type=paddle.data_type.dense_array(161 * 161))
+        text_data = paddle.layer.data(
+            name="transcript_text",
+            type=paddle.data_type.integer_value_sequence(vocab_size))
+        self._log_probs, self._loss = deep_speech2(
+            audio_data=audio_data,
+            text_data=text_data,
+            dict_size=vocab_size,
+            num_conv_layers=num_conv_layers,
+            num_rnn_layers=num_rnn_layers,
+            rnn_size=rnn_layer_size)
--- a/deep_speech_2/requirements.txt
+++ b/deep_speech_2/requirements.txt
-wget==3.2
 scipy==0.13.1
 resampy==0.1.5
-https://github.com/kpu/kenlm/archive/master.zip
+SoundFile==0.9.0.post1
+python_speech_features
+https://github.com/luotao1/kenlm/archive/master.zip
--- a/deep_speech_2/setup.sh
+++ b/deep_speech_2/setup.sh
@@ -9,25 +9,21 @@ if [ $? != 0 ]; then
    exit 1
 fi
-# install package Soundfile
+# install package libsndfile
-curl -O "http://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.28.tar.gz"
+python -c "import soundfile"
 if [ $? != 0 ]; then
-    echo "Download libsndfile-1.0.28.tar.gz failed !!!"
+    echo "Install package libsndfile into default system path."
-    exit 1
+    wget "http://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.28.tar.gz"
+    if [ $? != 0 ]; then
+        echo "Download libsndfile-1.0.28.tar.gz failed !!!"
+        exit 1
+    fi
+    tar -zxvf libsndfile-1.0.28.tar.gz
+    cd libsndfile-1.0.28
+    ./configure && make && make install
+    cd ..
+    rm -rf libsndfile-1.0.28
+    rm libsndfile-1.0.28.tar.gz
 fi
-tar -zxvf libsndfile-1.0.28.tar.gz
-cd libsndfile-1.0.28
-./configure && make && make install
-cd -
-rm -rf libsndfile-1.0.28
-rm libsndfile-1.0.28.tar.gz
-pip install SoundFile==0.9.0.post1
-if [ $? != 0 ]; then
-    echo "Install SoundFile failed !!!"
-    exit 1
-fi
-# prepare ./checkpoints
-mkdir checkpoints
 echo "Install all dependencies successfully."
--- a/deep_speech_2/tests/test_error_rate.py
+++ b/deep_speech_2/tests/test_error_rate.py
@@ -11,16 +11,54 @@ import error_rate
 class TestParse(unittest.TestCase):
    def test_wer_1(self):
        ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night'
-        hyp = 'i GOT IT TO the FULLEST i LOVE TO portable FROM OF STORES last night'
+        hyp = 'i GOT IT TO the FULLEST i LOVE TO portable FROM OF STORES last '\
+                'night'
        word_error_rate = error_rate.wer(ref, hyp)
        self.assertTrue(abs(word_error_rate - 0.769230769231) < 1e-6)
    def test_wer_2(self):
+        ref = 'as any in england i would say said gamewell proudly that is '\
+                'in his day'
+        hyp = 'as any in england i would say said came well proudly that is '\
+                'in his day'
+        word_error_rate = error_rate.wer(ref, hyp)
+        self.assertTrue(abs(word_error_rate - 0.1333333) < 1e-6)
+    def test_wer_3(self):
+        ref = 'the lieutenant governor lilburn w boggs afterward governor '\
+                'was a pronounced mormon hater and throughout the period of '\
+                'the troubles he manifested sympathy with the persecutors'
+        hyp = 'the lieutenant governor little bit how bags afterward '\
+                'governor was a pronounced warman hater and throughout the '\
+                'period of th troubles he manifests sympathy with the '\
+                'persecutors'
+        word_error_rate = error_rate.wer(ref, hyp)
+        self.assertTrue(abs(word_error_rate - 0.2692307692) < 1e-6)
+    def test_wer_4(self):
+        ref = 'the wood flamed up splendidly under the large brewing copper '\
+                'and it sighed so deeply'
+        hyp = 'the wood flame do splendidly under the large brewing copper '\
+                'and its side so deeply'
+        word_error_rate = error_rate.wer(ref, hyp)
+        self.assertTrue(abs(word_error_rate - 0.2666666667) < 1e-6)
+    def test_wer_5(self):
+        ref = 'all the morning they trudged up the mountain path and at noon '\
+                'unc and ojo sat on a fallen tree trunk and ate the last of '\
+                'the bread which the old munchkin had placed in his pocket'
+        hyp = 'all the morning they trudged up the mountain path and at noon '\
+                'unc in ojo sat on a fallen tree trunk and ate the last of '\
+                'the bread which the old munchkin had placed in his pocket'
+        word_error_rate = error_rate.wer(ref, hyp)
+        self.assertTrue(abs(word_error_rate - 0.027027027) < 1e-6)
+    def test_wer_6(self):
        ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night'
        word_error_rate = error_rate.wer(ref, ref)
        self.assertEqual(word_error_rate, 0.0)
-    def test_wer_3(self):
+    def test_wer_7(self):
        ref = ' '
        hyp = 'Hypothesis sentence'
        with self.assertRaises(ValueError):
@@ -33,22 +71,40 @@ class TestParse(unittest.TestCase):
        self.assertTrue(abs(char_error_rate - 0.25) < 1e-6)
    def test_cer_2(self):
+        ref = 'werewolf'
+        hyp = 'weae  wolf'
+        char_error_rate = error_rate.cer(ref, hyp, remove_space=True)
+        self.assertTrue(abs(char_error_rate - 0.125) < 1e-6)
+    def test_cer_3(self):
+        ref = 'were wolf'
+        hyp = 'were  wolf'
+        char_error_rate = error_rate.cer(ref, hyp)
+        self.assertTrue(abs(char_error_rate - 0.0) < 1e-6)
+    def test_cer_4(self):
        ref = 'werewolf'
        char_error_rate = error_rate.cer(ref, ref)
        self.assertEqual(char_error_rate, 0.0)
-    def test_cer_3(self):
+    def test_cer_5(self):
        ref = u'我是中国人'
        hyp = u'我是 美洲人'
        char_error_rate = error_rate.cer(ref, hyp)
        self.assertTrue(abs(char_error_rate - 0.6) < 1e-6)
-    def test_cer_4(self):
+    def test_cer_6(self):
+        ref = u'我 是 中 国 人'
+        hyp = u'我 是 美 洲 人'
+        char_error_rate = error_rate.cer(ref, hyp, remove_space=True)
+        self.assertTrue(abs(char_error_rate - 0.4) < 1e-6)
+    def test_cer_7(self):
        ref = u'我是中国人'
        char_error_rate = error_rate.cer(ref, ref)
        self.assertFalse(char_error_rate, 0.0)
-    def test_cer_5(self):
+    def test_cer_8(self):
        ref = ''
        hyp = 'Hypothesis'
        with self.assertRaises(ValueError):

--- a/deep_speech_2/tests/test_setup.py
+++ b/deep_speech_2/tests/test_setup.py
+"""Test Setup."""
+import unittest
+import numpy as np
+import os
+class TestSetup(unittest.TestCase):
+    def test_soundfile(self):
+        import soundfile as sf
+        # floating point data is typically limited to the interval [-1.0, 1.0],
+        # but smaller/larger values are supported as well
+        data = np.array([[1.75, -1.75], [1.0, -1.0], [0.5, -0.5],
+                         [0.25, -0.25]])
+        file = 'test.wav'
+        sf.write(file, data, 44100, format='WAV', subtype='FLOAT')
+        read, fs = sf.read(file)
+        self.assertTrue(np.all(read == data))
+        self.assertEqual(fs, 44100)
+        os.remove(file)
+if __name__ == '__main__':
+    unittest.main()
--- a/deep_speech_2/tools/_init_paths.py
+++ b/deep_speech_2/tools/_init_paths.py
+"""Set up paths for DS2"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os.path
+import sys
+def add_path(path):
+    if path not in sys.path:
+        sys.path.insert(0, path)
+this_dir = os.path.dirname(__file__)
+# Add project path to PYTHONPATH
+proj_path = os.path.join(this_dir, '..')
+add_path(proj_path)
--- a/deep_speech_2/tools/build_vocab.py
+++ b/deep_speech_2/tools/build_vocab.py
+"""Build vocabulary from manifest files.
+Each item in vocabulary file is a character.
+"""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import argparse
+import codecs
+import json
+from collections import Counter
+import os.path
+import _init_paths
+from data_utils import utils
+parser = argparse.ArgumentParser(description=__doc__)
+parser.add_argument(
+    "--manifest_paths",
+    type=str,
+    help="Manifest paths for building vocabulary."
+    "You can provide multiple manifest files.",
+    nargs='+',
+    required=True)
+parser.add_argument(
+    "--count_threshold",
+    default=0,
+    type=int,
+    help="Characters whose counts are below the threshold will be truncated. "
+    "(default: %(default)i)")
+parser.add_argument(
+    "--vocab_path",
+    default='datasets/vocab/zh_vocab.txt',
+    type=str,
+    help="File path to write the vocabulary. (default: %(default)s)")
+args = parser.parse_args()
+def count_manifest(counter, manifest_path):
+    manifest_jsons = utils.read_manifest(manifest_path)
+    for line_json in manifest_jsons:
+        for char in line_json['text']:
+            counter.update(char)
+def main():
+    counter = Counter()
+    for manifest_path in args.manifest_paths:
+        count_manifest(counter, manifest_path)
+    count_sorted = sorted(counter.items(), key=lambda x: x[1], reverse=True)
+    with codecs.open(args.vocab_path, 'w', 'utf-8') as fout:
+        for char, count in count_sorted:
+            if count < args.count_threshold: break
+            fout.write(char + '\n')
+if __name__ == '__main__':
+    main()
--- a/deep_speech_2/compute_mean_std.py
+++ b/deep_speech_2/compute_mean_std.py
@@ -4,12 +4,19 @@ from __future__ import division
 from __future__ import print_function
 import argparse
+import _init_paths
 from data_utils.normalizer import FeatureNormalizer
 from data_utils.augmentor.augmentation import AugmentationPipeline
 from data_utils.featurizer.audio_featurizer import AudioFeaturizer
 parser = argparse.ArgumentParser(
    description='Computing mean and stddev for feature normalizer.')
+parser.add_argument(
+    "--specgram_type",
+    default='linear',
+    type=str,
+    help="Feature type of audio data: 'linear' (power spectrum)"
+    " or 'mfcc'. (default: %(default)s)")
 parser.add_argument(
    "--manifest_path",
    default='datasets/manifest.train',
@@ -39,7 +46,7 @@ args = parser.parse_args()
 def main():
    augmentation_pipeline = AugmentationPipeline(args.augmentation_config)
-    audio_featurizer = AudioFeaturizer()
+    audio_featurizer = AudioFeaturizer(specgram_type=args.specgram_type)
    def augment_and_featurize(audio_segment):
        augmentation_pipeline.transform_audio(audio_segment)

--- a/deep_speech_2/train.py
+++ b/deep_speech_2/train.py
@@ -3,15 +3,11 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
-import sys
-import os
 import argparse
-import gzip
-import time
 import distutils.util
 import multiprocessing
 import paddle.v2 as paddle
-from model import deep_speech2
+from model import DeepSpeech2Model
 from data_utils.data import DataGenerator
 import utils
@@ -23,6 +19,12 @@ parser.add_argument(
    default=200,
    type=int,
    help="Training pass number. (default: %(default)s)")
+parser.add_argument(
+    "--num_iterations_print",
+    default=100,
+    type=int,
+    help="Number of iterations for every train cost printing. "
+    "(default: %(default)s)")
 parser.add_argument(
    "--num_conv_layers",
    default=2,
@@ -53,6 +55,12 @@ parser.add_argument(
    default=True,
    type=distutils.util.strtobool,
    help="Use sortagrad or not. (default: %(default)s)")
+parser.add_argument(
+    "--specgram_type",
+    default='linear',
+    type=str,
+    help="Feature type of audio data: 'linear' (power spectrum)"
+    " or 'mfcc'. (default: %(default)s)")
 parser.add_argument(
    "--max_duration",
    default=27.0,
@@ -78,7 +86,7 @@ parser.add_argument(
    help="Trainer number. (default: %(default)s)")
 parser.add_argument(
    "--num_threads_data",
-    default=multiprocessing.cpu_count(),
+    default=multiprocessing.cpu_count() // 2,
    type=int,
    help="Number of cpu threads for preprocessing data. (default: %(default)s)")
 parser.add_argument(
@@ -108,112 +116,71 @@ parser.add_argument(
    help="If set None, the training will start from scratch. "
    "Otherwise, the training will resume from "
    "the existing model of this path. (default: %(default)s)")
+parser.add_argument(
+    "--output_model_dir",
+    default="./checkpoints",
+    type=str,
+    help="Directory for saving models. (default: %(default)s)")
 parser.add_argument(
    "--augmentation_config",
-    default='[{"type": "shift", '
+    default=open('conf/augmentation.config', 'r').read(),
-    '"params": {"min_shift_ms": -5, "max_shift_ms": 5},'
-    '"prob": 1.0}]',
    type=str,
    help="Augmentation configuration in json-format. "
    "(default: %(default)s)")
+parser.add_argument(
+    "--is_local",
+    default=True,
+    type=distutils.util.strtobool,
+    help="Set to false if running with pserver in paddlecloud. "
+    "(default: %(default)s)")
 args = parser.parse_args()
 def train():
    """DeepSpeech2 training."""
+    train_generator = DataGenerator(
-    # initialize data generator
+        vocab_filepath=args.vocab_filepath,
-    def data_generator():
+        mean_std_filepath=args.mean_std_filepath,
-        return DataGenerator(
+        augmentation_config=args.augmentation_config,
-            vocab_filepath=args.vocab_filepath,
+        max_duration=args.max_duration,
-            mean_std_filepath=args.mean_std_filepath,
+        min_duration=args.min_duration,
-            augmentation_config=args.augmentation_config,
+        specgram_type=args.specgram_type,
-            max_duration=args.max_duration,
+        num_threads=args.num_threads_data)
-            min_duration=args.min_duration,
+    dev_generator = DataGenerator(
-            num_threads=args.num_threads_data)
+        vocab_filepath=args.vocab_filepath,
+        mean_std_filepath=args.mean_std_filepath,
-    train_generator = data_generator()
+        augmentation_config="{}",
-    test_generator = data_generator()
+        specgram_type=args.specgram_type,
+        num_threads=args.num_threads_data)
-    # create network config
-    # paddle.data_type.dense_array is used for variable batch input.
-    # The size 161 * 161 is only an placeholder value and the real shape
-    # of input batch data will be induced during training.
-    audio_data = paddle.layer.data(
-        name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
-    text_data = paddle.layer.data(
-        name="transcript_text",
-        type=paddle.data_type.integer_value_sequence(
-            train_generator.vocab_size))
-    cost = deep_speech2(
-        audio_data=audio_data,
-        text_data=text_data,
-        dict_size=train_generator.vocab_size,
-        num_conv_layers=args.num_conv_layers,
-        num_rnn_layers=args.num_rnn_layers,
-        rnn_size=args.rnn_layer_size,
-        is_inference=False)
-    # create/load parameters and optimizer
-    if args.init_model_path is None:
-        parameters = paddle.parameters.create(cost)
-    else:
-        if not os.path.isfile(args.init_model_path):
-            raise IOError("Invalid model!")
-        parameters = paddle.parameters.Parameters.from_tar(
-            gzip.open(args.init_model_path))
-    optimizer = paddle.optimizer.Adam(
-        learning_rate=args.adam_learning_rate, gradient_clipping_threshold=400)
-    trainer = paddle.trainer.SGD(
-        cost=cost, parameters=parameters, update_equation=optimizer)
-    # prepare data reader
    train_batch_reader = train_generator.batch_reader_creator(
        manifest_path=args.train_manifest_path,
        batch_size=args.batch_size,
        min_batch_size=args.trainer_count,
        sortagrad=args.use_sortagrad if args.init_model_path is None else False,
        shuffle_method=args.shuffle_method)
-    test_batch_reader = test_generator.batch_reader_creator(
+    dev_batch_reader = dev_generator.batch_reader_creator(
        manifest_path=args.dev_manifest_path,
        batch_size=args.batch_size,
        min_batch_size=1,  # must be 1, but will have errors.
        sortagrad=False,
        shuffle_method=None)
-    # create event handler
+    ds2_model = DeepSpeech2Model(
-    def event_handler(event):
+        vocab_size=train_generator.vocab_size,
-        global start_time, cost_sum, cost_counter
+        num_conv_layers=args.num_conv_layers,
-        if isinstance(event, paddle.event.EndIteration):
+        num_rnn_layers=args.num_rnn_layers,
-            cost_sum += event.cost
+        rnn_layer_size=args.rnn_layer_size,
-            cost_counter += 1
+        pretrained_model_path=args.init_model_path)
-            if (event.batch_id + 1) % 100 == 0:
+    ds2_model.train(
-                print("\nPass: %d, Batch: %d, TrainCost: %f" % (
+        train_batch_reader=train_batch_reader,
-                    event.pass_id, event.batch_id + 1, cost_sum / cost_counter))
+        dev_batch_reader=dev_batch_reader,
-                cost_sum, cost_counter = 0.0, 0
+        feeding_dict=train_generator.feeding,
-                with gzip.open("checkpoints/params.latest.tar.gz", 'w') as f:
+        learning_rate=args.adam_learning_rate,
-                    parameters.to_tar(f)
+        gradient_clipping=400,
-            else:
-                sys.stdout.write('.')
-                sys.stdout.flush()
-        if isinstance(event, paddle.event.BeginPass):
-            start_time = time.time()
-            cost_sum, cost_counter = 0.0, 0
-        if isinstance(event, paddle.event.EndPass):
-            result = trainer.test(
-                reader=test_batch_reader, feeding=test_generator.feeding)
-            print("\n------- Time: %d sec,  Pass: %d, ValidationCost: %s" %
-                  (time.time() - start_time, event.pass_id, result.cost))
-            with gzip.open("checkpoints/params.pass-%d.tar.gz" % event.pass_id,
-                           'w') as f:
-                parameters.to_tar(f)
-    # run train
-    trainer.train(
-        reader=train_batch_reader,
-        event_handler=event_handler,
        num_passes=args.num_passes,
-        feeding=train_generator.feeding)
+        num_iterations_print=args.num_iterations_print,
+        output_model_dir=args.output_model_dir,
+        is_local=args.is_local)
 def main():

--- a/deep_speech_2/tune.py
+++ b/deep_speech_2/tune.py
@@ -3,14 +3,13 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function
+import numpy as np
 import distutils.util
 import argparse
-import gzip
+import multiprocessing
 import paddle.v2 as paddle
 from data_utils.data import DataGenerator
-from model import deep_speech2
+from model import DeepSpeech2Model
-from decoder import *
-from lm.lm_scorer import LmScorer
 from error_rate import wer
 import utils
@@ -40,26 +39,37 @@ parser.add_argument(
    default=True,
    type=distutils.util.strtobool,
    help="Use gpu or not. (default: %(default)s)")
+parser.add_argument(
+    "--trainer_count",
+    default=8,
+    type=int,
+    help="Trainer number. (default: %(default)s)")
 parser.add_argument(
    "--num_threads_data",
-    default=multiprocessing.cpu_count(),
+    default=1,
    type=int,
    help="Number of cpu threads for preprocessing data. (default: %(default)s)")
 parser.add_argument(
    "--num_processes_beam_search",
-    default=multiprocessing.cpu_count(),
+    default=multiprocessing.cpu_count() // 2,
    type=int,
    help="Number of cpu processes for beam search. (default: %(default)s)")
+parser.add_argument(
+    "--specgram_type",
+    default='linear',
+    type=str,
+    help="Feature type of audio data: 'linear' (power spectrum)"
+    " or 'mfcc'. (default: %(default)s)")
 parser.add_argument(
    "--mean_std_filepath",
    default='mean_std.npz',
    type=str,
    help="Manifest path for normalizer. (default: %(default)s)")
 parser.add_argument(
-    "--decode_manifest_path",
+    "--tune_manifest_path",
-    default='datasets/manifest.test',
+    default='datasets/manifest.dev',
    type=str,
-    help="Manifest path for decoding. (default: %(default)s)")
+    help="Manifest path for tuning. (default: %(default)s)")
 parser.add_argument(
    "--model_filepath",
    default='checkpoints/params.latest.tar.gz',
@@ -77,7 +87,7 @@ parser.add_argument(
    help="Width for beam search decoding. (default: %(default)d)")
 parser.add_argument(
    "--language_model_path",
-    default="lm/data/1Billion.klm",
+    default="lm/data/common_crawl_00.prune01111.trie.klm",
    type=str,
    help="Path for language model. (default: %(default)s)")
 parser.add_argument(
@@ -121,95 +131,64 @@ args = parser.parse_args()
 def tune():
    """Tune parameters alpha and beta on one minibatch."""
    if not args.num_alphas >= 0:
        raise ValueError("num_alphas must be non-negative!")
    if not args.num_betas >= 0:
        raise ValueError("num_betas must be non-negative!")
-    # initialize data generator
    data_generator = DataGenerator(
        vocab_filepath=args.vocab_filepath,
        mean_std_filepath=args.mean_std_filepath,
        augmentation_config='{}',
+        specgram_type=args.specgram_type,
        num_threads=args.num_threads_data)
-    # create network config
-    # paddle.data_type.dense_array is used for variable batch input.
-    # The size 161 * 161 is only an placeholder value and the real shape
-    # of input batch data will be induced during training.
-    audio_data = paddle.layer.data(
-        name="audio_spectrogram", type=paddle.data_type.dense_array(161 * 161))
-    text_data = paddle.layer.data(
-        name="transcript_text",
-        type=paddle.data_type.integer_value_sequence(data_generator.vocab_size))
-    output_probs = deep_speech2(
-        audio_data=audio_data,
-        text_data=text_data,
-        dict_size=data_generator.vocab_size,
-        num_conv_layers=args.num_conv_layers,
-        num_rnn_layers=args.num_rnn_layers,
-        rnn_size=args.rnn_layer_size,
-        is_inference=True)
-    # load parameters
-    parameters = paddle.parameters.Parameters.from_tar(
-        gzip.open(args.model_filepath))
-    # prepare infer data
    batch_reader = data_generator.batch_reader_creator(
-        manifest_path=args.decode_manifest_path,
+        manifest_path=args.tune_manifest_path,
        batch_size=args.num_samples,
        sortagrad=False,
        shuffle_method=None)
-    # get one batch data for tuning
+    tune_data = batch_reader().next()
-    infer_data = batch_reader().next()
+    target_transcripts = [
+        ''.join([data_generator.vocab_list[token] for token in transcript])
-    # run inference
+        for _, transcript in tune_data
-    infer_results = paddle.infer(
-        output_layer=output_probs, parameters=parameters, input=infer_data)
-    num_steps = len(infer_results) // len(infer_data)
-    probs_split = [
-        infer_results[i * num_steps:(i + 1) * num_steps]
-        for i in xrange(0, len(infer_data))
    ]
+    ds2_model = DeepSpeech2Model(
+        vocab_size=data_generator.vocab_size,
+        num_conv_layers=args.num_conv_layers,
+        num_rnn_layers=args.num_rnn_layers,
+        rnn_layer_size=args.rnn_layer_size,
+        pretrained_model_path=args.model_filepath)
    # create grid for search
    cand_alphas = np.linspace(args.alpha_from, args.alpha_to, args.num_alphas)
    cand_betas = np.linspace(args.beta_from, args.beta_to, args.num_betas)
    params_grid = [(alpha, beta) for alpha in cand_alphas
                   for beta in cand_betas]
-    ext_scorer = LmScorer(args.alpha_from, args.beta_from,
-                          args.language_model_path)
    ## tune parameters in loop
    for alpha, beta in params_grid:
-        wer_sum, wer_counter = 0, 0
+        result_transcripts = ds2_model.infer_batch(
-        # reset scorer
+            infer_data=tune_data,
-        ext_scorer.reset_params(alpha, beta)
+            decode_method='beam_search',
-        # beam search using multiple processes
+            beam_alpha=alpha,
-        beam_search_results = ctc_beam_search_decoder_batch(
+            beam_beta=beta,
-            probs_split=probs_split,
-            vocabulary=data_generator.vocab_list,
            beam_size=args.beam_size,
            cutoff_prob=args.cutoff_prob,
-            blank_id=len(data_generator.vocab_list),
+            vocab_list=data_generator.vocab_list,
-            num_processes=args.num_processes_beam_search,
+            language_model_path=args.language_model_path,
-            ext_scoring_func=ext_scorer, )
+            num_processes=args.num_processes_beam_search)
-        for i, beam_search_result in enumerate(beam_search_results):
+        wer_sum, num_ins = 0.0, 0
-            target_transcription = ''.join([
+        for target, result in zip(target_transcripts, result_transcripts):
-                data_generator.vocab_list[index] for index in infer_data[i][1]
+            wer_sum += wer(target, result)
-            ])
+            num_ins += 1
-            wer_sum += wer(target_transcription, beam_search_result[0][1])
-            wer_counter += 1
        print("alpha = %f\tbeta = %f\tWER = %f" %
-              (alpha, beta, wer_sum / wer_counter))
+              (alpha, beta, wer_sum / num_ins))
 def main():
-    paddle.init(use_gpu=args.use_gpu, trainer_count=1)
+    utils.print_arguments(args)
+    paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count)
    tune()

--- a/dssm/README.md
+++ b/dssm/README.md
--- a/dssm/data/classification/test.txt
+++ b/dssm/data/classification/test.txt
+苹果	苹果 6s	0
+汽车 驾驶	驾校 培训	1
--- a/dssm/data/classification/train.txt
+++ b/dssm/data/classification/train.txt
+苹果 六 袋	苹果 6s	0
+新手 汽车 驾驶	驾校 培训	1
--- a/dssm/data/rank/test.txt
+++ b/dssm/data/rank/test.txt
+苹果 六 袋	苹果 6s	新手 汽车 驾驶	1
+新手 汽车 驾驶	驾校 培训	苹果 6s	0
--- a/dssm/data/rank/train.txt
+++ b/dssm/data/rank/train.txt
+苹果 六 袋	苹果 6s	新手 汽车 驾驶	1
+新手 汽车 驾驶	驾校 培训	苹果 6s	1
--- a/dssm/data/vocab.txt
+++ b/dssm/data/vocab.txt
+UNK
+苹果
+六
+袋
+6s
+新手
+汽车
+驾驶
+驾校
+培训
\ No newline at end of file
--- a/dssm/images/dssm.jpg
+++ b/dssm/images/dssm.jpg
--- a/dssm/images/dssm.png
+++ b/dssm/images/dssm.png
--- a/dssm/images/dssm2.jpg
+++ b/dssm/images/dssm2.jpg
--- a/dssm/images/dssm2.png
+++ b/dssm/images/dssm2.png
--- a/dssm/images/dssm3.jpg
+++ b/dssm/images/dssm3.jpg
--- a/dssm/infer.py
+++ b/dssm/infer.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+import argparse
+import itertools
+import reader
+import paddle.v2 as paddle
+from network_conf import DSSM
+from utils import logger, ModelType, ModelArch, load_dic
+parser = argparse.ArgumentParser(description="PaddlePaddle DSSM infer")
+parser.add_argument(
+    '--model_path',
+    type=str,
+    required=True,
+    help="path of model parameters file")
+parser.add_argument(
+    '-i',
+    '--data_path',
+    type=str,
+    required=True,
+    help="path of the dataset to infer")
+parser.add_argument(
+    '-o',
+    '--prediction_output_path',
+    type=str,
+    required=True,
+    help="path to output the prediction")
+parser.add_argument(
+    '-y',
+    '--model_type',
+    type=int,
+    required=True,
+    default=ModelType.CLASSIFICATION_MODE,
+    help="model type, %d for classification, %d for pairwise rank, %d for regression (default: classification)"
+    % (ModelType.CLASSIFICATION_MODE, ModelType.RANK_MODE,
+       ModelType.REGRESSION_MODE))
+parser.add_argument(
+    '-s',
+    '--source_dic_path',
+    type=str,
+    required=False,
+    help="path of the source's word dic")
+parser.add_argument(
+    '--target_dic_path',
+    type=str,
+    required=False,
+    help="path of the target's word dic, if not set, the `source_dic_path` will be used"
+)
+parser.add_argument(
+    '-a',
+    '--model_arch',
+    type=int,
+    required=True,
+    default=ModelArch.CNN_MODE,
+    help="model architecture, %d for CNN, %d for FC, %d for RNN" %
+    (ModelArch.CNN_MODE, ModelArch.FC_MODE, ModelArch.RNN_MODE))
+parser.add_argument(
+    '--share_network_between_source_target',
+    type=bool,
+    default=False,
+    help="whether to share network parameters between source and target")
+parser.add_argument(
+    '--share_embed',
+    type=bool,
+    default=False,
+    help="whether to share word embedding between source and target")
+parser.add_argument(
+    '--dnn_dims',
+    type=str,
+    default='256,128,64,32',
+    help="dimentions of dnn layers, default is '256,128,64,32', which means create a 4-layer dnn, demention of each layer is 256, 128, 64 and 32"
+)
+parser.add_argument(
+    '-c',
+    '--class_num',
+    type=int,
+    default=0,
+    help="number of categories for classification task.")
+args = parser.parse_args()
+args.model_type = ModelType(args.model_type)
+args.model_arch = ModelArch(args.model_arch)
+if args.model_type.is_classification():
+    assert args.class_num > 1, "--class_num should be set in classification task."
+layer_dims = map(int, args.dnn_dims.split(','))
+args.target_dic_path = args.source_dic_path if not args.target_dic_path else args.target_dic_path
+paddle.init(use_gpu=False, trainer_count=1)
+class Inferer(object):
+    def __init__(self, param_path):
+        logger.info("create DSSM model")
+        prediction = DSSM(
+            dnn_dims=layer_dims,
+            vocab_sizes=[
+                len(load_dic(path))
+                for path in [args.source_dic_path, args.target_dic_path]
+            ],
+            model_type=args.model_type,
+            model_arch=args.model_arch,
+            share_semantic_generator=args.share_network_between_source_target,
+            class_num=args.class_num,
+            share_embed=args.share_embed,
+            is_infer=True)()
+        # load parameter
+        logger.info("load model parameters from %s" % param_path)
+        self.parameters = paddle.parameters.Parameters.from_tar(
+            open(param_path, 'r'))
+        self.inferer = paddle.inference.Inference(
+            output_layer=prediction, parameters=self.parameters)
+    def infer(self, data_path):
+        logger.info("infer data...")
+        dataset = reader.Dataset(
+            train_path=data_path,
+            test_path=None,
+            source_dic_path=args.source_dic_path,
+            target_dic_path=args.target_dic_path,
+            model_type=args.model_type, )
+        infer_reader = paddle.batch(dataset.infer, batch_size=1000)
+        logger.warning('write predictions to %s' % args.prediction_output_path)
+        output_f = open(args.prediction_output_path, 'w')
+        for id, batch in enumerate(infer_reader()):
+            res = self.inferer.infer(input=batch)
+            predictions = [' '.join(map(str, x)) for x in res]
+            assert len(batch) == len(
+                predictions), "predict error, %d inputs, but %d predictions" % (
+                    len(batch), len(predictions))
+            output_f.write('\n'.join(map(str, predictions)) + '\n')
+if __name__ == '__main__':
+    inferer = Inferer(args.model_path)
+    inferer.infer(args.data_path)
--- a/dssm/network_conf.py
+++ b/dssm/network_conf.py
--- a/dssm/reader.py
+++ b/dssm/reader.py
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+from utils import UNK, ModelType, TaskType, load_dic, sent2ids, logger, ModelType
+class Dataset(object):
+    def __init__(self, train_path, test_path, source_dic_path, target_dic_path,
+                 model_type):
+        self.train_path = train_path
+        self.test_path = test_path
+        self.source_dic_path = source_dic_path
+        self.target_dic_path = target_dic_path
+        self.model_type = ModelType(model_type)
+        self.source_dic = load_dic(self.source_dic_path)
+        self.target_dic = load_dic(self.target_dic_path)
+        _record_reader = {
+            ModelType.CLASSIFICATION_MODE: self._read_classification_record,
+            ModelType.REGRESSION_MODE: self._read_regression_record,
+            ModelType.RANK_MODE: self._read_rank_record,
+        }
+        assert isinstance(model_type, ModelType)
+        self.record_reader = _record_reader[model_type.mode]
+        self.is_infer = False
+    def train(self):
+        '''
+        Load trainset.
+        '''
+        logger.info("[reader] load trainset from %s" % self.train_path)
+        with open(self.train_path) as f:
+            for line_id, line in enumerate(f):
+                yield self.record_reader(line)
+    def test(self):
+        '''
+        Load testset.
+        '''
+        # logger.info("[reader] load testset from %s" % self.test_path)
+        with open(self.test_path) as f:
+            for line_id, line in enumerate(f):
+                yield self.record_reader(line)
+    def infer(self):
+        self.is_infer = True
+        with open(self.train_path) as f:
+            for line in f:
+                yield self.record_reader(line)
+    def _read_classification_record(self, line):
+        '''
+        data format:
+            <source words> [TAB] <target words> [TAB] <label>
+        @line: str
+            a string line which represent a record.
+        '''
+        fs = line.strip().split('\t')
+        assert len(fs) == 3, "wrong format for classification\n" + \
+            "the format shoud be " +\
+            "<source words> [TAB] <target words> [TAB] <label>'"
+        source = sent2ids(fs[0], self.source_dic)
+        target = sent2ids(fs[1], self.target_dic)
+        if not self.is_infer:
+            label = int(fs[2])
+            return (source, target, label, )
+        return source, target
+    def _read_regression_record(self, line):
+        '''
+        data format:
+            <source words> [TAB] <target words> [TAB] <label>
+        @line: str
+            a string line which represent a record.
+        '''
+        fs = line.strip().split('\t')
+        assert len(fs) == 3, "wrong format for regression\n" + \
+            "the format shoud be " +\
+            "<source words> [TAB] <target words> [TAB] <label>'"
+        source = sent2ids(fs[0], self.source_dic)
+        target = sent2ids(fs[1], self.target_dic)
+        if not self.is_infer:
+            label = float(fs[2])
+            return (source, target, [label], )
+        return source, target
+    def _read_rank_record(self, line):
+        '''
+        data format:
+            <source words> [TAB] <left_target words> [TAB] <right_target words> [TAB] <label>
+        '''
+        fs = line.strip().split('\t')
+        assert len(fs) == 4, "wrong format for rank\n" + \
+            "the format should be " +\
+            "<source words> [TAB] <left_target words> [TAB] <right_target words> [TAB] <label>"
+        source = sent2ids(fs[0], self.source_dic)
+        left_target = sent2ids(fs[1], self.target_dic)
+        right_target = sent2ids(fs[2], self.target_dic)
+        if not self.is_infer:
+            label = int(fs[3])
+            return (source, left_target, right_target, label)
+        return source, left_target, right_target
+if __name__ == '__main__':
+    path = './data/classification/train.txt'
+    test_path = './data/classification/test.txt'
+    source_dic = './data/vocab.txt'
+    dataset = Dataset(path, test_path, source_dic, source_dic,
+                      ModelType.CLASSIFICATION)
+    for rcd in dataset.train():
+        print rcd
--- a/dssm/train.py
+++ b/dssm/train.py
--- a/dssm/utils.py
+++ b/dssm/utils.py
--- a/generate_sequence_by_rnn_lm/config.py
+++ b/generate_sequence_by_rnn_lm/config.py
@@ -26,7 +26,7 @@ num_passes = 20  # how many passes to train the model
 log_period = 50
 save_period_by_batches = 50
-use_gpu = True  # to use gpu or not
+use_gpu = False  # to use gpu or not
 trainer_count = 1  # number of trainer
 ##################  for model configuration  ##################

--- a/generate_sequence_by_rnn_lm/train.py
+++ b/generate_sequence_by_rnn_lm/train.py
--- a/ltr/README.md
+++ b/ltr/README.md
--- a/ltr/index.html
+++ b/ltr/index.html
--- a/ltr/lambda_rank.py
+++ b/ltr/lambda_rank.py
--- a/ltr/metrics.py
+++ b/ltr/metrics.py
--- a/ltr/ranknet.py
+++ b/ltr/ranknet.py
--- a/ltr/run_lambdarank.sh
+++ b/ltr/run_lambdarank.sh
--- a/ltr/run_ranknet.sh
+++ b/ltr/run_ranknet.sh
--- a/nce_cost/train.py
+++ b/nce_cost/train.py
--- a/nmt_without_attention/train.py
+++ b/nmt_without_attention/train.py
@@ -63,4 +63,4 @@ def train(save_dir_path, source_dict_dim, target_dict_dim):
 if __name__ == '__main__':
-    train(save_dir_path="models", source_dict_dim=3000, target_dict_dim=3000)
+    train(save_dir_path="models", source_dict_dim=30000, target_dict_dim=30000)
--- a/sequence_tagging_for_ner/README.md
+++ b/sequence_tagging_for_ner/README.md
--- a/sequence_tagging_for_ner/data/download.sh
+++ b/sequence_tagging_for_ner/data/download.sh
--- a/sequence_tagging_for_ner/index.html
+++ b/sequence_tagging_for_ner/index.html
--- a/sequence_tagging_for_ner/reader.py
+++ b/sequence_tagging_for_ner/reader.py
--- a/sequence_tagging_for_ner/train.py
+++ b/sequence_tagging_for_ner/train.py
--- a/ssd/README.md
+++ b/ssd/README.md
--- a/ssd/config/__init__.py
+++ b/ssd/config/__init__.py
--- a/ssd/config/pascal_voc_conf.py
+++ b/ssd/config/pascal_voc_conf.py
--- a/ssd/data/label_list
+++ b/ssd/data/label_list
--- a/ssd/data/prepare_voc_data.py
+++ b/ssd/data/prepare_voc_data.py
--- a/ssd/data_provider.py
+++ b/ssd/data_provider.py
--- a/ssd/eval.py
+++ b/ssd/eval.py
--- a/ssd/image_util.py
+++ b/ssd/image_util.py
--- a/ssd/images/SSD300x300_map.png
+++ b/ssd/images/SSD300x300_map.png
--- a/ssd/images/ssd_network.png
+++ b/ssd/images/ssd_network.png
--- a/ssd/images/vis_1.jpg
+++ b/ssd/images/vis_1.jpg
--- a/ssd/images/vis_2.jpg
+++ b/ssd/images/vis_2.jpg
--- a/ssd/images/vis_3.jpg
+++ b/ssd/images/vis_3.jpg
--- a/ssd/images/vis_4.jpg
+++ b/ssd/images/vis_4.jpg
--- a/ssd/index.html
+++ b/ssd/index.html
--- a/ssd/infer.py
+++ b/ssd/infer.py
--- a/ssd/train.py
+++ b/ssd/train.py
--- a/ssd/vgg_ssd_net.py
+++ b/ssd/vgg_ssd_net.py
--- a/ssd/visual.py
+++ b/ssd/visual.py