diff --git a/ctr/README.md b/ctr/README.md
index 9332a8516e72d9df6c81ee0faf7c44264e43c0a6..1b2a73757575404dfd7790440bf9f27617084c1d 100644
--- a/ctr/README.md
+++ b/ctr/README.md
@@ -1,14 +1,29 @@
 # 点击率预估
 
+以下是本例目录包含的文件以及对应说明:
+
+```
+├── README.md               # 本教程markdown 文档
+├── dataset.md              # 数据集处理教程
+├── images                  # 本教程图片目录
+│   ├── lr_vs_dnn.jpg
+│   └── wide_deep.png
+├── infer.py                # 预测脚本
+├── network_conf.py         # 模型网络配置
+├── reader.py               # data reader
+├── train.py                # 训练脚本
+└── utils.py                # helper functions
+└── avazu_data_processer.py # 示例数据预处理脚本
+```
+
 ## 背景介绍
 
-CTR(Click-Through Rate，点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率，
-通常被用来衡量一个在线广告系统的有效性。
+CTR(Click-Through Rate，点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
+是对用户点击一个特定链接的概率做出预测，是广告投放过程中的一个重要环节。精准的点击率预估对在线广告系统收益最大化具有重要意义。
 
-当有多个广告位时，CTR 预估一般会作为排序的基准。
-比如在搜索引擎的广告系统里，当用户输入一个带商业价值的搜索词（query）时，系统大体上会执行下列步骤来展示广告：
+当有多个广告位时，CTR 预估一般会作为排序的基准，比如在搜索引擎的广告系统里，当用户输入一个带商业价值的搜索词（query）时，系统大体上会执行下列步骤来展示广告：
 
-1.  召回满足 query 的广告集合
+1.  获取与用户搜索词相关的广告集合
 2.  业务规则和相关性过滤
 3.  根据拍卖机制和 CTR 排序
 4.  展出广告
@@ -36,13 +51,11 @@ Figure 1. LR 和 DNN 模型结构对比
 </p>
 
 LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构，可以看到 LR 和 DNN 有一些共通之处（比如权重累加），
-但前者的模型复杂度在相同输入维度下比后者可能低很多（从某方面讲，模型越复杂，越有潜力学习到更复杂的信息）。
-
+但前者的模型复杂度在相同输入维度下比后者可能低很多（从某方面讲，模型越复杂，越有潜力学习到更复杂的信息）；
 如果 LR 要达到匹敌 DNN 的学习能力，必须增加输入的维度，也就是增加特征的数量，
 这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
 
-LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力，包括内存和计算量等方面，工业界都有非常成熟的优化方法。
-
+LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力，包括内存和计算量等方面，工业界都有非常成熟的优化方法；
 而 DNN 模型具有自己学习新特征的能力，一定程度上能够提升特征使用的效率，
 这使得 DNN 模型在同样规模特征的情况下，更有可能达到更好的学习效果。
 
@@ -59,10 +72,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力，包括
 
 我们直接使用第一种方法做分类任务。
 
-我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。
+我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示本例中的模型。
+
+具体的特征处理方法参看 [data process](./dataset.md)。
+
+本教程中演示模型的输入格式如下：
+
+```
+# <dnn input ids> \t <lr input sparse values> \t click
+1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
+23 231 \t 1230:0.12 13421:0.9 \t 1
+```
+
+详细的格式描述如下：
+
+- `dnn input ids` 采用 one-hot 表示，只需要填写值为1的ID（注意这里不是变长输入）
+- `lr input sparse values` 使用了 `ID:VALUE` 的表示，值部分最好规约到值域 `[-1, 1]`。
+
+此外，模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度，文件的格式如下：
 
-具体的特征处理方法参看 [data process](./dataset.md)
+```
+dnn_input_dim: <int>
+lr_input_dim: <int>
+```
+
+其中， `<int>` 表示一个整型数值。
+
+本目录下的 `avazu_data_processor.py` 可以对下载的演示数据集\[[2](#参考文档)\] 进行处理，具体使用方法参考如下说明：
+
+```
+usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
+                               OUTPUT_DIR
+                               [--num_lines_to_detect NUM_LINES_TO_DETECT]
+                               [--test_set_size TEST_SET_SIZE]
+                               [--train_size TRAIN_SIZE]
+
+PaddlePaddle CTR example
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --data_path DATA_PATH
+                        path of the Avazu dataset
+  --output_dir OUTPUT_DIR
+                        directory to output
+  --num_lines_to_detect NUM_LINES_TO_DETECT
+                        number of records to detect dataset's meta info
+  --test_set_size TEST_SET_SIZE
+                        size of the validation dataset(default: 10000)
+  --train_size TRAIN_SIZE
+                        size of the trainset (default: 100000)
+```
 
+- `data_path` 是待处理的数据路径
+- `output_dir` 生成数据的输出路径
+- `num_lines_to_detect` 预先扫描数据生成ID的个数，这里是扫描的文件行数
+- `test_set_size` 生成测试集的行数
+- `train_size` 生成训练姐的行数
 
 ## Wide & Deep Learning Model
 
@@ -201,18 +266,20 @@ trainer.train(
 ## 运行训练和测试
 训练模型需要如下步骤：
 
-1. 下载训练数据，可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\]
+1. 准备训练数据
     1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
     2. 解压 train.gz 得到 train.txt
-2. 执行 `python train.py --train_data_path train.txt` ，开始训练
+    3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
+2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
 
 上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程，具体的命令行参数及用法如下
 
 ```
 usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
-                [--batch_size BATCH_SIZE] [--test_set_size TEST_SET_SIZE]
+                [--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
                 [--num_passes NUM_PASSES]
-                [--num_lines_to_detact NUM_LINES_TO_DETACT]
+                [--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
+                DATA_META_FILE --model_type MODEL_TYPE
 
 PaddlePaddle CTR example
 
@@ -220,16 +287,78 @@ optional arguments:
   -h, --help            show this help message and exit
   --train_data_path TRAIN_DATA_PATH
                         path of training dataset
+  --test_data_path TEST_DATA_PATH
+                        path of testing dataset
   --batch_size BATCH_SIZE
                         size of mini-batch (default:10000)
-  --test_set_size TEST_SET_SIZE
-                        size of the validation dataset(default: 10000)
   --num_passes NUM_PASSES
                         number of passes to train
-  --num_lines_to_detact NUM_LINES_TO_DETACT
-                        number of records to detect dataset's meta info
+  --model_output_prefix MODEL_OUTPUT_PREFIX
+                        prefix of path for model to store (default:
+                        ./ctr_models)
+  --data_meta_file DATA_META_FILE
+                        path of data meta info file
+  --model_type MODEL_TYPE
+                        model type, classification: 0, regression 1 (default
+                        classification)
+```
+
+- `train_data_path` ： 训练集的路径
+- `test_data_path` : 测试集的路径
+- `num_passes`: 模型训练多少轮
+- `data_meta_file`: 参考[数据和任务抽象](### 数据和任务抽象)的描述。
+- `model_type`: 模型分类或回归
+
+
+## 用训好的模型做预测
+训好的模型可以用来预测新的数据， 预测数据的格式为
+
+```
+# <dnn input ids> \t <lr input sparse values>
+1 23 190 \t 230:0.12 3421:0.9 23451:0.12
+23 231 \t 1230:0.12 13421:0.9
 ```
 
+这里与训练数据的格式唯一不同的地方，就是没有标签，也就是训练数据中第3列 `click` 对应的数值。
+
+`infer.py` 的使用方法如下
+
+```
+usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
+                --prediction_output_path PREDICTION_OUTPUT_PATH
+                [--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
+
+PaddlePaddle CTR example
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --model_gz_path MODEL_GZ_PATH
+                        path of model parameters gz file
+  --data_path DATA_PATH
+                        path of the dataset to infer
+  --prediction_output_path PREDICTION_OUTPUT_PATH
+                        path to output the prediction
+  --data_meta_path DATA_META_PATH
+                        path of trainset's meta info, default is ./data.meta
+  --model_type MODEL_TYPE
+                        model type, classification: 0, regression 1 (default
+                        classification)
+```
+
+- `model_gz_path_model`：用 `gz` 压缩过的模型路径
+- `data_path` ： 需要预测的数据路径
+- `prediction_output_paht`：预测输出的路径
+- `data_meta_file` ：参考[数据和任务抽象](### 数据和任务抽象)的描述。
+- `model_type` ：分类或回归
+
+示例数据可以用如下命令预测
+
+```
+python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
+```
+
+最终的预测结果位于 `predictions.txt`。
+
 ## 参考文献
 1. <https://en.wikipedia.org/wiki/Click-through_rate>
 2. <https://www.kaggle.com/c/avazu-ctr-prediction/data>
diff --git a/ctr/avazu_data_processer.py b/ctr/avazu_data_processer.py
new file mode 100644
index 0000000000000000000000000000000000000000..ca150d8f35a866ae4d5bb07e4391cc7f32076e0f
--- /dev/null
+++ b/ctr/avazu_data_processer.py
@@ -0,0 +1,413 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-import os
+import sys
+import csv
+import cPickle
+import argparse
+import numpy as np
+
+from utils import logger, TaskMode
+
+parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
+parser.add_argument(
+    '--data_path', type=str, required=True, help="path of the Avazu dataset")
+parser.add_argument(
+    '--output_dir', type=str, required=True, help="directory to output")
+parser.add_argument(
+    '--num_lines_to_detect',
+    type=int,
+    default=500000,
+    help="number of records to detect dataset's meta info")
+parser.add_argument(
+    '--test_set_size',
+    type=int,
+    default=10000,
+    help="size of the validation dataset(default: 10000)")
+parser.add_argument(
+    '--train_size',
+    type=int,
+    default=100000,
+    help="size of the trainset (default: 100000)")
+args = parser.parse_args()
+'''
+The fields of the dataset are:
+
+    0. id: ad identifier
+    1. click: 0/1 for non-click/click
+    2. hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
+    3. C1 -- anonymized categorical variable
+    4. banner_pos
+    5. site_id
+    6. site_domain
+    7. site_category
+    8. app_id
+    9. app_domain
+    10. app_category
+    11. device_id
+    12. device_ip
+    13. device_model
+    14. device_type
+    15. device_conn_type
+    16. C14-C21 -- anonymized categorical variables
+
+We will treat the following fields as categorical features:
+
+    - C1
+    - banner_pos
+    - site_category
+    - app_category
+    - device_type
+    - device_conn_type
+
+and some other features as id features:
+
+    - id
+    - site_id
+    - app_id
+    - device_id
+
+The `hour` field will be treated as a continuous feature and will be transformed
+to one-hot representation which has 24 bits.
+
+This script will output 3 files:
+
+1. train.txt
+2. test.txt
+3. infer.txt
+
+all the files are for demo.
+'''
+
+feature_dims = {}
+
+categorial_features = ('C1 banner_pos site_category app_category ' +
+                       'device_type device_conn_type').split()
+
+id_features = 'id site_id app_id device_id _device_id_cross_site_id'.split()
+
+
+def get_all_field_names(mode=0):
+    '''
+    @mode: int
+        0 for train, 1 for test
+    @return: list of str
+    '''
+    return categorial_features + ['hour'] + id_features + ['click'] \
+        if mode == 0 else []
+
+
+class CategoryFeatureGenerator(object):
+    '''
+    Generator category features.
+
+    Register all records by calling `register` first, then call `gen` to generate
+    one-hot representation for a record.
+    '''
+
+    def __init__(self):
+        self.dic = {'unk': 0}
+        self.counter = 1
+
+    def register(self, key):
+        '''
+        Register record.
+        '''
+        if key not in self.dic:
+            self.dic[key] = self.counter
+            self.counter += 1
+
+    def size(self):
+        return len(self.dic)
+
+    def gen(self, key):
+        '''
+        Generate one-hot representation for a record.
+        '''
+        if key not in self.dic:
+            res = self.dic['unk']
+        else:
+            res = self.dic[key]
+        return [res]
+
+    def __repr__(self):
+        return '<CategoryFeatureGenerator %d>' % len(self.dic)
+
+
+class IDfeatureGenerator(object):
+    def __init__(self, max_dim, cross_fea0=None, cross_fea1=None):
+        '''
+        @max_dim: int
+            Size of the id elements' space
+        '''
+        self.max_dim = max_dim
+        self.cross_fea0 = cross_fea0
+        self.cross_fea1 = cross_fea1
+
+    def gen(self, key):
+        '''
+        Generate one-hot representation for records
+        '''
+        return [hash(key) % self.max_dim]
+
+    def gen_cross_fea(self, fea1, fea2):
+        key = str(fea1) + str(fea2)
+        return self.gen(key)
+
+    def size(self):
+        return self.max_dim
+
+
+class ContinuousFeatureGenerator(object):
+    def __init__(self, n_intervals):
+        self.min = sys.maxint
+        self.max = sys.minint
+        self.n_intervals = n_intervals
+
+    def register(self, val):
+        self.min = min(self.minint, val)
+        self.max = max(self.maxint, val)
+
+    def gen(self, val):
+        self.len_part = (self.max - self.min) / self.n_intervals
+        return (val - self.min) / self.len_part
+
+
+# init all feature generators
+fields = {}
+for key in categorial_features:
+    fields[key] = CategoryFeatureGenerator()
+for key in id_features:
+    # for cross features
+    if 'cross' in key:
+        feas = key[1:].split('_cross_')
+        fields[key] = IDfeatureGenerator(10000000, *feas)
+    # for normal ID features
+    else:
+        fields[key] = IDfeatureGenerator(10000)
+
+# used as feed_dict in PaddlePaddle
+field_index = dict((key, id)
+                   for id, key in enumerate(['dnn_input', 'lr_input', 'click']))
+
+
+def detect_dataset(path, topn, id_fea_space=10000):
+    '''
+    Parse the first `topn` records to collect meta information of this dataset.
+
+    NOTE the records should be randomly shuffled first.
+    '''
+    # create categorical statis objects.
+    logger.warning('detecting dataset')
+
+    with open(path, 'rb') as csvfile:
+        reader = csv.DictReader(csvfile)
+        for row_id, row in enumerate(reader):
+            if row_id > topn:
+                break
+
+            for key in categorial_features:
+                fields[key].register(row[key])
+
+    for key, item in fields.items():
+        feature_dims[key] = item.size()
+
+    feature_dims['hour'] = 24
+    feature_dims['click'] = 1
+
+    feature_dims['dnn_input'] = np.sum(
+        feature_dims[key] for key in categorial_features + ['hour']) + 1
+    feature_dims['lr_input'] = np.sum(feature_dims[key]
+                                      for key in id_features) + 1
+    return feature_dims
+
+
+def load_data_meta(meta_path):
+    '''
+    Load dataset's meta infomation.
+    '''
+    feature_dims, fields = cPickle.load(open(meta_path, 'rb'))
+    return feature_dims, fields
+
+
+def concat_sparse_vectors(inputs, dims):
+    '''
+    Concaterate more than one sparse vectors into one.
+
+    @inputs: list
+        list of sparse vector
+    @dims: list of int
+        dimention of each sparse vector
+    '''
+    res = []
+    assert len(inputs) == len(dims)
+    start = 0
+    for no, vec in enumerate(inputs):
+        for v in vec:
+            res.append(v + start)
+        start += dims[no]
+    return res
+
+
+class AvazuDataset(object):
+    '''
+    Load AVAZU dataset as train set.
+    '''
+
+    def __init__(self,
+                 train_path,
+                 n_records_as_test=-1,
+                 fields=None,
+                 feature_dims=None):
+        self.train_path = train_path
+        self.n_records_as_test = n_records_as_test
+        self.fields = fields
+        # default is train mode.
+        self.mode = TaskMode.create_train()
+
+        self.categorial_dims = [
+            feature_dims[key] for key in categorial_features + ['hour']
+        ]
+        self.id_dims = [feature_dims[key] for key in id_features]
+
+    def train(self):
+        '''
+        Load trainset.
+        '''
+        logger.info("load trainset from %s" % self.train_path)
+        self.mode = TaskMode.create_train()
+        with open(self.train_path) as f:
+            reader = csv.DictReader(f)
+
+            for row_id, row in enumerate(reader):
+                # skip top n lines
+                if self.n_records_as_test > 0 and row_id < self.n_records_as_test:
+                    continue
+
+                rcd = self._parse_record(row)
+                if rcd:
+                    yield rcd
+
+    def test(self):
+        '''
+        Load testset.
+        '''
+        logger.info("load testset from %s" % self.train_path)
+        self.mode = TaskMode.create_test()
+        with open(self.train_path) as f:
+            reader = csv.DictReader(f)
+
+            for row_id, row in enumerate(reader):
+                # skip top n lines
+                if self.n_records_as_test > 0 and row_id > self.n_records_as_test:
+                    break
+
+                rcd = self._parse_record(row)
+                if rcd:
+                    yield rcd
+
+    def infer(self):
+        '''
+        Load inferset.
+        '''
+        logger.info("load inferset from %s" % self.train_path)
+        self.mode = TaskMode.create_infer()
+        with open(self.train_path) as f:
+            reader = csv.DictReader(f)
+
+            for row_id, row in enumerate(reader):
+                rcd = self._parse_record(row)
+                if rcd:
+                    yield rcd
+
+    def _parse_record(self, row):
+        '''
+        Parse a CSV row and get a record.
+        '''
+        record = []
+        for key in categorial_features:
+            record.append(self.fields[key].gen(row[key]))
+        record.append([int(row['hour'][-2:])])
+        dense_input = concat_sparse_vectors(record, self.categorial_dims)
+
+        record = []
+        for key in id_features:
+            if 'cross' not in key:
+                record.append(self.fields[key].gen(row[key]))
+            else:
+                fea0 = self.fields[key].cross_fea0
+                fea1 = self.fields[key].cross_fea1
+                record.append(
+                    self.fields[key].gen_cross_fea(row[fea0], row[fea1]))
+
+        sparse_input = concat_sparse_vectors(record, self.id_dims)
+
+        record = [dense_input, sparse_input]
+
+        if not self.mode.is_infer():
+            record.append(list((int(row['click']), )))
+        return record
+
+
+def ids2dense(vec, dim):
+    return vec
+
+
+def ids2sparse(vec):
+    return ["%d:1" % x for x in vec]
+
+
+detect_dataset(args.data_path, args.num_lines_to_detect)
+dataset = AvazuDataset(
+    args.data_path,
+    args.test_set_size,
+    fields=fields,
+    feature_dims=feature_dims)
+
+output_trainset_path = os.path.join(args.output_dir, 'train.txt')
+output_testset_path = os.path.join(args.output_dir, 'test.txt')
+output_infer_path = os.path.join(args.output_dir, 'infer.txt')
+output_meta_path = os.path.join(args.output_dir, 'data.meta.txt')
+
+with open(output_trainset_path, 'w') as f:
+    for id, record in enumerate(dataset.train()):
+        if id and id % 10000 == 0:
+            logger.info("load %d records" % id)
+        if id > args.train_size:
+            break
+        dnn_input, lr_input, click = record
+        dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
+        lr_input = ids2sparse(lr_input)
+        line = "%s\t%s\t%d\n" % (' '.join(map(str, dnn_input)),
+                                 ' '.join(map(str, lr_input)), click[0])
+        f.write(line)
+    logger.info('write to %s' % output_trainset_path)
+
+with open(output_testset_path, 'w') as f:
+    for id, record in enumerate(dataset.test()):
+        dnn_input, lr_input, click = record
+        dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
+        lr_input = ids2sparse(lr_input)
+        line = "%s\t%s\t%d\n" % (' '.join(map(str, dnn_input)),
+                                 ' '.join(map(str, lr_input)), click[0])
+        f.write(line)
+    logger.info('write to %s' % output_testset_path)
+
+with open(output_infer_path, 'w') as f:
+    for id, record in enumerate(dataset.infer()):
+        dnn_input, lr_input = record
+        dnn_input = ids2dense(dnn_input, feature_dims['dnn_input'])
+        lr_input = ids2sparse(lr_input)
+        line = "%s\t%s\n" % (' '.join(map(str, dnn_input)),
+                             ' '.join(map(str, lr_input)), )
+        f.write(line)
+        if id > args.test_set_size:
+            break
+    logger.info('write to %s' % output_infer_path)
+
+with open(output_meta_path, 'w') as f:
+    lines = [
+        "dnn_input_dim: %d" % feature_dims['dnn_input'],
+        "lr_input_dim: %d" % feature_dims['lr_input']
+    ]
+    f.write('\n'.join(lines))
+    logger.info('write data meta into %s' % output_meta_path)
diff --git a/ctr/data_provider.py b/ctr/data_provider.py
deleted file mode 100644
index f02d3d33e75163cf772921ef54729a3fc8da022b..0000000000000000000000000000000000000000
--- a/ctr/data_provider.py
+++ /dev/null
@@ -1,277 +0,0 @@
-import sys
-import csv
-import numpy as np
-'''
-The fields of the dataset are:
-
-    0. id: ad identifier
-    1. click: 0/1 for non-click/click
-    2. hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
-    3. C1 -- anonymized categorical variable
-    4. banner_pos
-    5. site_id
-    6. site_domain
-    7. site_category
-    8. app_id
-    9. app_domain
-    10. app_category
-    11. device_id
-    12. device_ip
-    13. device_model
-    14. device_type
-    15. device_conn_type
-    16. C14-C21 -- anonymized categorical variables
-
-We will treat following fields as categorical features:
-
-    - C1
-    - banner_pos
-    - site_category
-    - app_category
-    - device_type
-    - device_conn_type
-
-and some other features as id features:
-
-    - id
-    - site_id
-    - app_id
-    - device_id
-
-The `hour` field will be treated as a continuous feature and will be transformed
-to one-hot representation which has 24 bits.
-'''
-
-feature_dims = {}
-
-categorial_features = ('C1 banner_pos site_category app_category ' +
-                       'device_type device_conn_type').split()
-
-id_features = 'id site_id app_id device_id _device_id_cross_site_id'.split()
-
-
-def get_all_field_names(mode=0):
-    '''
-    @mode: int
-        0 for train, 1 for test
-    @return: list of str
-    '''
-    return categorial_features + ['hour'] + id_features + ['click'] \
-        if mode == 0 else []
-
-
-class CategoryFeatureGenerator(object):
-    '''
-    Generator category features.
-
-    Register all records by calling `register` first, then call `gen` to generate
-    one-hot representation for a record.
-    '''
-
-    def __init__(self):
-        self.dic = {'unk': 0}
-        self.counter = 1
-
-    def register(self, key):
-        '''
-        Register record.
-        '''
-        if key not in self.dic:
-            self.dic[key] = self.counter
-            self.counter += 1
-
-    def size(self):
-        return len(self.dic)
-
-    def gen(self, key):
-        '''
-        Generate one-hot representation for a record.
-        '''
-        if key not in self.dic:
-            res = self.dic['unk']
-        else:
-            res = self.dic[key]
-        return [res]
-
-    def __repr__(self):
-        return '<CategoryFeatureGenerator %d>' % len(self.dic)
-
-
-class IDfeatureGenerator(object):
-    def __init__(self, max_dim, cross_fea0=None, cross_fea1=None):
-        '''
-        @max_dim: int
-            Size of the id elements' space
-        '''
-        self.max_dim = max_dim
-        self.cross_fea0 = cross_fea0
-        self.cross_fea1 = cross_fea1
-
-    def gen(self, key):
-        '''
-        Generate one-hot representation for records
-        '''
-        return [hash(key) % self.max_dim]
-
-    def gen_cross_fea(self, fea1, fea2):
-        key = str(fea1) + str(fea2)
-        return self.gen(key)
-
-    def size(self):
-        return self.max_dim
-
-
-class ContinuousFeatureGenerator(object):
-    def __init__(self, n_intervals):
-        self.min = sys.maxint
-        self.max = sys.minint
-        self.n_intervals = n_intervals
-
-    def register(self, val):
-        self.min = min(self.minint, val)
-        self.max = max(self.maxint, val)
-
-    def gen(self, val):
-        self.len_part = (self.max - self.min) / self.n_intervals
-        return (val - self.min) / self.len_part
-
-
-# init all feature generators
-fields = {}
-for key in categorial_features:
-    fields[key] = CategoryFeatureGenerator()
-for key in id_features:
-    # for cross features
-    if 'cross' in key:
-        feas = key[1:].split('_cross_')
-        fields[key] = IDfeatureGenerator(10000000, *feas)
-    # for normal ID features
-    else:
-        fields[key] = IDfeatureGenerator(10000)
-
-# used as feed_dict in PaddlePaddle
-field_index = dict((key, id)
-                   for id, key in enumerate(['dnn_input', 'lr_input', 'click']))
-
-
-def detect_dataset(path, topn, id_fea_space=10000):
-    '''
-    Parse the first `topn` records to collect meta information of this dataset.
-
-    NOTE the records should be randomly shuffled first.
-    '''
-    # create categorical statis objects.
-
-    with open(path, 'rb') as csvfile:
-        reader = csv.DictReader(csvfile)
-        for row_id, row in enumerate(reader):
-            if row_id > topn:
-                break
-
-            for key in categorial_features:
-                fields[key].register(row[key])
-
-    for key, item in fields.items():
-        feature_dims[key] = item.size()
-
-    #for key in id_features:
-    #feature_dims[key] = id_fea_space
-
-    feature_dims['hour'] = 24
-    feature_dims['click'] = 1
-
-    feature_dims['dnn_input'] = np.sum(
-        feature_dims[key] for key in categorial_features + ['hour']) + 1
-    feature_dims['lr_input'] = np.sum(feature_dims[key]
-                                      for key in id_features) + 1
-
-    return feature_dims
-
-
-def concat_sparse_vectors(inputs, dims):
-    '''
-    Concaterate more than one sparse vectors into one.
-
-    @inputs: list
-        list of sparse vector
-    @dims: list of int
-        dimention of each sparse vector
-    '''
-    res = []
-    assert len(inputs) == len(dims)
-    start = 0
-    for no, vec in enumerate(inputs):
-        for v in vec:
-            res.append(v + start)
-        start += dims[no]
-    return res
-
-
-class AvazuDataset(object):
-    '''
-    Load AVAZU dataset as train set.
-    '''
-    TRAIN_MODE = 0
-    TEST_MODE = 1
-
-    def __init__(self, train_path, n_records_as_test=-1):
-        self.train_path = train_path
-        self.n_records_as_test = n_records_as_test
-        # task model: 0 train, 1 test
-        self.mode = 0
-
-    def train(self):
-        self.mode = self.TRAIN_MODE
-        return self._parse(self.train_path, skip_n_lines=self.n_records_as_test)
-
-    def test(self):
-        self.mode = self.TEST_MODE
-        return self._parse(self.train_path, top_n_lines=self.n_records_as_test)
-
-    def _parse(self, path, skip_n_lines=-1, top_n_lines=-1):
-        with open(path, 'rb') as csvfile:
-            reader = csv.DictReader(csvfile)
-
-            categorial_dims = [
-                feature_dims[key] for key in categorial_features + ['hour']
-            ]
-            id_dims = [feature_dims[key] for key in id_features]
-
-            for row_id, row in enumerate(reader):
-                if skip_n_lines > 0 and row_id < skip_n_lines:
-                    continue
-                if top_n_lines > 0 and row_id > top_n_lines:
-                    break
-
-                record = []
-                for key in categorial_features:
-                    record.append(fields[key].gen(row[key]))
-                record.append([int(row['hour'][-2:])])
-                dense_input = concat_sparse_vectors(record, categorial_dims)
-
-                record = []
-                for key in id_features:
-                    if 'cross' not in key:
-                        record.append(fields[key].gen(row[key]))
-                    else:
-                        fea0 = fields[key].cross_fea0
-                        fea1 = fields[key].cross_fea1
-                        record.append(
-                            fields[key].gen_cross_fea(row[fea0], row[fea1]))
-
-                sparse_input = concat_sparse_vectors(record, id_dims)
-
-                record = [dense_input, sparse_input]
-
-                record.append(list((int(row['click']), )))
-                yield record
-
-
-if __name__ == '__main__':
-    path = 'train.txt'
-    print detect_dataset(path, 400000)
-
-    filereader = AvazuDataset(path)
-    for no, rcd in enumerate(filereader.train()):
-        print no, rcd
-        if no > 1000: break
diff --git a/ctr/dataset.md b/ctr/dataset.md
index dd6443d56adaf548d6c39458900c711c7f274def..16c0f9784bf3409ac5bbe704f932a9b28680fbf8 100644
--- a/ctr/dataset.md
+++ b/ctr/dataset.md
@@ -1,6 +1,13 @@
 # 数据及处理
 ## 数据集介绍
 
+本教程演示使用Kaggle上CTR任务的数据集\[[3](#参考文献)\]的预处理方法，最终产生本模型需要的格式，详细的数据格式参考[README.md](./README.md)。
+
+Wide && Deep Model\[[2](#参考文献)\]的优势是融合稠密特征和大规模稀疏特征，
+因此特征处理方面也针对稠密和稀疏两种特征作处理，
+其中Deep部分的稠密值全部转化为ID类特征，
+通过embedding 来转化为稠密的向量输入；Wide部分主要通过ID的叉乘提升维度。
+
 数据集使用 `csv` 格式存储，其中各个字段内容如下：
 
 -   `id` : ad identifier
diff --git a/ctr/index.html b/ctr/index.html
index ff0c5d9b19ec046b61f7f38d6eb9e70dff33e1ec..f1df7456a7aef174254fb20f2710c78079f4f26f 100644
--- a/ctr/index.html
+++ b/ctr/index.html
@@ -42,15 +42,30 @@
 <div id="markdown" style='display:none'>
 # 点击率预估
 
+以下是本例目录包含的文件以及对应说明:
+
+```
+├── README.md               # 本教程markdown 文档
+├── dataset.md              # 数据集处理教程
+├── images                  # 本教程图片目录
+│   ├── lr_vs_dnn.jpg
+│   └── wide_deep.png
+├── infer.py                # 预测脚本
+├── network_conf.py         # 模型网络配置
+├── reader.py               # data reader
+├── train.py                # 训练脚本
+└── utils.py                # helper functions
+└── avazu_data_processer.py # 示例数据预处理脚本
+```
+
 ## 背景介绍
 
-CTR(Click-Through Rate，点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\] 是用来表示用户点击一个特定链接的概率，
-通常被用来衡量一个在线广告系统的有效性。
+CTR(Click-Through Rate，点击率预估)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
+是对用户点击一个特定链接的概率做出预测，是广告投放过程中的一个重要环节。精准的点击率预估对在线广告系统收益最大化具有重要意义。
 
-当有多个广告位时，CTR 预估一般会作为排序的基准。
-比如在搜索引擎的广告系统里，当用户输入一个带商业价值的搜索词（query）时，系统大体上会执行下列步骤来展示广告：
+当有多个广告位时，CTR 预估一般会作为排序的基准，比如在搜索引擎的广告系统里，当用户输入一个带商业价值的搜索词（query）时，系统大体上会执行下列步骤来展示广告：
 
-1.  召回满足 query 的广告集合
+1.  获取与用户搜索词相关的广告集合
 2.  业务规则和相关性过滤
 3.  根据拍卖机制和 CTR 排序
 4.  展出广告
@@ -78,13 +93,11 @@ Figure 1. LR 和 DNN 模型结构对比
 </p>
 
 LR 的蓝色箭头部分可以直接类比到 DNN 中对应的结构，可以看到 LR 和 DNN 有一些共通之处（比如权重累加），
-但前者的模型复杂度在相同输入维度下比后者可能低很多（从某方面讲，模型越复杂，越有潜力学习到更复杂的信息）。
-
+但前者的模型复杂度在相同输入维度下比后者可能低很多（从某方面讲，模型越复杂，越有潜力学习到更复杂的信息）；
 如果 LR 要达到匹敌 DNN 的学习能力，必须增加输入的维度，也就是增加特征的数量，
 这也就是为何 LR 和大规模的特征工程必须绑定在一起的原因。
 
-LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力，包括内存和计算量等方面，工业界都有非常成熟的优化方法。
-
+LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力，包括内存和计算量等方面，工业界都有非常成熟的优化方法；
 而 DNN 模型具有自己学习新特征的能力，一定程度上能够提升特征使用的效率，
 这使得 DNN 模型在同样规模特征的情况下，更有可能达到更好的学习效果。
 
@@ -101,10 +114,62 @@ LR 对于 DNN 模型的优势是对大规模稀疏特征的容纳能力，包括
 
 我们直接使用第一种方法做分类任务。
 
-我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示模型。
+我们使用 Kaggle 上 `Click-through rate prediction` 任务的数据集\[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\] 来演示本例中的模型。
+
+具体的特征处理方法参看 [data process](./dataset.md)。
+
+本教程中演示模型的输入格式如下：
+
+```
+# <dnn input ids> \t <lr input sparse values> \t click
+1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
+23 231 \t 1230:0.12 13421:0.9 \t 1
+```
+
+详细的格式描述如下：
+
+- `dnn input ids` 采用 one-hot 表示，只需要填写值为1的ID（注意这里不是变长输入）
+- `lr input sparse values` 使用了 `ID:VALUE` 的表示，值部分最好规约到值域 `[-1, 1]`。
+
+此外，模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度，文件的格式如下：
 
-具体的特征处理方法参看 [data process](./dataset.md)
+```
+dnn_input_dim: <int>
+lr_input_dim: <int>
+```
+
+其中， `<int>` 表示一个整型数值。
+
+本目录下的 `avazu_data_processor.py` 可以对下载的演示数据集\[[2](#参考文档)\] 进行处理，具体使用方法参考如下说明：
+
+```
+usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
+                               OUTPUT_DIR
+                               [--num_lines_to_detect NUM_LINES_TO_DETECT]
+                               [--test_set_size TEST_SET_SIZE]
+                               [--train_size TRAIN_SIZE]
+
+PaddlePaddle CTR example
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --data_path DATA_PATH
+                        path of the Avazu dataset
+  --output_dir OUTPUT_DIR
+                        directory to output
+  --num_lines_to_detect NUM_LINES_TO_DETECT
+                        number of records to detect dataset's meta info
+  --test_set_size TEST_SET_SIZE
+                        size of the validation dataset(default: 10000)
+  --train_size TRAIN_SIZE
+                        size of the trainset (default: 100000)
+```
 
+- `data_path` 是待处理的数据路径
+- `output_dir` 生成数据的输出路径
+- `num_lines_to_detect` 预先扫描数据生成ID的个数，这里是扫描的文件行数
+- `test_set_size` 生成测试集的行数
+- `train_size` 生成训练姐的行数
 
 ## Wide & Deep Learning Model
 
@@ -243,18 +308,20 @@ trainer.train(
 ## 运行训练和测试
 训练模型需要如下步骤：
 
-1. 下载训练数据，可以使用 Kaggle 上 CTR 比赛的数据\[[2](#参考文献)\]
+1. 准备训练数据
     1. 从 [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) 下载 train.gz
     2. 解压 train.gz 得到 train.txt
-2. 执行 `python train.py --train_data_path train.txt` ，开始训练
+    3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
+2. 执行 `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0` 开始训练
 
 上面第2个步骤可以为 `train.py` 填充命令行参数来定制模型的训练过程，具体的命令行参数及用法如下
 
 ```
 usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
-                [--batch_size BATCH_SIZE] [--test_set_size TEST_SET_SIZE]
+                [--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
                 [--num_passes NUM_PASSES]
-                [--num_lines_to_detact NUM_LINES_TO_DETACT]
+                [--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
+                DATA_META_FILE --model_type MODEL_TYPE
 
 PaddlePaddle CTR example
 
@@ -262,16 +329,78 @@ optional arguments:
   -h, --help            show this help message and exit
   --train_data_path TRAIN_DATA_PATH
                         path of training dataset
+  --test_data_path TEST_DATA_PATH
+                        path of testing dataset
   --batch_size BATCH_SIZE
                         size of mini-batch (default:10000)
-  --test_set_size TEST_SET_SIZE
-                        size of the validation dataset(default: 10000)
   --num_passes NUM_PASSES
                         number of passes to train
-  --num_lines_to_detact NUM_LINES_TO_DETACT
-                        number of records to detect dataset's meta info
+  --model_output_prefix MODEL_OUTPUT_PREFIX
+                        prefix of path for model to store (default:
+                        ./ctr_models)
+  --data_meta_file DATA_META_FILE
+                        path of data meta info file
+  --model_type MODEL_TYPE
+                        model type, classification: 0, regression 1 (default
+                        classification)
+```
+
+- `train_data_path` ： 训练集的路径
+- `test_data_path` : 测试集的路径
+- `num_passes`: 模型训练多少轮
+- `data_meta_file`: 参考[数据和任务抽象](### 数据和任务抽象)的描述。
+- `model_type`: 模型分类或回归
+
+
+## 用训好的模型做预测
+训好的模型可以用来预测新的数据， 预测数据的格式为
+
+```
+# <dnn input ids> \t <lr input sparse values>
+1 23 190 \t 230:0.12 3421:0.9 23451:0.12
+23 231 \t 1230:0.12 13421:0.9
 ```
 
+这里与训练数据的格式唯一不同的地方，就是没有标签，也就是训练数据中第3列 `click` 对应的数值。
+
+`infer.py` 的使用方法如下
+
+```
+usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
+                --prediction_output_path PREDICTION_OUTPUT_PATH
+                [--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
+
+PaddlePaddle CTR example
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --model_gz_path MODEL_GZ_PATH
+                        path of model parameters gz file
+  --data_path DATA_PATH
+                        path of the dataset to infer
+  --prediction_output_path PREDICTION_OUTPUT_PATH
+                        path to output the prediction
+  --data_meta_path DATA_META_PATH
+                        path of trainset's meta info, default is ./data.meta
+  --model_type MODEL_TYPE
+                        model type, classification: 0, regression 1 (default
+                        classification)
+```
+
+- `model_gz_path_model`：用 `gz` 压缩过的模型路径
+- `data_path` ： 需要预测的数据路径
+- `prediction_output_paht`：预测输出的路径
+- `data_meta_file` ：参考[数据和任务抽象](### 数据和任务抽象)的描述。
+- `model_type` ：分类或回归
+
+示例数据可以用如下命令预测
+
+```
+python infer.py --model_gz_path <model_path> --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
+```
+
+最终的预测结果位于 `predictions.txt`。
+
 ## 参考文献
 1. <https://en.wikipedia.org/wiki/Click-through_rate>
 2. <https://www.kaggle.com/c/avazu-ctr-prediction/data>
diff --git a/ctr/infer.py b/ctr/infer.py
new file mode 100644
index 0000000000000000000000000000000000000000..721c6b01b5a82b863e7db69865cd62c496b382d9
--- /dev/null
+++ b/ctr/infer.py
@@ -0,0 +1,81 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+import gzip
+import argparse
+import itertools
+
+import paddle.v2 as paddle
+import network_conf
+from train import dnn_layer_dims
+import reader
+from utils import logger, ModelType
+
+parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
+parser.add_argument(
+    '--model_gz_path',
+    type=str,
+    required=True,
+    help="path of model parameters gz file")
+parser.add_argument(
+    '--data_path', type=str, required=True, help="path of the dataset to infer")
+parser.add_argument(
+    '--prediction_output_path',
+    type=str,
+    required=True,
+    help="path to output the prediction")
+parser.add_argument(
+    '--data_meta_path',
+    type=str,
+    default="./data.meta",
+    help="path of trainset's meta info, default is ./data.meta")
+parser.add_argument(
+    '--model_type',
+    type=int,
+    required=True,
+    default=ModelType.CLASSIFICATION,
+    help='model type, classification: %d, regression %d (default classification)'
+    % (ModelType.CLASSIFICATION, ModelType.REGRESSION))
+
+args = parser.parse_args()
+
+paddle.init(use_gpu=False, trainer_count=1)
+
+
+class CTRInferer(object):
+    def __init__(self, param_path):
+        logger.info("create CTR model")
+        dnn_input_dim, lr_input_dim = reader.load_data_meta(args.data_meta_path)
+        # create the mdoel
+        self.ctr_model = network_conf.CTRmodel(
+            dnn_layer_dims,
+            dnn_input_dim,
+            lr_input_dim,
+            model_type=ModelType(args.model_type),
+            is_infer=True)
+        # load parameter
+        logger.info("load model parameters from %s" % param_path)
+        self.parameters = paddle.parameters.Parameters.from_tar(
+            gzip.open(param_path, 'r'))
+        self.inferer = paddle.inference.Inference(
+            output_layer=self.ctr_model.model,
+            parameters=self.parameters, )
+
+    def infer(self, data_path):
+        logger.info("infer data...")
+        dataset = reader.Dataset()
+        infer_reader = paddle.batch(
+            dataset.infer(args.data_path), batch_size=1000)
+        logger.warning('write predictions to %s' % args.prediction_output_path)
+        output_f = open(args.prediction_output_path, 'w')
+        for id, batch in enumerate(infer_reader()):
+            res = self.inferer.infer(input=batch)
+            predictions = [x for x in itertools.chain.from_iterable(res)]
+            assert len(batch) == len(
+                predictions), "predict error, %d inputs, but %d predictions" % (
+                    len(batch), len(predictions))
+            output_f.write('\n'.join(map(str, predictions)) + '\n')
+
+
+if __name__ == '__main__':
+    ctr_inferer = CTRInferer(args.model_gz_path)
+    ctr_inferer.infer(args.data_path)
diff --git a/ctr/network_conf.py b/ctr/network_conf.py
new file mode 100644
index 0000000000000000000000000000000000000000..d577abcfcf3adf57f8278ae7f0bc7e48a3621a06
--- /dev/null
+++ b/ctr/network_conf.py
@@ -0,0 +1,105 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+import paddle.v2 as paddle
+from paddle.v2 import layer
+from paddle.v2 import data_type as dtype
+from utils import logger, ModelType
+
+
+class CTRmodel(object):
+    '''
+    A CTR model which implements wide && deep learning model.
+    '''
+
+    def __init__(self,
+                 dnn_layer_dims,
+                 dnn_input_dim,
+                 lr_input_dim,
+                 model_type=ModelType.create_classification(),
+                 is_infer=False):
+        '''
+        @dnn_layer_dims: list of integer
+            dims of each layer in dnn
+        @dnn_input_dim: int
+            size of dnn's input layer
+        @lr_input_dim: int
+            size of lr's input layer
+        @is_infer: bool
+            whether to build a infer model
+        '''
+        self.dnn_layer_dims = dnn_layer_dims
+        self.dnn_input_dim = dnn_input_dim
+        self.lr_input_dim = lr_input_dim
+        self.model_type = model_type
+        self.is_infer = is_infer
+
+        self._declare_input_layers()
+
+        self.dnn = self._build_dnn_submodel_(self.dnn_layer_dims)
+        self.lr = self._build_lr_submodel_()
+
+        # model's prediction
+        # TODO(superjom) rename it to prediction
+        if self.model_type.is_classification():
+            self.model = self._build_classification_model(self.dnn, self.lr)
+        if self.model_type.is_regression():
+            self.model = self._build_regression_model(self.dnn, self.lr)
+
+    def _declare_input_layers(self):
+        self.dnn_merged_input = layer.data(
+            name='dnn_input',
+            type=paddle.data_type.sparse_binary_vector(self.dnn_input_dim))
+
+        self.lr_merged_input = layer.data(
+            name='lr_input',
+            type=paddle.data_type.sparse_vector(self.lr_input_dim))
+
+        if not self.is_infer:
+            self.click = paddle.layer.data(
+                name='click', type=dtype.dense_vector(1))
+
+    def _build_dnn_submodel_(self, dnn_layer_dims):
+        '''
+        build DNN submodel.
+        '''
+        dnn_embedding = layer.fc(
+            input=self.dnn_merged_input, size=dnn_layer_dims[0])
+        _input_layer = dnn_embedding
+        for i, dim in enumerate(dnn_layer_dims[1:]):
+            fc = layer.fc(
+                input=_input_layer,
+                size=dim,
+                act=paddle.activation.Relu(),
+                name='dnn-fc-%d' % i)
+            _input_layer = fc
+        return _input_layer
+
+    def _build_lr_submodel_(self):
+        '''
+        config LR submodel
+        '''
+        fc = layer.fc(
+            input=self.lr_merged_input, size=1, act=paddle.activation.Relu())
+        return fc
+
+    def _build_classification_model(self, dnn, lr):
+        merge_layer = layer.concat(input=[dnn, lr])
+        self.output = layer.fc(
+            input=merge_layer,
+            size=1,
+            # use sigmoid function to approximate ctr rate, a float value between 0 and 1.
+            act=paddle.activation.Sigmoid())
+
+        if not self.is_infer:
+            self.train_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
+                input=self.output, label=self.click)
+        return self.output
+
+    def _build_regression_model(self, dnn, lr):
+        merge_layer = layer.concat(input=[dnn, lr])
+        self.output = layer.fc(
+            input=merge_layer, size=1, act=paddle.activation.Sigmoid())
+        if not self.is_infer:
+            self.train_cost = paddle.layer.mse_cost(
+                input=self.output, label=self.click)
+        return self.output
diff --git a/ctr/reader.py b/ctr/reader.py
new file mode 100644
index 0000000000000000000000000000000000000000..d511b9d7c3b3a843b1e2819481b377ba5a49ce1a
--- /dev/null
+++ b/ctr/reader.py
@@ -0,0 +1,66 @@
+from utils import logger, TaskMode, load_dnn_input_record, load_lr_input_record
+
+feeding_index = {'dnn_input': 0, 'lr_input': 1, 'click': 2}
+
+
+class Dataset(object):
+    def __init__(self):
+        self.mode = TaskMode.create_train()
+
+    def train(self, path):
+        '''
+        Load trainset.
+        '''
+        logger.info("load trainset from %s" % path)
+        self.mode = TaskMode.create_train()
+        self.path = path
+        return self._parse
+
+    def test(self, path):
+        '''
+        Load testset.
+        '''
+        logger.info("load testset from %s" % path)
+        self.path = path
+        self.mode = TaskMode.create_test()
+        return self._parse
+
+    def infer(self, path):
+        '''
+        Load infer set.
+        '''
+        logger.info("load inferset from %s" % path)
+        self.path = path
+        self.mode = TaskMode.create_infer()
+        return self._parse
+
+    def _parse(self):
+        '''
+        Parse dataset.
+        '''
+        with open(self.path) as f:
+            for line_id, line in enumerate(f):
+                fs = line.strip().split('\t')
+                dnn_input = load_dnn_input_record(fs[0])
+                lr_input = load_lr_input_record(fs[1])
+                if not self.mode.is_infer():
+                    click = [int(fs[2])]
+                    yield dnn_input, lr_input, click
+                else:
+                    yield dnn_input, lr_input
+
+
+def load_data_meta(path):
+    '''
+    load data meta info from path, return (dnn_input_dim, lr_input_dim)
+    '''
+    with open(path) as f:
+        lines = f.read().split('\n')
+        err_info = "wrong meta format"
+        assert len(lines) == 2, err_info
+        assert 'dnn_input_dim:' in lines[0] and 'lr_input_dim:' in lines[
+            1], err_info
+        res = map(int, [_.split(':')[1] for _ in lines])
+        logger.info('dnn input dim: %d' % res[0])
+        logger.info('lr input dim: %d' % res[1])
+        return res
diff --git a/ctr/train.py b/ctr/train.py
index da6dc9dd6d9e386a87693b5a5bc0cbf95da0b069..64831089ae1b1df4cb73326824af71acb345f80d 100644
--- a/ctr/train.py
+++ b/ctr/train.py
@@ -1,138 +1,113 @@
 #!/usr/bin/env python
-# -*- coding: utf-8 -*-
-
+# -*- coding: utf-8 -*-import os
 import argparse
-import logging
-import paddle.v2 as paddle
-from paddle.v2 import layer
-from paddle.v2 import data_type as dtype
-from data_provider import field_index, detect_dataset, AvazuDataset
-
-parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
-parser.add_argument(
-    '--train_data_path',
-    type=str,
-    required=True,
-    help="path of training dataset")
-parser.add_argument(
-    '--batch_size',
-    type=int,
-    default=10000,
-    help="size of mini-batch (default:10000)")
-parser.add_argument(
-    '--test_set_size',
-    type=int,
-    default=10000,
-    help="size of the validation dataset(default: 10000)")
-parser.add_argument(
-    '--num_passes', type=int, default=10, help="number of passes to train")
-parser.add_argument(
-    '--num_lines_to_detact',
-    type=int,
-    default=500000,
-    help="number of records to detect dataset's meta info")
-
-args = parser.parse_args()
-
-dnn_layer_dims = [128, 64, 32, 1]
-data_meta_info = detect_dataset(args.train_data_path, args.num_lines_to_detact)
-
-logging.warning('detect categorical fields in dataset %s' %
-                args.train_data_path)
-for key, item in data_meta_info.items():
-    logging.warning('    - {}\t{}'.format(key, item))
-
-paddle.init(use_gpu=False, trainer_count=1)
+import gzip
 
-# ==============================================================================
-#                    input layers
-# ==============================================================================
-dnn_merged_input = layer.data(
-    name='dnn_input',
-    type=paddle.data_type.sparse_binary_vector(data_meta_info['dnn_input']))
-
-lr_merged_input = layer.data(
-    name='lr_input',
-    type=paddle.data_type.sparse_binary_vector(data_meta_info['lr_input']))
-
-click = paddle.layer.data(name='click', type=dtype.dense_vector(1))
+import reader
+import paddle.v2 as paddle
+from utils import logger, ModelType
+from network_conf import CTRmodel
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="PaddlePaddle CTR example")
+    parser.add_argument(
+        '--train_data_path',
+        type=str,
+        required=True,
+        help="path of training dataset")
+    parser.add_argument(
+        '--test_data_path', type=str, help='path of testing dataset')
+    parser.add_argument(
+        '--batch_size',
+        type=int,
+        default=10000,
+        help="size of mini-batch (default:10000)")
+    parser.add_argument(
+        '--num_passes', type=int, default=10, help="number of passes to train")
+    parser.add_argument(
+        '--model_output_prefix',
+        type=str,
+        default='./ctr_models',
+        help='prefix of path for model to store (default: ./ctr_models)')
+    parser.add_argument(
+        '--data_meta_file',
+        type=str,
+        required=True,
+        help='path of data meta info file', )
+    parser.add_argument(
+        '--model_type',
+        type=int,
+        required=True,
+        default=ModelType.CLASSIFICATION,
+        help='model type, classification: %d, regression %d (default classification)'
+        % (ModelType.CLASSIFICATION, ModelType.REGRESSION))
+
+    return parser.parse_args()
 
 
-# ==============================================================================
-#                    network structure
-# ==============================================================================
-def build_dnn_submodel(dnn_layer_dims):
-    dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0])
-    _input_layer = dnn_embedding
-    for i, dim in enumerate(dnn_layer_dims[1:]):
-        fc = layer.fc(
-            input=_input_layer,
-            size=dim,
-            act=paddle.activation.Relu(),
-            name='dnn-fc-%d' % i)
-        _input_layer = fc
-    return _input_layer
-
-
-# config LR submodel
-def build_lr_submodel():
-    fc = layer.fc(
-        input=lr_merged_input, size=1, name='lr', act=paddle.activation.Relu())
-    return fc
-
-
-# conbine DNN and LR submodels
-def combine_submodels(dnn, lr):
-    merge_layer = layer.concat(input=[dnn, lr])
-    fc = layer.fc(
-        input=merge_layer,
-        size=1,
-        name='output',
-        # use sigmoid function to approximate ctr rate, a float value between 0 and 1.
-        act=paddle.activation.Sigmoid())
-    return fc
-
-
-dnn = build_dnn_submodel(dnn_layer_dims)
-lr = build_lr_submodel()
-output = combine_submodels(dnn, lr)
+dnn_layer_dims = [128, 64, 32, 1]
 
 # ==============================================================================
 #                   cost and train period
 # ==============================================================================
-classification_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
-    input=output, label=click)
-
-params = paddle.parameters.create(classification_cost)
-
-optimizer = paddle.optimizer.Momentum(momentum=0.01)
-
-trainer = paddle.trainer.SGD(
-    cost=classification_cost, parameters=params, update_equation=optimizer)
-
-dataset = AvazuDataset(
-    args.train_data_path, n_records_as_test=args.test_set_size)
-
-
-def event_handler(event):
-    if isinstance(event, paddle.event.EndIteration):
-        num_samples = event.batch_id * args.batch_size
-        if event.batch_id % 100 == 0:
-            logging.warning("Pass %d, Samples %d, Cost %f" %
-                            (event.pass_id, num_samples, event.cost))
-
-        if event.batch_id % 1000 == 0:
-            result = trainer.test(
-                reader=paddle.batch(dataset.test, batch_size=args.batch_size),
-                feeding=field_index)
-            logging.warning("Test %d-%d, Cost %f" %
-                            (event.pass_id, event.batch_id, result.cost))
 
 
-trainer.train(
-    reader=paddle.batch(
-        paddle.reader.shuffle(dataset.train, buf_size=500),
-        batch_size=args.batch_size),
-    feeding=field_index,
-    event_handler=event_handler,
-    num_passes=args.num_passes)
+def train():
+    args = parse_args()
+    args.model_type = ModelType(args.model_type)
+    paddle.init(use_gpu=False, trainer_count=1)
+    dnn_input_dim, lr_input_dim = reader.load_data_meta(args.data_meta_file)
+
+    # create ctr model.
+    model = CTRmodel(
+        dnn_layer_dims,
+        dnn_input_dim,
+        lr_input_dim,
+        model_type=args.model_type,
+        is_infer=False)
+
+    params = paddle.parameters.create(model.train_cost)
+    optimizer = paddle.optimizer.AdaGrad()
+
+    trainer = paddle.trainer.SGD(
+        cost=model.train_cost, parameters=params, update_equation=optimizer)
+
+    dataset = reader.Dataset()
+
+    def __event_handler__(event):
+        if isinstance(event, paddle.event.EndIteration):
+            num_samples = event.batch_id * args.batch_size
+            if event.batch_id % 100 == 0:
+                logger.warning("Pass %d, Samples %d, Cost %f, %s" % (
+                    event.pass_id, num_samples, event.cost, event.metrics))
+
+            if event.batch_id % 1000 == 0:
+                if args.test_data_path:
+                    result = trainer.test(
+                        reader=paddle.batch(
+                            dataset.test(args.test_data_path),
+                            batch_size=args.batch_size),
+                        feeding=reader.feeding_index)
+                    logger.warning("Test %d-%d, Cost %f, %s" %
+                                   (event.pass_id, event.batch_id, result.cost,
+                                    result.metrics))
+
+                path = "{}-pass-{}-batch-{}-test-{}.tar.gz".format(
+                    args.model_output_prefix, event.pass_id, event.batch_id,
+                    result.cost)
+                with gzip.open(path, 'w') as f:
+                    params.to_tar(f)
+
+    trainer.train(
+        reader=paddle.batch(
+            paddle.reader.shuffle(
+                dataset.train(args.train_data_path), buf_size=500),
+            batch_size=args.batch_size),
+        feeding=reader.feeding_index,
+        event_handler=__event_handler__,
+        num_passes=args.num_passes)
+
+
+if __name__ == '__main__':
+    train()
diff --git a/ctr/utils.py b/ctr/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..d8cf569cc9b28fa04ee389838cef9edd6862e52b
--- /dev/null
+++ b/ctr/utils.py
@@ -0,0 +1,68 @@
+import logging
+
+logging.basicConfig()
+logger = logging.getLogger("paddle")
+logger.setLevel(logging.INFO)
+
+
+class TaskMode:
+    TRAIN_MODE = 0
+    TEST_MODE = 1
+    INFER_MODE = 2
+
+    def __init__(self, mode):
+        self.mode = mode
+
+    def is_train(self):
+        return self.mode == self.TRAIN_MODE
+
+    def is_test(self):
+        return self.mode == self.TEST_MODE
+
+    def is_infer(self):
+        return self.mode == self.INFER_MODE
+
+    @staticmethod
+    def create_train():
+        return TaskMode(TaskMode.TRAIN_MODE)
+
+    @staticmethod
+    def create_test():
+        return TaskMode(TaskMode.TEST_MODE)
+
+    @staticmethod
+    def create_infer():
+        return TaskMode(TaskMode.INFER_MODE)
+
+
+class ModelType:
+    CLASSIFICATION = 0
+    REGRESSION = 1
+
+    def __init__(self, mode):
+        self.mode = mode
+
+    def is_classification(self):
+        return self.mode == self.CLASSIFICATION
+
+    def is_regression(self):
+        return self.mode == self.REGRESSION
+
+    @staticmethod
+    def create_classification():
+        return ModelType(ModelType.CLASSIFICATION)
+
+    @staticmethod
+    def create_regression():
+        return ModelType(ModelType.REGRESSION)
+
+
+def load_dnn_input_record(sent):
+    return map(int, sent.split())
+
+
+def load_lr_input_record(sent):
+    res = []
+    for _ in [x.split(':') for x in sent.split()]:
+        res.append((int(_[0]), float(_[1]), ))
+    return res
diff --git a/nce_cost/train.py b/nce_cost/train.py
index 4ab5043725805003cf151c6d0c8af8dbbc8c199f..3babf7fe0963fcff54430cd174b0af523e68846b 100644
--- a/nce_cost/train.py
+++ b/nce_cost/train.py
@@ -43,7 +43,10 @@ def train(model_save_dir):
                 parameters.to_tar(f)
 
     trainer.train(
-        paddle.batch(paddle.dataset.imikolov.train(word_dict, 5), 64),
+        paddle.batch(
+            paddle.reader.shuffle(
+                lambda: paddle.dataset.imikolov.train(word_dict, 5)(),
+                buf_size=1000), 64),
         num_passes=1000,
         event_handler=event_handler)