Merge branch 'master' into multiview-simnet

bed0a70a · wuzhihua · GitHub · cb9fac25 · ce5e22e9 · bed0a70a
18 changed file
--- a/doc/imgs/match-pyramid.png
+++ b/doc/imgs/match-pyramid.png
--- a/models/contentunderstanding/readme.md
+++ b/models/contentunderstanding/readme.md
@@ -22,12 +22,12 @@

 |       模型        |       简介        |       论文        |
 | :------------------: | :--------------------: | :---------: |
-| TagSpace | 标签推荐 | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://research.fb.com/publications/tagspace-semantic-embeddings-from-hashtags/) |
+| TagSpace | 标签推荐 | [EMNLP 2014][TagSpace: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf) |
 | Classification | 文本分类 | [EMNLP 2014][Convolutional neural networks for sentence classication](https://www.aclweb.org/anthology/D14-1181.pdf) |

 下面是每个模型的简介（注：图片引用自链接中的论文）

-[TagSpace模型](https://research.fb.com/publications/tagspace-semantic-embeddings-from-hashtags)
+[TagSpace模型](https://www.aclweb.org/anthology/D14-1194.pdf)
 <p align="center">
 <img align="center" src="../../doc/imgs/tagspace.png">
 <p>
@@ -37,89 +37,173 @@
 <img align="center" src="../../doc/imgs/cnn-ckim2014.png">
 <p>

-##使用教程(快速开始)
+## 使用教程(快速开始)
 ```
 git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
-cd paddle-rec
-
+cd PaddleRec
 python -m paddlerec.run -m models/contentunderstanding/tagspace/config.yaml
 python -m paddlerec.run -m models/contentunderstanding/classification/config.yaml
 ```

 ## 使用教程（复现论文）

-###注意
+### 注意

 为了方便使用者能够快速的跑通每一个模型，我们在每个模型下都提供了样例数据。如果需要复现readme中的效果请使用以下提供的脚本下载对应数据集以及数据预处理。

-### 数据处理

 **（1）TagSpace**

+### 数据处理
 [数据地址](https://github.com/mhjabreel/CharCNN/tree/master/data/) , [备份数据地址](https://paddle-tagspace.bj.bcebos.com/data.tar)
- 
+
 数据格式如下
 ```
 "3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
 ```

-数据解压后，将文本数据转为paddle数据，先将数据放到训练数据目录和测试数据目录
-
+本文提供了快速将数据集中的汉字数据处理为可训练格式数据的脚本，您在解压数据集后，将原始数据存放在raw_big_train_data和raw_big_test_data两个目录下，并在python3环境下运行我们提供的text2paddle.py文件。即可生成可以直接用于训练的数据目录test_big_data和train_big_data。命令如下：
 ```
 mkdir raw_big_train_data
 mkdir raw_big_test_data
 mv train.csv raw_big_train_data
 mv test.csv raw_big_test_data
+python3 text2paddle.py raw_big_train_data/ raw_big_test_data/ train_big_data test_big_data big_vocab_text.txt big_vocab_tag.txt
 ```

-运行脚本text2paddle.py 生成paddle输入格式
+运行后的data目录：  
+
+```
+big_vocab_tag.txt  #标签词汇数
+big_vocab_text.txt #文本词汇数
+data.tar  #数据集
+raw_big_train_data  #数据集中原始的训练集
+raw_big_test_data  #数据集中原始的测试集
+train_data  #样例训练集
+test_data  #样例测试集
+train_big_data  #数据集经处理后的训练集
+test_big_data  #数据集经处理后的测试集
+text2paddle.py  #预处理文件
+```

+处理完成的数据格式如下：
 ```
-python text2paddle.py raw_big_train_data/ raw_big_test_data/ train_big_data test_big_data big_vocab_text.txt big_vocab_tag.txt
+2,27 7062 8390 456 407 8 11589 3166 4 7278 31046 33 3898 2897 426 1
+2,27 9493 836 355 20871 300 81 19 3 4125 9 449 462 13832 6 16570 1380 2874 5 0 797 236 19 3688 2106 14 8615 7 209 304 4 0 123 1
+2,27 12754 637 106 3839 1532 66 0 379 6 0 1246 9 307 33 161 2 8100 36 0 350 123 101 74 181 0 6657 4 0 1222 17195 1
 ```

+
 ### 训练
+退回tagspace目录中，打开文件config.yaml,更改其中的参数  
+将workspace改为您当前的绝对路径。（可用pwd命令获取绝对路径）  
+将dataset下sample_1的batch_size值从10改为128   
+将dataset下sample_1的data_path改为：{workspace}/data/train_big_data  
+将dataset下inferdata的batch_size值从10改为500 
+将dataset下inferdata的data_path改为：{workspace}/data/test_big_data 
+执行命令，开始训练：
 ```
-cd modles/contentunderstanding/tagspace
-python -m paddlerec.run -m ./config.yaml # 自定义修改超参后，指定配置文件，使用自定义配置
+python -m paddlerec.run -m ./config.yaml
 ```

 ### 预测
+在跑完训练后，模型会开始在验证集上预测。
+运行结果：
+```
+PaddleRec: Runner infer_runner Begin
+Executor Mode: infer
+processor_register begin
+Running SingleInstance.
+Running SingleNetwork.
+Running SingleInferStartup.
+Running SingleInferRunner.
+load persistables from increment/9
+batch: 1, acc: [0.91], loss: [0.02495437]
+batch: 2, acc: [0.936], loss: [0.01941476]
+batch: 3, acc: [0.918], loss: [0.02116447]
+batch: 4, acc: [0.916], loss: [0.0219945]
+batch: 5, acc: [0.902], loss: [0.02242816]
+batch: 6, acc: [0.9], loss: [0.02421589]
+batch: 7, acc: [0.9], loss: [0.026441]
+batch: 8, acc: [0.934], loss: [0.01797657]
+batch: 9, acc: [0.932], loss: [0.01687362]
+batch: 10, acc: [0.926], loss: [0.02047823]
+batch: 11, acc: [0.918], loss: [0.01998716]
+batch: 12, acc: [0.898], loss: [0.0229556]
+batch: 13, acc: [0.928], loss: [0.01736144]
+batch: 14, acc: [0.93], loss: [0.01911209]
 ```
-# 修改对应模型的config.yaml, workspace配置为当前目录的绝对路径
-# 修改对应模型的config.yaml，mode配置infer_runner
-# 示例: mode: train_runner -> mode: infer_runner
-# infer_runner中 class配置为 class: infer
-# 修改phase阶段为infer的配置，参照config注释

-# 修改完config.yaml后 执行:
-python -m paddlerec.run -m ./config.yaml
+**（2）Classification**
+
+### 数据处理
+情感倾向分析（Sentiment Classification，简称Senta）针对带有主观描述的中文文本，可自动判断该文本的情感极性类别并给出相应的置信度。情感类型分为积极、消极。情感倾向分析能够帮助企业理解用户消费习惯、分析热点话题和危机舆情监控，为企业提供有利的决策支持。  
+情感是人类的一种高级智能行为，为了识别文本的情感倾向，需要深入的语义建模。另外，不同领域（如餐饮、体育）在情感的表达各不相同，因而需要有大规模覆盖各个领域的数据进行模型训练。为此，我们通过基于深度学习的语义模型和大规模数据挖掘解决上述两个问题。效果上，我们基于开源情感倾向分类数据集ChnSentiCorp进行评测。  
+您可以直接执行以下命令下载我们分词完毕后的数据集,文件解压之后，senta_data目录下会存在训练数据（train.tsv）、开发集数据（dev.tsv）、测试集数据（test.tsv）以及对应的词典（word_dict.txt）：
+
+``` 
+wget https://baidu-nlp.bj.bcebos.com/sentiment_classification-dataset-1.0.0.tar.gz
+tar -zxvf sentiment_classification-dataset-1.0.0.tar.gz
 ```

-**（2）Classification**
+数据格式为一句中文的评价语句，和一个代表情感信息的标签。两者之间用/t分隔，中文的评价语句已经分词，词之间用空格分隔。  

-### 训练
 ```
-cd modles/contentunderstanding/classification
-python -m paddlerec.run -m ./config.yaml # 自定义修改超参后，指定配置文件，使用自定义配置
+15.4寸 笔记本 的 键盘 确实 爽 ， 基本 跟 台式机 差不多 了 ， 蛮 喜欢 数字 小 键盘 ， 输 数字 特 方便 ， 样子 也 很 美观 ， 做工 也 相当 不错    1
+跟 心灵 鸡汤 没 什么 本质 区别 嘛 ， 至少 我 不 喜欢 这样 读 经典 ， 把 经典 都 解读 成 这样 有点 去 中国 化 的 味道 了 0
+```
+本文提供了快速将数据集中的汉字数据处理为可训练格式数据的脚本，您在解压数据集后，将preprocess.py复制到senta_data文件中并执行，即可将数据集中提供的dev.tsv，test.tsv，train.tsv转化为可直接训练的dev.txt，test.txt，train.txt.
+```
+cp ./data/preprocess.py ./senta_data/
+cd senta_data/
+python preprocess.py
 ```

-### 预测
+### 训练
+创建存放训练集和测试集的目录，将数据放入目录中。
 ```
-# 修改对应模型的config.yaml, workspace配置为当前目录的绝对路径
-# 修改对应模型的config.yaml，mode配置infer_runner
-# 示例: mode: train_runner -> mode: infer_runner
-# infer_runner中 class配置为 class: infer
-# 修改phase阶段为infer的配置，参照config注释
+mkdir train
+mv train.txt train
+mkdir test
+mv dev.txt  test
+cd ..
+```  
+
+打开文件config.yaml,更改其中的参数  
+将workspace改为您当前的绝对路径。（可用pwd命令获取绝对路径）  
+将data1下的batch_size值从10改为128    
+将data1下的data_path改为：{workspace}/senta_data/train  
+将dataset_infer下的batch_size值从2改为256  
+将dataset_infer下的data_path改为：{workspace}/senta_data/test  

-# 修改完config.yaml后 执行:
+执行命令，开始训练：
+```
 python -m paddlerec.run -m ./config.yaml
 ```

+### 预测
+在跑完训练后，模型会开始在验证集上预测。
+运行结果：  
+
+```
+PaddleRec: Runner infer_runner Begin
+Executor Mode: infer
+processor_register begin
+Running SingleInstance.
+Running SingleNetwork.
+Running SingleInferStartup.
+Running SingleInferRunner.
+load persistables from increment/14
+batch: 1, acc: [0.91796875], loss: [0.2287855]
+batch: 2, acc: [0.91796875], loss: [0.22827303]
+batch: 3, acc: [0.90234375], loss: [0.27907994]
+```
+
+
 ## 效果对比
 ### 模型效果 (测试)

-|       数据集        |       模型       |       loss        |       auc          |       acc         |       mae          |
-| :------------------: | :--------------------: | :---------: |:---------: | :---------: |:---------: |
-|       ag news dataset        |       TagSpace       |       --        |       --          |       --          |       --          |
-|       --        |       Classification       |       --        |       --          |       --          |       --          |
+|       数据集        |       模型       |       loss         |       acc         |
+| :------------------: | :--------------------: | :---------: |:---------: | 
+|       ag news dataset        |       TagSpace       |       0.2282        |       0.9179          | 
+|       ChnSentiCorp        |       Classification       |       0.9177        |       0.0199          | 
--- a/models/contentunderstanding/tagspace/config.yaml
+++ b/models/contentunderstanding/tagspace/config.yaml
@@ -16,16 +16,21 @@ workspace: "models/contentunderstanding/tagspace"

 dataset:
 - name: sample_1
-  type: QueueDataset
-  batch_size: 5
+  type: DataLoader
+  batch_size: 10
  data_path: "{workspace}/data/train_data"
  data_converter: "{workspace}/reader.py"
+- name: inferdata
+  type: DataLoader
+  batch_size: 10
+  data_path: "{workspace}/data/test_data"
+  data_converter: "{workspace}/reader.py" 

 hyper_parameters:
  optimizer:
    class: Adagrad
    learning_rate: 0.001
-  vocab_text_size: 11447
+  vocab_text_size: 75378
  vocab_tag_size: 4
  emb_dim: 10
  hid_dim: 1000
@@ -34,22 +39,34 @@ hyper_parameters:
  neg_size: 3
  num_devices: 1

-mode: runner1
+mode: [runner1,infer_runner]

 runner:
 - name: runner1
  class: train
  epochs: 10
  device: cpu
-  save_checkpoint_interval: 2
-  save_inference_interval: 4
+  save_checkpoint_interval: 1
+  save_inference_interval: 1
  save_checkpoint_path: "increment"
  save_inference_path: "inference"
  save_inference_feed_varnames: []
  save_inference_fetch_varnames: []
+  phases: phase1
+- name: infer_runner
+  class: infer
+  # device to run training or infer
+  device: cpu
+  print_interval: 1
+  init_model_path: "increment/9" # load model path
+  phases: phase_infer

 phase:
 - name: phase1
  model: "{workspace}/model.py"
  dataset_name: sample_1
  thread_num: 1
+- name: phase_infer
+  model: "{workspace}/model.py"
+  dataset_name: inferdata
+  thread_num: 1
--- a/models/contentunderstanding/tagspace/data/text2paddle.py
+++ b/models/contentunderstanding/tagspace/data/text2paddle.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import six
+import collections
+import os
+import csv
+import re
+import sys
+if six.PY2:
+    reload(sys)
+    sys.setdefaultencoding('utf-8')
+
+
+def word_count(column_num, input_file, word_freq=None):
+    """
+    compute word count from corpus
+    """
+    if word_freq is None:
+        word_freq = collections.defaultdict(int)
+    data_file = csv.reader(input_file)
+    for row in data_file:
+        for w in re.split(r'\W+', row[column_num].strip()):
+            word_freq[w] += 1
+    return word_freq
+
+
+def build_dict(column_num=2, min_word_freq=0, train_dir="", test_dir=""):
+    """
+    Build a word dictionary from the corpus,  Keys of the dictionary are words,
+    and values are zero-based IDs of these words.
+    """
+    word_freq = collections.defaultdict(int)
+    files = os.listdir(train_dir)
+    for fi in files:
+        with open(os.path.join(train_dir, fi), "r", encoding='utf-8') as f:
+            word_freq = word_count(column_num, f, word_freq)
+    files = os.listdir(test_dir)
+    for fi in files:
+        with open(os.path.join(test_dir, fi), "r", encoding='utf-8') as f:
+            word_freq = word_count(column_num, f, word_freq)
+
+    word_freq = [x for x in six.iteritems(word_freq) if x[1] > min_word_freq]
+    word_freq_sorted = sorted(word_freq, key=lambda x: (-x[1], x[0]))
+    words, _ = list(zip(*word_freq_sorted))
+    word_idx = dict(list(zip(words, six.moves.range(len(words)))))
+    return word_idx
+
+
+def write_paddle(text_idx, tag_idx, train_dir, test_dir, output_train_dir,
+                 output_test_dir):
+    files = os.listdir(train_dir)
+    if not os.path.exists(output_train_dir):
+        os.mkdir(output_train_dir)
+    for fi in files:
+        with open(os.path.join(train_dir, fi), "r", encoding='utf-8') as f:
+            with open(
+                    os.path.join(output_train_dir, fi), "w",
+                    encoding='utf-8') as wf:
+                data_file = csv.reader(f)
+                for row in data_file:
+                    tag_raw = re.split(r'\W+', row[0].strip())
+                    pos_index = tag_idx.get(tag_raw[0])
+                    wf.write(str(pos_index) + ",")
+                    text_raw = re.split(r'\W+', row[2].strip())
+                    l = [text_idx.get(w) for w in text_raw]
+                    for w in l:
+                        wf.write(str(w) + " ")
+                    wf.write("\n")
+
+    files = os.listdir(test_dir)
+    if not os.path.exists(output_test_dir):
+        os.mkdir(output_test_dir)
+    for fi in files:
+        with open(os.path.join(test_dir, fi), "r", encoding='utf-8') as f:
+            with open(
+                    os.path.join(output_test_dir, fi), "w",
+                    encoding='utf-8') as wf:
+                data_file = csv.reader(f)
+                for row in data_file:
+                    tag_raw = re.split(r'\W+', row[0].strip())
+                    pos_index = tag_idx.get(tag_raw[0])
+                    wf.write(str(pos_index) + ",")
+                    text_raw = re.split(r'\W+', row[2].strip())
+                    l = [text_idx.get(w) for w in text_raw]
+                    for w in l:
+                        wf.write(str(w) + " ")
+                    wf.write("\n")
+
+
+def text2paddle(train_dir, test_dir, output_train_dir, output_test_dir,
+                output_vocab_text, output_vocab_tag):
+    print("start constuct word dict")
+    vocab_text = build_dict(2, 0, train_dir, test_dir)
+    with open(output_vocab_text, "w", encoding='utf-8') as wf:
+        wf.write(str(len(vocab_text)) + "\n")
+
+    vocab_tag = build_dict(0, 0, train_dir, test_dir)
+    with open(output_vocab_tag, "w", encoding='utf-8') as wf:
+        wf.write(str(len(vocab_tag)) + "\n")
+
+    print("construct word dict done\n")
+    write_paddle(vocab_text, vocab_tag, train_dir, test_dir, output_train_dir,
+                 output_test_dir)
+
+
+train_dir = sys.argv[1]
+test_dir = sys.argv[2]
+output_train_dir = sys.argv[3]
+output_test_dir = sys.argv[4]
+output_vocab_text = sys.argv[5]
+output_vocab_tag = sys.argv[6]
+text2paddle(train_dir, test_dir, output_train_dir, output_test_dir,
+            output_vocab_text, output_vocab_tag)
--- a/models/contentunderstanding/tagspace/model.py
+++ b/models/contentunderstanding/tagspace/model.py
@@ -16,7 +16,6 @@ import paddle.fluid as fluid
 import paddle.fluid.layers.nn as nn
 import paddle.fluid.layers.tensor as tensor
 import paddle.fluid.layers.control_flow as cf
-
 from paddlerec.core.model import ModelBase
 from paddlerec.core.utils import envs

@@ -98,14 +97,19 @@ class Model(ModelBase):
            tensor.fill_constant_batch_size_like(
                input=loss_part2, shape=[-1, 1], value=0.0, dtype='float32'),
            loss_part2)
-        avg_cost = nn.mean(loss_part3)
+        avg_cost = fluid.layers.mean(loss_part3)
+
        less = tensor.cast(cf.less_than(cos_neg, cos_pos), dtype='float32')
+        label_ones = fluid.layers.fill_constant_batch_size_like(
+            input=cos_neg, dtype='float32', shape=[-1, 1], value=1.0)
        correct = nn.reduce_sum(less)
+        total = fluid.layers.reduce_sum(label_ones)
+        acc = fluid.layers.elementwise_div(correct, total)
        self._cost = avg_cost

        if is_infer:
-            self._infer_results["correct"] = correct
-            self._infer_results["cos_pos"] = cos_pos
+            self._infer_results["acc"] = acc
+            self._infer_results["loss"] = self._cost
        else:
-            self._metrics["correct"] = correct
-            self._metrics["cos_pos"] = cos_pos
+            self._metrics["acc"] = acc
+            self._metrics["loss"] = self._cost
--- a/models/contentunderstanding/tagspace/readme.md
+++ b/models/contentunderstanding/tagspace/readme.md
+# tagspace文本分类模型
+
+以下是本例的简要目录结构及说明： 
+
+```
+├── data #样例数据
+	├── train_data
+		├── small_train.csv #训练数据样例
+	├── test_data
+    	├── small_test.csv #测试数据样例
+	├── text2paddle.py #数据处理程序
+├── __init__.py
+├── README.md #文档
+├── model.py #模型文件
+├── config.yaml #配置文件
+├── reader.py #读取程序
+```
+
+注：在阅读该示例前，建议您先了解以下内容：
+
+[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
+
+## 内容
+
+- [模型简介](#模型简介)
+- [数据准备](#数据准备)
+- [运行环境](#运行环境)
+- [快速开始](#快速开始)
+- [效果复现](#效果复现)
+- [进阶使用](#进阶使用)
+- [FAQ](#FAQ)
+
+
+## 模型简介
+tagspace模型是一种对文本打标签的方法，它主要学习从短文到相关主题标签的映射。论文中主要利用CNN做doc向量， 然后优化 f(w,t+),f(w,t-)的距离作为目标函数，得到了 t（标签）和doc在一个特征空间的向量表达，这样就可以找 doc的hashtags了。  
+
+论文[TAGSPACE: Semantic Embeddings from Hashtags](https://www.aclweb.org/anthology/D14-1194.pdf)中的网络结构如图所示，一层输入层，一个卷积层，一个pooling层以及最后一个全连接层进行降维。
+<p align="center">
+<img align="center" src="../../../doc/imgs/tagspace.png">
+<p>
+
+## 数据准备
+[数据地址](https://github.com/mhjabreel/CharCNN/tree/master/data/) , [备份数据地址](https://paddle-tagspace.bj.bcebos.com/data.tar)
+
+数据格式如下：  
+```
+"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again."
+```
+
+## 运行环境
+PaddlePaddle>=1.7.2  
+
+python 2.7/3.5/3.6/3.7  
+
+PaddleRec >=0.1  
+
+os : windows/linux/macos    
+
+
+## 快速开始
+本文提供了样例数据可以供您快速体验，在paddlerec目录下直接执行下面的命令即可启动训练： 
+
+```
+python -m paddlerec.run -m models/contentunderstanding/tagspace/config.yaml
+```   
+
+
+## 效果复现
+为了方便使用者能够快速的跑通每一个模型，我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。  
+1. 确认您当前所在目录为PaddleRec/models/contentunderstanding/tagspace  
+2. 在data目录下载并解压数据集，命令如下：  
+``` 
+cd data
+wget https://paddle-tagspace.bj.bcebos.com/data.tar
+tar -xvf data.tar
+```
+3. 本文提供了快速将数据集中的汉字数据处理为可训练格式数据的脚本，您在解压数据集后，将原始数据存放在raw_big_train_data和raw_big_test_data两个目录下，并在python3环境下运行我们提供的text2paddle.py文件。即可生成可以直接用于训练的数据目录test_big_data和train_big_data。命令如下：
+```
+mkdir raw_big_train_data
+mkdir raw_big_test_data
+mv train.csv raw_big_train_data
+mv test.csv raw_big_test_data
+python3 text2paddle.py raw_big_train_data/ raw_big_test_data/ train_big_data test_big_data big_vocab_text.txt big_vocab_tag.txt
+```
+
+运行后的data目录：  
+
+```
+big_vocab_tag.txt  #标签词汇数
+big_vocab_text.txt #文本词汇数
+data.tar  #数据集
+raw_big_train_data  #数据集中原始的训练集
+raw_big_test_data  #数据集中原始的测试集
+train_data  #样例训练集
+test_data  #样例测试集
+train_big_data  #数据集经处理后的训练集
+test_big_data  #数据集经处理后的测试集
+text2paddle.py  #预处理文件
+```
+
+处理完成的数据格式如下：
+```
+2,27 7062 8390 456 407 8 11589 3166 4 7278 31046 33 3898 2897 426 1
+2,27 9493 836 355 20871 300 81 19 3 4125 9 449 462 13832 6 16570 1380 2874 5 0 797 236 19 3688 2106 14 8615 7 209 304 4 0 123 1
+2,27 12754 637 106 3839 1532 66 0 379 6 0 1246 9 307 33 161 2 8100 36 0 350 123 101 74 181 0 6657 4 0 1222 17195 1
+```
+
+4. 退回tagspace目录中，打开文件config.yaml,更改其中的参数  
+将workspace改为您当前的绝对路径。（可用pwd命令获取绝对路径）  
+将dataset下sample_1的batch_size值从10改为128   
+将dataset下sample_1的data_path改为：{workspace}/data/train_big_data  
+将dataset下inferdata的batch_size值从10改为500 
+将dataset下inferdata的data_path改为：{workspace}/data/test_big_data 
+
+5.  执行命令，开始训练：
+```
+python -m paddlerec.run -m ./config.yaml
+```
+6. 运行结果：
+```
+PaddleRec: Runner infer_runner Begin
+Executor Mode: infer
+processor_register begin
+Running SingleInstance.
+Running SingleNetwork.
+Running SingleInferStartup.
+Running SingleInferRunner.
+load persistables from increment/9
+batch: 1, acc: [0.91], loss: [0.02495437]
+batch: 2, acc: [0.936], loss: [0.01941476]
+batch: 3, acc: [0.918], loss: [0.02116447]
+batch: 4, acc: [0.916], loss: [0.0219945]
+batch: 5, acc: [0.902], loss: [0.02242816]
+batch: 6, acc: [0.9], loss: [0.02421589]
+batch: 7, acc: [0.9], loss: [0.026441]
+batch: 8, acc: [0.934], loss: [0.01797657]
+batch: 9, acc: [0.932], loss: [0.01687362]
+batch: 10, acc: [0.926], loss: [0.02047823]
+batch: 11, acc: [0.918], loss: [0.01998716]
+batch: 12, acc: [0.898], loss: [0.0229556]
+batch: 13, acc: [0.928], loss: [0.01736144]
+batch: 14, acc: [0.93], loss: [0.01911209]
+```
+
+## 进阶使用
+  
+## FAQ
--- a/models/match/dssm/config.yaml
+++ b/models/match/dssm/config.yaml
@@ -17,50 +17,52 @@ workspace: "models/match/dssm"

 dataset:
 - name: dataset_train
-  batch_size: 4
-  type: QueueDataset
+  batch_size: 8
+  type: DataLoader # or QueueDataset
  data_path: "{workspace}/data/train" 
  data_converter: "{workspace}/synthetic_reader.py"
 - name: dataset_infer
  batch_size: 1
-  type: QueueDataset
-  data_path: "{workspace}/data/train"
+  type: DataLoader # or QueueDataset
+  data_path: "{workspace}/data/test"
  data_converter: "{workspace}/synthetic_evaluate_reader.py"

 hyper_parameters:
  optimizer:
    class: sgd
-    learning_rate: 0.01
+    learning_rate: 0.001
    strategy: async
-  trigram_d: 1000
-  neg_num: 4
+  trigram_d: 1439
+  neg_num: 1
  fc_sizes: [300, 300, 128]
  fc_acts: ['tanh', 'tanh', 'tanh']

-mode: train_runner
+mode: [train_runner,infer_runner]
 # config of each runner.
 # runner is a kind of paddle training class, which wraps the train/infer process.
 runner:
 - name: train_runner
  class: train
  # num of epochs
-  epochs: 4
+  epochs: 3
  # device to run training or infer
  device: cpu
-  save_checkpoint_interval: 2 # save model interval of epochs
-  save_inference_interval: 4 # save inference
+  save_checkpoint_interval: 1 # save model interval of epochs
+  save_inference_interval: 1 # save inference
  save_checkpoint_path: "increment" # save checkpoint path
  save_inference_path: "inference" # save inference path
  save_inference_feed_varnames: ["query", "doc_pos"] # feed vars of save inference
  save_inference_fetch_varnames: ["cos_sim_0.tmp_0"] # fetch vars of save inference
  init_model_path: "" # load model path
  print_interval: 2
+  phases: phase1
 - name: infer_runner
  class: infer
  # device to run training or infer
  device: cpu
  print_interval: 1
  init_model_path: "increment/2" # load model path
+  phases: phase2

 # runner will run all the phase in each epoch
 phase:
@@ -68,7 +70,7 @@ phase:
  model: "{workspace}/model.py" # user-defined model
  dataset_name: dataset_train # select dataset by name
  thread_num: 1
-#- name: phase2
-#  model: "{workspace}/model.py" # user-defined model
-#  dataset_name: dataset_infer # select dataset by name
-#  thread_num: 1
+- name: phase2
+  model: "{workspace}/model.py" # user-defined model
+  dataset_name: dataset_infer # select dataset by name
+  thread_num: 1
--- a/models/match/dssm/data/preprocess.py
+++ b/models/match/dssm/data/preprocess.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#encoding=utf-8
+
+import os
+import sys
+import numpy as np
+import random
+
+f = open("./zhidao", "r")
+lines = f.readlines()
+f.close()
+
+#建立字典
+word_dict = {}
+for line in lines:
+    line = line.strip().split("\t")
+    text = line[0].split(" ") + line[1].split(" ")
+    for word in text:
+        if word in word_dict:
+            continue
+        else:
+            word_dict[word] = len(word_dict) + 1
+
+f = open("./zhidao", "r")
+lines = f.readlines()
+f.close()
+
+lines = [line.strip().split("\t") for line in lines]
+
+#建立以query为key，以负例为value的字典
+neg_dict = {}
+for line in lines:
+    if line[2] == "0":
+        if line[0] in neg_dict:
+            neg_dict[line[0]].append(line[1])
+        else:
+            neg_dict[line[0]] = [line[1]]
+
+#建立以query为key，以正例为value的字典
+pos_dict = {}
+for line in lines:
+    if line[2] == "1":
+        if line[0] in pos_dict:
+            pos_dict[line[0]].append(line[1])
+        else:
+            pos_dict[line[0]] = [line[1]]
+
+#划分训练集和测试集
+query_list = list(pos_dict.keys())
+#print(len(query))
+random.shuffle(query_list)
+train_query = query_list[:90]
+test_query = query_list[90:]
+
+#获得训练集
+train_set = []
+for query in train_query:
+    for pos in pos_dict[query]:
+        if query not in neg_dict:
+            continue
+        for neg in neg_dict[query]:
+            train_set.append([query, pos, neg])
+random.shuffle(train_set)
+
+#获得测试集
+test_set = []
+for query in test_query:
+    for pos in pos_dict[query]:
+        test_set.append([query, pos, 1])
+    if query not in neg_dict:
+        continue
+    for neg in neg_dict[query]:
+        test_set.append([query, neg, 0])
+random.shuffle(test_set)
+
+#训练集中的query,pos,neg转化为词袋
+f = open("train.txt", "w")
+for line in train_set:
+    query = line[0].strip().split(" ")
+    pos = line[1].strip().split(" ")
+    neg = line[2].strip().split(" ")
+    query_token = [0] * (len(word_dict) + 1)
+    for word in query:
+        query_token[word_dict[word]] = 1
+    pos_token = [0] * (len(word_dict) + 1)
+    for word in pos:
+        pos_token[word_dict[word]] = 1
+    neg_token = [0] * (len(word_dict) + 1)
+    for word in neg:
+        neg_token[word_dict[word]] = 1
+    f.write(','.join([str(x) for x in query_token]) + "\t" + ','.join([
+        str(x) for x in pos_token
+    ]) + "\t" + ','.join([str(x) for x in neg_token]) + "\n")
+f.close()
+
+#测试集中的query和pos转化为词袋
+f = open("test.txt", "w")
+fa = open("label.txt", "w")
+for line in test_set:
+    query = line[0].strip().split(" ")
+    pos = line[1].strip().split(" ")
+    label = line[2]
+    query_token = [0] * (len(word_dict) + 1)
+    for word in query:
+        query_token[word_dict[word]] = 1
+    pos_token = [0] * (len(word_dict) + 1)
+    for word in pos:
+        pos_token[word_dict[word]] = 1
+    f.write(','.join([str(x) for x in query_token]) + "\t" + ','.join(
+        [str(x) for x in pos_token]) + "\n")
+    fa.write(str(label) + "\n")
+f.close()
+fa.close()
--- a/models/match/dssm/data/test/test.txt
+++ b/models/match/dssm/data/test/test.txt
--- a/models/match/dssm/data/train/sample_train.txt
+++ b/models/match/dssm/data/train/sample_train.txt
--- a/models/match/dssm/data/train/train.txt
+++ b/models/match/dssm/data/train/train.txt
--- a/models/match/dssm/model.py
+++ b/models/match/dssm/model.py
@@ -73,6 +73,7 @@ class Model(ModelBase):

        query_fc = fc(inputs[0], self.hidden_layers, self.hidden_acts,
                      ['query_l1', 'query_l2', 'query_l3'])
+
        doc_pos_fc = fc(inputs[1], self.hidden_layers, self.hidden_acts,
                        ['doc_pos_l1', 'doc_pos_l2', 'doc_pos_l3'])
        R_Q_D_p = fluid.layers.cos_sim(query_fc, doc_pos_fc)
@@ -93,7 +94,7 @@ class Model(ModelBase):
        prob = fluid.layers.softmax(concat_Rs, axis=1)

        hit_prob = fluid.layers.slice(
-            prob, axes=[0, 1], starts=[0, 0], ends=[4, 1])
+            prob, axes=[0, 1], starts=[0, 0], ends=[8, 1])
        loss = -fluid.layers.reduce_sum(fluid.layers.log(hit_prob))
        avg_cost = fluid.layers.mean(x=loss)
        self._cost = avg_cost

--- a/models/match/dssm/readme.md
+++ b/models/match/dssm/readme.md
+# DSSM文本匹配模型
+
+以下是本例的简要目录结构及说明： 
+
+```
+├── data #样例数据
+	├── train
+		├── train.txt #训练数据样例
+	├── test
+    	├── test.txt #测试数据样例
+	├── preprocess.py #数据处理程序
+├── __init__.py
+├── README.md #文档
+├── model.py #模型文件
+├── config.yaml #配置文件
+├── synthetic_reader.py #读取训练集的程序
+├── synthetic_evaluate_reader.py #读取测试集的程序
+├── transform.py #将数据整理成合适的格式方便计算指标
+├── run.sh #全量数据集中的训练脚本，从训练到预测并计算指标
+
+```
+
+注：在阅读该示例前，建议您先了解以下内容：
+
+[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
+
+## 内容
+
+- [模型简介](#模型简介)
+- [数据准备](#数据准备)
+- [运行环境](#运行环境)
+- [快速开始](#快速开始)
+- [效果复现](#效果复现)
+- [进阶使用](#进阶使用)
+- [FAQ](#FAQ)
+
+
+## 模型简介
+DSSM是Deep Structured Semantic Model的缩写，即我们通常说的基于深度网络的语义模型，其核心思想是将query和doc映射到到共同维度的语义空间中，通过最大化query和doc语义向量之间的余弦相似度，从而训练得到隐含语义模型，达到检索的目的。DSSM有很广泛的应用，比如：搜索引擎检索，广告相关性，问答系统，机器翻译等。    
+DSSM 的输入采用 BOW（Bag of words）的方式，相当于把字向量的位置信息抛弃了，整个句子里的词都放在一个袋子里了。将一个句子用这种方式转化为一个向量输入DNN中。  
+Query 和 Doc 的语义相似性可以用这两个向量的 cosine 距离表示，然后通过softmax 函数选出与Query语义最相似的样本 Doc 。  
+
+模型的具体细节可以阅读论文[DSSM](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf):
+<p align="center">
+<img align="center" src="../../../doc/imgs/dssm.png">
+<p>
+
+## 数据准备
+我们公开了自建的测试集，包括百度知道、ECOM、QQSIM、UNICOM 四个数据集。这里我们选取百度知道数据集来进行训练。执行以下命令可以获取上述数据集。
+```
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/simnet_dataset-1.0.0.tar.gz
+tar xzf simnet_dataset-1.0.0.tar.gz
+rm simnet_dataset-1.0.0.tar.gz
+```
+
+## 运行环境
+PaddlePaddle>=1.7.2
+
+python 2.7/3.5/3.6/3.7
+
+PaddleRec >=0.1
+
+os : windows/linux/macos
+
+## 快速开始
+本文提供了样例数据可以供您快速体验，在paddlerec目录下执行下面的命令即可快速启动训练： 
+
+```
+python -m paddlerec.run -m models/match/dssm/config.yaml
+```   
+
+输出结果示例：
+```
+PaddleRec: Runner train_runner Begin
+Executor Mode: train
+processor_register begin
+Running SingleInstance.
+Running SingleNetwork.
+file_list : ['models/match/dssm/data/train/train.txt']
+Running SingleStartup.
+Running SingleRunner.
+!!! The CPU_NUM is not specified, you should set CPU_NUM in the environment variable list.
+CPU_NUM indicates that how many CPUPlace are used in the current task.
+And if this parameter are set as N (equal to the number of physical CPU core) the program may be faster.
+export CPU_NUM=32 # for example, set CPU_NUM as number of physical CPU core which is 32.
+!!! The default number of CPU_NUM=1.
+I0821 06:56:26.224299 31061 parallel_executor.cc:440] The Program will be executed on CPU using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
+I0821 06:56:26.231163 31061 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1
+I0821 06:56:26.237023 31061 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
+I0821 06:56:26.240788 31061 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
+batch: 2, LOSS: [4.538238]
+batch: 4, LOSS: [4.16424]
+batch: 6, LOSS: [3.8121371]
+batch: 8, LOSS: [3.4250507]
+batch: 10, LOSS: [3.2285979]
+batch: 12, LOSS: [3.2116117]
+batch: 14, LOSS: [3.1406002]
+epoch 0 done, use time: 0.357971906662, global metrics: LOSS=[3.0968776]
+batch: 2, LOSS: [2.6843479]
+batch: 4, LOSS: [2.546976]
+batch: 6, LOSS: [2.4103594]
+batch: 8, LOSS: [2.301374]
+batch: 10, LOSS: [2.264183]
+batch: 12, LOSS: [2.315862]
+batch: 14, LOSS: [2.3409634]
+epoch 1 done, use time: 0.22123003006, global metrics: LOSS=[2.344321]
+batch: 2, LOSS: [2.0882485]
+batch: 4, LOSS: [2.006743]
+batch: 6, LOSS: [1.9231766]
+batch: 8, LOSS: [1.8850241]
+batch: 10, LOSS: [1.8829436]
+batch: 12, LOSS: [1.9336565]
+batch: 14, LOSS: [1.9784685]
+epoch 2 done, use time: 0.212922096252, global metrics: LOSS=[1.9934461]
+PaddleRec Finish
+```
+## 效果复现
+为了方便使用者能够快速的跑通每一个模型，我们在每个模型下都提供了样例数据。如果需要复现readme中的效果,请按如下步骤依次操作即可。  
+1. 确认您当前所在目录为PaddleRec/models/match/dssm
+2. 在data目录下载并解压数据集，命令如下：  
+``` 
+cd data
+wget --no-check-certificate https://baidu-nlp.bj.bcebos.com/simnet_dataset-1.0.0.tar.gz
+tar xzf simnet_dataset-1.0.0.tar.gz
+rm simnet_dataset-1.0.0.tar.gz
+```
+3. 本文提供了快速将数据集中的汉字数据处理为可训练格式数据的脚本，您在解压数据集后，可以看见目录中存在一个名为zhidao的文件。然后能可以在python3环境下运行我们提供的preprocess.py文件。即可生成可以直接用于训练的数据目录test.txt,train.txt和label.txt。将其放入train和test目录下以备训练时调用。命令如下：
+```
+mv data/zhidao ./
+rm -rf data
+python3 preprocess.py
+rm -f ./train/train.txt
+mv train.txt ./train
+rm -f ./test/test.txt
+mv test.txt test
+cd ..
+```
+经过预处理的格式：  
+训练集为三个稀疏的BOW方式的向量：query,pos,neg  
+测试集为两个稀疏的BOW方式的向量：query,pos  
+label.txt中对应的测试集中的标签
+
+4. 退回dssm目录中，打开文件config.yaml,更改其中的参数  
+
+将workspace改为您当前的绝对路径。（可用pwd命令获取绝对路径）  
+将dataset_train中的batch_size从8改为128
+将文件model.py中的 hit_prob = fluid.layers.slice(prob, axes=[0, 1], starts=[0, 0], ends=[8, 1])  
+    改为hit_prob = fluid.layers.slice(prob, axes=[0, 1], starts=[0, 0], ends=[128, 1]).当您需要改变batchsize的时候，end中第一个参数也需要随之变化
+
+5.  执行脚本，开始训练.脚本会运行python -m paddlerec.run -m ./config.yaml启动训练，并将结果输出到result文件中。然后启动transform.py整合数据，最后计算出正逆序指标：
+```
+sh run.sh
+```
+
+输出结果示例：
+```
+................run.................
+!!! The CPU_NUM is not specified, you should set CPU_NUM in the environment variable list.
+CPU_NUM indicates that how many CPUPlace are used in the current task.
+And if this parameter are set as N (equal to the number of physical CPU core) the program may be faster.
+
+export CPU_NUM=32 # for example, set CPU_NUM as number of physical CPU core which is 32.
+
+!!! The default number of CPU_NUM=1.
+I0821 07:16:04.512531 32200 parallel_executor.cc:440] The Program will be executed on CPU using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
+I0821 07:16:04.515708 32200 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1
+I0821 07:16:04.518872 32200 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
+I0821 07:16:04.520995 32200 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
+75
+pnr: 2.25581395349
+query_num: 11
+pair_num: 184 184
+equal_num: 44
+正序率： 0.692857142857
+97 43
+```
+6. 提醒：因为采取较小的数据集进行训练和测试，得到指标的浮动程度会比较大。如果得到的指标不合预期，可以多次执行步骤5，即可获得合理的指标。
+
+## 进阶使用
+  
+## FAQ
--- a/models/match/dssm/run.sh
+++ b/models/match/dssm/run.sh
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#!/bin/bash
+echo "................run................."
+python -m paddlerec.run -m ./config.yaml >result1.txt
+grep -i "query_doc_sim" ./result1.txt >./result2.txt
+sed '$d' result2.txt >result.txt
+rm -f result1.txt
+rm -f result2.txt
+python transform.py
+sort -t $'\t' -k1,1 -k 2nr,2 pair.txt >result.txt
+rm -f pair.txt
+python ../../../tools/cal_pos_neg.py result.txt
--- a/models/match/dssm/transform.py
+++ b/models/match/dssm/transform.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import random
+import numpy as np
+import sklearn.metrics
+
+label = []
+filename = './data/label.txt'
+f = open(filename, "r")
+f.readline()
+num = 0
+for line in f.readlines():
+    num = num + 1
+    line = line.strip()
+    label.append(line)
+f.close()
+print(num)
+
+filename = './result.txt'
+sim = []
+for line in open(filename):
+    line = line.strip().split(",")
+    line[1] = line[1].split(":")
+    line = line[1][1].strip(" ")
+    line = line.strip("[")
+    line = line.strip("]")
+    sim.append(float(line))
+
+filename = './data/test/test.txt'
+f = open(filename, "r")
+f.readline()
+query = []
+for line in f.readlines():
+    line = line.strip().split("\t")
+    query.append(line[0])
+f.close()
+
+filename = 'pair.txt'
+f = open(filename, "w")
+for i in range(len(sim)):
+    f.write(str(query[i]) + "\t" + str(sim[i]) + "\t" + str(label[i]) + "\n")
+f.close()
--- a/models/match/match-pyramid/readme.md
+++ b/models/match/match-pyramid/readme.md
 # match-pyramid文本匹配模型

-## 介绍
+以下是本例的简要目录结构及说明： 
+
+```
+├── data #样例数据
+    ├── process.py #数据处理脚本
+    ├── relation.test.fold1.txt #评估计算指标时用到的关系文件
+    ├── train
+    	├── train.txt #训练数据样例
+    ├── test
+    	├── test.txt #测试数据样例
+├── __init__.py
+├── README.md #文档
+├── model.py #模型文件
+├── config.yaml #配置文件
+├── data_process.sh #数据下载和处理脚本
+├── eval.py #计算指标的评估程序
+├── run.sh #一键运行程序
+├── test_reader.py #测试集读取程序
+├── train_reader.py #训练集读取程序
+```
+
+注：在阅读该示例前，建议您先了解以下内容：
+
+[paddlerec入门教程](https://github.com/PaddlePaddle/PaddleRec/blob/master/README.md)
+
+## 内容
+
+- [模型简介](#模型简介)
+- [数据准备](#数据准备)
+- [运行环境](#运行环境)
+- [快速开始](#快速开始)
+- [论文复现](#论文复现)
+- [进阶使用](#进阶使用)
+- [FAQ](#FAQ)
+
+
+## 模型简介
 在许多自然语言处理任务中，匹配两个文本是一个基本问题。一种有效的方法是从单词，短语和句子中提取有意义的匹配模式以产生匹配分数。受卷积神经网络在图像识别中的成功启发，神经元可以根据提取的基本视觉模式（例如定向的边角和边角）捕获许多复杂的模式，所以我们尝试将文本匹配建模为图像识别问题。本模型对齐原作者庞亮开源的tensorflow代码：https://github.com/pl8787/MatchPyramid-TensorFlow/blob/master/model/model_mp.py， 实现了下述论文中提出的Match-Pyramid模型：

 ```text
@@ -19,8 +55,23 @@
 3.关系文件：关系文件被用来存储两个句子之间的关系，如query 和document之间的关系。例如：relation.train.fold1.txt, relation.test.fold1.txt  
 4.嵌入层文件：我们将预训练的词向量存储在嵌入文件中。例如：embed_wiki-pdc_d50_norm  

-## 数据下载和预处理
-本文提供了数据集的下载以及一键生成训练和测试数据的预处理脚本，您可以直接一键运行:bash data_process.sh  
+## 运行环境
+PaddlePaddle>=1.7.2
+python 2.7/3.5/3.6/3.7
+PaddleRec >=0.1
+os : windows/linux/macos
+
+## 快速开始
+
+本文提供了样例数据可以供您快速体验，在paddlerec目录下直接执行下面的命令即可启动训练： 
+
+```
+python -m paddlerec.run -m models/match/match-pyramid/config.yaml
+```   
+
+## 论文复现
+1. 确认您当前所在目录为PaddleRec/models/match/match-pyramid
+2. 本文提供了原数据集的下载以及一键生成训练和测试数据的预处理脚本，您可以直接一键运行:bash data_process.sh  
 执行该脚本，会从国内源的服务器上下载Letor07数据集，删除掉data文件夹中原有的relation.test.fold1.txt和relation.train.fold1.txt，并将完整的数据集解压到data文件夹。随后运行 process.py 将全量训练数据放置于`./data/train`，全量测试数据放置于`./data/test`。并生成用于初始化embedding层的embedding.npy文件  
 执行该脚本的理想输出为：  
 ```
@@ -69,9 +120,11 @@ data/embed_wiki-pdc_d50_norm
 [./data/relation.test.fold1.txt]
        Instance size: 13652
 ```
+3. 打开文件config.yaml,更改其中的参数  
+
+将workspace改为您当前的绝对路径。（可用pwd命令获取绝对路径）

-## 一键训练并测试评估
-本文提供了一键执行训练，测试和评估的脚本，您可以直接一键运行：bash run.sh  
+4. 随后，您直接一键运行：bash run.sh  即可得到复现的论文效果
 执行该脚本后，会执行python -m paddlerec.run -m ./config.yaml 命令开始训练并测试模型，将测试的结果保存到result.txt文件，最后通过执行eval.py进行评估得到数据的map指标  
 执行该脚本的理想输出为：  
 ```
@@ -79,16 +132,7 @@ data/embed_wiki-pdc_d50_norm
 13651
 336
 ('map=', 0.420878322843591)
-```
-
-## 每个文件的作用
-paddlerec可以：  
-通过config.yaml规定模型的参数  
-通过model.py规定模型的组网  
-使用train_reader.py读取训练集中的数据  
-使用test_reader.py读取测试集中的数据。  
-本文额外提供：  
-data_process.sh用来一键处理数据  
-run.sh用来一键启动训练，直接得出测试结果  
-eval.py通过保存的测试结果，计算map指标  
-如需详细了解paddlerec的使用方法请参考https://github.com/PaddlePaddle/PaddleRec/blob/master/README_CN.md 页面下方的教程。    
+```  
+## 进阶使用
+  
+## FAQ
--- a/models/match/match-pyramid/run.sh
+++ b/models/match/match-pyramid/run.sh
 #!/bin/bash
 echo "................run................."
 python -m paddlerec.run -m ./config.yaml >result1.txt
-grep -A1 "prediction" ./result1.txt >./result.txt
+grep -i "prediction" ./result1.txt >./result.txt
 rm -f result1.txt
 python eval.py
--- a/models/match/readme.md
+++ b/models/match/readme.md
 # 匹配模型库

 ## 简介
-我们提供了常见的匹配任务中使用的模型算法的PaddleRec实现, 单机训练&预测效果指标以及分布式训练&预测性能指标等。实现的模型包括 [DSSM](http://gitlab.baidu.com/tangwei12/paddlerec/tree/develop/models/match/dssm)、[MultiView-Simnet](http://gitlab.baidu.com/tangwei12/paddlerec/tree/develop/models/match/multiview-simnet)。
+我们提供了常见的匹配任务中使用的模型算法的PaddleRec实现, 单机训练&预测效果指标以及分布式训练&预测性能指标等。实现的模型包括 [DSSM](http://gitlab.baidu.com/tangwei12/paddlerec/tree/develop/models/match/dssm)、[MultiView-Simnet](http://gitlab.baidu.com/tangwei12/paddlerec/tree/develop/models/match/multiview-simnet)、[match-pyramid](https://github.com/PaddlePaddle/PaddleRec/tree/master/models/match/match-pyramid)。

 模型算法库在持续添加中，欢迎关注。

@@ -18,6 +18,8 @@
 | :------------------: | :--------------------: | :---------: |
 | DSSM | Deep Structured Semantic Models | [CIKM 2013][Learning Deep Structured Semantic Models for Web Search using Clickthrough Data](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf) |
 | MultiView-Simnet | Multi-view Simnet for Personalized recommendation | [WWW 2015][A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommendation Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/frp1159-songA.pdf) |
+| match-pyramid | Text Matching as Image Recognition | [arXiv W2016][Text Matching as Image Recognition](https://arxiv.org/pdf/1602.06359.pdf) |
+

 下面是每个模型的简介（注：图片引用自链接中的论文）

@@ -31,24 +33,26 @@
 <img align="center" src="../../doc/imgs/multiview-simnet.png">
 <p>

-## 使用教程(快速开始)
-### 训练
-```shell
-git clone https://github.com/PaddlePaddle/PaddleRec.git paddle-rec
-cd paddle-rec
+[match-pyramid](https://arxiv.org/pdf/1602.06359.pdf):
+<p align="center">
+<img align="center" src="../../doc/imgs/match-pyramid.png">
+<p>

+## 使用教程(快速开始)
+### 训练&预测
+每个模型都提供了样例数据可以供您快速体验，在paddlerec目录下直接执行下面的命令即可启动训练：
+```
 python -m paddlerec.run -m models/match/dssm/config.yaml # dssm
 python -m paddlerec.run -m models/match/multiview-simnet/config.yaml # multiview-simnet
+python -m paddlerec.run -m models/contentunderstanding/match-pyramid/config.yaml #match-pyramid
 ```
+### 效果复现
+每个模型下的readme中都有详细的效果复现的教程，您可以进入模型的目录中详细查看  

-### 预测
-```shell
-# 修改对应模型的config.yaml, workspace配置为当前目录的绝对路径
-# 修改对应模型的config.yaml，mode配置infer_runner
-# 示例: mode: train_runner -> mode: infer_runner
-# infer_runner中 class配置为 class: infer
-# 修改phase阶段为infer的配置，参照config注释
+### 模型效果 (测试)

-# 修改完config.yaml后 执行:
-python -m paddlerec.run -m ./config.yaml # 以dssm为例
-```
+|       数据集        |       模型       |      正逆序比          |       map       |  
+| :------------------: | :--------------------: | :---------: |:---------: |
+|       zhidao       |       DSSM       |       2.25        |       --          | 
+|       Letor07        |       match-pyramid       |       --        |      0.42          | 
+|       zhidao        |       multiview-simnet       |       1.72        |       --          |