-
-
-
-
-"""
-
-
-def convert_markdown_into_html(argv=None):
- parser = argparse.ArgumentParser()
- parser.add_argument('filenames', nargs='*', help='Filenames to fix')
- args = parser.parse_args(argv)
-
- retv = 0
-
- for filename in args.filenames:
- with open(
- re.sub(r"README", "index", re.sub(r"\.md$", ".html", filename)),
- "w") as output:
- output.write(HEAD)
- with open(filename) as input:
- for line in input:
- output.write(line)
- output.write(TAIL)
-
- return retv
-
-
-if __name__ == '__main__':
- sys.exit(convert_markdown_into_html())
diff --git a/.travis.yml b/.travis.yml
index 0f67f656fde89e087d1324c2a19db2f506e930d2..52bfd5a1ba02b8ff32ef4248e00530fdd1319174 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -17,20 +17,26 @@ addons:
- python-pip
- python2.7-dev
ssh_known_hosts: 52.76.173.135
+
before_install:
- sudo pip install -U virtualenv pre-commit pip
- docker pull paddlepaddle/paddle:latest
+
script:
- - .travis/precommit.sh
- - docker run -i --rm -v "$PWD:/py_unittest" paddlepaddle/paddle:latest /bin/bash -c
- 'cd /py_unittest; sh .travis/unittest.sh'
+ - exit_code=0
+ - .travis/precommit.sh || exit_code=$(( exit_code | $? ))
+ - docker run -i --rm -v "$PWD:/py_unittest" paddlepaddle/paddle:latest /bin/bash -c
+ 'cd /py_unittest; sh .travis/unittest.sh' || exit_code=$(( exit_code | $? ))
- |
- if [[ "$TRAVIS_PULL_REQUEST" != "false" ]]; then exit 0; fi;
- if [[ "$TRAVIS_BRANCH" != "develop" && ! "$TRAVIS_BRANCH" =~ ^v[[:digit:]]+\.[[:digit:]]+(\.[[:digit:]]+)?(-\S*)?$ ]]; then echo "not develop branch, no deploy"; exit 0; fi;
+ if [[ "$TRAVIS_PULL_REQUEST" != "false" ]]; then exit $exit_code; fi;
+ if [[ "$TRAVIS_BRANCH" != "develop" && ! "$TRAVIS_BRANCH" =~ ^v[[:digit:]]+\.[[:digit:]]+(\.[[:digit:]]+)?(-\S*)?$ ]]; then echo "not develop branch, no deploy"; exit $exit_code; fi;
export DEPLOY_DOCS_SH=https://raw.githubusercontent.com/PaddlePaddle/PaddlePaddle.org/master/scripts/deploy/deploy_docs.sh
export MODELS_DIR=`pwd`
cd ..
curl $DEPLOY_DOCS_SH | bash -s $CONTENT_DEC_PASSWD $TRAVIS_BRANCH $MODELS_DIR
+ exit_code=$(( exit_code | $? ))
+ exit $exit_code
+
notifications:
email:
on_success: change
diff --git a/conv_seq2seq/README.md b/conv_seq2seq/README.md
index 64f98adc9f4624dd8aa915644957a994f6908a70..75ea8770266cc277843608de8320d74d54d1e8e4 100644
--- a/conv_seq2seq/README.md
+++ b/conv_seq2seq/README.md
@@ -7,51 +7,51 @@ Jonas Gehring, Micheal Auli, David Grangier, et al. Convolutional Sequence to Se
- In this tutorial, each line in a data file contains one sample and each sample consists of a source sentence and a target sentence. And the two sentences are seperated by '\t'. So, to use your own data, it should be organized as follows:
- ```
- \t
- ```
+ ```
+ \t
+ ```
# Training a Model
- Modify the following script if needed and then run:
- ```bash
- python train.py \
- --train_data_path ./data/train_data \
- --test_data_path ./data/test_data \
- --src_dict_path ./data/src_dict \
- --trg_dict_path ./data/trg_dict \
- --enc_blocks "[(256, 3)] * 5" \
- --dec_blocks "[(256, 3)] * 3" \
- --emb_size 256 \
- --pos_size 200 \
- --drop_rate 0.1 \
- --use_gpu False \
- --trainer_count 1 \
- --batch_size 32 \
- --num_passes 20 \
- >train.log 2>&1
- ```
+ ```bash
+ python train.py \
+ --train_data_path ./data/train_data \
+ --test_data_path ./data/test_data \
+ --src_dict_path ./data/src_dict \
+ --trg_dict_path ./data/trg_dict \
+ --enc_blocks "[(256, 3)] * 5" \
+ --dec_blocks "[(256, 3)] * 3" \
+ --emb_size 256 \
+ --pos_size 200 \
+ --drop_rate 0.1 \
+ --use_gpu False \
+ --trainer_count 1 \
+ --batch_size 32 \
+ --num_passes 20 \
+ >train.log 2>&1
+ ```
# Inferring by a Trained Model
- Infer by a trained model by running:
- ```bash
- python infer.py \
- --infer_data_path ./data/infer_data \
- --src_dict_path ./data/src_dict \
- --trg_dict_path ./data/trg_dict \
- --enc_blocks "[(256, 3)] * 5" \
- --dec_blocks "[(256, 3)] * 3" \
- --emb_size 256 \
- --pos_size 200 \
- --drop_rate 0.1 \
- --use_gpu False \
- --trainer_count 1 \
- --max_len 100 \
- --beam_size 1 \
- --model_path ./params.pass-0.tar.gz \
- 1>infer_result 2>infer.log
- ```
+ ```bash
+ python infer.py \
+ --infer_data_path ./data/infer_data \
+ --src_dict_path ./data/src_dict \
+ --trg_dict_path ./data/trg_dict \
+ --enc_blocks "[(256, 3)] * 5" \
+ --dec_blocks "[(256, 3)] * 3" \
+ --emb_size 256 \
+ --pos_size 200 \
+ --drop_rate 0.1 \
+ --use_gpu False \
+ --trainer_count 1 \
+ --max_len 100 \
+ --beam_size 1 \
+ --model_path ./params.pass-0.tar.gz \
+ 1>infer_result 2>infer.log
+ ```
# Notes
diff --git a/conv_seq2seq/model.py b/conv_seq2seq/model.py
index 01dd94288b4bbee2c4099a029ac042cec0fdc53d..85f23862ce53871edc37f2c0a617f0130798a66b 100644
--- a/conv_seq2seq/model.py
+++ b/conv_seq2seq/model.py
@@ -147,7 +147,8 @@ def encoder(token_emb,
encoded_sum = paddle.layer.addto(input=[encoded_vec, embedding])
# halve the variance of the sum
- encoded_sum = paddle.layer.slope_intercept(input=encoded_sum, slope=math.sqrt(0.5))
+ encoded_sum = paddle.layer.slope_intercept(
+ input=encoded_sum, slope=math.sqrt(0.5))
return encoded_vec, encoded_sum
diff --git a/ctr/index.html b/ctr/index.html
deleted file mode 100644
index 78dc8e825928780f6998fbd9f8178479c4c2aa04..0000000000000000000000000000000000000000
--- a/ctr/index.html
+++ /dev/null
@@ -1,403 +0,0 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-# Click-Through Rate Prediction
-
-## Introduction
-
-CTR(Click-Through Rate)\[[1](https://en.wikipedia.org/wiki/Click-through_rate)\]
-is a prediction of the probability that a user clicks on an advertisement. This model is widely used in the advertisement industry. Accurate click rate estimates are important for maximizing online advertising revenue.
-
-When there are multiple ad slots, CTR estimates are generally used as a baseline for ranking. For example, in a search engine's ad system, when the user enters a query, the system typically performs the following steps to show relevant ads.
-
-1. Get the ad collection associated with the user's search term.
-2. Business rules and relevance filtering.
-3. Rank by auction mechanism and CTR.
-4. Show ads.
-
-Here,CTR plays a crucial role.
-
-### Brief history
-Historically, the CTR prediction model has been evolving as follows.
-
-- Logistic Regression(LR) / Gradient Boosting Decision Trees (GBDT) + feature engineering
-- LR + Deep Neural Network (DNN)
-- DNN + feature engineering
-
-In the early stages of development LR dominated, but the recent years DNN based models are mainly used.
-
-
-### LR vs DNN
-
-The following figure shows the structure of LR and DNN model:
-
-
-
-Figure 1. LR and DNN model structure comparison
-
-
-We can see, LR and CNN have some common structures. However, DNN can have non-linear relation between input and output values by adding activation unit and further layers. This enables DNN to achieve better learning results in CTR estimates.
-
-In the following, we demonstrate how to use PaddlePaddle to learn to predict CTR.
-
-## Data and Model formation
-
-Here `click` is the learning objective. There are several ways to learn the objectives.
-
-1. Direct learning click, 0,1 for binary classification
-2. Learning to rank, pairwise rank or listwise rank
-3. Measure the ad click rate of each ad, then rank by the click rate.
-
-In this example, we use the first method.
-
-We use the Kaggle `Click-through rate prediction` task \[[2](https://www.kaggle.com/c/avazu-ctr-prediction/data)\].
-
-Please see the [data process](./dataset.md) for pre-processing data.
-
-The input data format for the demo model in this tutorial is as follows:
-
-```
-# \t \t click
-1 23 190 \t 230:0.12 3421:0.9 23451:0.12 \t 0
-23 231 \t 1230:0.12 13421:0.9 \t 1
-```
-
-Description:
-
-- `dnn input ids` one-hot coding.
-- `lr input sparse values` Use `ID:VALUE` , values are preferaly scaled to the range `[-1, 1]`。
-
-此外,模型训练时需要传入一个文件描述 dnn 和 lr两个子模型的输入维度,文件的格式如下:
-
-```
-dnn_input_dim:
-lr_input_dim:
-```
-
- represents an integer value.
-
-`avazu_data_processor.py` can be used to download the data set \[[2](#参考文档)\]and pre-process the data.
-
-```
-usage: avazu_data_processer.py [-h] --data_path DATA_PATH --output_dir
- OUTPUT_DIR
- [--num_lines_to_detect NUM_LINES_TO_DETECT]
- [--test_set_size TEST_SET_SIZE]
- [--train_size TRAIN_SIZE]
-
-PaddlePaddle CTR example
-
-optional arguments:
- -h, --help show this help message and exit
- --data_path DATA_PATH
- path of the Avazu dataset
- --output_dir OUTPUT_DIR
- directory to output
- --num_lines_to_detect NUM_LINES_TO_DETECT
- number of records to detect dataset's meta info
- --test_set_size TEST_SET_SIZE
- size of the validation dataset(default: 10000)
- --train_size TRAIN_SIZE
- size of the trainset (default: 100000)
-```
-
-- `data_path` The data path to be processed
-- `output_dir` The output path of the data
-- `num_lines_to_detect` The number of generated IDs
-- `test_set_size` The number of rows for the test set
-- `train_size` The number of rows of training set
-
-## Wide & Deep Learning Model
-
-Google proposed a model framework for Wide & Deep Learning to integrate the advantages of both DNNs suitable for learning abstract features and LR models for large sparse features.
-
-
-### Introduction to the model
-
-Wide & Deep Learning Model\[[3](#References)\] is a relatively mature model, but this model is still being used in the CTR predicting task. Here we demonstrate the use of this model to complete the CTR predicting task.
-
-The model structure is as follows:
-
-
-
-Figure 2. Wide & Deep Model
-
-
-The wide part of the left side of the model can accommodate large-scale coefficient features and has some memory for some specific information (such as ID); and the Deep part of the right side of the model can learn the implicit relationship between features.
-
-
-### Model Input
-
-The model has three inputs as follows.
-
-- `dnn_input` ,the Deep part of the input
-- `lr_input` ,the wide part of the input
-- `click` , click on or not
-
-```python
-dnn_merged_input = layer.data(
- name='dnn_input',
- type=paddle.data_type.sparse_binary_vector(self.dnn_input_dim))
-
-lr_merged_input = layer.data(
- name='lr_input',
- type=paddle.data_type.sparse_vector(self.lr_input_dim))
-
-click = paddle.layer.data(name='click', type=dtype.dense_vector(1))
-```
-
-### Wide part
-
-Wide part uses of the LR model, but the activation function changed to `RELU` for speed.
-
-```python
-def build_lr_submodel():
- fc = layer.fc(
- input=lr_merged_input, size=1, name='lr', act=paddle.activation.Relu())
- return fc
-```
-
-### Deep part
-
-The Deep part uses a standard multi-layer DNN.
-
-```python
-def build_dnn_submodel(dnn_layer_dims):
- dnn_embedding = layer.fc(input=dnn_merged_input, size=dnn_layer_dims[0])
- _input_layer = dnn_embedding
- for i, dim in enumerate(dnn_layer_dims[1:]):
- fc = layer.fc(
- input=_input_layer,
- size=dim,
- act=paddle.activation.Relu(),
- name='dnn-fc-%d' % i)
- _input_layer = fc
- return _input_layer
-```
-
-### Combine
-
-The output section uses `sigmoid` function to output (0,1) as the prediction value.
-
-```python
-# conbine DNN and LR submodels
-def combine_submodels(dnn, lr):
- merge_layer = layer.concat(input=[dnn, lr])
- fc = layer.fc(
- input=merge_layer,
- size=1,
- name='output',
- # use sigmoid function to approximate ctr, wihch is a float value between 0 and 1.
- act=paddle.activation.Sigmoid())
- return fc
-```
-
-### Training
-```python
-dnn = build_dnn_submodel(dnn_layer_dims)
-lr = build_lr_submodel()
-output = combine_submodels(dnn, lr)
-
-# ==============================================================================
-# cost and train period
-# ==============================================================================
-classification_cost = paddle.layer.multi_binary_label_cross_entropy_cost(
- input=output, label=click)
-
-
-paddle.init(use_gpu=False, trainer_count=11)
-
-params = paddle.parameters.create(classification_cost)
-
-optimizer = paddle.optimizer.Momentum(momentum=0)
-
-trainer = paddle.trainer.SGD(
- cost=classification_cost, parameters=params, update_equation=optimizer)
-
-dataset = AvazuDataset(train_data_path, n_records_as_test=test_set_size)
-
-def event_handler(event):
- if isinstance(event, paddle.event.EndIteration):
- if event.batch_id % 100 == 0:
- logging.warning("Pass %d, Samples %d, Cost %f" % (
- event.pass_id, event.batch_id * batch_size, event.cost))
-
- if event.batch_id % 1000 == 0:
- result = trainer.test(
- reader=paddle.batch(dataset.test, batch_size=1000),
- feeding=field_index)
- logging.warning("Test %d-%d, Cost %f" % (event.pass_id, event.batch_id,
- result.cost))
-
-
-trainer.train(
- reader=paddle.batch(
- paddle.reader.shuffle(dataset.train, buf_size=500),
- batch_size=batch_size),
- feeding=field_index,
- event_handler=event_handler,
- num_passes=100)
-```
-
-## Run training and testing
-The model go through the following steps:
-
-1. Prepare training data
- 1. Download train.gz from [Kaggle CTR](https://www.kaggle.com/c/avazu-ctr-prediction/data) .
- 2. Unzip train.gz to get train.txt
- 3. `mkdir -p output; python avazu_data_processer.py --data_path train.txt --output_dir output --num_lines_to_detect 1000 --test_set_size 100` 生成演示数据
-2. Execute `python train.py --train_data_path ./output/train.txt --test_data_path ./output/test.txt --data_meta_file ./output/data.meta.txt --model_type=0`. Start training.
-
-The argument options for `train.py` are as follows.
-
-```
-usage: train.py [-h] --train_data_path TRAIN_DATA_PATH
- [--test_data_path TEST_DATA_PATH] [--batch_size BATCH_SIZE]
- [--num_passes NUM_PASSES]
- [--model_output_prefix MODEL_OUTPUT_PREFIX] --data_meta_file
- DATA_META_FILE --model_type MODEL_TYPE
-
-PaddlePaddle CTR example
-
-optional arguments:
- -h, --help show this help message and exit
- --train_data_path TRAIN_DATA_PATH
- path of training dataset
- --test_data_path TEST_DATA_PATH
- path of testing dataset
- --batch_size BATCH_SIZE
- size of mini-batch (default:10000)
- --num_passes NUM_PASSES
- number of passes to train
- --model_output_prefix MODEL_OUTPUT_PREFIX
- prefix of path for model to store (default:
- ./ctr_models)
- --data_meta_file DATA_META_FILE
- path of data meta info file
- --model_type MODEL_TYPE
- model type, classification: 0, regression 1 (default
- classification)
-```
-
-- `train_data_path` : The path of the training set
-- `test_data_path` : The path of the testing set
-- `num_passes`: number of rounds of model training
-- `data_meta_file`: Please refer to [数据和任务抽象](### 数据和任务抽象)的描述。
-- `model_type`: Model classification or regressio
-
-
-## Use the training model for prediction
-The training model can be used to predict new data, and the format of the forecast data is as follows.
-
-
-```
-# \t
-1 23 190 \t 230:0.12 3421:0.9 23451:0.12
-23 231 \t 1230:0.12 13421:0.9
-```
-
-Here the only difference to the training data is that there is no label (i.e. `click` values).
-
-We now can use `infer.py` to perform inference.
-
-```
-usage: infer.py [-h] --model_gz_path MODEL_GZ_PATH --data_path DATA_PATH
- --prediction_output_path PREDICTION_OUTPUT_PATH
- [--data_meta_path DATA_META_PATH] --model_type MODEL_TYPE
-
-PaddlePaddle CTR example
-
-optional arguments:
- -h, --help show this help message and exit
- --model_gz_path MODEL_GZ_PATH
- path of model parameters gz file
- --data_path DATA_PATH
- path of the dataset to infer
- --prediction_output_path PREDICTION_OUTPUT_PATH
- path to output the prediction
- --data_meta_path DATA_META_PATH
- path of trainset's meta info, default is ./data.meta
- --model_type MODEL_TYPE
- model type, classification: 0, regression 1 (default
- classification)
-```
-
-- `model_gz_path_model`:path for `gz` compressed data.
-- `data_path` :
-- `prediction_output_patj`:path for the predicted values s
-- `data_meta_file` :Please refer to [数据和任务抽象](### 数据和任务抽象)。
-- `model_type` :Classification or regression
-
-The sample data can be predicted with the following command
-
-```
-python infer.py --model_gz_path --data_path output/infer.txt --prediction_output_path predictions.txt --data_meta_path data.meta.txt
-```
-
-The final prediction is written in `predictions.txt`。
-
-## References
-1.
-2.
-3. Cheng H T, Koc L, Harmsen J, et al. [Wide & deep learning for recommender systems](https://arxiv.org/pdf/1606.07792.pdf)[C]//Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016: 7-10.
-
-
-# Deep Structured Semantic Models (DSSM)
-Deep Structured Semantic Models (DSSM) is simple but powerful DNN based model for matching web search queries and the URL based documents. This example demonstrates how to use PaddlePaddle to implement a generic DSSM model for modeling the semantic similarity between two strings.
-
-## Background Introduction
-DSSM \[[1](##References)]is a classic semantic model proposed by the Institute of Physics. It is used to study the semantic distance between two texts. The general implementation of DSSM is as follows.
-
-1. The CTR predictor measures the degree of association between a user search query and a candidate web page.
-2. Text relevance, which measures the degree of semantic correlation between two strings.
-3. Automatically recommend, measure the degree of association between User and the recommended Item.
-
-
-## Model Architecture
-
-In the original paper \[[1](#References)] the DSSM model uses the implicit semantic relation between the user search query and the document as metric. The model structure is as follows
-
-
-
-Figure 1. DSSM In the original paper
-
-
-
-With the subsequent optimization of the DSSM model to simplify the structure \[[3](#References)],the model becomes:
-
-
-
-Figure 2. DSSM generic structure
-
-
-The blank box in the figure can be replaced by any model, such as fully connected FC, convoluted CNN, RNN, etc. The structure is designed to measure the semantic distance between two elements (such as strings).
-
-In practice,DSSM model serves as a basic building block, with different loss functions to achieve specific functions, such as
-
-- In ranking system, the pairwise rank loss function.
-- In the CTR estimate, instead of the binary classification on the click, use cross-entropy loss for a classification model
-- In regression model, the cosine similarity is used to calculate the similarity
-
-## Model Implementation
-At a high level, DSSM model is composed of three components: the left and right DNN, and loss function on top of them. In complex tasks, the structure of the left DNN and the light DNN can be different. In this example, we keep these two DNN structures the same. And we choose any of FC, CNN, and RNN for the DNN architecture.
-
-In PaddlePaddle, the loss functions are supported for any of classification, regression, and ranking. Among them, the distance between the left and right DNN is calculated by the cosine similarity. In the classification task, the predicted distribution is calculated by softmax.
-
-Here we demonstrate:
-
-- How CNN, FC do text information extraction can refer to [text classification](https://github.com/PaddlePaddle/models/blob/develop/text_classification/README.md#模型详解)
-- The contents of the RNN / GRU can be found in [Machine Translation](https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.md#gated-recurrent-unit-gru)
-- For Pairwise Rank learning, please refer to [learn to rank](https://github.com/PaddlePaddle/models/blob/develop/ltr/README.md)
-
-Figure 3 shows the general architecture for both regression and classification models.
-
-
-
-Figure 3. DSSM for REGRESSION or CLASSIFICATION
-
-
-The structure of the Pairwise Rank is more complex, as shown in Figure 4.
-
-
-
-图 4. DSSM for Pairwise Rank
-
-
-In below, we describe how to train DSSM model in PaddlePaddle. All the codes are included in `./network_conf.py`.
-
-
-### Create a word vector table for the text
-```python
-def create_embedding(self, input, prefix=''):
- """
- Create word embedding. The `prefix` is added in front of the name of
- embedding"s learnable parameter.
- """
- logger.info("Create embedding table [%s] whose dimention is %d" %
- (prefix, self.dnn_dims[0]))
- emb = paddle.layer.embedding(
- input=input,
- size=self.dnn_dims[0],
- param_attr=ParamAttr(name='%s_emb.w' % prefix))
- return emb
-```
-
-Since the input (embedding table) is a list of the IDs of the words corresponding to a sentence, the word vector table outputs the sequence of word vectors.
-
-### CNN implementation
-```python
-def create_cnn(self, emb, prefix=''):
-
- """
- A multi-layer CNN.
- :param emb: The word embedding.
- :type emb: paddle.layer
- :param prefix: The prefix will be added to of layers' names.
- :type prefix: str
- """
-
- def create_conv(context_len, hidden_size, prefix):
- key = "%s_%d_%d" % (prefix, context_len, hidden_size)
- conv = paddle.networks.sequence_conv_pool(
- input=emb,
- context_len=context_len,
- hidden_size=hidden_size,
- # set parameter attr for parameter sharing
- context_proj_param_attr=ParamAttr(name=key + "contex_proj.w"),
- fc_param_attr=ParamAttr(name=key + "_fc.w"),
- fc_bias_attr=ParamAttr(name=key + "_fc.b"),
- pool_bias_attr=ParamAttr(name=key + "_pool.b"))
- return conv
-
- conv_3 = create_conv(3, self.dnn_dims[1], "cnn")
- conv_4 = create_conv(4, self.dnn_dims[1], "cnn")
- return conv_3, conv_4
-```
-
-CNN accepts the word sequence of the embedding table, then process the data by convolution and pooling, and finally outputs a semantic vector.
-
-### RNN implementation
-
-RNN is suitable for learning variable length of the information
-
-```python
-def create_rnn(self, emb, prefix=''):
- """
- A GRU sentence vector learner.
- """
- gru = paddle.networks.simple_gru(
- input=emb,
- size=self.dnn_dims[1],
- mixed_param_attr=ParamAttr(name='%s_gru_mixed.w' % prefix),
- mixed_bias_param_attr=ParamAttr(name="%s_gru_mixed.b" % prefix),
- gru_param_attr=ParamAttr(name='%s_gru.w' % prefix),
- gru_bias_attr=ParamAttr(name="%s_gru.b" % prefix))
- sent_vec = paddle.layer.last_seq(gru)
- return sent_vec
-```
-
-### FC implementation
-
-```python
-def create_fc(self, emb, prefix=''):
-
- """
- A multi-layer fully connected neural networks.
- :param emb: The output of the embedding layer
- :type emb: paddle.layer
- :param prefix: A prefix will be added to the layers' names.
- :type prefix: str
- """
-
- _input_layer = paddle.layer.pooling(
- input=emb, pooling_type=paddle.pooling.Max())
- fc = paddle.layer.fc(
- input=_input_layer,
- size=self.dnn_dims[1],
- param_attr=ParamAttr(name='%s_fc.w' % prefix),
- bias_attr=ParamAttr(name="%s_fc.b" % prefix))
- return fc
-```
-
-In the construction of FC, we use `paddle.layer.pooling` for the maximum pooling operation on the word vector sequence. Then we transform the sequence into a fixed dimensional vector.
-
-### Multi-layer DNN implementation
-
-```python
-def create_dnn(self, sent_vec, prefix):
- if len(self.dnn_dims) > 1:
- _input_layer = sent_vec
- for id, dim in enumerate(self.dnn_dims[1:]):
- name = "%s_fc_%d_%d" % (prefix, id, dim)
- fc = paddle.layer.fc(
- input=_input_layer,
- size=dim,
- act=paddle.activation.Tanh(),
- param_attr=ParamAttr(name='%s.w' % name),
- bias_attr=ParamAttr(name='%s.b' % name),
- )
- _input_layer = fc
- return _input_layer
-```
-
-### Classification / Regression
-The structure of classification and regression is similar. Below function can be used for both tasks.
-Please check the function `_build_classification_or_regression_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for detail implementation.
-
-### Pairwise Rank
-
-Please check the function `_build_rank_model` in [network_conf.py]( https://github.com/PaddlePaddle/models/blob/develop/dssm/network_conf.py) for implementation.
-
-## Data Format
-Below is a simple example for the data in `./data`
-
-### Regression data format
-```
-# 3 fields each line:
-# - source's word ids
-# - target's word ids
-# - target
- \t \t
-```
-
-The example of this format is as follows.
-
-```
-3 6 10 \t 6 8 33 \t 0.7
-6 0 \t 6 9 330 \t 0.03
-```
-
-### Classification data format
-```
-# 3 fields each line:
-# - source's word ids
-# - target's word ids
-# - target
- \t \t
-# Globally Normalized Reader
-
-This model implements the work in the following paper:
-
-Jonathan Raiman and John Miller. Globally Normalized Reader. Empirical Methods in Natural Language Processing (EMNLP), 2017.
-
-If you use the dataset/code in your research, please cite the above paper:
-
-```text
-@inproceedings{raiman2015gnr,
- author={Raiman, Jonathan and Miller, John},
- booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
- title={Globally Normalized Reader},
- year={2017},
-}
-```
-
-You can also visit https://github.com/baidu-research/GloballyNormalizedReader to get more information.
-
-
-# Installation
-
-1. Please use [docker image](http://doc.paddlepaddle.org/develop/doc/getstarted/build_and_install/docker_install_en.html) to install the latest PaddlePaddle, by running:
- ```bash
- docker pull paddledev/paddle
- ```
-2. Download all necessary data by running:
- ```bash
- cd data && ./download.sh && cd ..
- ```
-3. Preprocess and featurizer data:
- ```bash
- python featurize.py --datadir data --outdir data/featurized --glove-path data/glove.840B.300d.txt
- ```
-
-# Training a Model
-
-- Configurate the model by modifying `config.py` if needed, and then run:
-
- ```bash
- python train.py 2>&1 | tee train.log
- ```
-
-# Inferring by a Trained Model
-
-- Infer by a trained model by running:
- ```bash
- python infer.py \
- --model_path models/pass_00000.tar.gz \
- --data_dir data/featurized/ \
- --batch_size 2 \
- --use_gpu 0 \
- --trainer_count 1 \
- 2>&1 | tee infer.log
- ```
-
-
-
-二叉树中每个非叶子节点是一个二类别分类器(sigmoid),如果类别是0,则取左子节点继续分类判断,反之取右子节点,直至达到叶节点。按照这种方式,每个类别均对应一条路径,例如从root到类别1的路径编码为0、1。训练阶段我们按照真实类别对应的路径,依次计算对应分类器的损失,然后综合所有损失得到最终损失。预测阶段,模型会输出各个非叶节点分类器的概率,我们可以根据概率获取路径编码,然后遍历路径编码就可以得到最终预测类别。传统softmax的计算复杂度为N(N为词典大小),Hsigmoid可以将复杂度降至log(N),详细理论细节可参照论文\[[1](#参考文献)\]。
-
-## 数据准备
-### PTB数据
-本文采用Penn Treebank (PTB)数据集([Tomas Mikolov预处理版本](http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz)),共包含train、valid和test三个文件。其中使用train作为训练数据,valid作为测试数据。本文训练的是5-gram模型,即用每条数据的前4个词来预测第5个词。PaddlePaddle提供了对应PTB数据集的python包[paddle.dataset.imikolov](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/imikolov.py) ,自动做数据的下载与预处理。预处理会把数据集中的每一句话前后加上开始符号\以及结束符号\,然后依据窗口大小(本文为5),从头到尾每次向右滑动窗口并生成一条数据。如"I have a dream that one day"可以生成\ I have a dream、I have a dream that、have a dream that one、a dream that one day、dream that one day \,PaddlePaddle会把词转换成id数据作为预处理的输出。
-
-### 自定义数据
-用户可以使用自己的数据集训练模型,自定义数据集最关键的地方是实现reader接口做数据处理,reader需要产生一个迭代器,迭代器负责解析文件中的每一行数据,返回一个python list,例如[1, 2, 3, 4, 5],分别是第一个到第四个词在字典中的id,PaddlePaddle会进一步将该list转化成`paddle.data_type.inter_value`类型作为data layer的输入,一个封装样例如下:
-
-```python
-def reader_creator(filename, word_dict, n):
- def reader():
- with open(filename) as f:
- UNK = word_dict['']
- for l in f:
- l = [''] + l.strip().split() + ['']
- if len(l) >= n:
- l = [word_dict.get(w, UNK) for w in l]
- for i in range(n, len(l) + 1):
- yield tuple(l[i - n:i])
- return reader
-
-
-def train_data(filename, word_dict, n):
- """
- Reader interface for training data.
-
- It returns a reader creator, each sample in the reader is a word ID tuple.
-
- :param filename: path of data file
- :type filename: str
- :param word_dict: word dictionary
- :type word_dict: dict
- :param n: sliding window size
- :type n: int
- """
- return reader_creator(filename, word_dict, n)
-```
-
-## 网络结构
-本文通过训练N-gram语言模型来获得词向量,具体地使用前4个词来预测当前词。网络输入为词在字典中的id,然后查询词向量词表获取词向量,接着拼接4个词的词向量,然后接入一个全连接隐层,最后是`Hsigmoid`层。详细网络结构见图2:
-
-
-
-图2. 网络配置结构
-
-
-代码如下:
-
-```python
-def ngram_lm(hidden_size, embed_size, dict_size, gram_num=4, is_train=True):
- emb_layers = []
- embed_param_attr = paddle.attr.Param(
- name="_proj", initial_std=0.001, learning_rate=1, l2_rate=0)
- for i in range(gram_num):
- word = paddle.layer.data(
- name="__word%02d__" % (i),
- type=paddle.data_type.integer_value(dict_size))
- emb_layers.append(
- paddle.layer.embedding(
- input=word, size=embed_size, param_attr=embed_param_attr))
-
- target_word = paddle.layer.data(
- name="__target_word__", type=paddle.data_type.integer_value(dict_size))
-
- embed_context = paddle.layer.concat(input=emb_layers)
-
- hidden_layer = paddle.layer.fc(
- input=embed_context,
- size=hidden_size,
- act=paddle.activation.Sigmoid(),
- layer_attr=paddle.attr.Extra(drop_rate=0.5),
- bias_attr=paddle.attr.Param(learning_rate=2),
- param_attr=paddle.attr.Param(
- initial_std=1. / math.sqrt(embed_size * 8), learning_rate=1))
-
- return paddle.layer.hsigmoid(
- input=hidden_layer,
- label=target_word,
- param_attr=paddle.attr.Param(name="sigmoid_w"),
- bias_attr=paddle.attr.Param(name="sigmoid_b"))
-```
-
-需要注意在 PaddlePaddle 中,hsigmoid 层将可学习参数存储为一个 `[类别数目 - 1 × 隐层向量宽度]` 大小的矩阵。预测时,需要将 hsigmoid 层替换为全连接运算**并固定以 `sigmoid` 为激活**。预测时输出一个宽度为`[batch_size x 类别数目 - 1]` 维度的矩阵(`batch_size = 1`时退化为一个向量)。矩阵行向量的每一维计算了一个输入向量属于一个内部结点的右孩子的概率。**全连接运算在加载 hsigmoid 层学习到的参数矩阵时,需要对参数矩阵进行一次转置**。代码片段如下:
-
-```python
-return paddle.layer.mixed(
- size=dict_size - 1,
- input=paddle.layer.trans_full_matrix_projection(
- hidden_layer, param_attr=paddle.attr.Param(name="sigmoid_w")),
- act=paddle.activation.Sigmoid(),
- bias_attr=paddle.attr.Param(name="sigmoid_b"))
-```
-上述代码片段中的 `paddle.layer.mixed` 必须以 PaddlePaddle 中 `paddle.layer.×_projection` 为输入。`paddle.layer.mixed` 将多个 `projection` (输入可以是多个)计算结果求和作为输出。`paddle.layer.trans_full_matrix_projection` 在计算矩阵乘法时会对参数$W$进行转置。
-
-## 训练阶段
-训练比较简单,直接运行``` python train.py ```。程序第一次运行会检测用户缓存文件夹中是否包含imikolov数据集,如果未包含,则自动下载。运行过程中,每100个iteration会打印模型训练信息,主要包含训练损失和测试损失,每个pass会保存一次模型。
-
-## 预测阶段
-在命令行运行 :
-```bash
-python infer.py \
- --model_path "models/XX" \
- --batch_size 1 \
- --use_gpu false \
- --trainer_count 1
-```
-参数含义如下:
-- `model_path`:指定训练好的模型所在的路径。必选。
-- `batch_size`:一次预测并行的样本数目。可选,默认值为 `1`。
-- `use_gpu`:是否使用 GPU 进行预测。可选,默认值为 `False`。
-- `trainer_count` : 预测使用的线程数目。可选,默认为 `1`。**注意:预测使用的线程数目必选大于一次预测并行的样本数目**。
-
-预测阶段根据多个二分类概率得到编码路径,遍历路径获取最终的预测类别,逻辑如下:
-
-```python
-def decode_res(infer_res, dict_size):
- """
- Inferring probabilities are orginized as a complete binary tree.
- The actual labels are leaves (indices are counted from class number).
- This function travels paths decoded from inferring results.
- If the probability >0.5 then go to right child, otherwise go to left child.
-
- param infer_res: inferring result
- param dict_size: class number
- return predict_lbls: actual class
- """
- predict_lbls = []
- infer_res = infer_res > 0.5
- for i, probs in enumerate(infer_res):
- idx = 0
- result = 1
- while idx < len(probs):
- result <<= 1
- if probs[idx]:
- result |= 1
- if probs[idx]:
- idx = idx * 2 + 2 # right child
- else:
- idx = idx * 2 + 1 # left child
-
- predict_lbl = result - dict_size
- predict_lbls.append(predict_lbl)
- return predict_lbls
-```
-
-预测程序的输入数据格式与训练阶段相同,如`have a dream that one`,程序会根据`have a dream that`生成一组概率,通过对概率解码生成预测词,`one`作为真实词,方便评估。解码函数的输入是一个batch样本的预测概率以及词表的大小,里面的循环是对每条样本的输出概率进行解码,解码方式就是按照左0右1的准则,不断遍历路径,直至到达叶子节点。
-
-## 参考文献
-1. Morin, F., & Bengio, Y. (2005, January). [Hierarchical Probabilistic Neural Network Language Model](http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf). In Aistats (Vol. 5, pp. 246-252).
-
-
-
-PaddlePaddle 实现该网络结构的代码见 `network_conf.py`。
-
-对双层时间序列的处理,需要先将双层时间序列数据变换成单层时间序列数据,再对每一个单层时间序列进行处理。 在 PaddlePaddle 中 ,`recurrent_group` 是帮助我们构建处理双层序列的层次化模型的主要工具。这里,我们使用两个嵌套的 `recurrent_group` 。外层的 `recurrent_group` 将段落拆解为句子,`step` 函数中拿到的输入是句子序列;内层的 `recurrent_group` 将句子拆解为词语,`step` 函数中拿到的输入是非序列的词语。
-
-在词语级别,我们通过 CNN 网络以词向量为输入输出学习到的句子表示;在段落级别,将每个句子的表示通过池化作用得到段落表示。
-
-``` python
-nest_group = paddle.layer.recurrent_group(input=[paddle.layer.SubsequenceInput(emb),
- hidden_size],
- step=cnn_cov_group)
-```
-
-
-拆解后的单层序列数据经过一个CNN网络学习对应的向量表示,CNN的网络结构包含以下部分:
-
-- **卷积层**: 文本分类中的卷积在时间序列上进行,卷积核的宽度和词向量层产出的矩阵一致,卷积后得到的结果为“特征图”, 使用多个不同高度的卷积核,可以得到多个特征图。本例代码默认使用了大小为 3(图1红色框)和 4(图1蓝色框)的卷积核。
-- **最大池化层**: 对卷积得到的各个特征图分别进行最大池化操作。由于特征图本身已经是向量,因此最大池化实际上就是选出各个向量中的最大元素。将所有最大元素又被拼接在一起,组成新的向量。
-- **线性投影层**: 将不同卷积得到的结果经过最大池化层之后拼接为一个长向量, 然后经过一个线性投影得到对应单层序列的表示向量。
-
-CNN网络具体代码实现如下:
-```python
-def cnn_cov_group(group_input, hidden_size):
- """
- Convolution group definition.
- :param group_input: The input of this layer.
- :type group_input: LayerOutput
- :params hidden_size: The size of the fully connected layer.
- :type hidden_size: int
- """
- conv3 = paddle.networks.sequence_conv_pool(
- input=group_input, context_len=3, hidden_size=hidden_size)
- conv4 = paddle.networks.sequence_conv_pool(
- input=group_input, context_len=4, hidden_size=hidden_size)
-
- linear_proj = paddle.layer.fc(input=[conv3, conv4],
- size=hidden_size,
- param_attr=paddle.attr.ParamAttr(name='_cov_value_weight'),
- bias_attr=paddle.attr.ParamAttr(name='_cov_value_bias'),
- act=paddle.activation.Linear())
-
- return linear_proj
-```
-PaddlePaddle 中已经封装好的带有池化的文本序列卷积模块:`paddle.networks.sequence_conv_pool`,可直接调用。
-
-在得到每个句子的表示向量之后, 将所有句子表示向量经过一个平均池化层, 得到一个样本的向量表示, 向量经过一个全连接层输出最终的预测结果。 代码如下:
-```python
-avg_pool = paddle.layer.pooling(input=nest_group,
- pooling_type=paddle.pooling.Avg(),
- agg_level=paddle.layer.AggregateLevel.TO_NO_SEQUENCE)
-
-prob = paddle.layer.mixed(size=class_num,
- input=[paddle.layer.full_matrix_projection(input=avg_pool)],
- act=paddle.activation.Softmax())
-```
-## 安装依赖包
-```bash
-pip install -r requirements.txt
-```
-
-## 指定训练配置参数
-
-通过 `config.py` 脚本修改训练和模型配置参数,脚本中有对可配置参数的详细解释,示例如下:
-```python
-class TrainerConfig(object):
-
- # whether to use GPU for training
- use_gpu = False
- # the number of threads used in one machine
- trainer_count = 1
-
- # train batch size
- batch_size = 32
-
- ...
-
-
-class ModelConfig(object):
-
- # embedding vector dimension
- emb_size = 28
-
- ...
-```
-修改 `config.py` 对参数进行调整。例如,通过修改 `use_gpu` 参数来指定是否使用 GPU 进行训练。
-
-## 使用 PaddlePaddle 内置数据运行
-
-### 训练
-在终端执行:
-```bash
-python train.py
-```
-将以 PaddlePaddle 内置的情感分类数据集: `imdb` 运行本例。
-### 预测
-训练结束后模型将存储在指定目录当中(默认models目录),在终端执行:
-```bash
-python infer.py --model_path 'models/params_pass_00000.tar.gz'
-```
-默认情况下,预测脚本将加载训练一个pass的模型对 `imdb的测试集` 进行测试。
-
-## 使用自定义数据训练和预测
-
-### 训练
-1.数据组织
-
-输入数据格式如下:每一行为一条样本,以 `\t` 分隔,第一列是类别标签,第二列是输入文本的内容。以下是两条示例数据:
-
-```
-positive This movie is very good. The actor is so handsome.
-negative What a terrible movie. I waste so much time.
-```
-
-2.编写数据读取接口
-
-自定义数据读取接口只需编写一个 Python 生成器实现**从原始输入文本中解析一条训练样本**的逻辑。以下代码片段实现了读取原始数据返回类型为: `paddle.data_type.integer_value_sub_sequence` 和 `paddle.data_type.integer_value`
-```python
-def train_reader(data_dir, word_dict, label_dict):
- """
- Reader interface for training data
-
- :param data_dir: data directory
- :type data_dir: str
- :param word_dict: path of word dictionary,
- the dictionary must has a "UNK" in it.
- :type word_dict: Python dict
- :param label_dict: path of label dictionary.
- :type label_dict: Python dict
- """
-
- def reader():
- UNK_ID = word_dict['']
- word_col = 1
- lbl_col = 0
-
- for file_name in os.listdir(data_dir):
- file_path = os.path.join(data_dir, file_name)
- if not os.path.isfile(file_path):
- continue
- with open(file_path, "r") as f:
- for line in f:
- line_split = line.strip().split("\t")
- doc = line_split[word_col]
- doc_ids = []
- for sent in doc.strip().split("."):
- sent_ids = [
- word_dict.get(w, UNK_ID)
- for w in sent.split()]
- if sent_ids:
- doc_ids.append(sent_ids)
-
- yield doc_ids, label_dict[line_split[lbl_col]]
-
- return reader
-```
-需要注意的是, 本例中以英文句号`'.'`作为分隔符, 将一段文本分隔为一定数量的句子, 且每个句子表示为对应词表的索引数组(`sent_ids`)。 由于当前样本的表示(`doc_ids`)中包含了该段文本的所有句子, 因此,它的类型为:`paddle.data_type.integer_value_sub_sequence`。
-
-
-3.指定命令行参数进行训练
-
-`train.py`训练脚本中包含以下参数:
-```
-Options:
- --train_data_dir TEXT The path of training dataset (default: None). If
- this parameter is not set, imdb dataset will be
- used.
- --test_data_dir TEXT The path of testing dataset (default: None). If this
- parameter is not set, imdb dataset will be used.
- --word_dict_path TEXT The path of word dictionary (default: None). If this
- parameter is not set, imdb dataset will be used. If
- this parameter is set, but the file does not exist,
- word dictionay will be built from the training data
- automatically.
- --label_dict_path TEXT The path of label dictionary (default: None).If this
- parameter is not set, imdb dataset will be used. If
- this parameter is set, but the file does not exist,
- label dictionay will be built from the training data
- automatically.
- --model_save_dir TEXT The path to save the trained models (default:
- 'models').
- --help Show this message and exit.
-```
-
-修改`train.py`脚本中的启动参数,可以直接运行本例。 以`data`目录下的示例数据为例,在终端执行:
-```bash
-python train.py \
- --train_data_dir 'data/train_data' \
- --test_data_dir 'data/test_data' \
- --word_dict_path 'word_dict.txt' \
- --label_dict_path 'label_dict.txt'
-```
-即可对样例数据进行训练。
-
-### 预测
-
-1.指定命令行参数
-
-`infer.py`训练脚本中包含以下参数:
-
-```
-Options:
- --data_path TEXT The path of data for inference (default: None). If
- this parameter is not set, imdb test dataset will be
- used.
- --model_path TEXT The path of saved model. [required]
- --word_dict_path TEXT The path of word dictionary (default: None). If this
- parameter is not set, imdb dataset will be used.
- --label_dict_path TEXT The path of label dictionary (default: None).If this
- parameter is not set, imdb dataset will be used.
- --batch_size INTEGER The number of examples in one batch (default: 32).
- --help Show this message and exit.
-```
-
-2.以`data`目录下的示例数据为例,在终端执行:
-```bash
-python infer.py \
- --data_path 'data/infer.txt' \
- --word_dict_path 'word_dict.txt' \
- --label_dict_path 'label_dict.txt' \
- --model_path 'models/params_pass_00000.tar.gz'
-```
-
-即可对样例数据进行预测。
-
-
-# Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering
-
-This model implements the work in the following paper:
-
-Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying Cao, Jie Zhou, and Wei Xu. Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering. [arXiv:1607.06275](https://arxiv.org/abs/1607.06275).
-
-If you use the dataset/code in your research, please cite the above paper:
-
-```text
-@article{li:2016:arxiv,
- author = {Li, Peng and Li, Wei and He, Zhengyan and Wang, Xuguang and Cao, Ying and Zhou, Jie and Xu, Wei},
- title = {Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering},
- journal = {arXiv:1607.06275v2},
- year = {2016},
- url = {https://arxiv.org/abs/1607.06275v2},
-}
-```
-
-
-## Installation
-
-1. Install PaddlePaddle v0.10.5 by the following commond. Note that v0.10.0 is not supported.
- ```bash
- # either one is OK
- # CPU
- pip install paddlepaddle
- # GPU
- pip install paddlepaddle-gpu
- ```
-2. Download the [WebQA](http://idl.baidu.com/WebQA.html) dataset by running
- ```bash
- cd data && ./download.sh && cd ..
- ```
-
-## Hyperparameters
-
-All the hyperparameters are defined in `config.py`. The default values are aligned with the paper.
-
-## Training
-
-Training can be launched using the following command:
-
-```bash
-PYTHONPATH=data/evaluation:$PYTHONPATH python train.py 2>&1 | tee train.log
-```
-## Validation and Test
-
-WebQA provides two versions of validation and test sets. Automatic validation and test can be lauched by
-
-```bash
-PYTHONPATH=data/evaluation:$PYTHONPATH python val_and_test.py models [ann|ir]
-```
-
-where
-
-* `models`: the directory where model files are stored. You can use `models` if `config.py` is not changed.
-* `ann`: using the validation and test sets with annotated evidence.
-* `ir`: using the validation and test sets with retrieved evidence.
-
-Note that validation and test can run simultaneously with training. `val_and_test.py` will handle the synchronization related problems.
-
-Intermediate results are stored in the directory `tmp`. You can delete them safely after validation and test.
-
-The results should be comparable with those shown in Table 3 in the paper.
-
-## Inferring using a Trained Model
-
-Infer using a trained model by running:
-```bash
-PYTHONPATH=data/evaluation:$PYTHONPATH python infer.py \
- MODEL_FILE \
- INPUT_DATA \
- OUTPUT_FILE \
- 2>&1 | tee infer.log
-```
-
-where
-
-* `MODEL_FILE`: a trained model produced by `train.py`.
-* `INPUT_DATA`: input data in the same format as the validation/test sets of the WebQA dataset.
-* `OUTPUT_FILE`: results in the format specified in the WebQA dataset for the evaluation scripts.
-
-## Pre-trained Models
-
-We have provided two pre-trained models, one for the validation and test sets with annotated evidence, and one for those with retrieved evidence. These two models are selected according to the performance on the corresponding version of validation set, which is consistent with the paper.
-
-The models can be downloaded with
-```bash
-cd pre-trained-models && ./download-models.sh && cd ..
-```
-
-The evaluation result on the test set with annotated evidence can be achieved by
-
-```bash
-PYTHONPATH=data/evaluation:$PYTHONPATH python infer.py \
- pre-trained-models/params_pass_00010.tar.gz \
- data/data/test.ann.json.gz \
- test.ann.output.txt.gz
-
-PYTHONPATH=data/evaluation:$PYTHONPATH \
- python data/evaluation/evaluate-tagging-result.py \
- test.ann.output.txt.gz \
- data/data/test.ann.json.gz \
- --fuzzy --schema BIO2
-# The result should be
-# chunk_f1=0.739091 chunk_precision=0.686119 chunk_recall=0.800926 true_chunks=3024 result_chunks=3530 correct_chunks=2422
-```
-
-And the evaluation result on the test set with retrieved evidence can be achieved by
-
-```bash
-PYTHONPATH=data/evaluation:$PYTHONPATH python infer.py \
- pre-trained-models/params_pass_00021.tar.gz \
- data/data/test.ir.json.gz \
- test.ir.output.txt.gz
-
-PYTHONPATH=data/evaluation:$PYTHONPATH \
- python data/evaluation/evaluate-voting-result.py \
- test.ir.output.txt.gz \
- data/data/test.ir.json.gz \
- --fuzzy --schema BIO2
-# The result should be
-# chunk_f1=0.749358 chunk_precision=0.727868 chunk_recall=0.772156 true_chunks=3024 result_chunks=3208 correct_chunks=2335
-```
-
-
-# Neural Machine Translation Model
-
-## Background Introduction
-Neural Machine Translation (NMT) is a simple new architecture for getting machines to learn to translate. Traditional machine translation methods are mainly based on phrase-based statistical translation approaches that use separately engineered subcomponents rules or statistical models. NMT models use deep learning and representation learning. This example describes how to construct an end-to-end neural machine translation (NMT) model using the recurrent neural network (RNN) in PaddlePaddle.
-
-## Model Overview
-RNN-based neural machine translation follows the encoder-decoder architecture. A common choice for the encoder and decoder is the recurrent neural network (RNN), used by most NMT models. Below is an example diagram of a general approach for NMT.
-
-
Figure 1. Encoder - Decoder frame p>
-
-The input and output of the neural machine translation model can be any of character, word or phrase. This example illustrates the word-based NMT.
-
-- **Encoder**: Encodes the source language sentence into a vector as input to the decoder. The original input of the decoder is the `id` sequence $ w = {w_1, w_2, ..., w_T} $ of the word, expressed in the one-hot code. In order to reduce the input dimension, and to establish the semantic association between words, the model is a word that is expressed by hot independent code. Word embedding is a word vector. For more information about word vector, please refer to PaddleBook [word vector] (https://github.com/PaddlePaddle/book/blob/develop/04.word2vec/README.cn.md) chapter. Finally, the RNN unit processes the input word by word to get the encoding vector of the complete sentence.
-
-- **Decoder**: Accepts the input of the encoder, decoding the target language sequence $ u = {u_1, u_2, ..., u_ {T '}} $ one by one. For each time step, the RNN unit outputs a hidden vector. Then the conditional probability of the next target word is calculated by `Softmax` normalization, i.e. $ P (u_i | w, u_1, u_2, ..., u_ {t- 1}) $. Thus, given the input $ w $, the corresponding translation result is $ u $
-
-$$ P(u_1,u_2,...,u_{T'} | w) = \prod_{t=1}^{t={T'}}p(u_t|w, u_1, u_2, u_{t-1})$$
-
-In Chinese to English translation, for example, the source language is Chinese, and the target language is English. The following is a sentence after the source language word segmentation.
-
-```
-祝愿 祖国 繁荣 昌盛
-```
-
-Corresponding target language English translation results for:
-
-```
-Wish motherland rich and powerful
-```
-
-In the preprocessing step, we prepare the parallel corpus data of the source language and the target language. Then we construct the dictionaries of the source language and the target language respectively. In the training stage, we use the pairwise parallel corpus training model. In the model test stage, the model automatically generates the corresponding English translation, and then it evaluates the resulting results with standard translations. For the evaluation metric, BLEU is most commonly used.
-
-### RNN unit
-The original structure of the RNN uses a vector to store the hidden state. However, the RNN of this structure is prone to have gradient vanishing problem, which is difficult to model for a long time. This issue can be addressed by using LSTM \[[1](#References)] and GRU (Gated Recurrent Unit) \[[2](#References)]. This solves the problem of long-term dependency by carefully forgetting the previous information. In this example, we demonstrate the GRU based model.
-
-
-
-图 2. GRU 单元
-
-
-We can see that, in addition to the implicit state, the GRU also contains two gates: the Update Gate and the Reset Gate. At each time step, the update of the threshold and the hidden state is determined by the formula on the right side of Figure 2. These two thresholds determine how the state is updated.
-
-### Bi-directional Encoder
-In the above basic model, when the encoder sequentially processes the input sentence sequence, the state of the current time contains only the history input information without the sequence information of the future time. For sequence modeling, the context of the future also contains important information. With the bi-directional encoder (Figure 3), we can get both information at the same time:
-
-
-
-
-The bi-directional encoder \[[3](#References)\] shown in Figure 3 consists of two independent RNNs that encode the input sequence from the forward and backward, respectively. Then it combines the outputs of the two RNNs together, as the final encoding output.
-
-In PaddlePaddle, bi-directional encoders can easily call using APIs:
-
-```python
-src_word_id = paddle.layer.data(
- name='source_language_word',
- type=paddle.data_type.integer_value_sequence(source_dict_dim))
-
-# source embedding
-src_embedding = paddle.layer.embedding(
- input=src_word_id, size=word_vector_dim)
-
-# bidirectional GRU as encoder
-encoded_vector = paddle.networks.bidirectional_gru(
- input=src_embedding,
- size=encoder_size,
- fwd_act=paddle.activation.Tanh(),
- fwd_gate_act=paddle.activation.Sigmoid(),
- bwd_act=paddle.activation.Tanh(),
- bwd_gate_act=paddle.activation.Sigmoid(),
- return_seq=True)
-```
-
-### Beam Search Algorithm
-After the training is completed, the model will input and decode the corresponding target language translation result according to the source language. Decoding, a direct way is to take each step conditional probability of the largest word, as the current moment of output. But the local optimal does not necessarily guarantee the global optimal. If the search for the full space is large, the cost is too large. In order to solve this problem, beam search algorithm is commonly used. Beam search is a heuristic graph search algorithm that controls the search width with a parameter $ k $] as follows:
-
-**1**. During decoding, always maintain $ k $ decoded sub-sequences;
-
-**2**. At the middle of time $ t $, for each sequence in the $ k $ sub-sequence, calculate the probability of the next word and take the maximum of $ k $ words with the largest probability, combining $ k ^ 2 $ New child sequence;
-
-**3**. Take the maximum probability of $ k $ in these combination sequences to update the original subsequence;
-
-**4**. iterate through it until you get $ k $ complete sentences as candidates for translation results.
-
-For more information on beam search, refer to the [beam search] chapter in PaddleBook [machine translation] (https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.cn.md) (https://github.com/PaddlePaddle/book/blob/develop.org.machine_translation/README.cn.md# beam search algorithm) section.
-
-
-### Decoder without Attention mechanism
-- In the relevant section of PaddleBook (https://github.com/PaddlePaddle/book/blob/develop/08.machine_translation/README.cn.md), the Attention Mechanism has been introduced. This example demonstrates Encoder-Decoder structure without attention mechanism. With regard to the attention mechanism, please refer to PaddleBook and references \[[3](#References)].
-
-In PaddlePaddle, commonly used RNN units can be conveniently called using APIs. For example, `recurrent_layer_group` can be used to implement custom actions at each point in the RNN. First, customize the single-step logic function, and then use the function `recurrent_group ()` to cycle through the single-step logic function to process the entire sequence. In this case, the unattended mechanism of the decoder uses `recurrent_layer_group` to implement the function` gru_decoder_without_attention () `. Corresponding code is as follows:
-
-
-```python
-# the initialization state for decoder GRU
-encoder_last = paddle.layer.last_seq(input=encoded_vector)
-encoder_last_projected = paddle.layer.fc(
- size=decoder_size, act=paddle.activation.Tanh(), input=encoder_last)
-
-# the step function for decoder GRU
-def gru_decoder_without_attention(enc_vec, current_word):
- '''
- Step function for gru decoder
- :param enc_vec: encoded vector of source language
- :type enc_vec: layer object
- :param current_word: current input of decoder
- :type current_word: layer object
- '''
- decoder_mem = paddle.layer.memory(
- name="gru_decoder",
- size=decoder_size,
- boot_layer=encoder_last_projected)
-
- context = paddle.layer.last_seq(input=enc_vec)
-
- decoder_inputs = paddle.layer.fc(
- size=decoder_size * 3, input=[context, current_word])
-
- gru_step = paddle.layer.gru_step(
- name="gru_decoder",
- act=paddle.activation.Tanh(),
- gate_act=paddle.activation.Sigmoid(),
- input=decoder_inputs,
- output_mem=decoder_mem,
- size=decoder_size)
-
- out = paddle.layer.fc(
- size=target_dict_dim,
- bias_attr=True,
- act=paddle.activation.Softmax(),
- input=gru_step)
- return out
-```
-
-In the model training and testing phase, the behavior of the decoder is different:
-
-- **Training phase**: The word vector of the target translation `trg_embedding` is passed as a parameter to the single step logic` gru_decoder_without_attention () `. The function` recurrent_group () `loop calls the single step logic execution, and finally calculates the target translation with the actual decoding;
-- **Testing phase**: The decoder predicts the next word based on the last generated word, `GeneratedInput ()`. The automatic fetch model predicts the highest probability of the $ k $ word vector passed to the single step logic. Then the beam_search () function calls the function `gru_decoder_without_attention ()` to complete the beam search and returns as a result.
-
-The training and generated returns are implemented in the following `if-else` conditional branches:
-
-```python
-group_input1 = paddle.layer.StaticInput(input=encoded_vector)
-group_inputs = [group_input1]
-
-decoder_group_name = "decoder_group"
-if is_generating:
- trg_embedding = paddle.layer.GeneratedInput(
- size=target_dict_dim,
- embedding_name="_target_language_embedding",
- embedding_size=word_vector_dim)
- group_inputs.append(trg_embedding)
-
- beam_gen = paddle.layer.beam_search(
- name=decoder_group_name,
- step=gru_decoder_without_attention,
- input=group_inputs,
- bos_id=0,
- eos_id=1,
- beam_size=beam_size,
- max_length=max_length)
-
- return beam_gen
-else:
- trg_embedding = paddle.layer.embedding(
- input=paddle.layer.data(
- name="target_language_word",
- type=paddle.data_type.integer_value_sequence(target_dict_dim)),
- size=word_vector_dim,
- param_attr=paddle.attr.ParamAttr(name="_target_language_embedding"))
- group_inputs.append(trg_embedding)
-
- decoder = paddle.layer.recurrent_group(
- name=decoder_group_name,
- step=gru_decoder_without_attention,
- input=group_inputs)
-
- lbl = paddle.layer.data(
- name="target_language_next_word",
- type=paddle.data_type.integer_value_sequence(target_dict_dim))
- cost = paddle.layer.classification_cost(input=decoder, label=lbl)
-
- return cost
-```
-
-## Data Preparation
-The data used in this example is from [WMT14] (http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/), which is a parallel corpus of French-to-English translation. Use [bitexts] (http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/bitexts.tgz) as training data, [dev + test data] (http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/data/dev+test.tgz) as validation and test data. PaddlePaddle has been packaged in the data set of the read interface, in the first run, the program will automatically complete the download. Users do not need to manually complete the relevant data preparation!
-
-## Model Training and Testing
-
-### Model Training
-
-Starting the model training is very simple, just in the command line window to execute `python train.py`. The `train ()` function in the `train.py` script of the model training phase completes the following logic:
-
-**a) Define the network, parse the network structure, initialize the model parameters.**
-
-```python
-# define the network topolgy.
-cost = seq2seq_net(source_dict_dim, target_dict_dim)
-parameters = paddle.parameters.create(cost)
-```
-
-**b) Set the training process optimization strategy. Define the training data to read `reader`**
-
-```python
-# define optimization method
-optimizer = paddle.optimizer.RMSProp(
- learning_rate=1e-3,
- gradient_clipping_threshold=10.0,
- regularization=paddle.optimizer.L2Regularization(rate=8e-4))
-
-# define the trainer instance
-trainer = paddle.trainer.SGD(
- cost=cost, parameters=parameters, update_equation=optimizer)
-
-# define data reader
-wmt14_reader = paddle.batch(
- paddle.reader.shuffle(
- paddle.dataset.wmt14.train(source_dict_dim), buf_size=8192),
- batch_size=55)
-```
-
-**c) Define the event handle, print the training intermediate results, save the model snapshot**
-
-```python
-# define the event_handler callback
-def event_handler(event):
- if isinstance(event, paddle.event.EndIteration):
- if not event.batch_id % 100 and event.batch_id:
- with gzip.open(
- os.path.join(save_path,
- "nmt_without_att_%05d_batch_%05d.tar.gz" %
- event.pass_id, event.batch_id), "w") as f:
- parameters.to_tar(f)
-
- if event.batch_id and not event.batch_id % 10:
- logger.info("Pass %d, Batch %d, Cost %f, %s" % (
- event.pass_id, event.batch_id, event.cost, event.metrics))
-```
-
-**d) Start training**
-
-```python
-# start training
-trainer.train(
- reader=wmt14_reader, event_handler=event_handler, num_passes=2)
-```
-
-The output sample is
-
-```text
-Pass 0, Batch 0, Cost 267.674663, {'classification_error_evaluator': 1.0}
-.........
-Pass 0, Batch 10, Cost 172.892294, {'classification_error_evaluator': 0.953895092010498}
-.........
-Pass 0, Batch 20, Cost 177.989329, {'classification_error_evaluator': 0.9052488207817078}
-.........
-Pass 0, Batch 30, Cost 153.633665, {'classification_error_evaluator': 0.8643803596496582}
-.........
-Pass 0, Batch 40, Cost 168.170543, {'classification_error_evaluator': 0.8348183631896973}
-```
-
-### Generate Translation Results
-In PaddlePaddle, it is also easy to use translated models to generate translated text.
-
-1. First of all, please modify the `generate.py` script` main` passed to the `generate` function parameters to choose which saved model to use. The default parameters are as follows:
-
- ```python
- generate(
- source_dict_dim=30000,
- target_dict_dim=30000,
- batch_size=20,
- beam_size=3,
- model_path="models/nmt_without_att_params_batch_00100.tar.gz")
- ```
-
-2. In the terminal phase, execute the `python generate.py` command. The` generate () `in the script executes the following code:
-
- **a) Load the test sample**
-
- ```python
- # load data samples for generation
- gen_creator = paddle.dataset.wmt14.gen(source_dict_dim)
- gen_data = []
- for item in gen_creator():
- gen_data.append((item[0], ))
- ```
-
- **b) Initialize the model, execute `infer ()` for each input sample to generate `beam search` translation results**
-
- ```python
- beam_gen = seq2seq_net(source_dict_dim, target_dict_dim, True)
- with gzip.open(init_models_path) as f:
- parameters = paddle.parameters.Parameters.from_tar(f)
- # prob is the prediction probabilities, and id is the prediction word.
- beam_result = paddle.infer(
- output_layer=beam_gen,
- parameters=parameters,
- input=gen_data,
- field=['prob', 'id'])
- ```
-
- **c) Next, load the source and target language dictionaries, convert the sentences represented by the `id` sequence into the original language and output the results.**
-
- ```python
- beam_result = inferer.infer(input=test_batch, field=["prob", "id"])
-
- gen_sen_idx = np.where(beam_result[1] == -1)[0]
- assert len(gen_sen_idx) == len(test_batch) * beam_size
-
- start_pos, end_pos = 1, 0
- for i, sample in enumerate(test_batch):
- print(" ".join([
- src_dict[w] for w in sample[0][1:-1]
- ])) # skip the start and ending mark when print the source sentence
- for j in xrange(beam_size):
- end_pos = gen_sen_idx[i * beam_size + j]
- print("%.4f\t%s" % (beam_result[0][i][j], " ".join(
- trg_dict[w] for w in beam_result[1][start_pos:end_pos])))
- start_pos = end_pos + 2
- print("\n")
- ```
-
-Set the width of the beam search to 3, enter a French sentence. Then it automatically generate the corresponding test data for the translation results, the output format is as follows:
-
-```text
-Elles connaissent leur entreprise mieux que personne .
--3.754819 They know their business better than anyone .
--4.445528 They know their businesses better than anyone .
--5.026885 They know their business better than anybody .
-
-```
-- The first line of input for the source language.
-- Second ~ beam_size + 1 line is the result of the `beam_size` translation generated by the column search
- - the output of the same row is separated into two columns by "\ t", the first column is the log probability of the sentence, and the second column is the text of the translation result.
- - the symbol `` represents the beginning of the sentence, the symbol `` indicates the end of a sentence, and if there is a word that is not included in the dictionary, it is replaced with the symbol ``.
-
-So far, we have implemented a basic machine translation model using PaddlePaddle. We can see, PaddlePaddle provides a flexible and rich API. This enables users to easily choose and use a various complex network configuration. NMT itself is also a rapidly developing field, and many new ideas continue to emerge. This example is a basic implementation of NMT. Users can also implement more complex NMT models using PaddlePaddle.
-
-
-## References
-[1] Sutskever I, Vinyals O, Le Q V. [Sequence to Sequence Learning with Neural Networks] (https://arxiv.org/abs/1409.3215) [J]. 2014, 4: 3104-3112.
-
-[2] Cho K, Van Merriënboer B, Gulcehre C, et al. [Learning phrase representations using RNN encoder-decoder for statistical machine translation (http://www.aclweb.org/anthology/D/D14/D14-1179 .pdf) [C]. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1724-1734.
-
-[3] Bahdanau D, Cho K, Bengio Y. [Neural machine translation by exclusive learning to align and translate] (https://arxiv.org/abs/1409.0473) [C]. Proceedings of ICLR 2015, 2015
-
-
-# Single Shot MultiBox Detector (SSD) Object Detection
-
-## Introduction
-Single Shot MultiBox Detector (SSD) is one of the new and enhanced detection algorithms detecting objects in images [ 1 ]. SSD algorithm is characterized by rapid detection and high detection accuracy. PaddlePaddle has an integrated SSD algorithm! This example demonstrates how to use the SSD model in PaddlePaddle for object detection. We first provide a brief introduction to the SSD principle. Then we describe how to train, evaluate and test on the PASCAL VOC data set, and finally on how to use SSD on custom data set.
-
-## SSD Architecture
-SSD uses a convolutional neural network to achieve end-to-end detection. The term "End-to-end" is used because it uses the input as the original image and the output for the test results, without the use of external tools or processes for feature extraction. One popular model of SSD is VGG16 [ 2 ]. SSD differs from VGG16 network model as in following.
-
-1. The final fc6, fc7 full connection layer into a convolution layer, convolution layer parameters through the original fc6, fc7 parameters obtained.
-2. Change the parameters of the pool5 layer from 2x2-s2 (kernel size 2x2, stride size to 2) to 3x3-s1-p1 (kernel size is 3x3, stride size is 1, padding size is 1).
-3. The initial layers are composed of conv4\_3、conv7、conv8\_2、conv9\_2、conv10\_2, and pool11 layers. The main purpose of the priorbox layer is to generate a series of rectangular candidates based on the input feature map. A more detailed introduction to SSD can be found in the paper\[[1](#References)\]。
-
-Below is the overall structure of the model (300x300)
-
-
-
-图1. SSD网络结构
-
-
-Each box in the figure represents a convolution layer, and the last two rectangles represent the summary of each convolution layer output and the post-processing phase. Specifically, the network will output a set of candidate rectangles in the prediction phase. Each rectangle contains two types of information: the position and the category score. The network produces thousands of predictions at various scales and aspect ratios before performing non-maximum suppression, resulting in a handful of final tags.
-
-## Example Overview
-This example contains the following files:
-
-
-
Table 1. Directory structure
-
File
Description
-
train.py
Training script
-
eval.py
Evaluation
-
infer.py
Prediction using the trained model
-
visual.py
Visualization of the test results
-
image_util.py
Image preprocessing required common function
-
data_provider.py
Data processing scripts, generate training, evaluate or detect the required data
-
config/pascal_voc_conf.py
Neural network hyperparameter configuration file
-
data/label_list
Label list
-
data/prepare_voc_data.py
Prepare training PASCAL VOC data list
-
-
-The training phase requires pre-processing of the data, including clipping, sampling, etc. This is done in ```image_util.py``` and ```data_provider.py```.```config/vgg_config.py```. ```data/prepare_voc_data.py``` is used to generate a list of files, including the training set and test set, the need to use the user to download and extract data, the default use of VOC2007 and VOC2012.
-
-## PASCAL VOC Data set
-
-### Data Preparation
-First download the data set. VOC2007\[[3](#References)\] contains both training and test data set, and VOC2012\[[4](#References)\] contains only training set. Downloaded data are stored in ```data/VOCdevkit/VOC2007``` and ```data/VOCdevkit/VOC2012```. Next, run ```data/prepare_voc_data.py``` to generate ```trainval.txt``` and ```test.txt```. The relevant function is as following:
-
-```python
-def prepare_filelist(devkit_dir, years, output_dir):
- trainval_list = []
- test_list = []
- for year in years:
- trainval, test = walk_dir(devkit_dir, year)
- trainval_list.extend(trainval)
- test_list.extend(test)
- random.shuffle(trainval_list)
- with open(osp.join(output_dir, 'trainval.txt'), 'w') as ftrainval:
- for item in trainval_list:
- ftrainval.write(item[0] + ' ' + item[1] + '\n')
-
- with open(osp.join(output_dir, 'test.txt'), 'w') as ftest:
- for item in test_list:
- ftest.write(item[0] + ' ' + item[1] + '\n')
-```
-
-The data in ```trainval.txt``` will look like:
-
-```
-VOCdevkit/VOC2007/JPEGImages/000005.jpg VOCdevkit/VOC2007/Annotations/000005.xml
-VOCdevkit/VOC2007/JPEGImages/000007.jpg VOCdevkit/VOC2007/Annotations/000007.xml
-VOCdevkit/VOC2007/JPEGImages/000009.jpg VOCdevkit/VOC2007/Annotations/000009.xml
-```
-
-The first field is the relative path of the image file, and the second field is the relative path of the corresponding label file.
-
-
-### To Use Pre-trained Model
-We also provide a pre-trained model using VGG-16 with good performance. To use the model, download the file http://paddlepaddle.bj.bcebos.com/model_zoo/detection/ssd_model/vgg_model.tar.gz, and place it as ```vgg/vgg_model.tar.gz```。
-
-### Training
-Next, run ```python train.py``` to train the model. Note that this example only supports the CUDA GPU environment, and can not be trained using only CPU. This is mainly because the training is very slow using CPU only.
-
-```python
-paddle.init(use_gpu=True, trainer_count=4)
-data_args = data_provider.Settings(
- data_dir='./data',
- label_file='label_list',
- resize_h=cfg.IMG_HEIGHT,
- resize_w=cfg.IMG_WIDTH,
- mean_value=[104,117,124])
-train(train_file_list='./data/trainval.txt',
- dev_file_list='./data/test.txt',
- data_args=data_args,
- init_model_path='./vgg/vgg_model.tar.gz')
-```
-
-Below is a description about this script:
-
-1. Call ```paddle.init``` with 4 GPUs.
-2. ```data_provider.Settings()``` is to pass configuration parameters. For ```config/vgg_config.py``` setting,300x300 is a typical configuration for both the accuracy and efficiency. It can be extended to 512x512 by modifying the configuration file.
-3. In ```train()```执 function, ```train_file_list``` specifies the training data list, and ```dev_file_list``` specifies the evaluation data list, and ```init_model_path``` specifies the pre-training model location.
-4. During the training process will print some log information, each training a batch will output the current number of rounds, the current batch cost and mAP (mean Average Precision. Each training pass will be saved a model to the default saved directory ```checkpoints``` (Need to be created in advance).
-
-The following shows the SDD300x300 in the VOC data set.
-
-
-
-图2. SSD300x300 mAP收敛曲线
-
-
-
-### Model Assessment
-Next, run ```python eval.py``` to evaluate the model.
-
-```python
-paddle.init(use_gpu=True, trainer_count=4) # use 4 gpus
-
-data_args = data_provider.Settings(
- data_dir='./data',
- label_file='label_list',
- resize_h=cfg.IMG_HEIGHT,
- resize_w=cfg.IMG_WIDTH,
- mean_value=[104, 117, 124])
-
-eval(
- eval_file_list='./data/test.txt',
- batch_size=4,
- data_args=data_args,
- model_path='models/pass-00000.tar.gz')
-```
-
-### Obejct Detection
-Run ```python infer.py``` to perform the object detection using the trained model.
-
-```python
-infer(
- eval_file_list='./data/infer.txt',
- save_path='infer.res',
- data_args=data_args,
- batch_size=4,
- model_path='models/pass-00000.tar.gz',
- threshold=0.3)
-```
-
-
-Here ```eval_file_list``` specified image path list, ```save_path``` specifies directory to save the prediction result.
-
-
-```text
-VOCdevkit/VOC2007/JPEGImages/006936.jpg 12 0.997844 131.255611777 162.271582842 396.475315094 334.0
-VOCdevkit/VOC2007/JPEGImages/006936.jpg 14 0.998557 229.160234332 49.5991278887 314.098775387 312.913876176
-VOCdevkit/VOC2007/JPEGImages/006936.jpg 14 0.372522 187.543615699 133.727034628 345.647156239 327.448492289
-...
-```
-
-一共包含4个字段,以tab分割,第一个字段是检测图像路径,第二字段为检测矩形框内类别,第三个字段是置信度,第四个字段是4个坐标值(以空格分割)。
-
-Below is the example after running ```python visual.py``` to visualize the model result. The default visualization of the image saved in the ```./visual_res```.
-
-
-
-
-
-
-Figure 3. SSD300x300 Visualization Example
-
-
-
-## To Use Custo Data set
-In PaddlePaddle, using the custom data set to train SSD model is also easy! Just input the format that ```train.txt``` can understand. Below is a recommended structure to input for ```train.txt```.
-
-```text
-image00000_file_path image00000_annotation_file_path
-image00001_file_path image00001_annotation_file_path
-image00002_file_path image00002_annotation_file_path
-...
-```
-
-The first column is for the image file path, and the second column for the corresponding marked data file path. In the case of using xml file format, ```data_provider.py``` can be used to process the data as follows.
-
-```python
-bbox_labels = []
-root = xml.etree.ElementTree.parse(label_path).getroot()
-for object in root.findall('object'):
- bbox_sample = []
- # start from 1
- bbox_sample.append(float(settings.label_list.index(
- object.find('name').text)))
- bbox = object.find('bndbox')
- difficult = float(object.find('difficult').text)
- bbox_sample.append(float(bbox.find('xmin').text)/img_width)
- bbox_sample.append(float(bbox.find('ymin').text)/img_height)
- bbox_sample.append(float(bbox.find('xmax').text)/img_width)
- bbox_sample.append(float(bbox.find('ymax').text)/img_height)
- bbox_sample.append(difficult)
- bbox_labels.append(bbox_sample)
-```
-
-Now the marked data(e.g. image00000\_annotation\_file\_path)is as follows:
-
-```text
-label1 xmin1 ymin1 xmax1 ymax1
-label2 xmin2 ymin2 xmax2 ymax2
-...
-```
-
-Here each row corresponds to an object for 5 fields. The first is for the label (note the background 0, need to be numbered from 1), and the remaining four are for the coordinates.
-
-```python
-bbox_labels = []
-with open(label_path) as flabel:
- for line in flabel:
- bbox_sample = []
- bbox = [float(i) for i in line.strip().split()]
- label = bbox[0]
- bbox_sample.append(label)
- bbox_sample.append(bbox[1]/float(img_width))
- bbox_sample.append(bbox[2]/float(img_height))
- bbox_sample.append(bbox[3]/float(img_width))
- bbox_sample.append(bbox[4]/float(img_height))
- bbox_sample.append(0.0)
- bbox_labels.append(bbox_sample)
-```
-
-Another important thing is to change the size of the image and the size of the object to change the configuration of the network structure. Use ```config/vgg_config.py``` to create the custom configuration file. For more details, please refer to \[[1](#References)\]。
-
-## References
-1. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. [SSD: Single shot multibox detector](https://arxiv.org/abs/1512.02325). European conference on computer vision. Springer, Cham, 2016.
-2. Simonyan, Karen, and Andrew Zisserman. [Very deep convolutional networks for large-scale image recognition](https://arxiv.org/abs/1409.1556). arXiv preprint arXiv:1409.1556 (2014).
-3. [The PASCAL Visual Object Classes Challenge 2007](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html)
-4. [Visual Object Classes Challenge 2012 (VOC2012)](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html)
-
-
-
-通过 PaddlePaddle 实现该 CNN 结构的代码见 `network_conf.py` 中的 `convolution_net` 函数,模型主要分为如下几个部分:
-
-- **词向量层**:与 DNN 中词向量层的作用一样,将词语转化为固定维度的向量,利用向量之间的距离来表示词之间的语义相关程度。如图2所示,将得到的词向量定义为行向量,再将语料中所有的单词产生的行向量拼接在一起组成矩阵。假设词向量维度为5,句子 “The cat sat on the read mat” 含 7 个词语,那么得到的矩阵维度为 7*5。关于词向量的更多信息请参考 PaddleBook 中的[词向量](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec)一节。
-
-- **卷积层**: 文本分类中的卷积在时间序列上进行,即卷积核的宽度和词向量层产出的矩阵一致,卷积沿着矩阵的高度方向进行。卷积后得到的结果被称为“特征图”(feature map)。假设卷积核的高度为 $h$,矩阵的高度为 $N$,卷积的步长为 1,则得到的特征图为一个高度为 $N+1-h$ 的向量。可以同时使用多个不同高度的卷积核,得到多个特征图。
-
-- **最大池化层**: 对卷积得到的各个特征图分别进行最大池化操作。由于特征图本身已经是向量,因此这里的最大池化实际上就是简单地选出各个向量中的最大元素。各个最大元素又被拼接在一起,组成新的向量,显然,该向量的维度等于特征图的数量,也就是卷积核的数量。举例来说,假设我们使用了四个不同的卷积核,卷积产生的特征图分别为:`[2,3,5]`、`[8,2,1]`、`[5,7,7,6]` 和 `[4,5,1,8]`,由于卷积核的高度不同,因此产生的特征图尺寸也有所差异。分别在这四个特征图上进行最大池化,结果为:`[5]`、`[8]`、`[7]`和`[8]`,最后将池化结果拼接在一起,得到`[5,8,7,8]`。
-
-- **全连接与输出层**:将最大池化的结果通过全连接层输出,与 DNN 模型一样,最后输出层的神经元个数与样本的类别数量一致,且输出之和为 1。
-
-CNN 网络的输入数据类型和 DNN 一致。PaddlePaddle 中已经封装好的带有池化的文本序列卷积模块:`paddle.networks.sequence_conv_pool`,可直接调用。该模块的 `context_len` 参数用于指定卷积核在同一时间覆盖的文本长度,即图 2 中的卷积核的高度。`hidden_size` 用于指定该类型的卷积核的数量。本例代码默认使用了 128 个大小为 3 的卷积核和 128 个大小为 4 的卷积核,这些卷积的结果经过最大池化和结果拼接后产生一个 256 维的向量,向量经过一个全连接层输出最终的预测结果。
-
-## 使用 PaddlePaddle 内置数据运行
-
-### 如何训练
-
-在终端中执行 `sh run.sh` 以下命令, 将以 PaddlePaddle 内置的情感分类数据集:`paddle.dataset.imdb` 直接运行本例,会看到如下输入:
-
-```text
-Pass 0, Batch 0, Cost 0.696031, {'__auc_evaluator_0__': 0.47360000014305115, 'classification_error_evaluator': 0.5}
-Pass 0, Batch 100, Cost 0.544438, {'__auc_evaluator_0__': 0.839249312877655, 'classification_error_evaluator': 0.30000001192092896}
-Pass 0, Batch 200, Cost 0.406581, {'__auc_evaluator_0__': 0.9030032753944397, 'classification_error_evaluator': 0.2199999988079071}
-Test at Pass 0, {'__auc_evaluator_0__': 0.9289745092391968, 'classification_error_evaluator': 0.14927999675273895}
-```
-日志每隔 100 个 batch 输出一次,输出信息包括:(1)Pass 序号;(2)Batch 序号;(3)依次输出当前 Batch 上评估指标的评估结果。评估指标在配置网络拓扑结构时指定,在上面的输出中,输出了训练样本集之的 AUC 以及错误率指标。
-
-### 如何预测
-
-训练结束后模型默认存储在当前工作目录下,在终端中执行 `python infer.py` ,预测脚本会加载训练好的模型进行预测。
-
-- 默认加载使用 `paddle.data.imdb.train` 训练一个 Pass 产出的 DNN 模型对 `paddle.dataset.imdb.test` 进行测试
-
-会看到如下输出:
-
-```text
-positive 0.9275 0.0725 previous reviewer gave a much better of the films plot details than i could what i recall mostly is that it was just so beautiful in every sense emotionally visually just br if you like movies that are wonderful to look at and also have emotional content to which that beauty is relevant i think you will be glad to have seen this extraordinary and unusual work of br on a scale of 1 to 10 id give it about an the only reason i shy away from 9 is that it is a mood piece if you are in the mood for a really artistic very romantic film then its a 10 i definitely think its a mustsee but none of us can be in that mood all the time so overall
-negative 0.0300 0.9700 i love scifi and am willing to put up with a lot scifi are usually and i tried to like this i really did but it is to good tv scifi as 5 is to star trek the original silly cheap cardboard sets stilted dialogues cg that doesnt match the background and painfully onedimensional characters cannot be overcome with a scifi setting im sure there are those of you out there who think 5 is good scifi tv its not its clichéd and while us viewers might like emotion and character development scifi is a genre that does not take itself seriously star trek it may treat important issues yet not as a serious philosophy its really difficult to care about the characters here as they are not simply just missing a of life their actions and reactions are wooden and predictable often painful to watch the makers of earth know its rubbish as they have to always say gene earth otherwise people would not continue watching must be turning in their as this dull cheap poorly edited watching it without breaks really brings this home of a show into space spoiler so kill off a main character and then bring him back as another actor all over again
-```
-
-输出日志每一行是对一条样本预测的结果,以 `\t` 分隔,共 3 列,分别是:(1)预测类别标签;(2)样本分别属于每一类的概率,内部以空格分隔;(3)输入文本。
-
-## 使用自定义数据训练和预测
-
-### 如何训练
-
-1. 数据组织
-
- 假设有如下格式的训练数据:每一行为一条样本,以 `\t` 分隔,第一列是类别标签,第二列是输入文本的内容,文本内容中的词语以空格分隔。以下是两条示例数据:
-
- ```
- positive PaddlePaddle is good
- negative What a terrible weather
- ```
-
-2. 编写数据读取接口
-
- 自定义数据读取接口只需编写一个 Python 生成器实现**从原始输入文本中解析一条训练样本**的逻辑。以下代码片段实现了读取原始数据返回类型为: `paddle.data_type.integer_value_sequence`(词语在字典的序号)和 `paddle.data_type.integer_value`(类别标签)的 2 个输入给网络中定义的 2 个 `data_layer` 的功能。
- ```python
- def train_reader(data_dir, word_dict, label_dict):
- def reader():
- UNK_ID = word_dict[""]
- word_col = 0
- lbl_col = 1
-
- for file_name in os.listdir(data_dir):
- with open(os.path.join(data_dir, file_name), "r") as f:
- for line in f:
- line_split = line.strip().split("\t")
- word_ids = [
- word_dict.get(w, UNK_ID)
- for w in line_split[word_col].split()
- ]
- yield word_ids, label_dict[line_split[lbl_col]]
-
- return reader
- ```
-
- - 关于 PaddlePaddle 中 `data_layer` 接受输入数据的类型,以及数据读取接口对应该返回数据的格式,请参考 [input-types](http://www.paddlepaddle.org/release_doc/0.9.0/doc_cn/ui/data_provider/pydataprovider2.html#input-types) 一节。
- - 以上代码片段详见本例目录下的 `reader.py` 脚本,`reader.py` 同时提供了读取测试数据的全部代码。
-
- 接下来,只需要将数据读取函数 `train_reader` 作为参数传递给 `train.py` 脚本中的 `paddle.batch` 接口即可使用自定义数据接口读取数据,调用方式如下:
-
- ```python
- train_reader = paddle.batch(
- paddle.reader.shuffle(
- reader.train_reader(train_data_dir, word_dict, lbl_dict),
- buf_size=1000),
- batch_size=batch_size)
- ```
-
-3. 修改命令行参数
-
- - 如果将数据组织成示例数据的同样的格式,只需在 `run.sh` 脚本中修改 `train.py` 启动参数,指定 `train_data_dir` 参数,可以直接运行本例,无需修改数据读取接口 `reader.py`。
- - 执行 `python train.py --help` 可以获取`train.py` 脚本各项启动参数的详细说明,主要参数如下:
- - `nn_type`:选择要使用的模型,目前支持两种:“dnn” 或者 “cnn”。
- - `train_data_dir`:指定训练数据所在的文件夹,使用自定义数据训练,必须指定此参数,否则使用`paddle.dataset.imdb`训练,同时忽略`test_data_dir`,`word_dict`,和 `label_dict` 参数。
- - `test_data_dir`:指定测试数据所在的文件夹,若不指定将不进行测试。
- - `word_dict`:字典文件所在的路径,若不指定,将从训练数据根据词频统计,自动建立字典。
- - `label_dict`:类别标签字典,用于将字符串类型的类别标签,映射为整数类型的序号。
- - `batch_size`:指定多少条样本后进行一次神经网络的前向运行及反向更新。
- - `num_passes`:指定训练多少个轮次。
-
-### 如何预测
-
-1. 修改 `infer.py` 中以下变量,指定使用的模型、指定测试数据。
-
- ```python
- model_path = "dnn_params_pass_00000.tar.gz" # 指定模型所在的路径
- nn_type = "dnn" # 指定测试使用的模型
- test_dir = "./data/test" # 指定测试文件所在的目录
- word_dict = "./data/dict/word_dict.txt" # 指定字典所在的路径
- label_dict = "./data/dict/label_dict.txt" # 指定类别标签字典的路径
- ```
-2. 在终端中执行 `python infer.py`。
-
-