README_en.md 11.4 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241
Running sample code in this directory requires PaddelPaddle v0.11.0 and later. If the PaddlePaddle on your device is lower than� this version, please follow the instructions in [installation document](http://www.paddlepaddle.org/docs/develop/documentation/en/build_and_install/pip_install_en.html) and make an update.


---

# Text Classification Based on Double Sequence

## Introduction
This example will demonstrate how to organize long text(usually paragraphs or chapters) input into a double sequence in PaddlePaddle to complete the task of classifying long text.

## Model introduction
We treat a text as a sequence of sentences, and each sentence is a sequence of words.

We first use the convolutional neural network to encode each sentence in the paragraph; then, let the expression vector of each sentence goes through the pooled layer to obtain the encoded vector of the paragraph; finally, the encoded vector of the paragraph is used as the classifier(the full connection of softmax layer) input to obtain the final classification result.

**The model structure is shown in the figure below**
<p align="center">
<img src="images/model.jpg" width = "60%" align="center"/><br/>
Figure1. Text classification model based on double layer sequence
</p>

PaddlePaddle implementation of the network structure is in `network_conf.py`.

To process double-level time series, we need to transform the double layer time series data into single time series data, and then process each single time series. In PaddlePaddle, recurrent_group is the main tool to help us build a hierarchical model for processing double decker sequences. Here, we use two nested recurrent_group. The outer recurrent_group dissolves the paragraph into a sentence, and the input from the step function is the sentence sequence. The recurrent_group in the inner layer dismantles the sentence into word. The input in the step function is a group of non-sequential words.

At the level of words, we obtain the expression of a sentence from word vectors using CNN. At the level of paragraphs, we obtain the expression of a paragraph from the expressions of the sentences in the paragraph through pooling.

``` python
nest_group = paddle.layer.recurrent_group(input=[paddle.layer.SubsequenceInput(emb),
                                                 hidden_size],
                                          step=cnn_cov_group)
```

The single layer sequence data after disassembly is represented by a CNN network to learn the corresponding vector, and the network structure of the CNN contains the following parts:

- **Convolution layer**: convolution in text classification is done on time series. The width of convolution kernel is consistent with the matrix generated by word vector level. After convolution, the result is a "feature map". Multiple feature maps can be obtained by using multiple convolutions of different heights. This code uses the convolution kernel of 3 (the red box of Figure 1) and 4 (the blue box of Figure 1) by default.
- **Maximum pool layer**: the maximum pool operation is performed on each feature graph obtained by convolution. Since the feature graph itself is already a vector, the maximum pooling is actually the largest element in the selection of each vector. All the largest elements are spliced together to form a new vector.
- **Linear projection layer**: splices the results from the maximum pool operations into a long vector. Linear projection is used to get the representation vectors of corresponding single layer sequences.

Implementation of CNN network:
```python
def cnn_cov_group(group_input, hidden_size):
    """
    Convolution group definition.
    :param group_input: The input of this layer.
    :type group_input: LayerOutput
    :params hidden_size: The size of the fully connected layer.
    :type hidden_size: int
    """
    conv3 = paddle.networks.sequence_conv_pool(
        input=group_input, context_len=3, hidden_size=hidden_size)
    conv4 = paddle.networks.sequence_conv_pool(
        input=group_input, context_len=4, hidden_size=hidden_size)

    linear_proj = paddle.layer.fc(input=[conv3, conv4],
                                  size=hidden_size,
                                  param_attr=paddle.attr.ParamAttr(name='_cov_value_weight'),
                                  bias_attr=paddle.attr.ParamAttr(name='_cov_value_bias'),
                                  act=paddle.activation.Linear())

    return linear_proj
```
PaddlePaddle has been encapsulated with a pooled text sequence convolution module: `paddle.networks.sequence_conv_pool`, which can be called directly.

After getting the expression vectors of each sentence, all the sentence vectors are passed through an average pool level, and a vector representation of a sample is obtained. The vector outputs the final prediction result through a fully connected layer. The code:
```python
avg_pool = paddle.layer.pooling(input=nest_group,
                                pooling_type=paddle.pooling.Avg(),
                                agg_level=paddle.layer.AggregateLevel.TO_NO_SEQUENCE)

prob = paddle.layer.mixed(size=class_num,
                          input=[paddle.layer.full_matrix_projection(input=avg_pool)],
                          act=paddle.activation.Softmax())
```
## Install dependency package
```bash
pip install -r requirements.txt
```

## Specify training configuration parameters

The training and model configuration parameters are modified through the  `config.py` script. There are detailed explanations for configurable parameters in the script. The examples are as follows:
```python
class TrainerConfig(object):

    # whether to use GPU for training
    use_gpu = False
    # the number of threads used in one machine
    trainer_count = 1

    # train batch size
    batch_size = 32

    ...


class ModelConfig(object):

    # embedding vector dimension
    emb_size = 28

    ...
```
Modify the `config.py` to adjust the parameters. For example, we can specify whether or not to use GPU for training by modifying `use_gpu`.
## Implement with PaddlePaddle Built-in data

### Train
Execute at the terminal:
```bash
python train.py
```
You will run this example with the PaddlePaddle's built-in emotional categorization dataset, `imdb` .
### Prediction
After training, the model will be stored in the specified directory (the default models directory), execute the following command:

```bash
python infer.py --model_path 'models/params_pass_00000.tar.gz'
```
The prediction script will load and train a pass model to test `test set of the IMDB`.

## Use custom data train and predict

### Train
1.Data structure

Each line is a sample with class label and text. Class label and text content are seperated by `\t`. The following are two samples::

```
positive        This movie is very good. The actor is so handsome.
negative        What a terrible movie. I waste so much time.
```

2.Write the Data Reading Interface

To define a custom data reading interface, we only need to write a Python generator to **parse the input text**. The following code fragment is implemented to read the return type of the original data: `paddle.data_type.integer_value_sub_sequence` and `paddle.data_type.integer_value`
```python
def train_reader(data_dir, word_dict, label_dict):
    """
    Reader interface for training data

    :param data_dir: data directory
    :type data_dir: str
    :param word_dict: path of word dictionary,
        the dictionary must has a "UNK" in it.
    :type word_dict: Python dict
    :param label_dict: path of label dictionary.
    :type label_dict: Python dict
    """

    def reader():
        UNK_ID = word_dict['<unk>']
        word_col = 1
        lbl_col = 0

        for file_name in os.listdir(data_dir):
            file_path = os.path.join(data_dir, file_name)
            if not os.path.isfile(file_path):
                continue
            with open(file_path, "r") as f:
                for line in f:
                    line_split = line.strip().split("\t")
                    doc = line_split[word_col]
                    doc_ids = []
                    for sent in doc.strip().split("."):
                        sent_ids = [
                            word_dict.get(w, UNK_ID)
                            for w in sent.split()]
                        if sent_ids:
                            doc_ids.append(sent_ids)

                    yield doc_ids, label_dict[line_split[lbl_col]]

    return reader
```
Note that, in this case to English period `'.'` as a separator, the text is divided into a certain number of sentences, and each sentence is expressed as the corresponding index array Thesaurus (`sent_ids`). Since the representation of the current sample (`doc_ids`) contains all the sentences of the text, it is type: `paddle.data_type.integer_value_sub_sequence`.

3.Specify command line parameters for training

`train.py` contains the following parameters:
```
Options:
  --train_data_dir TEXT   The path of training dataset (default: None). If
                          this parameter is not set, imdb dataset will be
                          used.
  --test_data_dir TEXT    The path of testing dataset (default: None). If this
                          parameter is not set, imdb dataset will be used.
  --word_dict_path TEXT   The path of word dictionary (default: None). If this
                          parameter is not set, imdb dataset will be used. If
                          this parameter is set, but the file does not exist,
                          word dictionay will be built from the training data
                          automatically.
  --label_dict_path TEXT  The path of label dictionary (default: None).If this
                          parameter is not set, imdb dataset will be used. If
                          this parameter is set, but the file does not exist,
                          label dictionay will be built from the training data
                          automatically.
  --model_save_dir TEXT   The path to save the trained models (default:
                          'models').
  --help                  Show this message and exit.
```

Modify the startup parameters in the `train.py` script to run this example directly. Take the sample data in the data directory for example, execute at the terminal:
```bash
python train.py \
  --train_data_dir 'data/train_data'  \
  --test_data_dir 'data/test_data' \
  --word_dict_path 'word_dict.txt' \
  --label_dict_path 'label_dict.txt'
```
So you can train with sample data.

### Prediction

1.Specify command line parameters

`infer.py` contains the following parameters:

```
Options:
  --data_path TEXT        The path of data for inference (default: None). If
                          this parameter is not set, imdb test dataset will be
                          used.
  --model_path TEXT       The path of saved model.  [required]
  --word_dict_path TEXT   The path of word dictionary (default: None). If this
                          parameter is not set, imdb dataset will be used.
  --label_dict_path TEXT  The path of label dictionary (default: None).If this
                          parameter is not set, imdb dataset will be used.
  --batch_size INTEGER    The number of examples in one batch (default: 32).
  --help                  Show this message and exit.
```

2.take the sample data in the `data` directory as an example, execute at the terminal:
```bash
python infer.py \
  --data_path 'data/infer.txt' \
  --word_dict_path 'word_dict.txt' \
  --label_dict_path 'label_dict.txt' \
  --model_path 'models/params_pass_00000.tar.gz'
```

So the sample data can be predicted.