nlp_application.md 15.7 KB
Newer Older
L
leiyuning 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Natural Language Processing (NLP) Application

<!-- TOC -->

- [Natural Language Processing (NLP) Application](#natural-language-processing-nlp-application)
    - [Overview](#overview)
    - [Preparation and Design](#preparation-and-design)
        - [Downloading the Dataset](#downloading-the-dataset)
        - [Determining Evaluation Criteria](#determining-evaluation-criteria)
        - [Determining the Network and Process](#determining-the-network-and-process)
    - [Implementation](#implementation)
        - [Importing Library Files](#importing-library-files)
        - [Configuring Environment Information](#configuring-environment-information)
        - [Preprocessing the Dataset](#preprocessing-the-dataset)
        - [Defining the Network](#defining-the-network)
16
        - [Pre-Traning](#pre-training)
L
leiyuning 已提交
17 18 19 20 21 22 23
        - [Defining the Optimizer and Loss Function](#defining-the-optimizer-and-loss-function)
        - [Training and Saving the Model](#training-and-saving-the-model)
        - [Validating the Model](#validating-the-model)
    - [Experiment Result](#experiment-result)

<!-- /TOC -->

24
<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_en/advanced_use/nlp_application.md" target="_blank"><img src="../_static/logo_source.png"></a>
25

L
leiyuning 已提交
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
## Overview

Sentiment classification is a subset of text classification in NLP, and is the most basic application of NLP. It is a process of analyzing and inferencing affective states and subjective information, that is, analyzing whether a person's sentiment is positive or negative.

> Generally, sentiments are classified into three categories: positive, negative, and neutral. In most cases, only positive and negative sentiments are used for training regardless of the neutral sentiments. The following dataset is a good example.

[20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) is a typical reference dataset for traditional text classification. It is a collection of approximately 20,000 news documents partitioned across 20 different newsgroups.
Some of the newsgroups are very closely related to each other (such as comp.sys.ibm.pc.hardware and comp.sys.mac.hardware), while others are highly unrelated (such as misc.forsale and soc.religion.christian).

As far as the network itself is concerned, the network structure of text classification is roughly similar to that of sentiment classification. After mastering how to construct the sentiment classification network, it is easy to construct a similar network which can be used in a text classification task with some parameter adjustments.

In the service context, text classification is to analyze the objective content in the text discussion, but sentiment classification is to find a viewpoint from the text. For example, "Forrest Gump has a clear theme and smooth pacing, which is excellent." In the text classification, this sentence is classified into a "movie" theme, but in the sentiment classification, this movie review is used to explore whether the sentiment is positive or negative.

Compared with traditional text classification, sentiment classification is simpler and more practical. High-quality datasets can be collected on common shopping websites and movie websites to benefit the business domains. For example, based on the domain context, the system can automatically analyze opinions of specific types of customers on the current product, analyze sentiments by subject and user type, and recommend products based on the analysis result, improving the conversion rate and bringing more business benefits.

In special fields, some non-polar words also fully express a sentimental tendency of a user. For example, when an app is downloaded and used, "the app is stuck" and "the download speed is so slow" express users' negative sentiments. In the stock market, "bullish" and "bull market" express users' positive sentiments. Therefore, in essence, we hope that the model can be used to mine special expressions in the vertical field as polarity words for the sentiment classification system.

Vertical polarity word = General polarity word + Domain-specific polarity word

According to the text processing granularity, sentiment analysis can be divided into word, phrase, sentence, paragraph, and chapter levels. A sentiment analysis at paragraph level is used as an example. The input is a paragraph, and the output is information about whether the movie review is positive or negative.

## Preparation and Design
### Downloading the Dataset

The IMDb movie review dataset is used as experimental data.
> Dataset download address: <http://ai.stanford.edu/~amaas/data/sentiment/>

The following are cases of negative and positive reviews.

| Review  | Label  | 
|---|---|
| "Quitting" may be as much about exiting a pre-ordained identity as about drug withdrawal. As a rural guy coming to Beijing, class and success must have struck this young artist face on as an appeal to separate from his roots and far surpass his peasant parents' acting success. Troubles arise, however, when the new man is too new, when it demands too big a departure from family, history, nature, and personal identity. The ensuing splits, and confusion between the imaginary and the real and the dissonance between the ordinary and the heroic are the stuff of a gut check on the one hand or a complete escape from self on the other.  |  Negative |  
| This movie is amazing because the fact that the real people portray themselves and their real life experience and do such a good job it's like they're almost living the past over again. Jia Hongsheng plays himself an actor who quit everything except music and drugs struggling with depression and searching for the meaning of life while being angry at everyone especially the people who care for him most.  | Positive  |

Download the GloVe file and add the following line at the beginning of the file, which means that a total of 400,000 words are read, and each word is represented by a word vector of 300 latitudes.
```
400000 300
```
GloVe file download address: <http://nlp.stanford.edu/data/glove.6B.zip>

### Determining Evaluation Criteria

As a typical classification, the evaluation criteria of sentiment classification can be determined by referring to that of the common classification. For example, accuracy, precision, recall, and F_beta scores can be used as references.

Accuracy = Number of accurately classified samples/Total number of samples

Precision = True positives/(True positives + False positives)

Recall = True positives/(True positives + False negatives)

F1 score = (2 x Precision x Recall)/(Precision + Recall)

In the IMDb dataset, the number of positive and negative samples does not vary greatly. Accuracy can be used as the evaluation criterion of the classification system.


### Determining the Network and Process

Z
zhangyi 已提交
83
Currently, MindSpore GPU and CPU supports SentimentNet network based on the long short-term memory (LSTM) network for NLP.
L
leiyuning 已提交
84
1. Load the dataset in use and process data.
Z
zhangyi 已提交
85
2. Use the SentimentNet network based on LSTM training data to generate a model.
L
leiyuning 已提交
86 87 88
    Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used for processing and predicting an important event with a long interval and delay in a time sequence. For details, refer to online documentation.
3. After the model is obtained, use the validation dataset to check the accuracy of model.

89
> The current sample is for the Ascend 910 AI processor. You can find the complete executable sample code at:<https://gitee.com/mindspore/mindspore/blob/master/model_zoo/official/nlp/lstm>
90 91 92 93 94 95
> - `src/config.py`:some configurations on the network, including the batch size and number of training epochs.
> - `src/dataset.py`:dataset related definition,include MindRecord file convert and data-preprocess, etc.
> - `src/imdb.py`: the util class for parsing IMDB dataset.
> - `src/lstm.py`: the definition of semantic net.
> - `train.py`: the training script.
> - `eval.py`: the evaluation script.
L
leiyuning 已提交
96 97 98 99 100 101

## Implementation
### Importing Library Files
The following are the required public modules and MindSpore modules and library files.
```python
import argparse
102 103
import os

L
leiyuning 已提交
104
import numpy as np
105 106 107 108 109 110 111 112 113

from src.config import lstm_cfg as cfg
from src.dataset import convert_to_mindrecord
from src.dataset import lstm_create_dataset
from src.lstm import SentimentNet
from mindspore import Tensor, nn, Model, context
from mindspore.nn import Accuracy
from mindspore.train.callback import LossMonitor, CheckpointConfig, ModelCheckpoint, TimeMonitor
from mindspore.train.serialization import load_param_into_net, load_checkpoint
L
leiyuning 已提交
114 115 116 117 118 119 120 121
```

### Configuring Environment Information

1. The `parser` module is used to transfer necessary information for running, such as storage paths of the dataset and the GloVe file. In this way, the frequently changed configurations can be entered during code running, which is more flexible.
    ```python
    parser = argparse.ArgumentParser(description='MindSpore LSTM Example')
    parser.add_argument('--preprocess', type=str, default='false', choices=['true', 'false'],
Z
zhangyi 已提交
122
                        help='whether to preprocess data.')
L
leiyuning 已提交
123
    parser.add_argument('--aclimdb_path', type=str, default="./aclImdb",
124
                        help='path where the dataset is stored.')
L
leiyuning 已提交
125
    parser.add_argument('--glove_path', type=str, default="./glove",
126
                        help='path where the GloVe is stored.')
L
leiyuning 已提交
127
    parser.add_argument('--preprocess_path', type=str, default="./preprocess",
128
                        help='path where the pre-process data is stored.')
Z
zhangyi 已提交
129
    parser.add_argument('--ckpt_path', type=str, default="./",
130 131 132
                        help='the path to save the checkpoint file.')
    parser.add_argument('--pre_trained', type=str, default=None,
                        help='the pretrained checkpoint file path.')
Z
zhangyi 已提交
133 134
    parser.add_argument('--device_target', type=str, default="GPU", choices=['GPU', 'CPU'],
                        help='the target device to run, support "GPU", "CPU". Default: "GPU".')
L
leiyuning 已提交
135 136 137 138 139 140 141 142 143
    args = parser.parse_args()
    ```

2. Before implementing code, configure necessary information, including the environment information, execution mode, backend information, and hardware information.
   
    ```python
    context.set_context(
        mode=context.GRAPH_MODE,
        save_graphs=False,
Z
zhangyi 已提交
144
        device_target=args.device_target)
L
leiyuning 已提交
145 146 147 148 149
    ```
    For details about the API configuration, see the `context.set_context`.

### Preprocessing the Dataset

150
Convert the dataset format to the MindRecord format for MindSpore to read.
L
leiyuning 已提交
151

152 153
```python
if args.preprocess == "true":
L
leiyuning 已提交
154 155
    print("============== Starting Data Pre-processing ==============")
    convert_to_mindrecord(cfg.embed_size, args.aclimdb_path, args.preprocess_path, args.glove_path)
156 157
```
> After convert success, we can file `mindrecord` files under the directory `preprocess_path`. Usually, this operation does not need to be performed every time while the data set is unchanged.
L
leiyuning 已提交
158

159
> `convert_to_mindrecord` You can find the complete definition at: <https://gitee.com/mindspore/mindspore/blob/master/model_zoo/official/nlp/lstm/src/dataset.py>
L
leiyuning 已提交
160

161 162 163
> It consists of two steps:
>1. Process the text dataset, including encoding, word segmentation, alignment, and processing the original GloVe data to adapt to the network structure.
>2. Convert the dataset format to the MindRecord format.
L
leiyuning 已提交
164 165 166 167


### Defining the Network

168 169 170 171 172 173 174 175 176 177 178
```python
embedding_table = np.loadtxt(os.path.join(args.preprocess_path, "weight.txt")).astype(np.float32)
network = SentimentNet(vocab_size=embedding_table.shape[0],
                       embed_size=cfg.embed_size,
                       num_hiddens=cfg.num_hiddens,
                       num_layers=cfg.num_layers,
                       bidirectional=cfg.bidirectional,
                       num_classes=cfg.num_classes,
                       weight=Tensor(embedding_table),
                       batch_size=cfg.batch_size)
```
179
> `SentimentNet` You can find the complete definition at: <https://gitee.com/mindspore/mindspore/blob/master/model_zoo/official/nlp/lstm/src/lstm.py>
L
leiyuning 已提交
180

181
### Pre-Training
L
leiyuning 已提交
182

183 184 185 186 187
The parameter `pre_trained` specifies the preloading CheckPoint file for pre-training, which is empty by default
```python
if args.pre_trained:
    load_param_into_net(network, load_checkpoint(args.pre_trained))
```
L
leiyuning 已提交
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206

### Defining the Optimizer and Loss Function

The sample code for defining the optimizer and loss function is as follows:

```python
loss = nn.SoftmaxCrossEntropyWithLogits(is_grad=False, sparse=True)
opt = nn.Momentum(network.trainable_params(), cfg.learning_rate, cfg.momentum)
loss_cb = LossMonitor()
```

### Training and Saving the Model

Load the corresponding dataset, configure the CheckPoint generation information, and train the model using the `model.train` API.

```python
model = Model(network, loss, opt, {'acc': Accuracy()})

print("============== Starting Training ==============")
J
jiangzhiwen 已提交
207
ds_train = lstm_create_dataset(args.preprocess_path, cfg.batch_size)
L
leiyuning 已提交
208
config_ck = CheckpointConfig(save_checkpoint_steps=cfg.save_checkpoint_steps,
209
                             keep_checkpoint_max=cfg.keep_checkpoint_max)
L
leiyuning 已提交
210
ckpoint_cb = ModelCheckpoint(prefix="lstm", directory=args.ckpt_path, config=config_ck)
Z
zhangyi 已提交
211 212 213 214 215
time_cb = TimeMonitor(data_size=ds_train.get_dataset_size())
if args.device_target == "CPU":
    model.train(cfg.num_epochs, ds_train, callbacks=[time_cb, ckpoint_cb, loss_cb], dataset_sink_mode=False)
else:
    model.train(cfg.num_epochs, ds_train, callbacks=[time_cb, ckpoint_cb, loss_cb])
216
print("============== Training Success ==============")
L
leiyuning 已提交
217
```
218
> `lstm_create_dataset` You can find the complete definition at: <https://gitee.com/mindspore/mindspore/blob/master/model_zoo/official/nlp/lstm/src/dataset.py>
L
leiyuning 已提交
219 220 221 222 223 224

### Validating the Model

Load the validation dataset and saved CheckPoint file, perform validation, and view the model quality.

```python
225 226
model = Model(network, loss, opt, {'acc': Accuracy()})

L
leiyuning 已提交
227
print("============== Starting Testing ==============")
228
ds_eval = lstm_create_dataset(args.preprocess_path, cfg.batch_size, training=False)
L
leiyuning 已提交
229 230
param_dict = load_checkpoint(args.ckpt_path)
load_param_into_net(network, param_dict)
Z
zhangyi 已提交
231 232 233 234
if args.device_target == "CPU":
    acc = model.eval(ds_eval, dataset_sink_mode=False)
else:
    acc = model.eval(ds_eval)
235
print("============== {} ==============".format(acc))
L
leiyuning 已提交
236 237 238
```

## Experiment Result
239
After 20 epochs, the accuracy on the test set is about 84.19%.
L
leiyuning 已提交
240 241 242 243

**Training Execution**
1. Run the training code and view the running result.
    ```shell
244
    $ python train.py --preprocess=true --ckpt_path=./ --device_target=GPU
L
leiyuning 已提交
245 246
    ```

247
    As shown in the following output, the loss value decreases gradually with the training process and reaches about 0.2855.
L
leiyuning 已提交
248 249 250 251 252

    ```shell
    ============== Starting Data Pre-processing ==============
    vocab_size:  252192
    ============== Starting Training ==============
253 254
    epoch: 1 step: 1, loss is 0.6935
    epoch: 1 step: 2, loss is 0.6924
L
leiyuning 已提交
255
    ...
256 257
    epoch: 10 step: 389, loss is 0.2675
    epoch: 10 step: 390, loss is 0.3232
L
leiyuning 已提交
258
    ...
259 260
    epoch: 20 step: 389, loss is 0.1354
    epoch: 20 step: 390, loss is 0.2855
L
leiyuning 已提交
261 262 263 264 265 266 267
    ```

2. Check the saved CheckPoint files.
   
   CheckPoint files (model files) are saved during the training. You can view all saved files in the file path.

    ```shell
Z
zhangyi 已提交
268
    $ ls ./*.ckpt
L
leiyuning 已提交
269 270 271 272 273
    ```

    The output is as follows:

    ```shell
274
    lstm-11_390.ckpt  lstm-12_390.ckpt  lstm-13_390.ckpt  lstm-14_390.ckpt  lstm-15_390.ckpt  lstm-16_390.ckpt  lstm-17_390.ckpt  lstm-18_390.ckpt  lstm-19_390.ckpt  lstm-20_390.ckpt
L
leiyuning 已提交
275 276 277 278 279 280 281
    ```

**Model Validation**

Use the last saved CheckPoint file to load and validate the dataset.

```shell
282
$ python eval.py --ckpt_path=./lstm-20_390.ckpt --device_target=GPU
L
leiyuning 已提交
283 284
```

285
As shown in the following output, the sentiment analysis accuracy of the text is about 84.19%, which is basically satisfactory.
L
leiyuning 已提交
286 287 288

```shell
============== Starting Testing ==============
289
============== {'acc': 0.8419471153846154} ==============
L
leiyuning 已提交
290 291
```