nlp_application.md 15.7 KB
Newer Older
L
leiyuning 已提交
1 2
# Natural Language Processing (NLP) Application

J
JunYuLiu 已提交
3 4
`GPU` `CPU` `Whole Process` `Beginner` `Intermediate` `Expert`

L
leiyuning 已提交
5 6 7 8 9 10 11 12 13 14 15 16 17
<!-- TOC -->

- [Natural Language Processing (NLP) Application](#natural-language-processing-nlp-application)
    - [Overview](#overview)
    - [Preparation and Design](#preparation-and-design)
        - [Downloading the Dataset](#downloading-the-dataset)
        - [Determining Evaluation Criteria](#determining-evaluation-criteria)
        - [Determining the Network and Process](#determining-the-network-and-process)
    - [Implementation](#implementation)
        - [Importing Library Files](#importing-library-files)
        - [Configuring Environment Information](#configuring-environment-information)
        - [Preprocessing the Dataset](#preprocessing-the-dataset)
        - [Defining the Network](#defining-the-network)
18
        - [Pre-Traning](#pre-training)
L
leiyuning 已提交
19 20 21 22 23 24 25
        - [Defining the Optimizer and Loss Function](#defining-the-optimizer-and-loss-function)
        - [Training and Saving the Model](#training-and-saving-the-model)
        - [Validating the Model](#validating-the-model)
    - [Experiment Result](#experiment-result)

<!-- /TOC -->

26
<a href="https://gitee.com/mindspore/docs/blob/master/tutorials/source_en/advanced_use/nlp_application.md" target="_blank"><img src="../_static/logo_source.png"></a>
27

L
leiyuning 已提交
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
## Overview

Sentiment classification is a subset of text classification in NLP, and is the most basic application of NLP. It is a process of analyzing and inferencing affective states and subjective information, that is, analyzing whether a person's sentiment is positive or negative.

> Generally, sentiments are classified into three categories: positive, negative, and neutral. In most cases, only positive and negative sentiments are used for training regardless of the neutral sentiments. The following dataset is a good example.

[20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) is a typical reference dataset for traditional text classification. It is a collection of approximately 20,000 news documents partitioned across 20 different newsgroups.
Some of the newsgroups are very closely related to each other (such as comp.sys.ibm.pc.hardware and comp.sys.mac.hardware), while others are highly unrelated (such as misc.forsale and soc.religion.christian).

As far as the network itself is concerned, the network structure of text classification is roughly similar to that of sentiment classification. After mastering how to construct the sentiment classification network, it is easy to construct a similar network which can be used in a text classification task with some parameter adjustments.

In the service context, text classification is to analyze the objective content in the text discussion, but sentiment classification is to find a viewpoint from the text. For example, "Forrest Gump has a clear theme and smooth pacing, which is excellent." In the text classification, this sentence is classified into a "movie" theme, but in the sentiment classification, this movie review is used to explore whether the sentiment is positive or negative.

Compared with traditional text classification, sentiment classification is simpler and more practical. High-quality datasets can be collected on common shopping websites and movie websites to benefit the business domains. For example, based on the domain context, the system can automatically analyze opinions of specific types of customers on the current product, analyze sentiments by subject and user type, and recommend products based on the analysis result, improving the conversion rate and bringing more business benefits.

In special fields, some non-polar words also fully express a sentimental tendency of a user. For example, when an app is downloaded and used, "the app is stuck" and "the download speed is so slow" express users' negative sentiments. In the stock market, "bullish" and "bull market" express users' positive sentiments. Therefore, in essence, we hope that the model can be used to mine special expressions in the vertical field as polarity words for the sentiment classification system.

Vertical polarity word = General polarity word + Domain-specific polarity word

According to the text processing granularity, sentiment analysis can be divided into word, phrase, sentence, paragraph, and chapter levels. A sentiment analysis at paragraph level is used as an example. The input is a paragraph, and the output is information about whether the movie review is positive or negative.

## Preparation and Design
### Downloading the Dataset

The IMDb movie review dataset is used as experimental data.
> Dataset download address: <http://ai.stanford.edu/~amaas/data/sentiment/>

The following are cases of negative and positive reviews.

| Review  | Label  | 
|---|---|
| "Quitting" may be as much about exiting a pre-ordained identity as about drug withdrawal. As a rural guy coming to Beijing, class and success must have struck this young artist face on as an appeal to separate from his roots and far surpass his peasant parents' acting success. Troubles arise, however, when the new man is too new, when it demands too big a departure from family, history, nature, and personal identity. The ensuing splits, and confusion between the imaginary and the real and the dissonance between the ordinary and the heroic are the stuff of a gut check on the one hand or a complete escape from self on the other.  |  Negative |  
| This movie is amazing because the fact that the real people portray themselves and their real life experience and do such a good job it's like they're almost living the past over again. Jia Hongsheng plays himself an actor who quit everything except music and drugs struggling with depression and searching for the meaning of life while being angry at everyone especially the people who care for him most.  | Positive  |

Download the GloVe file and add the following line at the beginning of the file, which means that a total of 400,000 words are read, and each word is represented by a word vector of 300 latitudes.
```
400000 300
```
GloVe file download address: <http://nlp.stanford.edu/data/glove.6B.zip>

### Determining Evaluation Criteria

As a typical classification, the evaluation criteria of sentiment classification can be determined by referring to that of the common classification. For example, accuracy, precision, recall, and F_beta scores can be used as references.

Accuracy = Number of accurately classified samples/Total number of samples

Precision = True positives/(True positives + False positives)

Recall = True positives/(True positives + False negatives)

F1 score = (2 x Precision x Recall)/(Precision + Recall)

In the IMDb dataset, the number of positive and negative samples does not vary greatly. Accuracy can be used as the evaluation criterion of the classification system.


### Determining the Network and Process

Z
zhangyi 已提交
85
Currently, MindSpore GPU and CPU supports SentimentNet network based on the long short-term memory (LSTM) network for NLP.
L
leiyuning 已提交
86
1. Load the dataset in use and process data.
Z
zhangyi 已提交
87
2. Use the SentimentNet network based on LSTM training data to generate a model.
L
leiyuning 已提交
88 89 90
    Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used for processing and predicting an important event with a long interval and delay in a time sequence. For details, refer to online documentation.
3. After the model is obtained, use the validation dataset to check the accuracy of model.

91
> The current sample is for the Ascend 910 AI processor. You can find the complete executable sample code at:<https://gitee.com/mindspore/mindspore/blob/master/model_zoo/official/nlp/lstm>
92 93 94 95 96 97
> - `src/config.py`:some configurations on the network, including the batch size and number of training epochs.
> - `src/dataset.py`:dataset related definition,include MindRecord file convert and data-preprocess, etc.
> - `src/imdb.py`: the util class for parsing IMDB dataset.
> - `src/lstm.py`: the definition of semantic net.
> - `train.py`: the training script.
> - `eval.py`: the evaluation script.
L
leiyuning 已提交
98 99 100 101 102 103

## Implementation
### Importing Library Files
The following are the required public modules and MindSpore modules and library files.
```python
import argparse
104 105
import os

L
leiyuning 已提交
106
import numpy as np
107 108 109 110 111 112 113 114 115

from src.config import lstm_cfg as cfg
from src.dataset import convert_to_mindrecord
from src.dataset import lstm_create_dataset
from src.lstm import SentimentNet
from mindspore import Tensor, nn, Model, context
from mindspore.nn import Accuracy
from mindspore.train.callback import LossMonitor, CheckpointConfig, ModelCheckpoint, TimeMonitor
from mindspore.train.serialization import load_param_into_net, load_checkpoint
L
leiyuning 已提交
116 117 118 119 120 121 122 123
```

### Configuring Environment Information

1. The `parser` module is used to transfer necessary information for running, such as storage paths of the dataset and the GloVe file. In this way, the frequently changed configurations can be entered during code running, which is more flexible.
    ```python
    parser = argparse.ArgumentParser(description='MindSpore LSTM Example')
    parser.add_argument('--preprocess', type=str, default='false', choices=['true', 'false'],
Z
zhangyi 已提交
124
                        help='whether to preprocess data.')
L
leiyuning 已提交
125
    parser.add_argument('--aclimdb_path', type=str, default="./aclImdb",
126
                        help='path where the dataset is stored.')
L
leiyuning 已提交
127
    parser.add_argument('--glove_path', type=str, default="./glove",
128
                        help='path where the GloVe is stored.')
L
leiyuning 已提交
129
    parser.add_argument('--preprocess_path', type=str, default="./preprocess",
130
                        help='path where the pre-process data is stored.')
Z
zhangyi 已提交
131
    parser.add_argument('--ckpt_path', type=str, default="./",
132 133 134
                        help='the path to save the checkpoint file.')
    parser.add_argument('--pre_trained', type=str, default=None,
                        help='the pretrained checkpoint file path.')
Z
zhangyi 已提交
135 136
    parser.add_argument('--device_target', type=str, default="GPU", choices=['GPU', 'CPU'],
                        help='the target device to run, support "GPU", "CPU". Default: "GPU".')
L
leiyuning 已提交
137 138 139 140 141 142 143 144 145
    args = parser.parse_args()
    ```

2. Before implementing code, configure necessary information, including the environment information, execution mode, backend information, and hardware information.
   
    ```python
    context.set_context(
        mode=context.GRAPH_MODE,
        save_graphs=False,
Z
zhangyi 已提交
146
        device_target=args.device_target)
L
leiyuning 已提交
147 148 149 150 151
    ```
    For details about the API configuration, see the `context.set_context`.

### Preprocessing the Dataset

152
Convert the dataset format to the MindRecord format for MindSpore to read.
L
leiyuning 已提交
153

154 155
```python
if args.preprocess == "true":
L
leiyuning 已提交
156 157
    print("============== Starting Data Pre-processing ==============")
    convert_to_mindrecord(cfg.embed_size, args.aclimdb_path, args.preprocess_path, args.glove_path)
158 159
```
> After convert success, we can file `mindrecord` files under the directory `preprocess_path`. Usually, this operation does not need to be performed every time while the data set is unchanged.
L
leiyuning 已提交
160

161
> `convert_to_mindrecord` You can find the complete definition at: <https://gitee.com/mindspore/mindspore/blob/master/model_zoo/official/nlp/lstm/src/dataset.py>
L
leiyuning 已提交
162

163 164 165
> It consists of two steps:
>1. Process the text dataset, including encoding, word segmentation, alignment, and processing the original GloVe data to adapt to the network structure.
>2. Convert the dataset format to the MindRecord format.
L
leiyuning 已提交
166 167 168 169


### Defining the Network

170 171 172 173 174 175 176 177 178 179 180
```python
embedding_table = np.loadtxt(os.path.join(args.preprocess_path, "weight.txt")).astype(np.float32)
network = SentimentNet(vocab_size=embedding_table.shape[0],
                       embed_size=cfg.embed_size,
                       num_hiddens=cfg.num_hiddens,
                       num_layers=cfg.num_layers,
                       bidirectional=cfg.bidirectional,
                       num_classes=cfg.num_classes,
                       weight=Tensor(embedding_table),
                       batch_size=cfg.batch_size)
```
181
> `SentimentNet` You can find the complete definition at: <https://gitee.com/mindspore/mindspore/blob/master/model_zoo/official/nlp/lstm/src/lstm.py>
L
leiyuning 已提交
182

183
### Pre-Training
L
leiyuning 已提交
184

185 186 187 188 189
The parameter `pre_trained` specifies the preloading CheckPoint file for pre-training, which is empty by default
```python
if args.pre_trained:
    load_param_into_net(network, load_checkpoint(args.pre_trained))
```
L
leiyuning 已提交
190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208

### Defining the Optimizer and Loss Function

The sample code for defining the optimizer and loss function is as follows:

```python
loss = nn.SoftmaxCrossEntropyWithLogits(is_grad=False, sparse=True)
opt = nn.Momentum(network.trainable_params(), cfg.learning_rate, cfg.momentum)
loss_cb = LossMonitor()
```

### Training and Saving the Model

Load the corresponding dataset, configure the CheckPoint generation information, and train the model using the `model.train` API.

```python
model = Model(network, loss, opt, {'acc': Accuracy()})

print("============== Starting Training ==============")
J
jiangzhiwen 已提交
209
ds_train = lstm_create_dataset(args.preprocess_path, cfg.batch_size)
L
leiyuning 已提交
210
config_ck = CheckpointConfig(save_checkpoint_steps=cfg.save_checkpoint_steps,
211
                             keep_checkpoint_max=cfg.keep_checkpoint_max)
L
leiyuning 已提交
212
ckpoint_cb = ModelCheckpoint(prefix="lstm", directory=args.ckpt_path, config=config_ck)
Z
zhangyi 已提交
213 214 215 216 217
time_cb = TimeMonitor(data_size=ds_train.get_dataset_size())
if args.device_target == "CPU":
    model.train(cfg.num_epochs, ds_train, callbacks=[time_cb, ckpoint_cb, loss_cb], dataset_sink_mode=False)
else:
    model.train(cfg.num_epochs, ds_train, callbacks=[time_cb, ckpoint_cb, loss_cb])
218
print("============== Training Success ==============")
L
leiyuning 已提交
219
```
220
> `lstm_create_dataset` You can find the complete definition at: <https://gitee.com/mindspore/mindspore/blob/master/model_zoo/official/nlp/lstm/src/dataset.py>
L
leiyuning 已提交
221 222 223 224 225 226

### Validating the Model

Load the validation dataset and saved CheckPoint file, perform validation, and view the model quality.

```python
227 228
model = Model(network, loss, opt, {'acc': Accuracy()})

L
leiyuning 已提交
229
print("============== Starting Testing ==============")
230
ds_eval = lstm_create_dataset(args.preprocess_path, cfg.batch_size, training=False)
L
leiyuning 已提交
231 232
param_dict = load_checkpoint(args.ckpt_path)
load_param_into_net(network, param_dict)
Z
zhangyi 已提交
233 234 235 236
if args.device_target == "CPU":
    acc = model.eval(ds_eval, dataset_sink_mode=False)
else:
    acc = model.eval(ds_eval)
237
print("============== {} ==============".format(acc))
L
leiyuning 已提交
238 239 240
```

## Experiment Result
241
After 20 epochs, the accuracy on the test set is about 84.19%.
L
leiyuning 已提交
242 243 244 245

**Training Execution**
1. Run the training code and view the running result.
    ```shell
246
    $ python train.py --preprocess=true --ckpt_path=./ --device_target=GPU
L
leiyuning 已提交
247 248
    ```

249
    As shown in the following output, the loss value decreases gradually with the training process and reaches about 0.2855.
L
leiyuning 已提交
250 251 252 253 254

    ```shell
    ============== Starting Data Pre-processing ==============
    vocab_size:  252192
    ============== Starting Training ==============
255 256
    epoch: 1 step: 1, loss is 0.6935
    epoch: 1 step: 2, loss is 0.6924
L
leiyuning 已提交
257
    ...
258 259
    epoch: 10 step: 389, loss is 0.2675
    epoch: 10 step: 390, loss is 0.3232
L
leiyuning 已提交
260
    ...
261 262
    epoch: 20 step: 389, loss is 0.1354
    epoch: 20 step: 390, loss is 0.2855
L
leiyuning 已提交
263 264 265 266 267 268 269
    ```

2. Check the saved CheckPoint files.
   
   CheckPoint files (model files) are saved during the training. You can view all saved files in the file path.

    ```shell
Z
zhangyi 已提交
270
    $ ls ./*.ckpt
L
leiyuning 已提交
271 272 273 274 275
    ```

    The output is as follows:

    ```shell
276
    lstm-11_390.ckpt  lstm-12_390.ckpt  lstm-13_390.ckpt  lstm-14_390.ckpt  lstm-15_390.ckpt  lstm-16_390.ckpt  lstm-17_390.ckpt  lstm-18_390.ckpt  lstm-19_390.ckpt  lstm-20_390.ckpt
L
leiyuning 已提交
277 278 279 280 281 282 283
    ```

**Model Validation**

Use the last saved CheckPoint file to load and validate the dataset.

```shell
284
$ python eval.py --ckpt_path=./lstm-20_390.ckpt --device_target=GPU
L
leiyuning 已提交
285 286
```

287
As shown in the following output, the sentiment analysis accuracy of the text is about 84.19%, which is basically satisfactory.
L
leiyuning 已提交
288 289 290

```shell
============== Starting Testing ==============
291
============== {'acc': 0.8419471153846154} ==============
L
leiyuning 已提交
292 293
```