# Natural Language Processing (NLP) Application - [Natural Language Processing (NLP) Application](#natural-language-processing-nlp-application) - [Overview](#overview) - [Preparation and Design](#preparation-and-design) - [Downloading the Dataset](#downloading-the-dataset) - [Determining Evaluation Criteria](#determining-evaluation-criteria) - [Determining the Network and Process](#determining-the-network-and-process) - [Implementation](#implementation) - [Importing Library Files](#importing-library-files) - [Configuring Environment Information](#configuring-environment-information) - [Preprocessing the Dataset](#preprocessing-the-dataset) - [Defining the Network](#defining-the-network) - [Pre-Training](#pre-training) - [Defining the Optimizer and Loss Function](#defining-the-optimizer-and-loss-function) - [Training and Saving the Model](#training-and-saving-the-model) - [Validating the Model](#validating-the-model) - [Experiment Result](#experiment-result) ## Overview Sentiment classification is a subset of text classification in NLP, and is the most basic application of NLP. It is a process of analyzing and inferencing affective states and subjective information, that is, analyzing whether a person's sentiment is positive or negative. > Generally, sentiments are classified into three categories: positive, negative, and neutral. In most cases, only positive and negative sentiments are used for training regardless of the neutral sentiments. The following dataset is a good example. [20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) is a typical reference dataset for traditional text classification. It is a collection of approximately 20,000 news documents partitioned across 20 different newsgroups. Some of the newsgroups are very closely related to each other (such as comp.sys.ibm.pc.hardware and comp.sys.mac.hardware), while others are highly unrelated (such as misc.forsale and soc.religion.christian). As far as the network itself is concerned, the network structure of text classification is roughly similar to that of sentiment classification. After mastering how to construct the sentiment classification network, it is easy to construct a similar network which can be used in a text classification task with some parameter adjustments. In the service context, text classification is to analyze the objective content in the text discussion, but sentiment classification is to find a viewpoint from the text. For example, "Forrest Gump has a clear theme and smooth pacing, which is excellent." In the text classification, this sentence is classified into a "movie" theme, but in the sentiment classification, this movie review is used to explore whether the sentiment is positive or negative. Compared with traditional text classification, sentiment classification is simpler and more practical. High-quality datasets can be collected on common shopping websites and movie websites to benefit the business domains. For example, based on the domain context, the system can automatically analyze opinions of specific types of customers on the current product, analyze sentiments by subject and user type, and recommend products based on the analysis result, improving the conversion rate and bringing more business benefits. In special fields, some non-polar words also fully express a sentimental tendency of a user. For example, when an app is downloaded and used, "the app is stuck" and "the download speed is so slow" express users' negative sentiments. In the stock market, "bullish" and "bull market" express users' positive sentiments. Therefore, in essence, we hope that the model can be used to mine special expressions in the vertical field as polarity words for the sentiment classification system. Vertical polarity word = General polarity word + Domain-specific polarity word According to the text processing granularity, sentiment analysis can be divided into word, phrase, sentence, paragraph, and chapter levels. A sentiment analysis at paragraph level is used as an example. The input is a paragraph, and the output is information about whether the movie review is positive or negative. ## Preparation and Design ### Downloading the Dataset The IMDb movie review dataset is used as experimental data. > Dataset download address: The following are cases of negative and positive reviews. | Review | Label | |---|---| | "Quitting" may be as much about exiting a pre-ordained identity as about drug withdrawal. As a rural guy coming to Beijing, class and success must have struck this young artist face on as an appeal to separate from his roots and far surpass his peasant parents' acting success. Troubles arise, however, when the new man is too new, when it demands too big a departure from family, history, nature, and personal identity. The ensuing splits, and confusion between the imaginary and the real and the dissonance between the ordinary and the heroic are the stuff of a gut check on the one hand or a complete escape from self on the other. | Negative | | This movie is amazing because the fact that the real people portray themselves and their real life experience and do such a good job it's like they're almost living the past over again. Jia Hongsheng plays himself an actor who quit everything except music and drugs struggling with depression and searching for the meaning of life while being angry at everyone especially the people who care for him most. | Positive | Download the GloVe file and add the following line at the beginning of the file, which means that a total of 400,000 words are read, and each word is represented by a word vector of 300 latitudes. ``` 400000 300 ``` GloVe file download address: ### Determining Evaluation Criteria As a typical classification, the evaluation criteria of sentiment classification can be determined by referring to that of the common classification. For example, accuracy, precision, recall, and F_beta scores can be used as references. Accuracy = Number of accurately classified samples/Total number of samples Precision = True positives/(True positives + False positives) Recall = True positives/(True positives + False negatives) F1 score = (2 x Precision x Recall)/(Precision + Recall) In the IMDb dataset, the number of positive and negative samples does not vary greatly. Accuracy can be used as the evaluation criterion of the classification system. ### Determining the Network and Process Currently, MindSpore GPU and CPU supports SentimentNet network based on the long short-term memory (LSTM) network for NLP. 1. Load the dataset in use and process data. 2. Use the SentimentNet network based on LSTM training data to generate a model. Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used for processing and predicting an important event with a long interval and delay in a time sequence. For details, refer to online documentation. 3. After the model is obtained, use the validation dataset to check the accuracy of model. > The current sample is for the Ascend 910 AI processor. You can find the complete executable sample code at: > - `src/config.py`:some configurations on the network, including the batch size and number of training epochs. > - `src/dataset.py`:dataset related definition,include MindRecord file convert and data-preprocess, etc. > - `src/imdb.py`: the util class for parsing IMDB dataset. > - `src/lstm.py`: the definition of semantic net. > - `train.py`: the training script. > - `eval.py`: the evaluation script. ## Implementation ### Importing Library Files The following are the required public modules and MindSpore modules and library files. ```python import argparse import os import numpy as np from src.config import lstm_cfg as cfg from src.dataset import convert_to_mindrecord from src.dataset import lstm_create_dataset from src.lstm import SentimentNet from mindspore import Tensor, nn, Model, context from mindspore.nn import Accuracy from mindspore.train.callback import LossMonitor, CheckpointConfig, ModelCheckpoint, TimeMonitor from mindspore.train.serialization import load_param_into_net, load_checkpoint ``` ### Configuring Environment Information 1. The `parser` module is used to transfer necessary information for running, such as storage paths of the dataset and the GloVe file. In this way, the frequently changed configurations can be entered during code running, which is more flexible. ```python parser = argparse.ArgumentParser(description='MindSpore LSTM Example') parser.add_argument('--preprocess', type=str, default='false', choices=['true', 'false'], help='whether to preprocess data.') parser.add_argument('--aclimdb_path', type=str, default="./aclImdb", help='path where the dataset is stored.') parser.add_argument('--glove_path', type=str, default="./glove", help='path where the GloVe is stored.') parser.add_argument('--preprocess_path', type=str, default="./preprocess", help='path where the pre-process data is stored.') parser.add_argument('--ckpt_path', type=str, default="./", help='the path to save the checkpoint file.') parser.add_argument('--pre_trained', type=str, default=None, help='the pretrained checkpoint file path.') parser.add_argument('--device_target', type=str, default="GPU", choices=['GPU', 'CPU'], help='the target device to run, support "GPU", "CPU". Default: "GPU".') args = parser.parse_args() ``` 2. Before implementing code, configure necessary information, including the environment information, execution mode, backend information, and hardware information. ```python context.set_context( mode=context.GRAPH_MODE, save_graphs=False, device_target=args.device_target) ``` For details about the API configuration, see the `context.set_context`. ### Preprocessing the Dataset Convert the dataset format to the MindRecord format for MindSpore to read. ```python if args.preprocess == "true": print("============== Starting Data Pre-processing ==============") convert_to_mindrecord(cfg.embed_size, args.aclimdb_path, args.preprocess_path, args.glove_path) ``` > After convert success, we can file `mindrecord` files under the directory `preprocess_path`. Usually, this operation does not need to be performed every time while the data set is unchanged. > `convert_to_mindrecord` You can find the complete definition at: > It consists of two steps: >1. Process the text dataset, including encoding, word segmentation, alignment, and processing the original GloVe data to adapt to the network structure. >2. Convert the dataset format to the MindRecord format. ### Defining the Network ```python embedding_table = np.loadtxt(os.path.join(args.preprocess_path, "weight.txt")).astype(np.float32) network = SentimentNet(vocab_size=embedding_table.shape[0], embed_size=cfg.embed_size, num_hiddens=cfg.num_hiddens, num_layers=cfg.num_layers, bidirectional=cfg.bidirectional, num_classes=cfg.num_classes, weight=Tensor(embedding_table), batch_size=cfg.batch_size) ``` > `SentimentNet` You can find the complete definition at: ### Pre-Training The parameter `pre_trained` specifies the preloading CheckPoint file for pre-training, which is empty by default ```python if args.pre_trained: load_param_into_net(network, load_checkpoint(args.pre_trained)) ``` ### Defining the Optimizer and Loss Function The sample code for defining the optimizer and loss function is as follows: ```python loss = nn.SoftmaxCrossEntropyWithLogits(is_grad=False, sparse=True) opt = nn.Momentum(network.trainable_params(), cfg.learning_rate, cfg.momentum) loss_cb = LossMonitor() ``` ### Training and Saving the Model Load the corresponding dataset, configure the CheckPoint generation information, and train the model using the `model.train` API. ```python model = Model(network, loss, opt, {'acc': Accuracy()}) print("============== Starting Training ==============") ds_train = lstm_create_dataset(args.preprocess_path, cfg.batch_size) config_ck = CheckpointConfig(save_checkpoint_steps=cfg.save_checkpoint_steps, keep_checkpoint_max=cfg.keep_checkpoint_max) ckpoint_cb = ModelCheckpoint(prefix="lstm", directory=args.ckpt_path, config=config_ck) time_cb = TimeMonitor(data_size=ds_train.get_dataset_size()) if args.device_target == "CPU": model.train(cfg.num_epochs, ds_train, callbacks=[time_cb, ckpoint_cb, loss_cb], dataset_sink_mode=False) else: model.train(cfg.num_epochs, ds_train, callbacks=[time_cb, ckpoint_cb, loss_cb]) print("============== Training Success ==============") ``` > `lstm_create_dataset` You can find the complete definition at: ### Validating the Model Load the validation dataset and saved CheckPoint file, perform validation, and view the model quality. ```python model = Model(network, loss, opt, {'acc': Accuracy()}) print("============== Starting Testing ==============") ds_eval = lstm_create_dataset(args.preprocess_path, cfg.batch_size, training=False) param_dict = load_checkpoint(args.ckpt_path) load_param_into_net(network, param_dict) if args.device_target == "CPU": acc = model.eval(ds_eval, dataset_sink_mode=False) else: acc = model.eval(ds_eval) print("============== {} ==============".format(acc)) ``` ## Experiment Result After 20 epochs, the accuracy on the test set is about 84.19%. **Training Execution** 1. Run the training code and view the running result. ```shell $ python train.py --preprocess=true --ckpt_path=./ --device_target=GPU ``` As shown in the following output, the loss value decreases gradually with the training process and reaches about 0.2855. ```shell ============== Starting Data Pre-processing ============== vocab_size: 252192 ============== Starting Training ============== epoch: 1 step: 1, loss is 0.6935 epoch: 1 step: 2, loss is 0.6924 ... epoch: 10 step: 389, loss is 0.2675 epoch: 10 step: 390, loss is 0.3232 ... epoch: 20 step: 389, loss is 0.1354 epoch: 20 step: 390, loss is 0.2855 ``` 2. Check the saved CheckPoint files. CheckPoint files (model files) are saved during the training. You can view all saved files in the file path. ```shell $ ls ./*.ckpt ``` The output is as follows: ```shell lstm-11_390.ckpt lstm-12_390.ckpt lstm-13_390.ckpt lstm-14_390.ckpt lstm-15_390.ckpt lstm-16_390.ckpt lstm-17_390.ckpt lstm-18_390.ckpt lstm-19_390.ckpt lstm-20_390.ckpt ``` **Model Validation** Use the last saved CheckPoint file to load and validate the dataset. ```shell $ python eval.py --ckpt_path=./lstm-20_390.ckpt --device_target=GPU ``` As shown in the following output, the sentiment analysis accuracy of the text is about 84.19%, which is basically satisfactory. ```shell ============== Starting Testing ============== ============== {'acc': 0.8419471153846154} ============== ```