add sentiment classification

c62683d0 · wangxiao1021 · f7668155 · c62683d0 · c62683d0 · c62683d0
9 changed file
--- a/examples/sentiment_classification/README.md
+++ b/examples/sentiment_classification/README.md
+## 简介
+情感是人类的一种高级智能行为，为了识别文本的情感倾向，需要深入的语义建模。另外，不同领域（如餐饮、体育）在情感的表达各不相同，因而需要有大规模覆盖各个领域的数据进行模型训练。为此，我们通过基于深度学习的语义模型和大规模数据挖掘解决上述两个问题。效果上，我们基于开源情感倾向分类数据集ChnSentiCorp进行评测。具体数据如下所示：
+| 模型 | dev | test |
+| :------| :------ | :------ |
+| CNN | 90.6% | 89.7% |
+| BOW | 90.1% | 90.3% |
+| GRU | 90.0% | 91.1% |
+| BIGRU | 89.7% |  89.6% |
+动态图文档请见[Dygraph](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/user_guides/howto/dygraph/DyGraph.html)
+## 快速开始
+本项目依赖于 Paddlepaddle 1.7.0 及以上版本，请参考 [安装指南](http://www.paddlepaddle.org/#quick-start) 进行安装。
+python版本依赖python 2.7或python 3.5及以上版本。
+#### 代码下载及环境变量设置
+    克隆代码库到本地，并设置`PYTHONPATH`环境变量
+    ```bash
+    git clone https://github.com/PaddlePaddle/hapi
+    cd hapi
+    export PYTHONPATH=$PYTHONPATH:`pwd`
+    cd examples/sentiment_classification
+    ```
+#### 数据准备
+下载经过预处理的数据，文件解压之后，senta_data目录下会存在训练数据（train.tsv）、开发集数据（dev.tsv）、测试集数据（test.tsv）以及对应的词典（word_dict.txt）
+```shell
+wget https://baidu-nlp.bj.bcebos.com/sentiment_classification-dataset-1.0.0.tar.gz
+tar -zxvf sentiment_classification-dataset-1.0.0.tar.gz
+```
+#### 模型训练
+基于示例的数据集，可以运行下面的命令，在训练集（train.tsv）上进行模型训练，并在开发集（dev.tsv）验证。训练阶段需手动创建模型需要保存的文件夹，并且通过checkpoints设置保存文件路径。
+model_type从bow_net，cnn_net，gru_net，bigru_net中选择。
+模型相关参数均在`senta.yaml`中设置，模型训练需确保`senta.yaml`中`do_train`属性置为`True`。
+```shell
+python sentiment_classifier.py
+```
+#### 模型预测
+利用已有模型，可以运行下面命令，对未知label的数据（test.tsv）进行预测。
+模型预测需确保`senta.yaml`中`do_infer`属性为`True`。
+```shell
+python sentiment_classifier.py
+```
+#### 模型参数
+模型参数配置文件：`senta.yaml`
+1. batch_size, 根据模型情况和GPU占用率选择batch_size, 建议cnn/bow选择较大batch_size, gru/bigru选择较小batch_size。
+2. padding_size默认为150。
+3. epoch, training时默认设置为5，infer默认为1。
+4. learning_rate默认为0.002。
+## 进阶使用
+#### 任务定义
+传统的情感分类主要基于词典或者特征工程的方式进行分类，这种方法需要繁琐的人工特征设计和先验知识，理解停留于浅层并且扩展泛化能力差。为了避免传统方法的局限，我们采用近年来飞速发展的深度学习技术。基于深度学习的情感分类不依赖于人工特征，它能够端到端的对输入文本进行语义理解，并基于语义表示进行情感倾向的判断。
+#### 模型原理介绍
+本项目针对情感倾向性分类问题，：
+ CNN（Convolutional Neural Networks），是一个基础的序列模型，能处理变长序列输入，提取局部区域之内的特征；
+ BOW（Bag Of Words）模型，是一个非序列模型，使用基本的全连接结构；
+ GRU（Gated Recurrent Unit），序列模型，能够较好地解决序列文本中长距离依赖的问题；
+ BI-GRU（Bidirectional Gated Recurrent Unit），序列模型，采用双向双层GRU结构，更好地捕获句子中的语义特征；
+#### 数据格式说明
+训练、预测、评估使用的数据可以由用户根据实际的应用场景，自己组织数据。数据由两列组成，以制表符分隔，第一列是以空格分词的中文文本（分词预处理方法将在下文具体说明），文件为utf8编码；第二列是情感倾向分类的类别（0表示消极；1表示积极），注意数据文件第一行固定表示为"text_a\tlabel"
+```text
+特 喜欢 这种 好看的 狗狗                  1
+这 真是 惊艳 世界 的 中国 黑科技          1
+环境 特别 差 ，脏兮兮 的，再也 不去 了     0
+```
+#### 代码结构说明
+```text
+.
+├── sentiment_classifier.py     # 该项目的主函数，封装包括训练、预测、评估的部分
+├── models.py                   # 网络结构
+```
--- a/examples/sentiment_classification/models.py
+++ b/examples/sentiment_classification/models.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.nn import Conv2D, Pool2D, Linear, Embedding
+from paddle.fluid.dygraph.base import to_variable
+import numpy as np
+from hapi.model import Model
+from hapi.text.text import GRUEncoderLayer as BiGRUEncoder
+from hapi.text.text import BOWEncoder, CNNEncoder, GRUEncoder
+class CNN(Model):
+    def __init__(self,  dict_dim, batch_size, seq_len):
+        super(CNN, self).__init__()
+        self.dict_dim = dict_dim
+        self.emb_dim = 128
+        self.hid_dim = 128
+        self.fc_hid_dim = 96
+        self.class_dim = 2
+        self.channels = 1
+        self.win_size = [3, self.hid_dim]
+        self.batch_size = batch_size
+        self.seq_len = seq_len
+        self._encoder = CNNEncoder(
+            dict_size=self.dict_dim + 1,
+            emb_dim=self.emb_dim,
+            seq_len=self.seq_len,
+            filter_size= self.win_size,
+            num_filters= self.hid_dim,
+            hidden_dim= self.hid_dim,
+            padding_idx=None,
+            act='tanh')
+        self._fc1 = Linear(input_dim = self.hid_dim*self.seq_len, output_dim=self.fc_hid_dim, act="softmax")
+        self._fc_prediction = Linear(input_dim = self.fc_hid_dim,
+                                 output_dim = self.class_dim,
+                                 act="softmax")
+    def forward(self, inputs):
+        conv_3 = self._encoder(inputs)
+        fc_1 = self._fc1(conv_3)
+        prediction = self._fc_prediction(fc_1)
+        return prediction
+class BOW(Model):
+    def __init__(self, dict_dim, batch_size, seq_len):
+        super(BOW, self).__init__()
+        self.dict_dim = dict_dim
+        self.emb_dim = 128
+        self.hid_dim = 128
+        self.fc_hid_dim = 96
+        self.class_dim = 2
+        self.batch_size = batch_size
+        self.seq_len = seq_len
+        self._encoder = BOWEncoder(
+            dict_size=self.dict_dim + 1,
+            emb_dim=self.emb_dim,
+            padding_idx=None,
+            bow_dim=self.hid_dim,
+            seq_len=self.seq_len)
+        self._fc1 = Linear(input_dim = self.hid_dim, output_dim=self.hid_dim, act="tanh")
+        self._fc2 = Linear(input_dim = self.hid_dim, output_dim=self.fc_hid_dim, act="tanh")
+        self._fc_prediction = Linear(input_dim = self.fc_hid_dim,
+                                 output_dim = self.class_dim,
+                                 act="softmax")
+    def forward(self, inputs):
+        bow_1 = self._encoder(inputs)
+        bow_1 = fluid.layers.tanh(bow_1)
+        fc_1 = self._fc1(bow_1)
+        fc_2 = self._fc2(fc_1)
+        prediction = self._fc_prediction(fc_2)
+        return prediction
+class GRU(Model):
+    def __init__(self, dict_dim, batch_size, seq_len):
+        super(GRU, self).__init__()
+        self.dict_dim = dict_dim
+        self.emb_dim = 128
+        self.hid_dim = 128
+        self.fc_hid_dim = 96
+        self.class_dim = 2
+        self.batch_size = batch_size
+        self.seq_len = seq_len
+        self._fc1 = Linear(input_dim=self.hid_dim, output_dim=self.fc_hid_dim, act="tanh")
+        self._fc_prediction = Linear(input_dim=self.fc_hid_dim,
+                                 output_dim=self.class_dim,
+                                 act="softmax")
+        self._encoder = GRUEncoder(
+            dict_size=self.dict_dim + 1,
+            emb_dim=self.emb_dim,
+            gru_dim=self.hid_dim,
+            hidden_dim=self.hid_dim,
+            padding_idx=None,
+            seq_len=self.seq_len)
+    def forward(self, inputs):
+        emb = self._encoder(inputs)
+        fc_1 = self._fc1(emb)
+        prediction = self._fc_prediction(fc_1)
+        return prediction
+class BiGRU(Model):
+    def __init__(self, dict_dim, batch_size, seq_len):
+        super(BiGRU, self).__init__()
+        self.dict_dim = dict_dim
+        self.emb_dim = 128
+        self.hid_dim = 128
+        self.fc_hid_dim = 96
+        self.class_dim = 2
+        self.batch_size = batch_size
+        self.seq_len = seq_len
+        self.embedding = Embedding(
+            size=[self.dict_dim + 1, self.emb_dim],
+            dtype='float32',
+            param_attr=fluid.ParamAttr(learning_rate=30),
+            is_sparse=False)
+        h_0 = np.zeros((self.batch_size, self.hid_dim), dtype="float32")
+        h_0 = to_variable(h_0)
+        self._fc1 = Linear(input_dim = self.hid_dim, output_dim=self.hid_dim*3)
+        self._fc2 = Linear(input_dim = self.hid_dim*2, output_dim=self.fc_hid_dim, act="tanh")
+        self._fc_prediction = Linear(input_dim=self.fc_hid_dim,
+                                 output_dim=self.class_dim,
+                                 act="softmax")
+        self._encoder = BiGRUEncoder(
+            grnn_hidden_dim=self.hid_dim,
+            input_dim=self.hid_dim * 3,
+            h_0=h_0,
+            init_bound=0.1,
+            is_bidirection=True)
+    def forward(self, inputs):
+        emb = self.embedding(inputs)
+        emb = fluid.layers.reshape(emb, shape=[self.batch_size, -1, self.hid_dim])
+        fc_1 = self._fc1(emb)
+        encoded_vector = self._encoder(fc_1)
+        encoded_vector = fluid.layers.tanh(encoded_vector)
+        encoded_vector = fluid.layers.reduce_max(encoded_vector, dim=1)
+        fc_2 = self._fc2(encoded_vector)
+        prediction = self._fc_prediction(fc_2)
+        return prediction
--- a/examples/sentiment_classification/senta.yaml
+++ b/examples/sentiment_classification/senta.yaml
+checkpoints: "./checkpoints"
+epoch: 5
+save_freq: 1
+eval_freq: 1
+lr: 0.002
+padding_size: 150
+skip_steps: 10
+verbose: False
+data_dir: "./senta_data/"
+vocab_path: "./senta_data/word_dict.txt"
+vocab_size: 33256
+batch_size: 20
+random_seed: 0
+use_cuda: True
+do_train: True
+do_infer: False
+model_type: "bow_net"
+output_dir: "./output"
--- a/examples/sentiment_classification/sentiment_classifier.py
+++ b/examples/sentiment_classification/sentiment_classifier.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Sentiment Classification in Paddle Dygraph Mode. """
+from __future__ import print_function
+import numpy as np
+import paddle.fluid as fluid
+from hapi.model import set_device, Model, CrossEntropy, Input
+from hapi.configure import Config
+from hapi.text.senta import SentaProcessor, Optimizer
+from hapi.metrics import Accuracy
+from models import CNN, BOW, GRU, BiGRU
+import json
+import os
+args = Config(yaml_file='./senta.yaml')
+args.build()
+args.Print()
+device = set_device("gpu" if args.use_cuda else "cpu")
+dev_count = fluid.core.get_cuda_device_count() if args.use_cuda else 1
+def main():
+    if args.do_train:
+        train()
+    elif args.do_infer:
+        infer()
+def train():
+    fluid.enable_dygraph(device)
+    processor = SentaProcessor(
+        data_dir=args.data_dir,
+        vocab_path=args.vocab_path,
+        random_seed=args.random_seed)
+    num_labels = len(processor.get_labels())
+    num_train_examples = processor.get_num_examples(phase="train")
+    max_train_steps = args.epoch * num_train_examples // args.batch_size // dev_count
+    train_data_generator = processor.data_generator(
+        batch_size=args.batch_size,
+        padding_size=args.padding_size,
+        places=device,
+        phase='train',
+        epoch=args.epoch,
+        shuffle=False)
+    eval_data_generator = processor.data_generator(
+        batch_size=args.batch_size,
+        padding_size=args.padding_size,
+        places=device,
+        phase='dev',
+        epoch=args.epoch,
+        shuffle=False)
+    if args.model_type == 'cnn_net':
+        model = CNN( args.vocab_size, args.batch_size,
+                     args.padding_size)
+    elif args.model_type == 'bow_net':
+        model = BOW( args.vocab_size, args.batch_size,
+                     args.padding_size)
+    elif args.model_type == 'gru_net':
+        model = GRU( args.vocab_size, args.batch_size,
+                     args.padding_size)
+    elif args.model_type == 'bigru_net':
+        model = BiGRU( args.vocab_size, args.batch_size,
+                       args.padding_size)
+    optimizer = Optimizer(
+        num_train_steps=max_train_steps,
+        model_cls=model,
+        learning_rate=args.lr,
+        parameter_list=model.parameters())
+    inputs = [Input([None, None], 'int64', name='doc')]
+    labels = [Input([None, 1], 'int64', name='label')]
+    model.prepare(
+        optimizer,
+        CrossEntropy(),
+        Accuracy(topk=(1,)),
+        inputs,
+        labels,
+        device=device)
+    model.fit(train_data=train_data_generator,
+              eval_data=eval_data_generator,
+              batch_size=args.batch_size,
+              epochs=args.epoch,
+              save_dir=args.checkpoints,
+              eval_freq=args.eval_freq,
+              save_freq=args.save_freq)
+def infer():
+    fluid.enable_dygraph(device)
+    processor = SentaProcessor(
+        data_dir=args.data_dir,
+        vocab_path=args.vocab_path,
+        random_seed=args.random_seed)
+    infer_data_generator = processor.data_generator(
+        batch_size=args.batch_size,
+        padding_size=args.padding_size,
+        places=device,
+        phase='infer',
+        epoch=1,
+        shuffle=False)
+    if args.model_type == 'cnn_net':
+        model_infer = CNN( args.vocab_size, args.batch_size,
+                           args.padding_size)
+    elif args.model_type == 'bow_net':
+        model_infer = BOW( args.vocab_size, args.batch_size,
+                           args.padding_size)
+    elif args.model_type == 'gru_net':
+        model_infer = GRU( args.vocab_size, args.batch_size,
+                           args.padding_size)
+    elif args.model_type == 'bigru_net':
+        model_infer = BiGRU( args.vocab_size, args.batch_size,
+                             args.padding_size)
+    print('Do inferring ...... ')
+    inputs = [Input([None, None], 'int64', name='doc')]
+    model_infer.prepare(
+        None,
+        CrossEntropy(),
+        Accuracy(topk=(1,)),
+        inputs,
+        device=device)
+    model_infer.load(args.checkpoints, reset_optimizer=True)
+    preds = model_infer.predict(test_data=infer_data_generator)
+    preds = np.array(preds[0]).reshape((-1, 2))
+    if args.output_dir:
+        with open(os.path.join(args.output_dir, 'predictions.json'), 'w') as w:
+            for p in range(len(preds)):
+                label = np.argmax(preds[p])
+                result = json.dumps({'index': p, 'label': label, 'probs': preds[p].tolist()})
+                w.write(result+'\n')
+        print('Predictions saved at '+os.path.join(args.output_dir, 'predictions.json'))
+if __name__ == '__main__':
+    main()
--- a/hapi/downloader.py
+++ b/hapi/downloader.py
-# -*- coding: UTF-8 -*-
-#   Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from __future__ import print_function
-import os
-import tarfile
-import shutil
-from collections import OrderedDict
-import sys
-import urllib
-URLLIB=urllib
-if sys.version_info >= (3, 0):
-    import urllib.request
-    URLLIB=urllib.request
-__all__ = ["download", "ls"]
-_pretrain = (('RoBERTa-zh-base', 'https://bert-models.bj.bcebos.com/chinese_roberta_wwm_ext_L-12_H-768_A-12.tar.gz'),
-            ('RoBERTa-zh-large', 'https://bert-models.bj.bcebos.com/chinese_roberta_wwm_large_ext_L-24_H-1024_A-16.tar.gz'),
-            ('ERNIE-v2-en-base', 'https://ernie.bj.bcebos.com/ERNIE_Base_en_stable-2.0.0.tar.gz'),
-            ('ERNIE-v2-en-large', 'https://ernie.bj.bcebos.com/ERNIE_Large_en_stable-2.0.0.tar.gz'),
-            ('XLNet-cased-base', 'https://xlnet.bj.bcebos.com/xlnet_cased_L-12_H-768_A-12.tgz'),
-            ('XLNet-cased-large', 'https://xlnet.bj.bcebos.com/xlnet_cased_L-24_H-1024_A-16.tgz'),
-            ('ERNIE-v1-zh-base', 'https://baidu-nlp.bj.bcebos.com/ERNIE_stable-1.0.1.tar.gz'),
-            ('ERNIE-v1-zh-base-max-len-512', 'https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz'),
-            ('BERT-en-uncased-large-whole-word-masking', 'https://bert-models.bj.bcebos.com/wwm_uncased_L-24_H-1024_A-16.tar.gz'),
-            ('BERT-en-cased-large-whole-word-masking', 'https://bert-models.bj.bcebos.com/wwm_cased_L-24_H-1024_A-16.tar.gz'),
-            ('BERT-en-uncased-base', 'https://bert-models.bj.bcebos.com/uncased_L-12_H-768_A-12.tar.gz'),
-            ('BERT-en-uncased-large', 'https://bert-models.bj.bcebos.com/uncased_L-24_H-1024_A-16.tar.gz'),
-            ('BERT-en-cased-base', 'https://bert-models.bj.bcebos.com/cased_L-12_H-768_A-12.tar.gz'),
-            ('BERT-en-cased-large','https://bert-models.bj.bcebos.com/cased_L-24_H-1024_A-16.tar.gz'),
-            ('BERT-multilingual-uncased-base', 'https://bert-models.bj.bcebos.com/multilingual_L-12_H-768_A-12.tar.gz'),
-            ('BERT-multilingual-cased-base', 'https://bert-models.bj.bcebos.com/multi_cased_L-12_H-768_A-12.tar.gz'),
-            ('BERT-zh-base', 'https://bert-models.bj.bcebos.com/chinese_L-12_H-768_A-12.tar.gz'),)
-_items = OrderedDict(_pretrain)
-def _download(item, path, silent=False, convert=False):
-    data_url = _items[item]
-    if data_url == None:
-        return
-    if not silent:
-        print('Downloading {} from {}...'.format(item, data_url))
-    data_dir = path + '/' + item
-    if not os.path.exists(data_dir):
-        os.makedirs(os.path.join(data_dir))
-    data_name = data_url.split('/')[-1]
-    filename = data_dir + '/' + data_name
-    # print process
-    def _reporthook(count, chunk_size, total_size):
-        bytes_so_far = count * chunk_size
-        percent = float(bytes_so_far) / float(total_size)
-        if percent > 1:
-            percent = 1
-        if not silent:
-            print('\r>> Downloading... {:.1%}'.format(percent), end = "")
-    URLLIB.urlretrieve(data_url, filename, reporthook=_reporthook)
-    if not silent:
-        print(' done!')
-        print ('Extracting {}...'.format(data_name), end=" ")
-    if os.path.exists(filename):
-        tar = tarfile.open(filename, 'r')
-        tar.extractall(path = data_dir)
-        tar.close()
-        os.remove(filename)
-    if len(os.listdir(data_dir))==1:
-        source_path = data_dir + '/' + data_name.split('.')[0]
-        fileList = os.listdir(source_path)
-        for file in fileList:
-            filePath = os.path.join(source_path, file)
-            shutil.move(filePath, data_dir)
-        os.removedirs(source_path)
-    if not silent:
-        print ('done!')
-    if convert:
-        if not silent:
-            print ('Converting params...', end=" ")
-        _convert(data_dir, silent)
-def _convert(path, silent=False):
-    if os.path.isfile(path + '/params/__palminfo__'):
-        if not silent:
-            print ('already converted.')
-    else:
-        if os.path.exists(path + '/params/'):
-            os.rename(path + '/params/', path + '/params1/')
-            os.mkdir(path + '/params/')
-            tar_model = tarfile.open(path + '/params/' + '__palmmodel__', 'w')
-            tar_info = open(path + '/params/'+ '__palminfo__', 'w')
-            for root, dirs, files in os.walk(path + '/params1/'):
-                for file in files:
-                    src_file = os.path.join(root, file)
-                    tar_model.add(src_file, '__paddlepalm_' + file)
-                    tar_info.write('__paddlepalm_' + file)
-                    os.remove(src_file)
-            tar_model.close()
-            tar_info.close()
-            os.removedirs(path + '/params1/') 
-    if not silent:
-        print ('done!')
-def download(item='all', path='.'):
-    """
-    Args:
-        item: the item to download.
-        path: the target dir to download to. Default is `.`, means current dir.
-    """
-    # item = item.lower()
-    # scope = scope.lower()
-    if item != 'all':
-        assert item in _items, '{} is not found. Support list: {}'.format(list(_items.keys()))
-        _download(item, path)
-    else:
-        for item in _items.keys():
-            _download(item, path)
-def _ls():
-    for item in _items.keys():
-        print ('  => ' + item)
-def ls(): 
-    print ('Available pretrain models: ')
-    _ls()
--- a/hapi/text/senta/__init__.py
+++ b/hapi/text/senta/__init__.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. 
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from hapi.text.senta.data_processer import SentaProcessor
+from hapi.text.senta.optimization import Optimizer as Optimizer
--- a/hapi/text/senta/data_processer.py
+++ b/hapi/text/senta/data_processer.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+from hapi.text.senta.data_reader import load_vocab
+from hapi.text.senta.data_reader import data_reader
+from paddle.io import DataLoader
+class SentaProcessor(object):
+    def __init__(self, data_dir, vocab_path, random_seed=None):
+        self.data_dir = data_dir
+        self.vocab = load_vocab(vocab_path)
+        self.num_examples = {"train": -1, "dev": -1, "infer": -1}
+        np.random.seed(random_seed)
+    def get_train_examples(self, data_dir, epoch, shuffle, batch_size, places, padding_size):
+        train_reader = data_reader((self.data_dir + "/train.tsv"), self.vocab,
+                           self.num_examples, "train", epoch, padding_size, shuffle)
+        loader = DataLoader.from_generator(capacity=50, return_list=True)
+        loader.set_sample_generator(train_reader, batch_size=batch_size, drop_last=False, places=places)
+        return loader
+    def get_dev_examples(self, data_dir, epoch, shuffle, batch_size, places, padding_size):
+        dev_reader = data_reader((self.data_dir + "/dev.tsv"), self.vocab,
+                           self.num_examples, "dev", epoch, padding_size, shuffle)
+        loader = DataLoader.from_generator(capacity=50, return_list=True)
+        loader.set_sample_generator(dev_reader, batch_size=batch_size, drop_last=False, places=places)
+        return loader
+    def get_test_examples(self, data_dir, epoch, batch_size, places, padding_size):
+        test_reader = data_reader((self.data_dir + "/test.tsv"), self.vocab,
+                           self.num_examples, "infer", epoch, padding_size)
+        loader = DataLoader.from_generator(capacity=50, return_list=True)
+        loader.set_sample_generator(test_reader, batch_size=batch_size, drop_last=False, places=places)
+        return loader
+    def get_labels(self):
+        return ["0", "1"]
+    def get_num_examples(self, phase):
+        if phase not in ['train', 'dev', 'infer']:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'dev', 'infer'].")
+        return self.num_examples[phase]
+    def get_train_progress(self):
+        return self.current_train_example, self.current_train_epoch
+    def data_generator(self, padding_size, batch_size, places, phase='train', epoch=1, shuffle=True):
+        if phase == "train":
+            return self.get_train_examples(self.data_dir, epoch, shuffle, batch_size, places, padding_size)
+        elif phase == "dev":
+            return self.get_dev_examples(self.data_dir, epoch, shuffle, batch_size, places, padding_size)
+        elif phase == "infer":
+            return self.get_test_examples(self.data_dir, epoch, batch_size, places, padding_size)
+        else:
+            raise ValueError(
+                "Unknown phase, which should be in ['train', 'dev', 'infer'].")
--- a/hapi/text/senta/data_reader.py
+++ b/hapi/text/senta/data_reader.py
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import io
+import sys
+import random
+def str2bool(v):
+    return v.lower() in ("true", "t", "1")
+def data_reader(file_path, word_dict, num_examples, phrase, epoch, padding_size, shuffle=False):
+    unk_id = len(word_dict)
+    all_data = []
+    with io.open(file_path, "r", encoding='utf8') as fin:
+        for line in fin:
+            if line.startswith('text_a'):
+                continue
+            cols = line.strip().split("\t")
+            if len(cols) != 2:
+                sys.stderr.write("[NOTICE] Error Format Line!")
+                continue
+            label = [int(cols[1])]
+            wids = [
+                word_dict[x] if x in word_dict else unk_id
+                for x in cols[0].split(" ")
+            ]
+            wids = wids[:padding_size]
+            while len(wids) < padding_size:
+                wids.append(unk_id)
+            all_data.append((wids, label))
+    if shuffle:
+        if phrase == "train":
+            random.shuffle(all_data)
+    num_examples[phrase] = len(all_data)
+    def reader():
+        for epoch_index in range(epoch):
+            for doc, label in all_data:
+                yield doc, label
+    return reader
+def load_vocab(file_path):
+    vocab = {}
+    with io.open(file_path, 'r', encoding='utf8') as f:
+        wid = 0
+        for line in f:
+            if line.strip() not in vocab:
+                vocab[line.strip()] = wid
+                wid += 1
+    vocab["<unk>"] = len(vocab)
+    return vocab
--- a/hapi/text/senta/optimization.py
+++ b/hapi/text/senta/optimization.py
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimization and learning rate scheduling."""
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+import paddle.fluid as fluid
+from paddle.fluid.dygraph.learning_rate_scheduler import LearningRateDecay
+class ConstantLR(LearningRateDecay):
+    def __init__(self, learning_rate, begin=0, step=1, dtype='float32'):
+        super(ConstantLR, self).__init__(begin, step, dtype)
+        self.learning_rate = learning_rate
+    def step(self):
+        return self.learning_rate
+class LinearDecay(LearningRateDecay):
+    def __init__(self,
+                 learning_rate,
+                 warmup_steps,
+                 decay_steps,
+                 end_learning_rate=0.0001,
+                 power=1.0,
+                 cycle=False,
+                 begin=0,
+                 step=1,
+                 dtype='float32'):
+        super(LinearDecay, self).__init__(begin, step, dtype)
+        self.learning_rate = learning_rate
+        self.warmup_steps = warmup_steps
+        self.decay_steps = decay_steps
+        self.end_learning_rate = end_learning_rate
+        self.power = power
+        self.cycle = cycle
+    def step(self):
+        if self.step_num < self.warmup_steps:
+            decayed_lr = self.learning_rate * (self.step_num /
+                                               self.warmup_steps)
+            decayed_lr = self.create_lr_var(decayed_lr)
+        else:
+            tmp_step_num = self.step_num
+            tmp_decay_steps = self.decay_steps
+            if self.cycle:
+                div_res = fluid.layers.ceil(
+                    self.create_lr_var(tmp_step_num / float(self.decay_steps)))
+                if tmp_step_num == 0:
+                    div_res = self.create_lr_var(1.0)
+                tmp_decay_steps = self.decay_steps * div_res
+            else:
+                tmp_step_num = self.create_lr_var(
+                    tmp_step_num
+                    if tmp_step_num < self.decay_steps else self.decay_steps)
+                decayed_lr = (self.learning_rate - self.end_learning_rate) * \
+                    ((1 - tmp_step_num / tmp_decay_steps) ** self.power) + self.end_learning_rate
+        return decayed_lr
+class Optimizer(object):
+    def __init__(self,
+                 num_train_steps,
+                 learning_rate,
+                 model_cls,
+                 weight_decay=0,
+                 warmup_steps=0,
+                 scheduler='linear_warmup_decay',
+                 loss_scaling=1.0,
+                 parameter_list=None):
+        self.warmup_steps = warmup_steps
+        self.num_train_steps = num_train_steps
+        self.learning_rate = learning_rate
+        self.model_cls = model_cls
+        self.weight_decay = weight_decay
+        self.scheduler = scheduler
+        self.loss_scaling = loss_scaling
+        self.parameter_list = parameter_list
+        self.scheduled_lr = 0.0
+        self.optimizer = self.lr_schedule()
+    def lr_schedule(self):
+        if self.warmup_steps > 0:
+            if self.scheduler == 'noam_decay':
+                self.scheduled_lr = fluid.dygraph.NoamDecay(1 / (
+                    self.warmup_steps * (self.learning_rate**2)),
+                                                            self.warmup_steps)
+            elif self.scheduler == 'linear_warmup_decay':
+                self.scheduled_lr = LinearDecay(self.learning_rate,
+                                                self.warmup_steps,
+                                                self.num_train_steps, 0.0)
+            else:
+                raise ValueError("Unkown learning rate scheduler, should be "
+                                 "'noam_decay' or 'linear_warmup_decay'")
+            optimizer = fluid.optimizer.Adam(
+                learning_rate=self.scheduled_lr,
+                parameter_list=self.parameter_list)
+        else:
+            self.scheduled_lr = ConstantLR(self.learning_rate)
+            optimizer = fluid.optimizer.Adam(
+                learning_rate=self.scheduled_lr,
+                parameter_list=self.parameter_list)
+        return optimizer
+    def exclude_from_weight_decay(self, name):
+        if name.find("layer_norm") > -1:
+            return True
+        bias_suffix = ["_bias", "_b", ".b_0"]
+        for suffix in bias_suffix:
+            if name.endswith(suffix):
+                return True
+        return False
+    def state_dict(self):
+        return self.optimizer.state_dict()
+    def set_dict(self, state_dict):
+        return self.optimizer.set_dict(state_dict)
+    def get_opti_var_name_list(self):
+        return self.optimizer.get_opti_var_name_list()
+    def current_step_lr(self):
+        return self.optimizer.current_step_lr()
+    def minimize(self, loss, use_data_parallel=False, model=None):
+        param_list = dict()
+        clip_norm_thres = 1.0
+        #grad_clip = fluid.clip.GradientClipByGlobalNorm(clip_norm_thres)
+        if use_data_parallel:
+            loss = model.scale_loss(loss)
+        loss.backward()
+        if self.weight_decay > 0:
+            for param in self.model_cls.parameters():
+                param_list[param.name] = param * 1.0
+                param_list[param.name].stop_gradient = True
+        if use_data_parallel:
+            assert model is not None
+            model.apply_collective_grads()
+        #_, param_grads = self.optimizer.minimize(loss, grad_clip=grad_clip)
+        _, param_grads = self.optimizer.minimize(loss)
+        if self.weight_decay > 0:
+            for param, grad in param_grads:
+                if self.exclude_from_weight_decay(param.name):
+                    continue
+                if isinstance(self.scheduled_lr.step(), float):
+                    updated_param = param.numpy() - param_list[
+                        param.name].numpy(
+                        ) * self.weight_decay * self.scheduled_lr.step()
+                else:
+                    updated_param = param.numpy(
+                    ) - param_list[param.name].numpy(
+                    ) * self.weight_decay * self.scheduled_lr.step().numpy()
+                updated_param_var = fluid.dygraph.to_variable(updated_param)
+                param = updated_param_var
+                #param = fluid.layers.reshape(x=updated_param_var, shape=list(updated_param_var.shape))