diff --git a/modules/text/language_model/albert-base-v1/README.md b/modules/text/language_model/albert-base-v1/README.md index abef64ad567a5f1446e1a7286298d18d8049045b..b0f68e8b9267e0acf56d01304d1928ff0ca1eaec 100644 --- a/modules/text/language_model/albert-base-v1/README.md +++ b/modules/text/language_model/albert-base-v1/README.md @@ -25,9 +25,9 @@ - ### 1、环境依赖 - - paddlepaddle >= 2.0.0 + - paddlepaddle >= 2.2.0 - - paddlehub >= 2.0.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst) + - paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst) - ### 2、安装 diff --git a/modules/text/language_model/albert-base-v2/README.md b/modules/text/language_model/albert-base-v2/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d2ad12fcb2b3126f050969bf582476d3f1de4cab --- /dev/null +++ b/modules/text/language_model/albert-base-v2/README.md @@ -0,0 +1,173 @@ +# albert-base-v2 +|模型名称|albert-base-v2| +| :--- | :---: | +|类别|文本-语义模型| +|网络|albert-base-v2| +|数据集|-| +|是否支持Fine-tuning|是| +|模型大小|90MB| +|最新更新日期|2022-02-08| +|数据指标|-| + +## 一、模型基本信息 + +- ### 模型介绍 + + - ALBERT针对当前预训练模型参数量过大的问题,提出了以下改进方案: + + - 嵌入向量参数化的因式分解。ALBERT对词嵌入参数进行了因式分解,先将单词映射到一个低维的词嵌入空间E,然后再将其映射到高维的隐藏空间H。 + + - 跨层参数共享。ALBERT共享了层之间的全部参数。 + +更多详情请参考[ALBERT论文](https://arxiv.org/abs/1909.11942) + +## 二、安装 + +- ### 1、环境依赖 + + - paddlepaddle >= 2.2.0 + + - paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst) + +- ### 2、安装 + + - ```shell + $ hub install albert-base-v2 + ``` + - 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md) + | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md) + +## 三、模型API预测 + +- ### 1、预测代码示例 + +```python +import paddlehub as hub + +data = [ + ['这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'], + ['怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'], + ['作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'], +] +label_map = {0: 'negative', 1: 'positive'} + +model = hub.Module( + name='albert-base-v2', + version='1.0.0', + task='seq-cls', + load_checkpoint='/path/to/parameters', + label_map=label_map) +results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False) +for idx, text in enumerate(data): + print('Data: {} \t Label: {}'.format(text, results[idx])) +``` + +详情可参考PaddleHub示例: +- [文本分类](../../../../demo/text_classification) +- [序列标注](../../../../demo/sequence_labeling) + +- ### 2、API + + - ```python + def __init__( + task=None, + load_checkpoint=None, + label_map=None, + num_classes=2, + suffix=False, + **kwargs, + ) + ``` + + - 创建Module对象(动态图组网版本) + + - **参数** + + - `task`: 任务名称,可为`seq-cls`(文本分类任务)或`token-cls`(序列标注任务)。 + - `load_checkpoint`:使用PaddleHub Fine-tune api训练保存的模型参数文件路径。 + - `label_map`:预测时的类别映射表。 + - `num_classes`:分类任务的类别数,如果指定了`label_map`,此参数可不传,默认2分类。 + - `suffix`: 序列标注任务的标签格式,如果设定为`True`,标签以'-B', '-I', '-E' 或者 '-S'为结尾,此参数默认为`False`。 + - `**kwargs`:用户额外指定的关键字字典类型的参数。 + + - ```python + def predict( + data, + max_seq_len=128, + batch_size=1, + use_gpu=False + ) + ``` + + - **参数** + + - `data`: 待预测数据,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。每个样例文本数量(1个或者2个)需和训练时保持一致。 + - `max_seq_len`:模型处理文本的最大长度 + - `batch_size`:模型批处理大小 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,不同任务类型的返回结果如下 + - 文本分类:列表里包含每个句子的预测标签,格式为\[label\_1, label\_2, …,\] + - 序列标注:列表里包含每个句子每个token的预测标签,格式为\[\[token\_1, token\_2, …,\], \[token\_1, token\_2, …,\], …,\] + + - ```python + def get_embedding( + data, + use_gpu=False + ) + ``` + + - 用于获取输入文本的句子粒度特征与字粒度特征 + + - **参数** + + - `data`:输入文本列表,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\],其中每个元素都是对应样例的特征输出,每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。 + +## 四、服务部署 + +- PaddleHub Serving可以部署一个在线获取预训练词向量。 + +- ### 第一步:启动PaddleHub Serving + + - ```shell + $ hub serving start -m albert-base-v2 + ``` + + - 这样就完成了一个获取预训练词向量服务化API的部署,默认端口号为8866。 + + - **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。 + +- ### 第二步:发送预测请求 + + - 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果 + + - ```python + import requests + import json + + # 指定用于获取embedding的文本[[text_1], [text_2], ... ]} + text = [["今天是个好日子"], ["天气预报说今天要下雨"]] + # 以key的方式指定text传入预测方法的时的参数,此例中为"data" + # 对应本地部署,则为module.get_embedding(data=text) + data = {"data": text} + # 发送post请求,content-type类型应指定json方式,url中的ip地址需改为对应机器的ip + url = "http://127.0.0.1:8866/predict/albert-base-v2" + # 指定post请求的headers为application/json方式 + headers = {"Content-Type": "application/json"} + + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.json()) + ``` + +## 五、更新历史 + +* 1.0.0 + + 初始发布 diff --git a/modules/text/language_model/albert-base-v2/__init__.py b/modules/text/language_model/albert-base-v2/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/modules/text/language_model/albert-base-v2/module.py b/modules/text/language_model/albert-base-v2/module.py new file mode 100644 index 0000000000000000000000000000000000000000..b3b639ed2eb4065ed3c144773b09a090c268051c --- /dev/null +++ b/modules/text/language_model/albert-base-v2/module.py @@ -0,0 +1,177 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math +import os +from typing import Dict + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers.albert.modeling import AlbertForSequenceClassification +from paddlenlp.transformers.albert.modeling import AlbertForTokenClassification +from paddlenlp.transformers.albert.modeling import AlbertModel +from paddlenlp.transformers.albert.tokenizer import AlbertTokenizer + +from paddlehub.module.module import moduleinfo +from paddlehub.module.nlp_module import TransformerModule +from paddlehub.utils.log import logger + + +@moduleinfo(name="albert-base-v2", + version="1.0.0", + summary="", + author="Baidu", + author_email="", + type="nlp/semantic_model", + meta=TransformerModule) +class Albert(nn.Layer): + """ + ALBERT model + """ + + def __init__( + self, + task: str = None, + load_checkpoint: str = None, + label_map: Dict = None, + num_classes: int = 2, + suffix: bool = False, + **kwargs, + ): + super(Albert, self).__init__() + if label_map: + self.label_map = label_map + self.num_classes = len(label_map) + else: + self.num_classes = num_classes + + if task == 'sequence_classification': + task = 'seq-cls' + logger.warning( + "current task name 'sequence_classification' was renamed to 'seq-cls', " + "'sequence_classification' has been deprecated and will be removed in the future.", ) + if task == 'seq-cls': + self.model = AlbertForSequenceClassification.from_pretrained(pretrained_model_name_or_path='albert-base-v2', + num_classes=self.num_classes, + **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task == 'token-cls': + self.model = AlbertForTokenClassification.from_pretrained(pretrained_model_name_or_path='albert-base-v2', + num_classes=self.num_classes, + **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = ChunkEvaluator(label_list=[self.label_map[i] for i in sorted(self.label_map.keys())], + suffix=suffix) + elif task == 'text-matching': + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-base-v2', **kwargs) + self.dropout = paddle.nn.Dropout(0.1) + self.classifier = paddle.nn.Linear(self.model.config['hidden_size'] * 3, 2) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task is None: + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-base-v2', **kwargs) + else: + raise RuntimeError("Unknown task {}, task should be one in {}".format(task, self._tasks_supported)) + + self.task = task + + if load_checkpoint is not None and os.path.isfile(load_checkpoint): + state_dict = paddle.load(load_checkpoint) + self.set_state_dict(state_dict) + logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint)) + + def forward(self, + input_ids=None, + token_type_ids=None, + position_ids=None, + attention_mask=None, + query_input_ids=None, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_input_ids=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + seq_lengths=None, + labels=None): + + if self.task != 'text-matching': + result = self.model(input_ids, token_type_ids, position_ids, attention_mask) + else: + query_result = self.model(query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask) + title_result = self.model(title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask) + + if self.task == 'seq-cls': + logits = result + probs = F.softmax(logits, axis=1) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + elif self.task == 'token-cls': + logits = result + token_level_probs = F.softmax(logits, axis=-1) + preds = token_level_probs.argmax(axis=-1) + if labels is not None: + loss = self.criterion(logits, labels.unsqueeze(-1)) + num_infer_chunks, num_label_chunks, num_correct_chunks = \ + self.metric.compute(None, seq_lengths, preds, labels) + self.metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + _, _, f1_score = map(float, self.metric.accumulate()) + return token_level_probs, loss, {'f1_score': f1_score} + return token_level_probs + elif self.task == 'text-matching': + query_token_embedding, _ = query_result + query_token_embedding = self.dropout(query_token_embedding) + query_attention_mask = paddle.unsqueeze( + (query_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + query_token_embedding = query_token_embedding * query_attention_mask + query_sum_embedding = paddle.sum(query_token_embedding, axis=1) + query_sum_mask = paddle.sum(query_attention_mask, axis=1) + query_mean = query_sum_embedding / query_sum_mask + + title_token_embedding, _ = title_result + title_token_embedding = self.dropout(title_token_embedding) + title_attention_mask = paddle.unsqueeze( + (title_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + title_token_embedding = title_token_embedding * title_attention_mask + title_sum_embedding = paddle.sum(title_token_embedding, axis=1) + title_sum_mask = paddle.sum(title_attention_mask, axis=1) + title_mean = title_sum_embedding / title_sum_mask + + sub = paddle.abs(paddle.subtract(query_mean, title_mean)) + projection = paddle.concat([query_mean, title_mean, sub], axis=-1) + logits = self.classifier(projection) + probs = F.softmax(logits) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + else: + sequence_output, pooled_output = result + return sequence_output, pooled_output + + @staticmethod + def get_tokenizer(*args, **kwargs): + """ + Gets the tokenizer that is customized for this module. + """ + return AlbertTokenizer.from_pretrained(pretrained_model_name_or_path='albert-base-v2', *args, **kwargs) diff --git a/modules/text/language_model/albert-chinese-base/README.md b/modules/text/language_model/albert-chinese-base/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a3964f44859853101c7e102bf4f40f51d1af893b --- /dev/null +++ b/modules/text/language_model/albert-chinese-base/README.md @@ -0,0 +1,173 @@ +# albert-chinese-base +|模型名称|albert-chinese-base| +| :--- | :---: | +|类别|文本-语义模型| +|网络|albert-chinese-base| +|数据集|-| +|是否支持Fine-tuning|是| +|模型大小|77MB| +|最新更新日期|2022-02-08| +|数据指标|-| + +## 一、模型基本信息 + +- ### 模型介绍 + + - ALBERT针对当前预训练模型参数量过大的问题,提出了以下改进方案: + + - 嵌入向量参数化的因式分解。ALBERT对词嵌入参数进行了因式分解,先将单词映射到一个低维的词嵌入空间E,然后再将其映射到高维的隐藏空间H。 + + - 跨层参数共享。ALBERT共享了层之间的全部参数。 + +更多详情请参考[ALBERT论文](https://arxiv.org/abs/1909.11942) + +## 二、安装 + +- ### 1、环境依赖 + + - paddlepaddle >= 2.2.0 + + - paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst) + +- ### 2、安装 + + - ```shell + $ hub install albert-chinese-base + ``` + - 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md) + | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md) + +## 三、模型API预测 + +- ### 1、预测代码示例 + +```python +import paddlehub as hub + +data = [ + ['这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'], + ['怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'], + ['作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'], +] +label_map = {0: 'negative', 1: 'positive'} + +model = hub.Module( + name='albert-chinese-base', + version='1.0.0', + task='seq-cls', + load_checkpoint='/path/to/parameters', + label_map=label_map) +results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False) +for idx, text in enumerate(data): + print('Data: {} \t Label: {}'.format(text, results[idx])) +``` + +详情可参考PaddleHub示例: +- [文本分类](../../../../demo/text_classification) +- [序列标注](../../../../demo/sequence_labeling) + +- ### 2、API + + - ```python + def __init__( + task=None, + load_checkpoint=None, + label_map=None, + num_classes=2, + suffix=False, + **kwargs, + ) + ``` + + - 创建Module对象(动态图组网版本) + + - **参数** + + - `task`: 任务名称,可为`seq-cls`(文本分类任务)或`token-cls`(序列标注任务)。 + - `load_checkpoint`:使用PaddleHub Fine-tune api训练保存的模型参数文件路径。 + - `label_map`:预测时的类别映射表。 + - `num_classes`:分类任务的类别数,如果指定了`label_map`,此参数可不传,默认2分类。 + - `suffix`: 序列标注任务的标签格式,如果设定为`True`,标签以'-B', '-I', '-E' 或者 '-S'为结尾,此参数默认为`False`。 + - `**kwargs`:用户额外指定的关键字字典类型的参数。 + + - ```python + def predict( + data, + max_seq_len=128, + batch_size=1, + use_gpu=False + ) + ``` + + - **参数** + + - `data`: 待预测数据,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。每个样例文本数量(1个或者2个)需和训练时保持一致。 + - `max_seq_len`:模型处理文本的最大长度 + - `batch_size`:模型批处理大小 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,不同任务类型的返回结果如下 + - 文本分类:列表里包含每个句子的预测标签,格式为\[label\_1, label\_2, …,\] + - 序列标注:列表里包含每个句子每个token的预测标签,格式为\[\[token\_1, token\_2, …,\], \[token\_1, token\_2, …,\], …,\] + + - ```python + def get_embedding( + data, + use_gpu=False + ) + ``` + + - 用于获取输入文本的句子粒度特征与字粒度特征 + + - **参数** + + - `data`:输入文本列表,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\],其中每个元素都是对应样例的特征输出,每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。 + +## 四、服务部署 + +- PaddleHub Serving可以部署一个在线获取预训练词向量。 + +- ### 第一步:启动PaddleHub Serving + + - ```shell + $ hub serving start -m albert-chinese-base + ``` + + - 这样就完成了一个获取预训练词向量服务化API的部署,默认端口号为8866。 + + - **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。 + +- ### 第二步:发送预测请求 + + - 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果 + + - ```python + import requests + import json + + # 指定用于获取embedding的文本[[text_1], [text_2], ... ]} + text = [["今天是个好日子"], ["天气预报说今天要下雨"]] + # 以key的方式指定text传入预测方法的时的参数,此例中为"data" + # 对应本地部署,则为module.get_embedding(data=text) + data = {"data": text} + # 发送post请求,content-type类型应指定json方式,url中的ip地址需改为对应机器的ip + url = "http://127.0.0.1:8866/predict/albert-chinese-base" + # 指定post请求的headers为application/json方式 + headers = {"Content-Type": "application/json"} + + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.json()) + ``` + +## 五、更新历史 + +* 1.0.0 + + 初始发布 diff --git a/modules/text/language_model/albert-chinese-base/__init__.py b/modules/text/language_model/albert-chinese-base/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/modules/text/language_model/albert-chinese-base/module.py b/modules/text/language_model/albert-chinese-base/module.py new file mode 100644 index 0000000000000000000000000000000000000000..0a7796a0122123225b21ff6fec8b9a2c8368a716 --- /dev/null +++ b/modules/text/language_model/albert-chinese-base/module.py @@ -0,0 +1,175 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math +import os +from typing import Dict + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers.albert.modeling import AlbertForSequenceClassification +from paddlenlp.transformers.albert.modeling import AlbertForTokenClassification +from paddlenlp.transformers.albert.modeling import AlbertModel +from paddlenlp.transformers.albert.tokenizer import AlbertTokenizer + +from paddlehub.module.module import moduleinfo +from paddlehub.module.nlp_module import TransformerModule +from paddlehub.utils.log import logger + + +@moduleinfo(name="albert-chinese-base", + version="1.0.0", + summary="", + author="Baidu", + author_email="", + type="nlp/semantic_model", + meta=TransformerModule) +class Albert(nn.Layer): + """ + ALBERT model + """ + + def __init__( + self, + task: str = None, + load_checkpoint: str = None, + label_map: Dict = None, + num_classes: int = 2, + suffix: bool = False, + **kwargs, + ): + super(Albert, self).__init__() + if label_map: + self.label_map = label_map + self.num_classes = len(label_map) + else: + self.num_classes = num_classes + + if task == 'sequence_classification': + task = 'seq-cls' + logger.warning( + "current task name 'sequence_classification' was renamed to 'seq-cls', " + "'sequence_classification' has been deprecated and will be removed in the future.", ) + if task == 'seq-cls': + self.model = AlbertForSequenceClassification.from_pretrained( + pretrained_model_name_or_path='albert-chinese-base', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task == 'token-cls': + self.model = AlbertForTokenClassification.from_pretrained( + pretrained_model_name_or_path='albert-chinese-base', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = ChunkEvaluator(label_list=[self.label_map[i] for i in sorted(self.label_map.keys())], + suffix=suffix) + elif task == 'text-matching': + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-chinese-base', **kwargs) + self.dropout = paddle.nn.Dropout(0.1) + self.classifier = paddle.nn.Linear(self.model.config['hidden_size'] * 3, 2) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task is None: + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-chinese-base', **kwargs) + else: + raise RuntimeError("Unknown task {}, task should be one in {}".format(task, self._tasks_supported)) + + self.task = task + + if load_checkpoint is not None and os.path.isfile(load_checkpoint): + state_dict = paddle.load(load_checkpoint) + self.set_state_dict(state_dict) + logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint)) + + def forward(self, + input_ids=None, + token_type_ids=None, + position_ids=None, + attention_mask=None, + query_input_ids=None, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_input_ids=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + seq_lengths=None, + labels=None): + + if self.task != 'text-matching': + result = self.model(input_ids, token_type_ids, position_ids, attention_mask) + else: + query_result = self.model(query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask) + title_result = self.model(title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask) + + if self.task == 'seq-cls': + logits = result + probs = F.softmax(logits, axis=1) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + elif self.task == 'token-cls': + logits = result + token_level_probs = F.softmax(logits, axis=-1) + preds = token_level_probs.argmax(axis=-1) + if labels is not None: + loss = self.criterion(logits, labels.unsqueeze(-1)) + num_infer_chunks, num_label_chunks, num_correct_chunks = \ + self.metric.compute(None, seq_lengths, preds, labels) + self.metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + _, _, f1_score = map(float, self.metric.accumulate()) + return token_level_probs, loss, {'f1_score': f1_score} + return token_level_probs + elif self.task == 'text-matching': + query_token_embedding, _ = query_result + query_token_embedding = self.dropout(query_token_embedding) + query_attention_mask = paddle.unsqueeze( + (query_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + query_token_embedding = query_token_embedding * query_attention_mask + query_sum_embedding = paddle.sum(query_token_embedding, axis=1) + query_sum_mask = paddle.sum(query_attention_mask, axis=1) + query_mean = query_sum_embedding / query_sum_mask + + title_token_embedding, _ = title_result + title_token_embedding = self.dropout(title_token_embedding) + title_attention_mask = paddle.unsqueeze( + (title_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + title_token_embedding = title_token_embedding * title_attention_mask + title_sum_embedding = paddle.sum(title_token_embedding, axis=1) + title_sum_mask = paddle.sum(title_attention_mask, axis=1) + title_mean = title_sum_embedding / title_sum_mask + + sub = paddle.abs(paddle.subtract(query_mean, title_mean)) + projection = paddle.concat([query_mean, title_mean, sub], axis=-1) + logits = self.classifier(projection) + probs = F.softmax(logits) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + else: + sequence_output, pooled_output = result + return sequence_output, pooled_output + + @staticmethod + def get_tokenizer(*args, **kwargs): + """ + Gets the tokenizer that is customized for this module. + """ + return AlbertTokenizer.from_pretrained(pretrained_model_name_or_path='albert-chinese-base', *args, **kwargs) diff --git a/modules/text/language_model/albert-chinese-large/README.md b/modules/text/language_model/albert-chinese-large/README.md new file mode 100644 index 0000000000000000000000000000000000000000..48fec21d1eaf1fbbf36ec2973ea2f44501af75ea --- /dev/null +++ b/modules/text/language_model/albert-chinese-large/README.md @@ -0,0 +1,173 @@ +# albert-chinese-large +|模型名称|albert-chinese-large| +| :--- | :---: | +|类别|文本-语义模型| +|网络|albert-chinese-large| +|数据集|-| +|是否支持Fine-tuning|是| +|模型大小|112MB| +|最新更新日期|2022-02-08| +|数据指标|-| + +## 一、模型基本信息 + +- ### 模型介绍 + + - ALBERT针对当前预训练模型参数量过大的问题,提出了以下改进方案: + + - 嵌入向量参数化的因式分解。ALBERT对词嵌入参数进行了因式分解,先将单词映射到一个低维的词嵌入空间E,然后再将其映射到高维的隐藏空间H。 + + - 跨层参数共享。ALBERT共享了层之间的全部参数。 + +更多详情请参考[ALBERT论文](https://arxiv.org/abs/1909.11942) + +## 二、安装 + +- ### 1、环境依赖 + + - paddlepaddle >= 2.2.0 + + - paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst) + +- ### 2、安装 + + - ```shell + $ hub install albert-chinese-large + ``` + - 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md) + | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md) + +## 三、模型API预测 + +- ### 1、预测代码示例 + +```python +import paddlehub as hub + +data = [ + ['这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'], + ['怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'], + ['作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'], +] +label_map = {0: 'negative', 1: 'positive'} + +model = hub.Module( + name='albert-chinese-large', + version='1.0.0', + task='seq-cls', + load_checkpoint='/path/to/parameters', + label_map=label_map) +results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False) +for idx, text in enumerate(data): + print('Data: {} \t Label: {}'.format(text, results[idx])) +``` + +详情可参考PaddleHub示例: +- [文本分类](../../../../demo/text_classification) +- [序列标注](../../../../demo/sequence_labeling) + +- ### 2、API + + - ```python + def __init__( + task=None, + load_checkpoint=None, + label_map=None, + num_classes=2, + suffix=False, + **kwargs, + ) + ``` + + - 创建Module对象(动态图组网版本) + + - **参数** + + - `task`: 任务名称,可为`seq-cls`(文本分类任务)或`token-cls`(序列标注任务)。 + - `load_checkpoint`:使用PaddleHub Fine-tune api训练保存的模型参数文件路径。 + - `label_map`:预测时的类别映射表。 + - `num_classes`:分类任务的类别数,如果指定了`label_map`,此参数可不传,默认2分类。 + - `suffix`: 序列标注任务的标签格式,如果设定为`True`,标签以'-B', '-I', '-E' 或者 '-S'为结尾,此参数默认为`False`。 + - `**kwargs`:用户额外指定的关键字字典类型的参数。 + + - ```python + def predict( + data, + max_seq_len=128, + batch_size=1, + use_gpu=False + ) + ``` + + - **参数** + + - `data`: 待预测数据,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。每个样例文本数量(1个或者2个)需和训练时保持一致。 + - `max_seq_len`:模型处理文本的最大长度 + - `batch_size`:模型批处理大小 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,不同任务类型的返回结果如下 + - 文本分类:列表里包含每个句子的预测标签,格式为\[label\_1, label\_2, …,\] + - 序列标注:列表里包含每个句子每个token的预测标签,格式为\[\[token\_1, token\_2, …,\], \[token\_1, token\_2, …,\], …,\] + + - ```python + def get_embedding( + data, + use_gpu=False + ) + ``` + + - 用于获取输入文本的句子粒度特征与字粒度特征 + + - **参数** + + - `data`:输入文本列表,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\],其中每个元素都是对应样例的特征输出,每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。 + +## 四、服务部署 + +- PaddleHub Serving可以部署一个在线获取预训练词向量。 + +- ### 第一步:启动PaddleHub Serving + + - ```shell + $ hub serving start -m albert-chinese-large + ``` + + - 这样就完成了一个获取预训练词向量服务化API的部署,默认端口号为8866。 + + - **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。 + +- ### 第二步:发送预测请求 + + - 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果 + + - ```python + import requests + import json + + # 指定用于获取embedding的文本[[text_1], [text_2], ... ]} + text = [["今天是个好日子"], ["天气预报说今天要下雨"]] + # 以key的方式指定text传入预测方法的时的参数,此例中为"data" + # 对应本地部署,则为module.get_embedding(data=text) + data = {"data": text} + # 发送post请求,content-type类型应指定json方式,url中的ip地址需改为对应机器的ip + url = "http://127.0.0.1:8866/predict/albert-chinese-large" + # 指定post请求的headers为application/json方式 + headers = {"Content-Type": "application/json"} + + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.json()) + ``` + +## 五、更新历史 + +* 1.0.0 + + 初始发布 diff --git a/modules/text/language_model/albert-chinese-large/__init__.py b/modules/text/language_model/albert-chinese-large/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/modules/text/language_model/albert-chinese-large/module.py b/modules/text/language_model/albert-chinese-large/module.py new file mode 100644 index 0000000000000000000000000000000000000000..f7aa985d1bdf926416368b8197aba811ee671b14 --- /dev/null +++ b/modules/text/language_model/albert-chinese-large/module.py @@ -0,0 +1,175 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math +import os +from typing import Dict + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers.albert.modeling import AlbertForSequenceClassification +from paddlenlp.transformers.albert.modeling import AlbertForTokenClassification +from paddlenlp.transformers.albert.modeling import AlbertModel +from paddlenlp.transformers.albert.tokenizer import AlbertTokenizer + +from paddlehub.module.module import moduleinfo +from paddlehub.module.nlp_module import TransformerModule +from paddlehub.utils.log import logger + + +@moduleinfo(name="albert-chinese-large", + version="1.0.0", + summary="", + author="Baidu", + author_email="", + type="nlp/semantic_model", + meta=TransformerModule) +class Albert(nn.Layer): + """ + ALBERT model + """ + + def __init__( + self, + task: str = None, + load_checkpoint: str = None, + label_map: Dict = None, + num_classes: int = 2, + suffix: bool = False, + **kwargs, + ): + super(Albert, self).__init__() + if label_map: + self.label_map = label_map + self.num_classes = len(label_map) + else: + self.num_classes = num_classes + + if task == 'sequence_classification': + task = 'seq-cls' + logger.warning( + "current task name 'sequence_classification' was renamed to 'seq-cls', " + "'sequence_classification' has been deprecated and will be removed in the future.", ) + if task == 'seq-cls': + self.model = AlbertForSequenceClassification.from_pretrained( + pretrained_model_name_or_path='albert-chinese-large', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task == 'token-cls': + self.model = AlbertForTokenClassification.from_pretrained( + pretrained_model_name_or_path='albert-chinese-large', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = ChunkEvaluator(label_list=[self.label_map[i] for i in sorted(self.label_map.keys())], + suffix=suffix) + elif task == 'text-matching': + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-chinese-large', **kwargs) + self.dropout = paddle.nn.Dropout(0.1) + self.classifier = paddle.nn.Linear(self.model.config['hidden_size'] * 3, 2) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task is None: + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-chinese-large', **kwargs) + else: + raise RuntimeError("Unknown task {}, task should be one in {}".format(task, self._tasks_supported)) + + self.task = task + + if load_checkpoint is not None and os.path.isfile(load_checkpoint): + state_dict = paddle.load(load_checkpoint) + self.set_state_dict(state_dict) + logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint)) + + def forward(self, + input_ids=None, + token_type_ids=None, + position_ids=None, + attention_mask=None, + query_input_ids=None, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_input_ids=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + seq_lengths=None, + labels=None): + + if self.task != 'text-matching': + result = self.model(input_ids, token_type_ids, position_ids, attention_mask) + else: + query_result = self.model(query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask) + title_result = self.model(title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask) + + if self.task == 'seq-cls': + logits = result + probs = F.softmax(logits, axis=1) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + elif self.task == 'token-cls': + logits = result + token_level_probs = F.softmax(logits, axis=-1) + preds = token_level_probs.argmax(axis=-1) + if labels is not None: + loss = self.criterion(logits, labels.unsqueeze(-1)) + num_infer_chunks, num_label_chunks, num_correct_chunks = \ + self.metric.compute(None, seq_lengths, preds, labels) + self.metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + _, _, f1_score = map(float, self.metric.accumulate()) + return token_level_probs, loss, {'f1_score': f1_score} + return token_level_probs + elif self.task == 'text-matching': + query_token_embedding, _ = query_result + query_token_embedding = self.dropout(query_token_embedding) + query_attention_mask = paddle.unsqueeze( + (query_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + query_token_embedding = query_token_embedding * query_attention_mask + query_sum_embedding = paddle.sum(query_token_embedding, axis=1) + query_sum_mask = paddle.sum(query_attention_mask, axis=1) + query_mean = query_sum_embedding / query_sum_mask + + title_token_embedding, _ = title_result + title_token_embedding = self.dropout(title_token_embedding) + title_attention_mask = paddle.unsqueeze( + (title_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + title_token_embedding = title_token_embedding * title_attention_mask + title_sum_embedding = paddle.sum(title_token_embedding, axis=1) + title_sum_mask = paddle.sum(title_attention_mask, axis=1) + title_mean = title_sum_embedding / title_sum_mask + + sub = paddle.abs(paddle.subtract(query_mean, title_mean)) + projection = paddle.concat([query_mean, title_mean, sub], axis=-1) + logits = self.classifier(projection) + probs = F.softmax(logits) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + else: + sequence_output, pooled_output = result + return sequence_output, pooled_output + + @staticmethod + def get_tokenizer(*args, **kwargs): + """ + Gets the tokenizer that is customized for this module. + """ + return AlbertTokenizer.from_pretrained(pretrained_model_name_or_path='albert-chinese-large', *args, **kwargs) diff --git a/modules/text/language_model/albert-chinese-small/README.md b/modules/text/language_model/albert-chinese-small/README.md new file mode 100644 index 0000000000000000000000000000000000000000..8a4440ee5aa039024170845efc00a8c07a757b9a --- /dev/null +++ b/modules/text/language_model/albert-chinese-small/README.md @@ -0,0 +1,173 @@ +# albert-chinese-small +|模型名称|albert-chinese-small| +| :--- | :---: | +|类别|文本-语义模型| +|网络|albert-chinese-small| +|数据集|-| +|是否支持Fine-tuning|是| +|模型大小|44MB| +|最新更新日期|2022-02-08| +|数据指标|-| + +## 一、模型基本信息 + +- ### 模型介绍 + + - ALBERT针对当前预训练模型参数量过大的问题,提出了以下改进方案: + + - 嵌入向量参数化的因式分解。ALBERT对词嵌入参数进行了因式分解,先将单词映射到一个低维的词嵌入空间E,然后再将其映射到高维的隐藏空间H。 + + - 跨层参数共享。ALBERT共享了层之间的全部参数。 + +更多详情请参考[ALBERT论文](https://arxiv.org/abs/1909.11942) + +## 二、安装 + +- ### 1、环境依赖 + + - paddlepaddle >= 2.2.0 + + - paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst) + +- ### 2、安装 + + - ```shell + $ hub install albert-chinese-small + ``` + - 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md) + | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md) + +## 三、模型API预测 + +- ### 1、预测代码示例 + +```python +import paddlehub as hub + +data = [ + ['这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'], + ['怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'], + ['作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'], +] +label_map = {0: 'negative', 1: 'positive'} + +model = hub.Module( + name='albert-chinese-small', + version='1.0.0', + task='seq-cls', + load_checkpoint='/path/to/parameters', + label_map=label_map) +results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False) +for idx, text in enumerate(data): + print('Data: {} \t Label: {}'.format(text, results[idx])) +``` + +详情可参考PaddleHub示例: +- [文本分类](../../../../demo/text_classification) +- [序列标注](../../../../demo/sequence_labeling) + +- ### 2、API + + - ```python + def __init__( + task=None, + load_checkpoint=None, + label_map=None, + num_classes=2, + suffix=False, + **kwargs, + ) + ``` + + - 创建Module对象(动态图组网版本) + + - **参数** + + - `task`: 任务名称,可为`seq-cls`(文本分类任务)或`token-cls`(序列标注任务)。 + - `load_checkpoint`:使用PaddleHub Fine-tune api训练保存的模型参数文件路径。 + - `label_map`:预测时的类别映射表。 + - `num_classes`:分类任务的类别数,如果指定了`label_map`,此参数可不传,默认2分类。 + - `suffix`: 序列标注任务的标签格式,如果设定为`True`,标签以'-B', '-I', '-E' 或者 '-S'为结尾,此参数默认为`False`。 + - `**kwargs`:用户额外指定的关键字字典类型的参数。 + + - ```python + def predict( + data, + max_seq_len=128, + batch_size=1, + use_gpu=False + ) + ``` + + - **参数** + + - `data`: 待预测数据,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。每个样例文本数量(1个或者2个)需和训练时保持一致。 + - `max_seq_len`:模型处理文本的最大长度 + - `batch_size`:模型批处理大小 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,不同任务类型的返回结果如下 + - 文本分类:列表里包含每个句子的预测标签,格式为\[label\_1, label\_2, …,\] + - 序列标注:列表里包含每个句子每个token的预测标签,格式为\[\[token\_1, token\_2, …,\], \[token\_1, token\_2, …,\], …,\] + + - ```python + def get_embedding( + data, + use_gpu=False + ) + ``` + + - 用于获取输入文本的句子粒度特征与字粒度特征 + + - **参数** + + - `data`:输入文本列表,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\],其中每个元素都是对应样例的特征输出,每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。 + +## 四、服务部署 + +- PaddleHub Serving可以部署一个在线获取预训练词向量。 + +- ### 第一步:启动PaddleHub Serving + + - ```shell + $ hub serving start -m albert-chinese-small + ``` + + - 这样就完成了一个获取预训练词向量服务化API的部署,默认端口号为8866。 + + - **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。 + +- ### 第二步:发送预测请求 + + - 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果 + + - ```python + import requests + import json + + # 指定用于获取embedding的文本[[text_1], [text_2], ... ]} + text = [["今天是个好日子"], ["天气预报说今天要下雨"]] + # 以key的方式指定text传入预测方法的时的参数,此例中为"data" + # 对应本地部署,则为module.get_embedding(data=text) + data = {"data": text} + # 发送post请求,content-type类型应指定json方式,url中的ip地址需改为对应机器的ip + url = "http://127.0.0.1:8866/predict/albert-chinese-small" + # 指定post请求的headers为application/json方式 + headers = {"Content-Type": "application/json"} + + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.json()) + ``` + +## 五、更新历史 + +* 1.0.0 + + 初始发布 diff --git a/modules/text/language_model/albert-chinese-small/__init__.py b/modules/text/language_model/albert-chinese-small/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/modules/text/language_model/albert-chinese-small/module.py b/modules/text/language_model/albert-chinese-small/module.py new file mode 100644 index 0000000000000000000000000000000000000000..f7500c908bf404e066254e9d52135185874c7d5f --- /dev/null +++ b/modules/text/language_model/albert-chinese-small/module.py @@ -0,0 +1,175 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math +import os +from typing import Dict + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers.albert.modeling import AlbertForSequenceClassification +from paddlenlp.transformers.albert.modeling import AlbertForTokenClassification +from paddlenlp.transformers.albert.modeling import AlbertModel +from paddlenlp.transformers.albert.tokenizer import AlbertTokenizer + +from paddlehub.module.module import moduleinfo +from paddlehub.module.nlp_module import TransformerModule +from paddlehub.utils.log import logger + + +@moduleinfo(name="albert-chinese-small", + version="1.0.0", + summary="", + author="Baidu", + author_email="", + type="nlp/semantic_model", + meta=TransformerModule) +class Albert(nn.Layer): + """ + ALBERT model + """ + + def __init__( + self, + task: str = None, + load_checkpoint: str = None, + label_map: Dict = None, + num_classes: int = 2, + suffix: bool = False, + **kwargs, + ): + super(Albert, self).__init__() + if label_map: + self.label_map = label_map + self.num_classes = len(label_map) + else: + self.num_classes = num_classes + + if task == 'sequence_classification': + task = 'seq-cls' + logger.warning( + "current task name 'sequence_classification' was renamed to 'seq-cls', " + "'sequence_classification' has been deprecated and will be removed in the future.", ) + if task == 'seq-cls': + self.model = AlbertForSequenceClassification.from_pretrained( + pretrained_model_name_or_path='albert-chinese-small', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task == 'token-cls': + self.model = AlbertForTokenClassification.from_pretrained( + pretrained_model_name_or_path='albert-chinese-small', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = ChunkEvaluator(label_list=[self.label_map[i] for i in sorted(self.label_map.keys())], + suffix=suffix) + elif task == 'text-matching': + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-chinese-small', **kwargs) + self.dropout = paddle.nn.Dropout(0.1) + self.classifier = paddle.nn.Linear(self.model.config['hidden_size'] * 3, 2) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task is None: + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-chinese-small', **kwargs) + else: + raise RuntimeError("Unknown task {}, task should be one in {}".format(task, self._tasks_supported)) + + self.task = task + + if load_checkpoint is not None and os.path.isfile(load_checkpoint): + state_dict = paddle.load(load_checkpoint) + self.set_state_dict(state_dict) + logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint)) + + def forward(self, + input_ids=None, + token_type_ids=None, + position_ids=None, + attention_mask=None, + query_input_ids=None, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_input_ids=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + seq_lengths=None, + labels=None): + + if self.task != 'text-matching': + result = self.model(input_ids, token_type_ids, position_ids, attention_mask) + else: + query_result = self.model(query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask) + title_result = self.model(title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask) + + if self.task == 'seq-cls': + logits = result + probs = F.softmax(logits, axis=1) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + elif self.task == 'token-cls': + logits = result + token_level_probs = F.softmax(logits, axis=-1) + preds = token_level_probs.argmax(axis=-1) + if labels is not None: + loss = self.criterion(logits, labels.unsqueeze(-1)) + num_infer_chunks, num_label_chunks, num_correct_chunks = \ + self.metric.compute(None, seq_lengths, preds, labels) + self.metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + _, _, f1_score = map(float, self.metric.accumulate()) + return token_level_probs, loss, {'f1_score': f1_score} + return token_level_probs + elif self.task == 'text-matching': + query_token_embedding, _ = query_result + query_token_embedding = self.dropout(query_token_embedding) + query_attention_mask = paddle.unsqueeze( + (query_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + query_token_embedding = query_token_embedding * query_attention_mask + query_sum_embedding = paddle.sum(query_token_embedding, axis=1) + query_sum_mask = paddle.sum(query_attention_mask, axis=1) + query_mean = query_sum_embedding / query_sum_mask + + title_token_embedding, _ = title_result + title_token_embedding = self.dropout(title_token_embedding) + title_attention_mask = paddle.unsqueeze( + (title_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + title_token_embedding = title_token_embedding * title_attention_mask + title_sum_embedding = paddle.sum(title_token_embedding, axis=1) + title_sum_mask = paddle.sum(title_attention_mask, axis=1) + title_mean = title_sum_embedding / title_sum_mask + + sub = paddle.abs(paddle.subtract(query_mean, title_mean)) + projection = paddle.concat([query_mean, title_mean, sub], axis=-1) + logits = self.classifier(projection) + probs = F.softmax(logits) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + else: + sequence_output, pooled_output = result + return sequence_output, pooled_output + + @staticmethod + def get_tokenizer(*args, **kwargs): + """ + Gets the tokenizer that is customized for this module. + """ + return AlbertTokenizer.from_pretrained(pretrained_model_name_or_path='albert-chinese-small', *args, **kwargs) diff --git a/modules/text/language_model/albert-chinese-tiny/README.md b/modules/text/language_model/albert-chinese-tiny/README.md new file mode 100644 index 0000000000000000000000000000000000000000..08c8f9f8c3d365e1e1f43ddef858b7f8c97866b9 --- /dev/null +++ b/modules/text/language_model/albert-chinese-tiny/README.md @@ -0,0 +1,173 @@ +# albert-chinese-tiny +|模型名称|albert-chinese-tiny| +| :--- | :---: | +|类别|文本-语义模型| +|网络|albert-chinese-tiny| +|数据集|-| +|是否支持Fine-tuning|是| +|模型大小|40MB| +|最新更新日期|2022-02-08| +|数据指标|-| + +## 一、模型基本信息 + +- ### 模型介绍 + + - ALBERT针对当前预训练模型参数量过大的问题,提出了以下改进方案: + + - 嵌入向量参数化的因式分解。ALBERT对词嵌入参数进行了因式分解,先将单词映射到一个低维的词嵌入空间E,然后再将其映射到高维的隐藏空间H。 + + - 跨层参数共享。ALBERT共享了层之间的全部参数。 + +更多详情请参考[ALBERT论文](https://arxiv.org/abs/1909.11942) + +## 二、安装 + +- ### 1、环境依赖 + + - paddlepaddle >= 2.2.0 + + - paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst) + +- ### 2、安装 + + - ```shell + $ hub install albert-chinese-tiny + ``` + - 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md) + | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md) + +## 三、模型API预测 + +- ### 1、预测代码示例 + +```python +import paddlehub as hub + +data = [ + ['这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'], + ['怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'], + ['作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'], +] +label_map = {0: 'negative', 1: 'positive'} + +model = hub.Module( + name='albert-chinese-tiny', + version='1.0.0', + task='seq-cls', + load_checkpoint='/path/to/parameters', + label_map=label_map) +results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False) +for idx, text in enumerate(data): + print('Data: {} \t Label: {}'.format(text, results[idx])) +``` + +详情可参考PaddleHub示例: +- [文本分类](../../../../demo/text_classification) +- [序列标注](../../../../demo/sequence_labeling) + +- ### 2、API + + - ```python + def __init__( + task=None, + load_checkpoint=None, + label_map=None, + num_classes=2, + suffix=False, + **kwargs, + ) + ``` + + - 创建Module对象(动态图组网版本) + + - **参数** + + - `task`: 任务名称,可为`seq-cls`(文本分类任务)或`token-cls`(序列标注任务)。 + - `load_checkpoint`:使用PaddleHub Fine-tune api训练保存的模型参数文件路径。 + - `label_map`:预测时的类别映射表。 + - `num_classes`:分类任务的类别数,如果指定了`label_map`,此参数可不传,默认2分类。 + - `suffix`: 序列标注任务的标签格式,如果设定为`True`,标签以'-B', '-I', '-E' 或者 '-S'为结尾,此参数默认为`False`。 + - `**kwargs`:用户额外指定的关键字字典类型的参数。 + + - ```python + def predict( + data, + max_seq_len=128, + batch_size=1, + use_gpu=False + ) + ``` + + - **参数** + + - `data`: 待预测数据,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。每个样例文本数量(1个或者2个)需和训练时保持一致。 + - `max_seq_len`:模型处理文本的最大长度 + - `batch_size`:模型批处理大小 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,不同任务类型的返回结果如下 + - 文本分类:列表里包含每个句子的预测标签,格式为\[label\_1, label\_2, …,\] + - 序列标注:列表里包含每个句子每个token的预测标签,格式为\[\[token\_1, token\_2, …,\], \[token\_1, token\_2, …,\], …,\] + + - ```python + def get_embedding( + data, + use_gpu=False + ) + ``` + + - 用于获取输入文本的句子粒度特征与字粒度特征 + + - **参数** + + - `data`:输入文本列表,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\],其中每个元素都是对应样例的特征输出,每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。 + +## 四、服务部署 + +- PaddleHub Serving可以部署一个在线获取预训练词向量。 + +- ### 第一步:启动PaddleHub Serving + + - ```shell + $ hub serving start -m albert-chinese-tiny + ``` + + - 这样就完成了一个获取预训练词向量服务化API的部署,默认端口号为8866。 + + - **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。 + +- ### 第二步:发送预测请求 + + - 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果 + + - ```python + import requests + import json + + # 指定用于获取embedding的文本[[text_1], [text_2], ... ]} + text = [["今天是个好日子"], ["天气预报说今天要下雨"]] + # 以key的方式指定text传入预测方法的时的参数,此例中为"data" + # 对应本地部署,则为module.get_embedding(data=text) + data = {"data": text} + # 发送post请求,content-type类型应指定json方式,url中的ip地址需改为对应机器的ip + url = "http://127.0.0.1:8866/predict/albert-chinese-tiny" + # 指定post请求的headers为application/json方式 + headers = {"Content-Type": "application/json"} + + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.json()) + ``` + +## 五、更新历史 + +* 1.0.0 + + 初始发布 diff --git a/modules/text/language_model/albert-chinese-tiny/__init__.py b/modules/text/language_model/albert-chinese-tiny/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/modules/text/language_model/albert-chinese-tiny/module.py b/modules/text/language_model/albert-chinese-tiny/module.py new file mode 100644 index 0000000000000000000000000000000000000000..5b5b79ba2b78b7f77a6578a3855f26ebd49aebcf --- /dev/null +++ b/modules/text/language_model/albert-chinese-tiny/module.py @@ -0,0 +1,175 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math +import os +from typing import Dict + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers.albert.modeling import AlbertForSequenceClassification +from paddlenlp.transformers.albert.modeling import AlbertForTokenClassification +from paddlenlp.transformers.albert.modeling import AlbertModel +from paddlenlp.transformers.albert.tokenizer import AlbertTokenizer + +from paddlehub.module.module import moduleinfo +from paddlehub.module.nlp_module import TransformerModule +from paddlehub.utils.log import logger + + +@moduleinfo(name="albert-chinese-tiny", + version="1.0.0", + summary="", + author="Baidu", + author_email="", + type="nlp/semantic_model", + meta=TransformerModule) +class Albert(nn.Layer): + """ + ALBERT model + """ + + def __init__( + self, + task: str = None, + load_checkpoint: str = None, + label_map: Dict = None, + num_classes: int = 2, + suffix: bool = False, + **kwargs, + ): + super(Albert, self).__init__() + if label_map: + self.label_map = label_map + self.num_classes = len(label_map) + else: + self.num_classes = num_classes + + if task == 'sequence_classification': + task = 'seq-cls' + logger.warning( + "current task name 'sequence_classification' was renamed to 'seq-cls', " + "'sequence_classification' has been deprecated and will be removed in the future.", ) + if task == 'seq-cls': + self.model = AlbertForSequenceClassification.from_pretrained( + pretrained_model_name_or_path='albert-chinese-tiny', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task == 'token-cls': + self.model = AlbertForTokenClassification.from_pretrained( + pretrained_model_name_or_path='albert-chinese-tiny', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = ChunkEvaluator(label_list=[self.label_map[i] for i in sorted(self.label_map.keys())], + suffix=suffix) + elif task == 'text-matching': + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-chinese-tiny', **kwargs) + self.dropout = paddle.nn.Dropout(0.1) + self.classifier = paddle.nn.Linear(self.model.config['hidden_size'] * 3, 2) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task is None: + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-chinese-tiny', **kwargs) + else: + raise RuntimeError("Unknown task {}, task should be one in {}".format(task, self._tasks_supported)) + + self.task = task + + if load_checkpoint is not None and os.path.isfile(load_checkpoint): + state_dict = paddle.load(load_checkpoint) + self.set_state_dict(state_dict) + logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint)) + + def forward(self, + input_ids=None, + token_type_ids=None, + position_ids=None, + attention_mask=None, + query_input_ids=None, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_input_ids=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + seq_lengths=None, + labels=None): + + if self.task != 'text-matching': + result = self.model(input_ids, token_type_ids, position_ids, attention_mask) + else: + query_result = self.model(query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask) + title_result = self.model(title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask) + + if self.task == 'seq-cls': + logits = result + probs = F.softmax(logits, axis=1) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + elif self.task == 'token-cls': + logits = result + token_level_probs = F.softmax(logits, axis=-1) + preds = token_level_probs.argmax(axis=-1) + if labels is not None: + loss = self.criterion(logits, labels.unsqueeze(-1)) + num_infer_chunks, num_label_chunks, num_correct_chunks = \ + self.metric.compute(None, seq_lengths, preds, labels) + self.metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + _, _, f1_score = map(float, self.metric.accumulate()) + return token_level_probs, loss, {'f1_score': f1_score} + return token_level_probs + elif self.task == 'text-matching': + query_token_embedding, _ = query_result + query_token_embedding = self.dropout(query_token_embedding) + query_attention_mask = paddle.unsqueeze( + (query_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + query_token_embedding = query_token_embedding * query_attention_mask + query_sum_embedding = paddle.sum(query_token_embedding, axis=1) + query_sum_mask = paddle.sum(query_attention_mask, axis=1) + query_mean = query_sum_embedding / query_sum_mask + + title_token_embedding, _ = title_result + title_token_embedding = self.dropout(title_token_embedding) + title_attention_mask = paddle.unsqueeze( + (title_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + title_token_embedding = title_token_embedding * title_attention_mask + title_sum_embedding = paddle.sum(title_token_embedding, axis=1) + title_sum_mask = paddle.sum(title_attention_mask, axis=1) + title_mean = title_sum_embedding / title_sum_mask + + sub = paddle.abs(paddle.subtract(query_mean, title_mean)) + projection = paddle.concat([query_mean, title_mean, sub], axis=-1) + logits = self.classifier(projection) + probs = F.softmax(logits) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + else: + sequence_output, pooled_output = result + return sequence_output, pooled_output + + @staticmethod + def get_tokenizer(*args, **kwargs): + """ + Gets the tokenizer that is customized for this module. + """ + return AlbertTokenizer.from_pretrained(pretrained_model_name_or_path='albert-chinese-tiny', *args, **kwargs) diff --git a/modules/text/language_model/albert-chinese-xlarge/README.md b/modules/text/language_model/albert-chinese-xlarge/README.md new file mode 100644 index 0000000000000000000000000000000000000000..ee1c99456d6f15e7c0000a4c76f3954cd3948639 --- /dev/null +++ b/modules/text/language_model/albert-chinese-xlarge/README.md @@ -0,0 +1,173 @@ +# albert-chinese-xlarge +|模型名称|albert-chinese-xlarge| +| :--- | :---: | +|类别|文本-语义模型| +|网络|albert-chinese-xlarge| +|数据集|-| +|是否支持Fine-tuning|是| +|模型大小|346MB| +|最新更新日期|2022-02-08| +|数据指标|-| + +## 一、模型基本信息 + +- ### 模型介绍 + + - ALBERT针对当前预训练模型参数量过大的问题,提出了以下改进方案: + + - 嵌入向量参数化的因式分解。ALBERT对词嵌入参数进行了因式分解,先将单词映射到一个低维的词嵌入空间E,然后再将其映射到高维的隐藏空间H。 + + - 跨层参数共享。ALBERT共享了层之间的全部参数。 + +更多详情请参考[ALBERT论文](https://arxiv.org/abs/1909.11942) + +## 二、安装 + +- ### 1、环境依赖 + + - paddlepaddle >= 2.2.0 + + - paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst) + +- ### 2、安装 + + - ```shell + $ hub install albert-chinese-xlarge + ``` + - 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md) + | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md) + +## 三、模型API预测 + +- ### 1、预测代码示例 + +```python +import paddlehub as hub + +data = [ + ['这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'], + ['怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'], + ['作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'], +] +label_map = {0: 'negative', 1: 'positive'} + +model = hub.Module( + name='albert-chinese-xlarge', + version='1.0.0', + task='seq-cls', + load_checkpoint='/path/to/parameters', + label_map=label_map) +results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False) +for idx, text in enumerate(data): + print('Data: {} \t Label: {}'.format(text, results[idx])) +``` + +详情可参考PaddleHub示例: +- [文本分类](../../../../demo/text_classification) +- [序列标注](../../../../demo/sequence_labeling) + +- ### 2、API + + - ```python + def __init__( + task=None, + load_checkpoint=None, + label_map=None, + num_classes=2, + suffix=False, + **kwargs, + ) + ``` + + - 创建Module对象(动态图组网版本) + + - **参数** + + - `task`: 任务名称,可为`seq-cls`(文本分类任务)或`token-cls`(序列标注任务)。 + - `load_checkpoint`:使用PaddleHub Fine-tune api训练保存的模型参数文件路径。 + - `label_map`:预测时的类别映射表。 + - `num_classes`:分类任务的类别数,如果指定了`label_map`,此参数可不传,默认2分类。 + - `suffix`: 序列标注任务的标签格式,如果设定为`True`,标签以'-B', '-I', '-E' 或者 '-S'为结尾,此参数默认为`False`。 + - `**kwargs`:用户额外指定的关键字字典类型的参数。 + + - ```python + def predict( + data, + max_seq_len=128, + batch_size=1, + use_gpu=False + ) + ``` + + - **参数** + + - `data`: 待预测数据,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。每个样例文本数量(1个或者2个)需和训练时保持一致。 + - `max_seq_len`:模型处理文本的最大长度 + - `batch_size`:模型批处理大小 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,不同任务类型的返回结果如下 + - 文本分类:列表里包含每个句子的预测标签,格式为\[label\_1, label\_2, …,\] + - 序列标注:列表里包含每个句子每个token的预测标签,格式为\[\[token\_1, token\_2, …,\], \[token\_1, token\_2, …,\], …,\] + + - ```python + def get_embedding( + data, + use_gpu=False + ) + ``` + + - 用于获取输入文本的句子粒度特征与字粒度特征 + + - **参数** + + - `data`:输入文本列表,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\],其中每个元素都是对应样例的特征输出,每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。 + +## 四、服务部署 + +- PaddleHub Serving可以部署一个在线获取预训练词向量。 + +- ### 第一步:启动PaddleHub Serving + + - ```shell + $ hub serving start -m albert-chinese-xlarge + ``` + + - 这样就完成了一个获取预训练词向量服务化API的部署,默认端口号为8866。 + + - **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。 + +- ### 第二步:发送预测请求 + + - 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果 + + - ```python + import requests + import json + + # 指定用于获取embedding的文本[[text_1], [text_2], ... ]} + text = [["今天是个好日子"], ["天气预报说今天要下雨"]] + # 以key的方式指定text传入预测方法的时的参数,此例中为"data" + # 对应本地部署,则为module.get_embedding(data=text) + data = {"data": text} + # 发送post请求,content-type类型应指定json方式,url中的ip地址需改为对应机器的ip + url = "http://127.0.0.1:8866/predict/albert-chinese-xlarge" + # 指定post请求的headers为application/json方式 + headers = {"Content-Type": "application/json"} + + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.json()) + ``` + +## 五、更新历史 + +* 1.0.0 + + 初始发布 diff --git a/modules/text/language_model/albert-chinese-xlarge/__init__.py b/modules/text/language_model/albert-chinese-xlarge/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/modules/text/language_model/albert-chinese-xlarge/module.py b/modules/text/language_model/albert-chinese-xlarge/module.py new file mode 100644 index 0000000000000000000000000000000000000000..5e76ee63f8a89f717a9ecae2d338e83aa2214f4b --- /dev/null +++ b/modules/text/language_model/albert-chinese-xlarge/module.py @@ -0,0 +1,175 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math +import os +from typing import Dict + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers.albert.modeling import AlbertForSequenceClassification +from paddlenlp.transformers.albert.modeling import AlbertForTokenClassification +from paddlenlp.transformers.albert.modeling import AlbertModel +from paddlenlp.transformers.albert.tokenizer import AlbertTokenizer + +from paddlehub.module.module import moduleinfo +from paddlehub.module.nlp_module import TransformerModule +from paddlehub.utils.log import logger + + +@moduleinfo(name="albert-chinese-xlarge", + version="1.0.0", + summary="", + author="Baidu", + author_email="", + type="nlp/semantic_model", + meta=TransformerModule) +class Albert(nn.Layer): + """ + ALBERT model + """ + + def __init__( + self, + task: str = None, + load_checkpoint: str = None, + label_map: Dict = None, + num_classes: int = 2, + suffix: bool = False, + **kwargs, + ): + super(Albert, self).__init__() + if label_map: + self.label_map = label_map + self.num_classes = len(label_map) + else: + self.num_classes = num_classes + + if task == 'sequence_classification': + task = 'seq-cls' + logger.warning( + "current task name 'sequence_classification' was renamed to 'seq-cls', " + "'sequence_classification' has been deprecated and will be removed in the future.", ) + if task == 'seq-cls': + self.model = AlbertForSequenceClassification.from_pretrained( + pretrained_model_name_or_path='albert-chinese-xlarge', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task == 'token-cls': + self.model = AlbertForTokenClassification.from_pretrained( + pretrained_model_name_or_path='albert-chinese-xlarge', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = ChunkEvaluator(label_list=[self.label_map[i] for i in sorted(self.label_map.keys())], + suffix=suffix) + elif task == 'text-matching': + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-chinese-xlarge', **kwargs) + self.dropout = paddle.nn.Dropout(0.1) + self.classifier = paddle.nn.Linear(self.model.config['hidden_size'] * 3, 2) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task is None: + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-chinese-xlarge', **kwargs) + else: + raise RuntimeError("Unknown task {}, task should be one in {}".format(task, self._tasks_supported)) + + self.task = task + + if load_checkpoint is not None and os.path.isfile(load_checkpoint): + state_dict = paddle.load(load_checkpoint) + self.set_state_dict(state_dict) + logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint)) + + def forward(self, + input_ids=None, + token_type_ids=None, + position_ids=None, + attention_mask=None, + query_input_ids=None, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_input_ids=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + seq_lengths=None, + labels=None): + + if self.task != 'text-matching': + result = self.model(input_ids, token_type_ids, position_ids, attention_mask) + else: + query_result = self.model(query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask) + title_result = self.model(title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask) + + if self.task == 'seq-cls': + logits = result + probs = F.softmax(logits, axis=1) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + elif self.task == 'token-cls': + logits = result + token_level_probs = F.softmax(logits, axis=-1) + preds = token_level_probs.argmax(axis=-1) + if labels is not None: + loss = self.criterion(logits, labels.unsqueeze(-1)) + num_infer_chunks, num_label_chunks, num_correct_chunks = \ + self.metric.compute(None, seq_lengths, preds, labels) + self.metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + _, _, f1_score = map(float, self.metric.accumulate()) + return token_level_probs, loss, {'f1_score': f1_score} + return token_level_probs + elif self.task == 'text-matching': + query_token_embedding, _ = query_result + query_token_embedding = self.dropout(query_token_embedding) + query_attention_mask = paddle.unsqueeze( + (query_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + query_token_embedding = query_token_embedding * query_attention_mask + query_sum_embedding = paddle.sum(query_token_embedding, axis=1) + query_sum_mask = paddle.sum(query_attention_mask, axis=1) + query_mean = query_sum_embedding / query_sum_mask + + title_token_embedding, _ = title_result + title_token_embedding = self.dropout(title_token_embedding) + title_attention_mask = paddle.unsqueeze( + (title_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + title_token_embedding = title_token_embedding * title_attention_mask + title_sum_embedding = paddle.sum(title_token_embedding, axis=1) + title_sum_mask = paddle.sum(title_attention_mask, axis=1) + title_mean = title_sum_embedding / title_sum_mask + + sub = paddle.abs(paddle.subtract(query_mean, title_mean)) + projection = paddle.concat([query_mean, title_mean, sub], axis=-1) + logits = self.classifier(projection) + probs = F.softmax(logits) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + else: + sequence_output, pooled_output = result + return sequence_output, pooled_output + + @staticmethod + def get_tokenizer(*args, **kwargs): + """ + Gets the tokenizer that is customized for this module. + """ + return AlbertTokenizer.from_pretrained(pretrained_model_name_or_path='albert-chinese-xlarge', *args, **kwargs) diff --git a/modules/text/language_model/albert-chinese-xxlarge/README.md b/modules/text/language_model/albert-chinese-xxlarge/README.md new file mode 100644 index 0000000000000000000000000000000000000000..8cd47b50ed0f2a0f91aa2918417cf9ea1c92cac2 --- /dev/null +++ b/modules/text/language_model/albert-chinese-xxlarge/README.md @@ -0,0 +1,173 @@ +# albert-chinese-xxlarge +|模型名称|albert-chinese-xxlarge| +| :--- | :---: | +|类别|文本-语义模型| +|网络|albert-chinese-xxlarge| +|数据集|-| +|是否支持Fine-tuning|是| +|模型大小|1.3GB| +|最新更新日期|2022-02-08| +|数据指标|-| + +## 一、模型基本信息 + +- ### 模型介绍 + + - ALBERT针对当前预训练模型参数量过大的问题,提出了以下改进方案: + + - 嵌入向量参数化的因式分解。ALBERT对词嵌入参数进行了因式分解,先将单词映射到一个低维的词嵌入空间E,然后再将其映射到高维的隐藏空间H。 + + - 跨层参数共享。ALBERT共享了层之间的全部参数。 + +更多详情请参考[ALBERT论文](https://arxiv.org/abs/1909.11942) + +## 二、安装 + +- ### 1、环境依赖 + + - paddlepaddle >= 2.2.0 + + - paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst) + +- ### 2、安装 + + - ```shell + $ hub install albert-chinese-xxlarge + ``` + - 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md) + | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md) + +## 三、模型API预测 + +- ### 1、预测代码示例 + +```python +import paddlehub as hub + +data = [ + ['这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'], + ['怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'], + ['作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'], +] +label_map = {0: 'negative', 1: 'positive'} + +model = hub.Module( + name='albert-chinese-xxlarge', + version='1.0.0', + task='seq-cls', + load_checkpoint='/path/to/parameters', + label_map=label_map) +results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False) +for idx, text in enumerate(data): + print('Data: {} \t Label: {}'.format(text, results[idx])) +``` + +详情可参考PaddleHub示例: +- [文本分类](../../../../demo/text_classification) +- [序列标注](../../../../demo/sequence_labeling) + +- ### 2、API + + - ```python + def __init__( + task=None, + load_checkpoint=None, + label_map=None, + num_classes=2, + suffix=False, + **kwargs, + ) + ``` + + - 创建Module对象(动态图组网版本) + + - **参数** + + - `task`: 任务名称,可为`seq-cls`(文本分类任务)或`token-cls`(序列标注任务)。 + - `load_checkpoint`:使用PaddleHub Fine-tune api训练保存的模型参数文件路径。 + - `label_map`:预测时的类别映射表。 + - `num_classes`:分类任务的类别数,如果指定了`label_map`,此参数可不传,默认2分类。 + - `suffix`: 序列标注任务的标签格式,如果设定为`True`,标签以'-B', '-I', '-E' 或者 '-S'为结尾,此参数默认为`False`。 + - `**kwargs`:用户额外指定的关键字字典类型的参数。 + + - ```python + def predict( + data, + max_seq_len=128, + batch_size=1, + use_gpu=False + ) + ``` + + - **参数** + + - `data`: 待预测数据,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。每个样例文本数量(1个或者2个)需和训练时保持一致。 + - `max_seq_len`:模型处理文本的最大长度 + - `batch_size`:模型批处理大小 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,不同任务类型的返回结果如下 + - 文本分类:列表里包含每个句子的预测标签,格式为\[label\_1, label\_2, …,\] + - 序列标注:列表里包含每个句子每个token的预测标签,格式为\[\[token\_1, token\_2, …,\], \[token\_1, token\_2, …,\], …,\] + + - ```python + def get_embedding( + data, + use_gpu=False + ) + ``` + + - 用于获取输入文本的句子粒度特征与字粒度特征 + + - **参数** + + - `data`:输入文本列表,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\],其中每个元素都是对应样例的特征输出,每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。 + +## 四、服务部署 + +- PaddleHub Serving可以部署一个在线获取预训练词向量。 + +- ### 第一步:启动PaddleHub Serving + + - ```shell + $ hub serving start -m albert-chinese-xxlarge + ``` + + - 这样就完成了一个获取预训练词向量服务化API的部署,默认端口号为8866。 + + - **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。 + +- ### 第二步:发送预测请求 + + - 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果 + + - ```python + import requests + import json + + # 指定用于获取embedding的文本[[text_1], [text_2], ... ]} + text = [["今天是个好日子"], ["天气预报说今天要下雨"]] + # 以key的方式指定text传入预测方法的时的参数,此例中为"data" + # 对应本地部署,则为module.get_embedding(data=text) + data = {"data": text} + # 发送post请求,content-type类型应指定json方式,url中的ip地址需改为对应机器的ip + url = "http://127.0.0.1:8866/predict/albert-chinese-xxlarge" + # 指定post请求的headers为application/json方式 + headers = {"Content-Type": "application/json"} + + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.json()) + ``` + +## 五、更新历史 + +* 1.0.0 + + 初始发布 diff --git a/modules/text/language_model/albert-chinese-xxlarge/__init__.py b/modules/text/language_model/albert-chinese-xxlarge/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/modules/text/language_model/albert-chinese-xxlarge/module.py b/modules/text/language_model/albert-chinese-xxlarge/module.py new file mode 100644 index 0000000000000000000000000000000000000000..2fcc3cbf5471a0385bce25fad0d818d023466b46 --- /dev/null +++ b/modules/text/language_model/albert-chinese-xxlarge/module.py @@ -0,0 +1,175 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math +import os +from typing import Dict + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers.albert.modeling import AlbertForSequenceClassification +from paddlenlp.transformers.albert.modeling import AlbertForTokenClassification +from paddlenlp.transformers.albert.modeling import AlbertModel +from paddlenlp.transformers.albert.tokenizer import AlbertTokenizer + +from paddlehub.module.module import moduleinfo +from paddlehub.module.nlp_module import TransformerModule +from paddlehub.utils.log import logger + + +@moduleinfo(name="albert-chinese-xxlarge", + version="1.0.0", + summary="", + author="Baidu", + author_email="", + type="nlp/semantic_model", + meta=TransformerModule) +class Albert(nn.Layer): + """ + ALBERT model + """ + + def __init__( + self, + task: str = None, + load_checkpoint: str = None, + label_map: Dict = None, + num_classes: int = 2, + suffix: bool = False, + **kwargs, + ): + super(Albert, self).__init__() + if label_map: + self.label_map = label_map + self.num_classes = len(label_map) + else: + self.num_classes = num_classes + + if task == 'sequence_classification': + task = 'seq-cls' + logger.warning( + "current task name 'sequence_classification' was renamed to 'seq-cls', " + "'sequence_classification' has been deprecated and will be removed in the future.", ) + if task == 'seq-cls': + self.model = AlbertForSequenceClassification.from_pretrained( + pretrained_model_name_or_path='albert-chinese-xxlarge', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task == 'token-cls': + self.model = AlbertForTokenClassification.from_pretrained( + pretrained_model_name_or_path='albert-chinese-xxlarge', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = ChunkEvaluator(label_list=[self.label_map[i] for i in sorted(self.label_map.keys())], + suffix=suffix) + elif task == 'text-matching': + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-chinese-xxlarge', **kwargs) + self.dropout = paddle.nn.Dropout(0.1) + self.classifier = paddle.nn.Linear(self.model.config['hidden_size'] * 3, 2) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task is None: + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-chinese-xxlarge', **kwargs) + else: + raise RuntimeError("Unknown task {}, task should be one in {}".format(task, self._tasks_supported)) + + self.task = task + + if load_checkpoint is not None and os.path.isfile(load_checkpoint): + state_dict = paddle.load(load_checkpoint) + self.set_state_dict(state_dict) + logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint)) + + def forward(self, + input_ids=None, + token_type_ids=None, + position_ids=None, + attention_mask=None, + query_input_ids=None, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_input_ids=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + seq_lengths=None, + labels=None): + + if self.task != 'text-matching': + result = self.model(input_ids, token_type_ids, position_ids, attention_mask) + else: + query_result = self.model(query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask) + title_result = self.model(title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask) + + if self.task == 'seq-cls': + logits = result + probs = F.softmax(logits, axis=1) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + elif self.task == 'token-cls': + logits = result + token_level_probs = F.softmax(logits, axis=-1) + preds = token_level_probs.argmax(axis=-1) + if labels is not None: + loss = self.criterion(logits, labels.unsqueeze(-1)) + num_infer_chunks, num_label_chunks, num_correct_chunks = \ + self.metric.compute(None, seq_lengths, preds, labels) + self.metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + _, _, f1_score = map(float, self.metric.accumulate()) + return token_level_probs, loss, {'f1_score': f1_score} + return token_level_probs + elif self.task == 'text-matching': + query_token_embedding, _ = query_result + query_token_embedding = self.dropout(query_token_embedding) + query_attention_mask = paddle.unsqueeze( + (query_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + query_token_embedding = query_token_embedding * query_attention_mask + query_sum_embedding = paddle.sum(query_token_embedding, axis=1) + query_sum_mask = paddle.sum(query_attention_mask, axis=1) + query_mean = query_sum_embedding / query_sum_mask + + title_token_embedding, _ = title_result + title_token_embedding = self.dropout(title_token_embedding) + title_attention_mask = paddle.unsqueeze( + (title_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + title_token_embedding = title_token_embedding * title_attention_mask + title_sum_embedding = paddle.sum(title_token_embedding, axis=1) + title_sum_mask = paddle.sum(title_attention_mask, axis=1) + title_mean = title_sum_embedding / title_sum_mask + + sub = paddle.abs(paddle.subtract(query_mean, title_mean)) + projection = paddle.concat([query_mean, title_mean, sub], axis=-1) + logits = self.classifier(projection) + probs = F.softmax(logits) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + else: + sequence_output, pooled_output = result + return sequence_output, pooled_output + + @staticmethod + def get_tokenizer(*args, **kwargs): + """ + Gets the tokenizer that is customized for this module. + """ + return AlbertTokenizer.from_pretrained(pretrained_model_name_or_path='albert-chinese-xxlarge', *args, **kwargs) diff --git a/modules/text/language_model/albert-xxlarge-v1/README.md b/modules/text/language_model/albert-xxlarge-v1/README.md new file mode 100644 index 0000000000000000000000000000000000000000..1b001f48dc6d9a77f77cb5630f3f0bb2b2a217c1 --- /dev/null +++ b/modules/text/language_model/albert-xxlarge-v1/README.md @@ -0,0 +1,173 @@ +# albert-xxlarge-v1 +|模型名称|albert-xxlarge-v1| +| :--- | :---: | +|类别|文本-语义模型| +|网络|albert-xxlarge-v1| +|数据集|-| +|是否支持Fine-tuning|是| +|模型大小|1.3GB| +|最新更新日期|2022-02-08| +|数据指标|-| + +## 一、模型基本信息 + +- ### 模型介绍 + + - ALBERT针对当前预训练模型参数量过大的问题,提出了以下改进方案: + + - 嵌入向量参数化的因式分解。ALBERT对词嵌入参数进行了因式分解,先将单词映射到一个低维的词嵌入空间E,然后再将其映射到高维的隐藏空间H。 + + - 跨层参数共享。ALBERT共享了层之间的全部参数。 + +更多详情请参考[ALBERT论文](https://arxiv.org/abs/1909.11942) + +## 二、安装 + +- ### 1、环境依赖 + + - paddlepaddle >= 2.2.0 + + - paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst) + +- ### 2、安装 + + - ```shell + $ hub install albert-xxlarge-v1 + ``` + - 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md) + | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md) + +## 三、模型API预测 + +- ### 1、预测代码示例 + +```python +import paddlehub as hub + +data = [ + ['这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'], + ['怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'], + ['作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'], +] +label_map = {0: 'negative', 1: 'positive'} + +model = hub.Module( + name='albert-xxlarge-v1', + version='1.0.0', + task='seq-cls', + load_checkpoint='/path/to/parameters', + label_map=label_map) +results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False) +for idx, text in enumerate(data): + print('Data: {} \t Label: {}'.format(text, results[idx])) +``` + +详情可参考PaddleHub示例: +- [文本分类](../../../../demo/text_classification) +- [序列标注](../../../../demo/sequence_labeling) + +- ### 2、API + + - ```python + def __init__( + task=None, + load_checkpoint=None, + label_map=None, + num_classes=2, + suffix=False, + **kwargs, + ) + ``` + + - 创建Module对象(动态图组网版本) + + - **参数** + + - `task`: 任务名称,可为`seq-cls`(文本分类任务)或`token-cls`(序列标注任务)。 + - `load_checkpoint`:使用PaddleHub Fine-tune api训练保存的模型参数文件路径。 + - `label_map`:预测时的类别映射表。 + - `num_classes`:分类任务的类别数,如果指定了`label_map`,此参数可不传,默认2分类。 + - `suffix`: 序列标注任务的标签格式,如果设定为`True`,标签以'-B', '-I', '-E' 或者 '-S'为结尾,此参数默认为`False`。 + - `**kwargs`:用户额外指定的关键字字典类型的参数。 + + - ```python + def predict( + data, + max_seq_len=128, + batch_size=1, + use_gpu=False + ) + ``` + + - **参数** + + - `data`: 待预测数据,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。每个样例文本数量(1个或者2个)需和训练时保持一致。 + - `max_seq_len`:模型处理文本的最大长度 + - `batch_size`:模型批处理大小 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,不同任务类型的返回结果如下 + - 文本分类:列表里包含每个句子的预测标签,格式为\[label\_1, label\_2, …,\] + - 序列标注:列表里包含每个句子每个token的预测标签,格式为\[\[token\_1, token\_2, …,\], \[token\_1, token\_2, …,\], …,\] + + - ```python + def get_embedding( + data, + use_gpu=False + ) + ``` + + - 用于获取输入文本的句子粒度特征与字粒度特征 + + - **参数** + + - `data`:输入文本列表,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\],其中每个元素都是对应样例的特征输出,每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。 + +## 四、服务部署 + +- PaddleHub Serving可以部署一个在线获取预训练词向量。 + +- ### 第一步:启动PaddleHub Serving + + - ```shell + $ hub serving start -m albert-xxlarge-v1 + ``` + + - 这样就完成了一个获取预训练词向量服务化API的部署,默认端口号为8866。 + + - **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。 + +- ### 第二步:发送预测请求 + + - 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果 + + - ```python + import requests + import json + + # 指定用于获取embedding的文本[[text_1], [text_2], ... ]} + text = [["今天是个好日子"], ["天气预报说今天要下雨"]] + # 以key的方式指定text传入预测方法的时的参数,此例中为"data" + # 对应本地部署,则为module.get_embedding(data=text) + data = {"data": text} + # 发送post请求,content-type类型应指定json方式,url中的ip地址需改为对应机器的ip + url = "http://127.0.0.1:8866/predict/albert-xxlarge-v1" + # 指定post请求的headers为application/json方式 + headers = {"Content-Type": "application/json"} + + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.json()) + ``` + +## 五、更新历史 + +* 1.0.0 + + 初始发布 diff --git a/modules/text/language_model/albert-xxlarge-v1/__init__.py b/modules/text/language_model/albert-xxlarge-v1/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/modules/text/language_model/albert-xxlarge-v1/module.py b/modules/text/language_model/albert-xxlarge-v1/module.py new file mode 100644 index 0000000000000000000000000000000000000000..a99c06ee25b5670622a4efeb0e7bc49588fdee5b --- /dev/null +++ b/modules/text/language_model/albert-xxlarge-v1/module.py @@ -0,0 +1,176 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math +import os +from typing import Dict + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers.albert.modeling import AlbertForSequenceClassification +from paddlenlp.transformers.albert.modeling import AlbertForTokenClassification +from paddlenlp.transformers.albert.modeling import AlbertModel +from paddlenlp.transformers.albert.tokenizer import AlbertTokenizer + +from paddlehub.module.module import moduleinfo +from paddlehub.module.nlp_module import TransformerModule +from paddlehub.utils.log import logger + + +@moduleinfo(name="albert-xxlarge-v1", + version="1.0.0", + summary="", + author="Baidu", + author_email="", + type="nlp/semantic_model", + meta=TransformerModule) +class Albert(nn.Layer): + """ + ALBERT model + """ + + def __init__( + self, + task: str = None, + load_checkpoint: str = None, + label_map: Dict = None, + num_classes: int = 2, + suffix: bool = False, + **kwargs, + ): + super(Albert, self).__init__() + if label_map: + self.label_map = label_map + self.num_classes = len(label_map) + else: + self.num_classes = num_classes + + if task == 'sequence_classification': + task = 'seq-cls' + logger.warning( + "current task name 'sequence_classification' was renamed to 'seq-cls', " + "'sequence_classification' has been deprecated and will be removed in the future.", ) + if task == 'seq-cls': + self.model = AlbertForSequenceClassification.from_pretrained( + pretrained_model_name_or_path='albert-xxlarge-v1', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task == 'token-cls': + self.model = AlbertForTokenClassification.from_pretrained(pretrained_model_name_or_path='albert-xxlarge-v1', + num_classes=self.num_classes, + **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = ChunkEvaluator(label_list=[self.label_map[i] for i in sorted(self.label_map.keys())], + suffix=suffix) + elif task == 'text-matching': + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-xxlarge-v1', **kwargs) + self.dropout = paddle.nn.Dropout(0.1) + self.classifier = paddle.nn.Linear(self.model.config['hidden_size'] * 3, 2) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task is None: + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-xxlarge-v1', **kwargs) + else: + raise RuntimeError("Unknown task {}, task should be one in {}".format(task, self._tasks_supported)) + + self.task = task + + if load_checkpoint is not None and os.path.isfile(load_checkpoint): + state_dict = paddle.load(load_checkpoint) + self.set_state_dict(state_dict) + logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint)) + + def forward(self, + input_ids=None, + token_type_ids=None, + position_ids=None, + attention_mask=None, + query_input_ids=None, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_input_ids=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + seq_lengths=None, + labels=None): + + if self.task != 'text-matching': + result = self.model(input_ids, token_type_ids, position_ids, attention_mask) + else: + query_result = self.model(query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask) + title_result = self.model(title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask) + + if self.task == 'seq-cls': + logits = result + probs = F.softmax(logits, axis=1) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + elif self.task == 'token-cls': + logits = result + token_level_probs = F.softmax(logits, axis=-1) + preds = token_level_probs.argmax(axis=-1) + if labels is not None: + loss = self.criterion(logits, labels.unsqueeze(-1)) + num_infer_chunks, num_label_chunks, num_correct_chunks = \ + self.metric.compute(None, seq_lengths, preds, labels) + self.metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + _, _, f1_score = map(float, self.metric.accumulate()) + return token_level_probs, loss, {'f1_score': f1_score} + return token_level_probs + elif self.task == 'text-matching': + query_token_embedding, _ = query_result + query_token_embedding = self.dropout(query_token_embedding) + query_attention_mask = paddle.unsqueeze( + (query_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + query_token_embedding = query_token_embedding * query_attention_mask + query_sum_embedding = paddle.sum(query_token_embedding, axis=1) + query_sum_mask = paddle.sum(query_attention_mask, axis=1) + query_mean = query_sum_embedding / query_sum_mask + + title_token_embedding, _ = title_result + title_token_embedding = self.dropout(title_token_embedding) + title_attention_mask = paddle.unsqueeze( + (title_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + title_token_embedding = title_token_embedding * title_attention_mask + title_sum_embedding = paddle.sum(title_token_embedding, axis=1) + title_sum_mask = paddle.sum(title_attention_mask, axis=1) + title_mean = title_sum_embedding / title_sum_mask + + sub = paddle.abs(paddle.subtract(query_mean, title_mean)) + projection = paddle.concat([query_mean, title_mean, sub], axis=-1) + logits = self.classifier(projection) + probs = F.softmax(logits) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + else: + sequence_output, pooled_output = result + return sequence_output, pooled_output + + @staticmethod + def get_tokenizer(*args, **kwargs): + """ + Gets the tokenizer that is customized for this module. + """ + return AlbertTokenizer.from_pretrained(pretrained_model_name_or_path='albert-xxlarge-v1', *args, **kwargs) diff --git a/modules/text/language_model/albert-xxlarge-v2/README.md b/modules/text/language_model/albert-xxlarge-v2/README.md new file mode 100644 index 0000000000000000000000000000000000000000..af477225b0a71f4d9726a7694ad922ff18a38771 --- /dev/null +++ b/modules/text/language_model/albert-xxlarge-v2/README.md @@ -0,0 +1,173 @@ +# albert-xxlarge-v2 +|模型名称|albert-xxlarge-v2| +| :--- | :---: | +|类别|文本-语义模型| +|网络|albert-xxlarge-v2| +|数据集|-| +|是否支持Fine-tuning|是| +|模型大小|1.3GB| +|最新更新日期|2022-02-08| +|数据指标|-| + +## 一、模型基本信息 + +- ### 模型介绍 + + - ALBERT针对当前预训练模型参数量过大的问题,提出了以下改进方案: + + - 嵌入向量参数化的因式分解。ALBERT对词嵌入参数进行了因式分解,先将单词映射到一个低维的词嵌入空间E,然后再将其映射到高维的隐藏空间H。 + + - 跨层参数共享。ALBERT共享了层之间的全部参数。 + +更多详情请参考[ALBERT论文](https://arxiv.org/abs/1909.11942) + +## 二、安装 + +- ### 1、环境依赖 + + - paddlepaddle >= 2.2.0 + + - paddlehub >= 2.2.0 | [如何安装PaddleHub](../../../../docs/docs_ch/get_start/installation.rst) + +- ### 2、安装 + + - ```shell + $ hub install albert-xxlarge-v2 + ``` + - 如您安装时遇到问题,可参考:[零基础windows安装](../../../../docs/docs_ch/get_start/windows_quickstart.md) + | [零基础Linux安装](../../../../docs/docs_ch/get_start/linux_quickstart.md) | [零基础MacOS安装](../../../../docs/docs_ch/get_start/mac_quickstart.md) + +## 三、模型API预测 + +- ### 1、预测代码示例 + +```python +import paddlehub as hub + +data = [ + ['这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'], + ['怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'], + ['作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'], +] +label_map = {0: 'negative', 1: 'positive'} + +model = hub.Module( + name='albert-xxlarge-v2', + version='1.0.0', + task='seq-cls', + load_checkpoint='/path/to/parameters', + label_map=label_map) +results = model.predict(data, max_seq_len=50, batch_size=1, use_gpu=False) +for idx, text in enumerate(data): + print('Data: {} \t Label: {}'.format(text, results[idx])) +``` + +详情可参考PaddleHub示例: +- [文本分类](../../../../demo/text_classification) +- [序列标注](../../../../demo/sequence_labeling) + +- ### 2、API + + - ```python + def __init__( + task=None, + load_checkpoint=None, + label_map=None, + num_classes=2, + suffix=False, + **kwargs, + ) + ``` + + - 创建Module对象(动态图组网版本) + + - **参数** + + - `task`: 任务名称,可为`seq-cls`(文本分类任务)或`token-cls`(序列标注任务)。 + - `load_checkpoint`:使用PaddleHub Fine-tune api训练保存的模型参数文件路径。 + - `label_map`:预测时的类别映射表。 + - `num_classes`:分类任务的类别数,如果指定了`label_map`,此参数可不传,默认2分类。 + - `suffix`: 序列标注任务的标签格式,如果设定为`True`,标签以'-B', '-I', '-E' 或者 '-S'为结尾,此参数默认为`False`。 + - `**kwargs`:用户额外指定的关键字字典类型的参数。 + + - ```python + def predict( + data, + max_seq_len=128, + batch_size=1, + use_gpu=False + ) + ``` + + - **参数** + + - `data`: 待预测数据,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。每个样例文本数量(1个或者2个)需和训练时保持一致。 + - `max_seq_len`:模型处理文本的最大长度 + - `batch_size`:模型批处理大小 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,不同任务类型的返回结果如下 + - 文本分类:列表里包含每个句子的预测标签,格式为\[label\_1, label\_2, …,\] + - 序列标注:列表里包含每个句子每个token的预测标签,格式为\[\[token\_1, token\_2, …,\], \[token\_1, token\_2, …,\], …,\] + + - ```python + def get_embedding( + data, + use_gpu=False + ) + ``` + + - 用于获取输入文本的句子粒度特征与字粒度特征 + + - **参数** + + - `data`:输入文本列表,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。 + - `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。 + + - **返回** + + - `results`:list类型,格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\],其中每个元素都是对应样例的特征输出,每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。 + +## 四、服务部署 + +- PaddleHub Serving可以部署一个在线获取预训练词向量。 + +- ### 第一步:启动PaddleHub Serving + + - ```shell + $ hub serving start -m albert-xxlarge-v2 + ``` + + - 这样就完成了一个获取预训练词向量服务化API的部署,默认端口号为8866。 + + - **NOTE:** 如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。 + +- ### 第二步:发送预测请求 + + - 配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果 + + - ```python + import requests + import json + + # 指定用于获取embedding的文本[[text_1], [text_2], ... ]} + text = [["今天是个好日子"], ["天气预报说今天要下雨"]] + # 以key的方式指定text传入预测方法的时的参数,此例中为"data" + # 对应本地部署,则为module.get_embedding(data=text) + data = {"data": text} + # 发送post请求,content-type类型应指定json方式,url中的ip地址需改为对应机器的ip + url = "http://127.0.0.1:8866/predict/albert-xxlarge-v2" + # 指定post请求的headers为application/json方式 + headers = {"Content-Type": "application/json"} + + r = requests.post(url=url, headers=headers, data=json.dumps(data)) + print(r.json()) + ``` + +## 五、更新历史 + +* 1.0.0 + + 初始发布 diff --git a/modules/text/language_model/albert-xxlarge-v2/__init__.py b/modules/text/language_model/albert-xxlarge-v2/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/modules/text/language_model/albert-xxlarge-v2/module.py b/modules/text/language_model/albert-xxlarge-v2/module.py new file mode 100644 index 0000000000000000000000000000000000000000..091d1842c949212ab2f4a2ae607529162e6e60f4 --- /dev/null +++ b/modules/text/language_model/albert-xxlarge-v2/module.py @@ -0,0 +1,176 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math +import os +from typing import Dict + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F +from paddlenlp.metrics import ChunkEvaluator +from paddlenlp.transformers.albert.modeling import AlbertForSequenceClassification +from paddlenlp.transformers.albert.modeling import AlbertForTokenClassification +from paddlenlp.transformers.albert.modeling import AlbertModel +from paddlenlp.transformers.albert.tokenizer import AlbertTokenizer + +from paddlehub.module.module import moduleinfo +from paddlehub.module.nlp_module import TransformerModule +from paddlehub.utils.log import logger + + +@moduleinfo(name="albert-xxlarge-v2", + version="1.0.0", + summary="", + author="Baidu", + author_email="", + type="nlp/semantic_model", + meta=TransformerModule) +class Albert(nn.Layer): + """ + ALBERT model + """ + + def __init__( + self, + task: str = None, + load_checkpoint: str = None, + label_map: Dict = None, + num_classes: int = 2, + suffix: bool = False, + **kwargs, + ): + super(Albert, self).__init__() + if label_map: + self.label_map = label_map + self.num_classes = len(label_map) + else: + self.num_classes = num_classes + + if task == 'sequence_classification': + task = 'seq-cls' + logger.warning( + "current task name 'sequence_classification' was renamed to 'seq-cls', " + "'sequence_classification' has been deprecated and will be removed in the future.", ) + if task == 'seq-cls': + self.model = AlbertForSequenceClassification.from_pretrained( + pretrained_model_name_or_path='albert-xxlarge-v2', num_classes=self.num_classes, **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task == 'token-cls': + self.model = AlbertForTokenClassification.from_pretrained(pretrained_model_name_or_path='albert-xxlarge-v2', + num_classes=self.num_classes, + **kwargs) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = ChunkEvaluator(label_list=[self.label_map[i] for i in sorted(self.label_map.keys())], + suffix=suffix) + elif task == 'text-matching': + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-xxlarge-v2', **kwargs) + self.dropout = paddle.nn.Dropout(0.1) + self.classifier = paddle.nn.Linear(self.model.config['hidden_size'] * 3, 2) + self.criterion = paddle.nn.loss.CrossEntropyLoss() + self.metric = paddle.metric.Accuracy() + elif task is None: + self.model = AlbertModel.from_pretrained(pretrained_model_name_or_path='albert-xxlarge-v2', **kwargs) + else: + raise RuntimeError("Unknown task {}, task should be one in {}".format(task, self._tasks_supported)) + + self.task = task + + if load_checkpoint is not None and os.path.isfile(load_checkpoint): + state_dict = paddle.load(load_checkpoint) + self.set_state_dict(state_dict) + logger.info('Loaded parameters from %s' % os.path.abspath(load_checkpoint)) + + def forward(self, + input_ids=None, + token_type_ids=None, + position_ids=None, + attention_mask=None, + query_input_ids=None, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_input_ids=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + seq_lengths=None, + labels=None): + + if self.task != 'text-matching': + result = self.model(input_ids, token_type_ids, position_ids, attention_mask) + else: + query_result = self.model(query_input_ids, query_token_type_ids, query_position_ids, query_attention_mask) + title_result = self.model(title_input_ids, title_token_type_ids, title_position_ids, title_attention_mask) + + if self.task == 'seq-cls': + logits = result + probs = F.softmax(logits, axis=1) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + elif self.task == 'token-cls': + logits = result + token_level_probs = F.softmax(logits, axis=-1) + preds = token_level_probs.argmax(axis=-1) + if labels is not None: + loss = self.criterion(logits, labels.unsqueeze(-1)) + num_infer_chunks, num_label_chunks, num_correct_chunks = \ + self.metric.compute(None, seq_lengths, preds, labels) + self.metric.update(num_infer_chunks.numpy(), num_label_chunks.numpy(), num_correct_chunks.numpy()) + _, _, f1_score = map(float, self.metric.accumulate()) + return token_level_probs, loss, {'f1_score': f1_score} + return token_level_probs + elif self.task == 'text-matching': + query_token_embedding, _ = query_result + query_token_embedding = self.dropout(query_token_embedding) + query_attention_mask = paddle.unsqueeze( + (query_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + query_token_embedding = query_token_embedding * query_attention_mask + query_sum_embedding = paddle.sum(query_token_embedding, axis=1) + query_sum_mask = paddle.sum(query_attention_mask, axis=1) + query_mean = query_sum_embedding / query_sum_mask + + title_token_embedding, _ = title_result + title_token_embedding = self.dropout(title_token_embedding) + title_attention_mask = paddle.unsqueeze( + (title_input_ids != self.model.pad_token_id).astype(self.model.pooler.dense.weight.dtype), axis=2) + title_token_embedding = title_token_embedding * title_attention_mask + title_sum_embedding = paddle.sum(title_token_embedding, axis=1) + title_sum_mask = paddle.sum(title_attention_mask, axis=1) + title_mean = title_sum_embedding / title_sum_mask + + sub = paddle.abs(paddle.subtract(query_mean, title_mean)) + projection = paddle.concat([query_mean, title_mean, sub], axis=-1) + logits = self.classifier(projection) + probs = F.softmax(logits) + if labels is not None: + loss = self.criterion(logits, labels) + correct = self.metric.compute(probs, labels) + acc = self.metric.update(correct) + return probs, loss, {'acc': acc} + return probs + else: + sequence_output, pooled_output = result + return sequence_output, pooled_output + + @staticmethod + def get_tokenizer(*args, **kwargs): + """ + Gets the tokenizer that is customized for this module. + """ + return AlbertTokenizer.from_pretrained(pretrained_model_name_or_path='albert-xxlarge-v2', *args, **kwargs)