Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
PaddleHub
提交
05d93eaf
P
PaddleHub
项目概览
PaddlePaddle
/
PaddleHub
大约 1 年 前同步成功
通知
282
Star
12117
Fork
2091
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
200
列表
看板
标记
里程碑
合并请求
4
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
PaddleHub
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
200
Issue
200
列表
看板
标记
里程碑
合并请求
4
合并请求
4
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
05d93eaf
编写于
3月 12, 2022
作者:
K
KP
提交者:
GitHub
3月 12, 2022
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #1780 from linjieccc/add_plm
Add albert-base-v1
上级
393a36c4
c0802fc5
变更
3
隐藏空白更改
内联
并排
Showing
3 changed file
with
350 addition
and
0 deletion
+350
-0
modules/text/language_model/albert-base-v1/README.md
modules/text/language_model/albert-base-v1/README.md
+173
-0
modules/text/language_model/albert-base-v1/__init__.py
modules/text/language_model/albert-base-v1/__init__.py
+0
-0
modules/text/language_model/albert-base-v1/module.py
modules/text/language_model/albert-base-v1/module.py
+177
-0
未找到文件。
modules/text/language_model/albert-base-v1/README.md
0 → 100644
浏览文件 @
05d93eaf
# albert-base-v1
|模型名称|albert-base-v1|
| :--- | :---: |
|类别|文本-语义模型|
|网络|albert-base-v1|
|数据集|-|
|是否支持Fine-tuning|是|
|模型大小|90MB|
|最新更新日期|2022-02-08|
|数据指标|-|
## 一、模型基本信息
-
### 模型介绍
-
ALBERT针对当前预训练模型参数量过大的问题,提出了以下改进方案:
- 嵌入向量参数化的因式分解。ALBERT对词嵌入参数进行了因式分解,先将单词映射到一个低维的词嵌入空间E,然后再将其映射到高维的隐藏空间H。
- 跨层参数共享。ALBERT共享了层之间的全部参数。
更多详情请参考
[
ALBERT论文
](
https://arxiv.org/abs/1909.11942
)
## 二、安装
-
### 1、环境依赖
-
paddlepaddle >= 2.0.0
-
paddlehub >= 2.0.0 |
[
如何安装PaddleHub
](
../../../../docs/docs_ch/get_start/installation.rst
)
-
### 2、安装
-
```shell
$ hub install albert-base-v1
```
-
如您安装时遇到问题,可参考:
[
零基础windows安装
](
../../../../docs/docs_ch/get_start/windows_quickstart.md
)
|
[
零基础Linux安装
](
../../../../docs/docs_ch/get_start/linux_quickstart.md
)
|
[
零基础MacOS安装
](
../../../../docs/docs_ch/get_start/mac_quickstart.md
)
## 三、模型API预测
-
### 1、预测代码示例
```
python
import
paddlehub
as
hub
data
=
[
[
'这个宾馆比较陈旧了,特价的房间也很一般。总体来说一般'
],
[
'怀着十分激动的心情放映,可是看着看着发现,在放映完毕后,出现一集米老鼠的动画片'
],
[
'作为老的四星酒店,房间依然很整洁,相当不错。机场接机服务很好,可以在车上办理入住手续,节省时间。'
],
]
label_map
=
{
0
:
'negative'
,
1
:
'positive'
}
model
=
hub
.
Module
(
name
=
'albert-base-v1'
,
version
=
'1.0.0'
,
task
=
'seq-cls'
,
load_checkpoint
=
'/path/to/parameters'
,
label_map
=
label_map
)
results
=
model
.
predict
(
data
,
max_seq_len
=
50
,
batch_size
=
1
,
use_gpu
=
False
)
for
idx
,
text
in
enumerate
(
data
):
print
(
'Data: {}
\t
Label: {}'
.
format
(
text
,
results
[
idx
]))
```
详情可参考PaddleHub示例:
-
[
文本分类
](
../../../../demo/text_classification
)
-
[
序列标注
](
../../../../demo/sequence_labeling
)
-
### 2、API
-
```python
def __init__(
task=None,
load_checkpoint=None,
label_map=None,
num_classes=2,
suffix=False,
**kwargs,
)
```
- 创建Module对象(动态图组网版本)
- **参数**
- `task`: 任务名称,可为`seq-cls`(文本分类任务)或`token-cls`(序列标注任务)。
- `load_checkpoint`:使用PaddleHub Fine-tune api训练保存的模型参数文件路径。
- `label_map`:预测时的类别映射表。
- `num_classes`:分类任务的类别数,如果指定了`label_map`,此参数可不传,默认2分类。
- `suffix`: 序列标注任务的标签格式,如果设定为`True`,标签以'-B', '-I', '-E' 或者 '-S'为结尾,此参数默认为`False`。
- `**kwargs`:用户额外指定的关键字字典类型的参数。
-
```python
def predict(
data,
max_seq_len=128,
batch_size=1,
use_gpu=False
)
```
- **参数**
- `data`: 待预测数据,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。每个样例文本数量(1个或者2个)需和训练时保持一致。
- `max_seq_len`:模型处理文本的最大长度
- `batch_size`:模型批处理大小
- `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。
- **返回**
- `results`:list类型,不同任务类型的返回结果如下
- 文本分类:列表里包含每个句子的预测标签,格式为\[label\_1, label\_2, …,\]
- 序列标注:列表里包含每个句子每个token的预测标签,格式为\[\[token\_1, token\_2, …,\], \[token\_1, token\_2, …,\], …,\]
-
```python
def get_embedding(
data,
use_gpu=False
)
```
- 用于获取输入文本的句子粒度特征与字粒度特征
- **参数**
- `data`:输入文本列表,格式为\[\[sample\_a\_text\_a, sample\_a\_text\_b\], \[sample\_b\_text\_a, sample\_b\_text\_b\],…,\],其中每个元素都是一个样例,每个样例可以包含text\_a与text\_b。
- `use_gpu`:是否使用gpu,默认为False。对于GPU用户,建议开启use_gpu。
- **返回**
- `results`:list类型,格式为\[\[sample\_a\_pooled\_feature, sample\_a\_seq\_feature\], \[sample\_b\_pooled\_feature, sample\_b\_seq\_feature\],…,\],其中每个元素都是对应样例的特征输出,每个样例都有句子粒度特征pooled\_feature与字粒度特征seq\_feature。
## 四、服务部署
-
PaddleHub Serving可以部署一个在线获取预训练词向量。
-
### 第一步:启动PaddleHub Serving
-
```shell
$ hub serving start -m albert-base-v1
```
-
这样就完成了一个获取预训练词向量服务化API的部署,默认端口号为8866。
-
**NOTE:**
如使用GPU预测,则需要在启动服务之前,请设置CUDA_VISIBLE_DEVICES环境变量,否则不用设置。
-
### 第二步:发送预测请求
-
配置好服务端,以下数行代码即可实现发送预测请求,获取预测结果
-
```python
import requests
import json
# 指定用于获取embedding的文本[[text_1], [text_2], ... ]}
text = [["今天是个好日子"], ["天气预报说今天要下雨"]]
# 以key的方式指定text传入预测方法的时的参数,此例中为"data"
# 对应本地部署,则为module.get_embedding(data=text)
data = {"data": text}
# 发送post请求,content-type类型应指定json方式,url中的ip地址需改为对应机器的ip
url = "http://127.0.0.1:8866/predict/albert-base-v1"
# 指定post请求的headers为application/json方式
headers = {"Content-Type": "application/json"}
r = requests.post(url=url, headers=headers, data=json.dumps(data))
print(r.json())
```
## 五、更新历史
*
1.0.0
初始发布
modules/text/language_model/albert-base-v1/__init__.py
0 → 100644
浏览文件 @
05d93eaf
modules/text/language_model/albert-base-v1/module.py
0 → 100644
浏览文件 @
05d93eaf
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
math
import
os
from
typing
import
Dict
import
paddle
import
paddle.nn
as
nn
import
paddle.nn.functional
as
F
from
paddlenlp.metrics
import
ChunkEvaluator
from
paddlenlp.transformers.albert.modeling
import
AlbertForSequenceClassification
from
paddlenlp.transformers.albert.modeling
import
AlbertForTokenClassification
from
paddlenlp.transformers.albert.modeling
import
AlbertModel
from
paddlenlp.transformers.albert.tokenizer
import
AlbertTokenizer
from
paddlehub.module.module
import
moduleinfo
from
paddlehub.module.nlp_module
import
TransformerModule
from
paddlehub.utils.log
import
logger
@
moduleinfo
(
name
=
"albert-base-v1"
,
version
=
"1.0.0"
,
summary
=
""
,
author
=
"Baidu"
,
author_email
=
""
,
type
=
"nlp/semantic_model"
,
meta
=
TransformerModule
)
class
Albert
(
nn
.
Layer
):
"""
ALBERT model
"""
def
__init__
(
self
,
task
:
str
=
None
,
load_checkpoint
:
str
=
None
,
label_map
:
Dict
=
None
,
num_classes
:
int
=
2
,
suffix
:
bool
=
False
,
**
kwargs
,
):
super
(
Albert
,
self
).
__init__
()
if
label_map
:
self
.
label_map
=
label_map
self
.
num_classes
=
len
(
label_map
)
else
:
self
.
num_classes
=
num_classes
if
task
==
'sequence_classification'
:
task
=
'seq-cls'
logger
.
warning
(
"current task name 'sequence_classification' was renamed to 'seq-cls', "
"'sequence_classification' has been deprecated and will be removed in the future."
,
)
if
task
==
'seq-cls'
:
self
.
model
=
AlbertForSequenceClassification
.
from_pretrained
(
pretrained_model_name_or_path
=
'albert-base-v1'
,
num_classes
=
self
.
num_classes
,
**
kwargs
)
self
.
criterion
=
paddle
.
nn
.
loss
.
CrossEntropyLoss
()
self
.
metric
=
paddle
.
metric
.
Accuracy
()
elif
task
==
'token-cls'
:
self
.
model
=
AlbertForTokenClassification
.
from_pretrained
(
pretrained_model_name_or_path
=
'albert-base-v1'
,
num_classes
=
self
.
num_classes
,
**
kwargs
)
self
.
criterion
=
paddle
.
nn
.
loss
.
CrossEntropyLoss
()
self
.
metric
=
ChunkEvaluator
(
label_list
=
[
self
.
label_map
[
i
]
for
i
in
sorted
(
self
.
label_map
.
keys
())],
suffix
=
suffix
)
elif
task
==
'text-matching'
:
self
.
model
=
AlbertModel
.
from_pretrained
(
pretrained_model_name_or_path
=
'albert-base-v1'
,
**
kwargs
)
self
.
dropout
=
paddle
.
nn
.
Dropout
(
0.1
)
self
.
classifier
=
paddle
.
nn
.
Linear
(
self
.
model
.
config
[
'hidden_size'
]
*
3
,
2
)
self
.
criterion
=
paddle
.
nn
.
loss
.
CrossEntropyLoss
()
self
.
metric
=
paddle
.
metric
.
Accuracy
()
elif
task
is
None
:
self
.
model
=
AlbertModel
.
from_pretrained
(
pretrained_model_name_or_path
=
'albert-base-v1'
,
**
kwargs
)
else
:
raise
RuntimeError
(
"Unknown task {}, task should be one in {}"
.
format
(
task
,
self
.
_tasks_supported
))
self
.
task
=
task
if
load_checkpoint
is
not
None
and
os
.
path
.
isfile
(
load_checkpoint
):
state_dict
=
paddle
.
load
(
load_checkpoint
)
self
.
set_state_dict
(
state_dict
)
logger
.
info
(
'Loaded parameters from %s'
%
os
.
path
.
abspath
(
load_checkpoint
))
def
forward
(
self
,
input_ids
=
None
,
token_type_ids
=
None
,
position_ids
=
None
,
attention_mask
=
None
,
query_input_ids
=
None
,
query_token_type_ids
=
None
,
query_position_ids
=
None
,
query_attention_mask
=
None
,
title_input_ids
=
None
,
title_token_type_ids
=
None
,
title_position_ids
=
None
,
title_attention_mask
=
None
,
seq_lengths
=
None
,
labels
=
None
):
if
self
.
task
!=
'text-matching'
:
result
=
self
.
model
(
input_ids
,
token_type_ids
,
position_ids
,
attention_mask
)
else
:
query_result
=
self
.
model
(
query_input_ids
,
query_token_type_ids
,
query_position_ids
,
query_attention_mask
)
title_result
=
self
.
model
(
title_input_ids
,
title_token_type_ids
,
title_position_ids
,
title_attention_mask
)
if
self
.
task
==
'seq-cls'
:
logits
=
result
probs
=
F
.
softmax
(
logits
,
axis
=
1
)
if
labels
is
not
None
:
loss
=
self
.
criterion
(
logits
,
labels
)
correct
=
self
.
metric
.
compute
(
probs
,
labels
)
acc
=
self
.
metric
.
update
(
correct
)
return
probs
,
loss
,
{
'acc'
:
acc
}
return
probs
elif
self
.
task
==
'token-cls'
:
logits
=
result
token_level_probs
=
F
.
softmax
(
logits
,
axis
=-
1
)
preds
=
token_level_probs
.
argmax
(
axis
=-
1
)
if
labels
is
not
None
:
loss
=
self
.
criterion
(
logits
,
labels
.
unsqueeze
(
-
1
))
num_infer_chunks
,
num_label_chunks
,
num_correct_chunks
=
\
self
.
metric
.
compute
(
None
,
seq_lengths
,
preds
,
labels
)
self
.
metric
.
update
(
num_infer_chunks
.
numpy
(),
num_label_chunks
.
numpy
(),
num_correct_chunks
.
numpy
())
_
,
_
,
f1_score
=
map
(
float
,
self
.
metric
.
accumulate
())
return
token_level_probs
,
loss
,
{
'f1_score'
:
f1_score
}
return
token_level_probs
elif
self
.
task
==
'text-matching'
:
query_token_embedding
,
_
=
query_result
query_token_embedding
=
self
.
dropout
(
query_token_embedding
)
query_attention_mask
=
paddle
.
unsqueeze
(
(
query_input_ids
!=
self
.
model
.
pad_token_id
).
astype
(
self
.
model
.
pooler
.
dense
.
weight
.
dtype
),
axis
=
2
)
query_token_embedding
=
query_token_embedding
*
query_attention_mask
query_sum_embedding
=
paddle
.
sum
(
query_token_embedding
,
axis
=
1
)
query_sum_mask
=
paddle
.
sum
(
query_attention_mask
,
axis
=
1
)
query_mean
=
query_sum_embedding
/
query_sum_mask
title_token_embedding
,
_
=
title_result
title_token_embedding
=
self
.
dropout
(
title_token_embedding
)
title_attention_mask
=
paddle
.
unsqueeze
(
(
title_input_ids
!=
self
.
model
.
pad_token_id
).
astype
(
self
.
model
.
pooler
.
dense
.
weight
.
dtype
),
axis
=
2
)
title_token_embedding
=
title_token_embedding
*
title_attention_mask
title_sum_embedding
=
paddle
.
sum
(
title_token_embedding
,
axis
=
1
)
title_sum_mask
=
paddle
.
sum
(
title_attention_mask
,
axis
=
1
)
title_mean
=
title_sum_embedding
/
title_sum_mask
sub
=
paddle
.
abs
(
paddle
.
subtract
(
query_mean
,
title_mean
))
projection
=
paddle
.
concat
([
query_mean
,
title_mean
,
sub
],
axis
=-
1
)
logits
=
self
.
classifier
(
projection
)
probs
=
F
.
softmax
(
logits
)
if
labels
is
not
None
:
loss
=
self
.
criterion
(
logits
,
labels
)
correct
=
self
.
metric
.
compute
(
probs
,
labels
)
acc
=
self
.
metric
.
update
(
correct
)
return
probs
,
loss
,
{
'acc'
:
acc
}
return
probs
else
:
sequence_output
,
pooled_output
=
result
return
sequence_output
,
pooled_output
@
staticmethod
def
get_tokenizer
(
*
args
,
**
kwargs
):
"""
Gets the tokenizer that is customized for this module.
"""
return
AlbertTokenizer
.
from_pretrained
(
pretrained_model_name_or_path
=
'albert-base-v1'
,
*
args
,
**
kwargs
)
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录