未验证 提交 5b5bf8f3 编写于 作者: L lugimzzz 提交者: GitHub

Add ERNIE UIE in modelcenter (#5574)

* add_uie

* fill_yaml

* modify_info

* add_model_size

* imgae_size

* image_size

* modify

* change_en

* change_en

* Update info.yaml
Co-authored-by: NliuTINA0907 <65896652+liuTINA0907@users.noreply.github.com>
上级 53d7e873
import gradio as gr
from paddlenlp import Taskflow
schema = ['时间', '选手', '赛事名称']
ie = Taskflow('information_extraction', schema=schema)
# UGC: Define the inference fn() for your models
def model_inference(schema, text):
ie.set_schema(eval(schema))
res = ie(text)
json_out = {"text": text, "result": res}
return json_out
def clear_all():
return None, None, None
with gr.Blocks() as demo:
gr.Markdown("ERNIE-UIE")
with gr.Column(scale=1, min_width=100):
schema = gr.Textbox(
placeholder="ex. ['时间', '选手', '赛事名称']",
label="Schema (You can type any schema.)",
lines=2)
text = gr.Textbox(
placeholder="ex. 2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!",
label="Text (You can type any input sequence.)",
lines=2)
with gr.Row():
btn1 = gr.Button("Clear")
btn2 = gr.Button("Submit")
json_out = gr.JSON(label="Information Extraction Output")
btn1.click(fn=clear_all, inputs=None, outputs=[schema, text, json_out])
btn2.click(fn=model_inference, inputs=[schema, text], outputs=[json_out])
gr.Button.style(1)
demo.launch()
【ERNIE-UIE-App-YAML】
APP_Info:
title: ERNIE-UIE-App
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 3.4.1
app_file: app.py
license: apache-2.0
device: cpu
\ No newline at end of file
gradio
paddlepaddle
paddlenlp==2.4.2
# 信息抽取预训练模型:
| 模型名称 | 模型结构 | 使用场景 | 适配语言 | 模型大小 | 下载地址 |
| :---: | :--------: | :--------: | :--------: |:--------: |:--------: |
| `uie-base` (默认)| 12-layers, 768-hidden, 12-heads | 文本抽取 | 中文 | 450MB|[预训练模型](https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_base.pdparams) |
| `uie-base-en` | 12-layers, 768-hidden, 12-heads | 文本抽取 | 英文 | 418MB|[预训练模型](https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_base_en.pdparams) |
| `uie-medical-base` | 12-layers, 768-hidden, 12-heads | 文本抽取 | 中文 | 288MB|[预训练模型](https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medium_v1.0/model_state.pdparams) |
| `uie-medium`| 6-layers, 768-hidden, 12-heads | 文本抽取 | 中文 | 288MB|[预训练模型](https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_medium.pdparams) |
| `uie-mini`| 6-layers, 384-hidden, 12-heads | 文本抽取 | 中文 | 103MB|[预训练模型](https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_mini.pdparams) |
| `uie-micro`| 4-layers, 384-hidden, 12-heads | 文本抽取 | 中文 | 90MB|[预训练模型](https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_micro.pdparams) |
| `uie-nano`| 4-layers, 312-hidden, 12-heads | 文本抽取 | 中文 | 69MB|[预训练模型](https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_nano.pdparams) |
| `uie-m-large`| 24-layers, 1024-hidden, 16-heads | 文本抽取 | 中、英文 | 2.1G|[预训练模型](https://paddlenlp.bj.bcebos.com/models/transformers/uie_m/uie_m_large.pdparams) |
| `uie-m-base`| 12-layers, 768-hidden, 12-heads | 文本抽取 | 中、英文 | 1.1G|[预训练模型](https://paddlenlp.bj.bcebos.com/models/transformers/uie_m/uie_m_base.pdparams) |
# Information Extraction Pretrained Model
| model name | model details | application | language | model size |download |
| :---: | :--------: | :--------: | :--------: | :--------: |:--------: |
| `uie-base` (default)| 12-layers, 768-hidden, 12-heads | Information Extraction | ZH |450MB|[Pretrained Model](https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_base.pdparams) |
| `uie-base-en` | 12-layers, 768-hidden, 12-heads | Information Extraction | EN |418MB|[Pretrained Model](https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_base_en.pdparams) |
| `uie-medical-base` | 12-layers, 768-hidden, 12-heads | Information Extraction | ZH |288MB|[Pretrained Model](https://bj.bcebos.com/paddlenlp/taskflow/information_extraction/uie_medium_v1.0/model_state.pdparams) |
| `uie-medium`| 6-layers, 768-hidden, 12-heads | Information Extraction | ZH |288MB|[Pretrained Model](https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_medium.pdparams) |
| `uie-mini`| 6-layers, 384-hidden, 12-heads | Information Extraction | ZH |103MB|[Pretrained Model](https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_mini.pdparams) |
| `uie-micro`| 4-layers, 384-hidden, 12-heads | Information Extraction | ZH |90MB|[Pretrained Model](https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_micro.pdparams) |
| `uie-nano`| 4-layers, 312-hidden, 12-heads | Information Extraction | ZH |69MB|[Pretrained Model](https://bj.bcebos.com/paddlenlp/models/transformers/uie/uie_nano.pdparams) |
| `uie-m-large`| 24-layers, 1024-hidden, 16-heads | Information Extraction | ZH,EN |2.1G|[Pretrained Model](https://paddlenlp.bj.bcebos.com/models/transformers/uie_m/uie_m_large.pdparams) |
| `uie-m-base`| 12-layers, 768-hidden, 12-heads | Information Extraction | ZH,EN |1.1G|[Pretrained Model](https://paddlenlp.bj.bcebos.com/models/transformers/uie_m/uie_m_base.pdparams) |
---
Model_Info:
name: "ERNIE-UIE"
description: "信息抽取模型"
description_en: "Information Extraction Model"
icon: "@后续UE统一设计之后,会存到bos上某个位置"
from_repo: "PaddleNLP"
Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Relationship Extraction"
sub_tag: "关系抽取"
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Named Entity Recognition"
sub_tag: "命名实体识别"
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Emotional Classification"
sub_tag: "情感分类"
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Pos labeling"
sub_tag: "词性标注"
Example:
- tag_en: "Intelligent Finance"
tag: "智慧金融"
sub_tag_en: "Key Word Extraction"
title: "使用PaddleNLP UIE模型抽取PDF版上市公司公告"
sub_tag: "关键字段抽取"
url: "https://aistudio.baidu.com/aistudio/projectdetail/4497591"
- tag_en: "Intelligent Retail"
tag: "智慧零售"
sub_tag_en: "Express Bill Information Extraction"
title: "五条标注数据搞定快递单信息抽取,PaddleNLP信息抽取技术重磅升级!"
sub_tag: "快递单信息抽取"
url: "https://aistudio.baidu.com/aistudio/projectdetail/4038499"
Datasets: ""
Pulisher: "Baidu"
License: "apache.2.0"
Paper:
- title: "Unified Structure Generation for Universal Information Extraction"
url: "https://arxiv.org/pdf/2203.12277.pdf"
IfTraining: 0
IfOnlineDemo: 1
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. UIE模型简介\n",
"\n",
"[UIE(Universal Information Extraction)](https://arxiv.org/pdf/2203.12277.pdf):Yaojie Lu等人在ACL-2022中提出了通用信息抽取统一框架UIE。该框架实现了实体抽取、关系抽取、事件抽取、情感分析等任务的统一建模,并使得不同任务间具备良好的迁移和泛化能力。为了方便大家使用UIE的强大能力,PaddleNLP借鉴该论文的方法,基于ERNIE 3.0知识增强预训练模型,训练并开源了首个中文通用信息抽取模型UIE。该模型可以支持不限定行业领域和抽取目标的关键信息抽取,实现零样本快速冷启动,并具备优秀的小样本微调能力,快速适配特定的抽取目标。\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/167236006-66ed845d-21b8-4647-908b-e1c6e7613eb1.png width=\"800\"/>\n",
"</div>\n",
"\n",
"- 法律场景-判决书抽取/>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. 应用示例&模型效果\n",
"\n",
"UIE不限定行业领域和抽取目标,以下是一些零样本行业示例:\n",
"\n",
"- 医疗场景-专病结构化\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/169017581-93c8ee44-856d-4d17-970c-b6138d10f8bc.png width=\"700\"/>\n",
"</div>\n",
"\n",
"- 法律场景-判决书抽取\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/169017863-442c50f1-bfd4-47d0-8d95-8b1d53cfba3c.png width=\"700\"/>\n",
"</div>\n",
"\n",
"- 金融场景-收入证明、招股书抽取\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/169017982-e521ddf6-d233-41f3-974e-6f40f8f2edbc.png width=\"700\"/>\n",
"</div>\n",
"\n",
"- 公安场景-事故报告抽取\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/169018340-31efc1bf-f54d-43f7-b62a-8f7ce9bf0536.png width=\"700\"/>\n",
"</div align=\"center\">\n",
"\n",
"- 旅游场景-宣传册、手册抽取\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/169018113-c937eb0b-9fd7-4ecc-8615-bcdde2dac81d.png width=\"700\"/>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们在互联网、医疗、金融三大垂类自建测试集上进行了实验:\n",
"\n",
"<table>\n",
"<tr><th row_span='2'><th colspan='2'>金融<th colspan='2'>医疗<th colspan='2'>互联网\n",
"<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot\n",
"<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86\n",
"<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68\n",
"<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38\n",
"<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22\n",
"<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35\n",
"<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b><td><b>78.49</b><td><b>83.02</b>\n",
"<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13\n",
"</table>\n",
"\n",
"0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测,5-shot表示每个类别包含5条标注数据进行模型微调。**实验表明UIE在垂类场景可以通过少量数据(few-shot)进一步提升效果**。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3.模型使用\n",
"\n",
"```paddlenlp.Taskflow```提供通用信息抽取、评价观点抽取等能力,可抽取多种类型的信息,包括但不限于命名实体识别(如人名、地名、机构名等)、关系(如电影的导演、歌曲的发行时间等)、事件(如某路口发生车祸、某地发生地震等)、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标,无需训练即可统一抽取输入文本中的对应信息。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.1 环境准备\n",
"安装PaddleNLP"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install --upgrade paddlenlp\n",
"! pip show paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from pprint import pprint\n",
"from paddlenlp import Taskflow"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 实体抽取\n",
"\n",
" 命名实体识别(Named Entity Recognition,简称NER),是指识别文本中具有特定意义的实体。在开放域信息抽取中,抽取的类别没有限制,用户可以自己定义。\n",
"\n",
" 例如抽取的目标实体类型是\"时间\"、\"选手\"和\"赛事名称\", 调用示例如下:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction\n",
"ie = Taskflow('information_extraction', schema=schema, model='uie-base')\n",
"ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en')\n",
"pprint(ie(\"2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!\")) # Better print results using pprint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 关系抽取\n",
"\n",
"关系抽取(Relation Extraction,简称RE),是指从文本中识别实体并抽取实体之间的语义关系,进而获取三元组信息,即<主体,谓语,客体>。\n",
"\n",
"例如以\"竞赛名称\"作为抽取主体,抽取关系类型为\"主办方\"、\"承办方\"和\"已举办次数\", 调用示例如下:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = {'竞赛名称': ['主办方', '承办方', '已举办次数']} # Define the schema for relation extraction\n",
"ie.set_schema(schema) # Reset schema\n",
"pprint(ie('2022语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办,百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办,已连续举办4届,成为全球最热门的中文NLP赛事之一。'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.3 事件抽取\n",
"\n",
"事件抽取 (Event Extraction, 简称EE),是指从自然语言文本中抽取预定义的事件触发词(Trigger)和事件论元(Argument),组合为相应的事件结构化信息。\n",
"\n",
"例如抽取的目标是\"地震\"事件的\"地震强度\"、\"时间\"、\"震中位置\"和\"震源深度\"这些信息,调用示例如下:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # Define the schema for event extraction\n",
"ie.set_schema(schema) # Reset schema\n",
"ie('中国地震台网正式测定:5月16日06时08分在云南临沧市凤庆县(北纬24.34度,东经99.98度)发生3.5级地震,震源深度10千米。')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.4 评论观点抽取\n",
"\n",
"评论观点抽取,是指抽取文本中包含的评价维度、观点词。\n",
"\n",
"例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向,调用示例如下:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = {'评价维度': ['观点词', '情感倾向[正向,负向]']} # Define the schema for opinion extraction\n",
"ie.set_schema(schema) # Reset schema\n",
"pprint(ie(\"店面干净,很清静,服务员服务热情,性价比很高,发现收银台有排队\")) # Better print results using pprint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"英文模型调用示例如下:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]\n",
"ie_en.set_schema(schema)\n",
"pprint(ie_en(\"The teacher is very nice.\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.5 情感分类\n",
"\n",
"句子级情感倾向分类,即判断句子的情感倾向是“正向”还是“负向”,调用示例如下:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = '情感倾向[正向,负向]' # Define the schema for sentence-level sentiment classification\n",
"ie.set_schema(schema) # Reset schema\n",
"ie('这个产品用起来真的很流畅,我非常喜欢')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"英文模型调用示例如下:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = 'Sentiment classification [negative, positive]'\n",
"ie_en.set_schema(schema)\n",
"ie_en('I am sorry but this is the worst film I have ever seen in my life.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.6 跨任务抽取\n",
"\n",
"例如在法律场景同时对文本进行实体抽取和关系抽取,调用示例如:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]\n",
"ie.set_schema(schema)\n",
"pprint(ie(\"北京市海淀区人民法院\\n民事判决书\\n(199x)建初字第xxx号\\n原告:张三。\\n委托代理人李四,北京市 A律师事务所律师。\\n被告:B公司,法定代表人王五,开发公司总经理。\\n委托代理人赵六,北京市 C律师事务所律师。\")) # Better print results using pprint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.7 模型训练\n",
"对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本(zero-shot)抽取,对于细分场景我们推荐使用轻定制功能(标注少量数据进行模型微调)以进一步提升效果。模型训练细节请参考[UIE训练定制](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. 相关论文以及引用信息\n",
"\n",
"- **[Unified Structure Generation for Universal Information Extraction](https://arxiv.org/pdf/2203.12277.pdf)**\n",
"- **[Quantizing deep convolutional networks for efficient inference: A whitepaper](https://arxiv.org/pdf/1806.08342.pdf)**\n",
"- **[PACT: Parameterized Clipping Activation for Quantized Neural Networks](https://arxiv.org/abs/1805.06085)**"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. UIE Introduction\n",
"\n",
"[UIE(Universal Information Extraction)](https://arxiv.org/pdf/2203.12277.pdf):Yaojie Lu et al. proposed the universal information extraction unified framework UIE in ACL-2022. The framework achieves the unified modeling of entity extraction, relationship extraction, event extraction, sentiment analysis and other tasks, and makes different tasks have good migration and generalization capabilities. In order to facilitate the use of UIE's powerful capabilities, PaddleNLP released the first Chinese general information extraction model, UIE, based on the ERNIE 3.0 knowledge enhancement pretrained model. The model supports key information extraction without limitation of the industry field and extraction domain, achieving excellent few-shot finetune ability to adapt to specific extraction domain.\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/167236006-66ed845d-21b8-4647-908b-e1c6e7613eb1.png width=\"800\"/>\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Application & Model Effects\n",
"\n",
"\n",
"- Medical Scenario - Specialized Disease Structured\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/169017581-93c8ee44-856d-4d17-970c-b6138d10f8bc.png width=\"700\"/>\n",
"</div>\n",
"\n",
"- Legal Scenario - Judgment Extraction\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/169017863-442c50f1-bfd4-47d0-8d95-8b1d53cfba3c.png width=\"700\"/>\n",
"</div>\n",
"\n",
"\n",
"- Financial scenario - income certificate and prospectus extraction\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/169017982-e521ddf6-d233-41f3-974e-6f40f8f2edbc.png width=\"700\"/>\n",
"</div>\n",
"\n",
"- Social scenario - accident report extraction\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/169018340-31efc1bf-f54d-43f7-b62a-8f7ce9bf0536.png width=\"700\"/>\n",
"</div>\n",
"\n",
"- Tourism scenario - brochure extraction\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/169018113-c937eb0b-9fd7-4ecc-8615-bcdde2dac81d.png width=\"700\"/>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We conducted experiments on three major private testsets: Internet, medical and finance:\n",
"\n",
"<table>\n",
"<tr><th row_span='2'><th colspan='2'>Finance<th colspan='2'>Medical<th colspan='2'>Internet\n",
"<tr><td><th>0-shot<th>5-shot<th>0-shot<th>5-shot<th>0-shot<th>5-shot\n",
"<tr><td>uie-base (12L768H)<td>46.43<td>70.92<td><b>71.83</b><td>85.72<td>78.33<td>81.86\n",
"<tr><td>uie-medium (6L768H)<td>41.11<td>64.53<td>65.40<td>75.72<td>78.32<td>79.68\n",
"<tr><td>uie-mini (6L384H)<td>37.04<td>64.65<td>60.50<td>78.36<td>72.09<td>76.38\n",
"<tr><td>uie-micro (4L384H)<td>37.53<td>62.11<td>57.04<td>75.92<td>66.00<td>70.22\n",
"<tr><td>uie-nano (4L312H)<td>38.94<td>66.83<td>48.29<td>76.74<td>62.86<td>72.35\n",
"<tr><td>uie-m-large (24L1024H)<td><b>49.35</b><td><b>74.55</b><td>70.50<td><b>92.66</b><td><b>78.49</b><td><b>83.02</b>\n",
"<tr><td>uie-m-base (12L768H)<td>38.46<td>74.31<td>63.37<td>87.32<td>76.27<td>80.13\n",
"</table>\n",
"\n",
"0-shot refers to the predicted results directly from ``` paddlelp. Taskflow ```, and 5-shot refers to the predicted results from fine-tuned model with 5 annotated data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3.Usage\n",
"\n",
"```paddlenlp.Taskflow``` provides the ability to extract general information, evaluation opinions, etc. It can extract various types of information, including but not limited to named entity recognition (such as person name, place name, organization name, etc.), relationship (such as film director, song release time, etc.), events (such as an accident at an intersection, an earthquake at a place, etc.), evaluation dimensions, opinion words, emotional orientation, and other information. Users can use natural language to customize the extraction target, and can uniformly extract the corresponding information in the input text without training。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.1 Environment\n",
"Install PaddleNLP"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! pip install --upgrade paddlenlp\n",
"! pip show paddlenlp"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from pprint import pprint\n",
"from paddlenlp import Taskflow"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 Entity extraction\n",
"\n",
"Named Entity Recognition (NER) refers to identifying entities with specific meanings in text. In open domain information extraction, there is no limit to the categories extracted, and users can define them themselves.\n",
"\n",
"For example, the extracted target entity types are \"时间\", \"选手\" and \"赛事名称\". The example is as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction\n",
"ie = Taskflow('information_extraction', schema=schema, model='uie-base')\n",
"ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en')\n",
"pprint(ie(\"2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!\")) # Better print results using pprint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 Relationship Extraction\n",
"Relation Extraction (RE) refers to identifying entities from text and extracting semantic relationships between entities to obtain triple information: subject,predicate and object.\n",
"\n",
"For example, \"竞赛名称\" is taken as the extraction subject, and the extraction relationship types are \"主办方\", \"承办方\" and \"已举办次数\". The example is as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = {'竞赛名称': ['主办方', '承办方', '已举办次数']} # Define the schema for relation extraction\n",
"ie.set_schema(schema) # Reset schema\n",
"pprint(ie('2022语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办,百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办,已连续举办4届,成为全球最热门的中文NLP赛事之一。'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.3 Event extraction\n",
"\n",
"Event Extraction (EE) refers to extracting predefined event trigger words and event arguments from natural language texts and combining them into corresponding event structured information.\n",
"\n",
"For example, the target of extraction is \"地震强度\", \"时间\", \"震中位置\" and \"震源深度\" of the \"地震\" event. The example is as follows:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # Define the schema for event extraction\n",
"ie.set_schema(schema) # Reset schema\n",
"ie('中国地震台网正式测定:5月16日06时08分在云南临沧市凤庆县(北纬24.34度,东经99.98度)发生3.5级地震,震源深度10千米。')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.4 Extraction of comments\n",
"\n",
"Comment and opinion extraction refers to the extraction of evaluation dimensions and opinion words contained in the text.\n",
"\n",
"For example, the target of extraction is the evaluation dimension contained in the text and its corresponding opinion words and sentiment tendencies. The example is as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = {'评价维度': ['观点词', '情感倾向[正向,负向]']} # Define the schema for opinion extraction\n",
"ie.set_schema(schema) # Reset schema\n",
"pprint(ie(\"店面干净,很清静,服务员服务热情,性价比很高,发现收银台有排队\")) # Better print results using pprint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The english example is as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]\n",
"ie_en.set_schema(schema)\n",
"pprint(ie_en(\"The teacher is very nice.\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.5 Sentiment classification\n",
"Sentence level sentiment classification, that is, to judge whether the sentence's sentimentorientation is \"positive\" or \"negative\". The example is as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = '情感倾向[正向,负向]' # Define the schema for sentence-level sentiment classification\n",
"ie.set_schema(schema) # Reset schema\n",
"ie('这个产品用起来真的很流畅,我非常喜欢')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The english example is as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = 'Sentiment classification [negative, positive]'\n",
"ie_en.set_schema(schema)\n",
"ie_en('I am sorry but this is the worst film I have ever seen in my life.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.6 Cross scenario extraction\n",
"\n",
"For example, in a legal scenario, both entity extraction and relationship extraction are performed on the text. The example is as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]\n",
"ie.set_schema(schema)\n",
"pprint(ie(\"北京市海淀区人民法院\\n民事判决书\\n(199x)建初字第xxx号\\n原告:张三。\\n委托代理人李四,北京市 A律师事务所律师。\\n被告:B公司,法定代表人王五,开发公司总经理。\\n委托代理人赵六,北京市 C律师事务所律师。\")) # Better print results using pprint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.7 Model training\n",
"For simple extraction targets, you can directly use ```paddlenlp.Taskflow```implements zero shot extraction. For private scenarios, we recommend using the customization function (mark a small amount of data for model fine-tuning) to further improve the effect. For details of model training, please refer to [UIE Training](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. Related papers and citations\n",
"\n",
"- **[Unified Structure Generation for Universal Information Extraction](https://arxiv.org/pdf/2203.12277.pdf)**\n",
"- **[Quantizing deep convolutional networks for efficient inference: A whitepaper](https://arxiv.org/pdf/1806.08342.pdf)**\n",
"- **[PACT: Parameterized Clipping Activation for Quantized Neural Networks](https://arxiv.org/abs/1805.06085)**"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册