{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. UIE模型简介\n", "\n", "[UIE(Universal Information Extraction)](https://arxiv.org/pdf/2203.12277.pdf):Yaojie Lu等人在ACL-2022中提出了通用信息抽取统一框架UIE。该框架实现了实体抽取、关系抽取、事件抽取、情感分析等任务的统一建模,并使得不同任务间具备良好的迁移和泛化能力。为了方便大家使用UIE的强大能力,PaddleNLP借鉴该论文的方法,基于ERNIE 3.0知识增强预训练模型,训练并开源了首个中文通用信息抽取模型UIE。该模型可以支持不限定行业领域和抽取目标的关键信息抽取,实现零样本快速冷启动,并具备优秀的小样本微调能力,快速适配特定的抽取目标。\n", "\n", "
\n", " \n", "
\n", "\n", "- 法律场景-判决书抽取/>\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. 应用示例&模型效果\n", "\n", "UIE不限定行业领域和抽取目标,以下是一些零样本行业示例:\n", "\n", "- 医疗场景-专病结构化\n", "\n", "
\n", " \n", "
\n", "\n", "- 法律场景-判决书抽取\n", "\n", "
\n", " \n", "
\n", "\n", "- 金融场景-收入证明、招股书抽取\n", "\n", "
\n", " \n", "
\n", "\n", "- 公安场景-事故报告抽取\n", "\n", "
\n", " \n", "
\n", "\n", "- 旅游场景-宣传册、手册抽取\n", "\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们在互联网、医疗、金融三大垂类自建测试集上进行了实验:\n", "\n", "\n", "
金融医疗互联网\n", "
0-shot5-shot0-shot5-shot0-shot5-shot\n", "
uie-base (12L768H)46.4370.9271.8385.7278.3381.86\n", "
uie-medium (6L768H)41.1164.5365.4075.7278.3279.68\n", "
uie-mini (6L384H)37.0464.6560.5078.3672.0976.38\n", "
uie-micro (4L384H)37.5362.1157.0475.9266.0070.22\n", "
uie-nano (4L312H)38.9466.8348.2976.7462.8672.35\n", "
uie-m-large (24L1024H)49.3574.5570.5092.6678.4983.02\n", "
uie-m-base (12L768H)38.4674.3163.3787.3276.2780.13\n", "
\n", "\n", "0-shot表示无训练数据直接通过```paddlenlp.Taskflow```进行预测,5-shot表示每个类别包含5条标注数据进行模型微调。**实验表明UIE在垂类场景可以通过少量数据(few-shot)进一步提升效果**。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3.模型使用\n", "\n", "```paddlenlp.Taskflow```提供通用信息抽取、评价观点抽取等能力,可抽取多种类型的信息,包括但不限于命名实体识别(如人名、地名、机构名等)、关系(如电影的导演、歌曲的发行时间等)、事件(如某路口发生车祸、某地发生地震等)、以及评价维度、观点词、情感倾向等信息。用户可以使用自然语言自定义抽取目标,无需训练即可统一抽取输入文本中的对应信息。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.1 环境准备\n", "安装PaddleNLP" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! pip install --upgrade paddlenlp\n", "! pip show paddlenlp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pprint import pprint\n", "from paddlenlp import Taskflow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2 实体抽取\n", "\n", " 命名实体识别(Named Entity Recognition,简称NER),是指识别文本中具有特定意义的实体。在开放域信息抽取中,抽取的类别没有限制,用户可以自己定义。\n", "\n", " 例如抽取的目标实体类型是\"时间\"、\"选手\"和\"赛事名称\", 调用示例如下:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction\n", "ie = Taskflow('information_extraction', schema=schema, model='uie-base')\n", "ie_en = Taskflow('information_extraction', schema=schema, model='uie-base-en')\n", "pprint(ie(\"2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!\")) # Better print results using pprint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2 关系抽取\n", "\n", "关系抽取(Relation Extraction,简称RE),是指从文本中识别实体并抽取实体之间的语义关系,进而获取三元组信息,即<主体,谓语,客体>。\n", "\n", "例如以\"竞赛名称\"作为抽取主体,抽取关系类型为\"主办方\"、\"承办方\"和\"已举办次数\", 调用示例如下:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = {'竞赛名称': ['主办方', '承办方', '已举办次数']} # Define the schema for relation extraction\n", "ie.set_schema(schema) # Reset schema\n", "pprint(ie('2022语言与智能技术竞赛由中国中文信息学会和中国计算机学会联合主办,百度公司、中国中文信息学会评测工作委员会和中国计算机学会自然语言处理专委会承办,已连续举办4届,成为全球最热门的中文NLP赛事之一。'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.3 事件抽取\n", "\n", "事件抽取 (Event Extraction, 简称EE),是指从自然语言文本中抽取预定义的事件触发词(Trigger)和事件论元(Argument),组合为相应的事件结构化信息。\n", "\n", "例如抽取的目标是\"地震\"事件的\"地震强度\"、\"时间\"、\"震中位置\"和\"震源深度\"这些信息,调用示例如下:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = {'地震触发词': ['地震强度', '时间', '震中位置', '震源深度']} # Define the schema for event extraction\n", "ie.set_schema(schema) # Reset schema\n", "ie('中国地震台网正式测定:5月16日06时08分在云南临沧市凤庆县(北纬24.34度,东经99.98度)发生3.5级地震,震源深度10千米。')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.4 评论观点抽取\n", "\n", "评论观点抽取,是指抽取文本中包含的评价维度、观点词。\n", "\n", "例如抽取的目标是文本中包含的评价维度及其对应的观点词和情感倾向,调用示例如下:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = {'评价维度': ['观点词', '情感倾向[正向,负向]']} # Define the schema for opinion extraction\n", "ie.set_schema(schema) # Reset schema\n", "pprint(ie(\"店面干净,很清静,服务员服务热情,性价比很高,发现收银台有排队\")) # Better print results using pprint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "英文模型调用示例如下:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = [{'Aspect': ['Opinion', 'Sentiment classification [negative, positive]']}]\n", "ie_en.set_schema(schema)\n", "pprint(ie_en(\"The teacher is very nice.\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.5 情感分类\n", "\n", "句子级情感倾向分类,即判断句子的情感倾向是“正向”还是“负向”,调用示例如下:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = '情感倾向[正向,负向]' # Define the schema for sentence-level sentiment classification\n", "ie.set_schema(schema) # Reset schema\n", "ie('这个产品用起来真的很流畅,我非常喜欢')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "英文模型调用示例如下:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = 'Sentiment classification [negative, positive]'\n", "ie_en.set_schema(schema)\n", "ie_en('I am sorry but this is the worst film I have ever seen in my life.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.6 跨任务抽取\n", "\n", "例如在法律场景同时对文本进行实体抽取和关系抽取,调用示例如:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = ['法院', {'原告': '委托代理人'}, {'被告': '委托代理人'}]\n", "ie.set_schema(schema)\n", "pprint(ie(\"北京市海淀区人民法院\\n民事判决书\\n(199x)建初字第xxx号\\n原告:张三。\\n委托代理人李四,北京市 A律师事务所律师。\\n被告:B公司,法定代表人王五,开发公司总经理。\\n委托代理人赵六,北京市 C律师事务所律师。\")) # Better print results using pprint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.7 模型训练\n", "对于简单的抽取目标可以直接使用```paddlenlp.Taskflow```实现零样本(zero-shot)抽取,对于细分场景我们推荐使用轻定制功能(标注少量数据进行模型微调)以进一步提升效果。模型训练细节请参考[UIE训练定制](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/uie)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. 相关论文以及引用信息\n", "\n", "- **[Unified Structure Generation for Universal Information Extraction](https://arxiv.org/pdf/2203.12277.pdf)**\n", "- **[Quantizing deep convolutional networks for efficient inference: A whitepaper](https://arxiv.org/pdf/1806.08342.pdf)**\n", "- **[PACT: Parameterized Clipping Activation for Quantized Neural Networks](https://arxiv.org/abs/1805.06085)**" ] } ], "metadata": { "language_info": { "name": "python" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }