未验证 提交 73f54a75 编写于 作者: C chenxiaozeng 提交者: GitHub

update ERNIE-Layout introduction and ERNIE's tags (#5669)

上级 fa3a1912
...@@ -10,6 +10,10 @@ Task: ...@@ -10,6 +10,10 @@ Task:
tag: "自然语言处理" tag: "自然语言处理"
sub_tag_en: "Pretrained Model" sub_tag_en: "Pretrained Model"
sub_tag: "预训练模型" sub_tag: "预训练模型"
- tag_en: "Wenxin Big Models"
tag: "文心大模型"
sub_tag_en: "Pretrained Model"
sub_tag: "预训练模型"
Example: Example:
- title: "【快速上手ERNIE 3.0】中文情感分析实战" - title: "【快速上手ERNIE 3.0】中文情感分析实战"
url: "https://aistudio.baidu.com/aistudio/projectdetail/3955163" url: "https://aistudio.baidu.com/aistudio/projectdetail/3955163"
......
...@@ -7,6 +7,10 @@ Model_Info: ...@@ -7,6 +7,10 @@ Model_Info:
icon: "https://user-images.githubusercontent.com/11793384/203492273-64f38a22-a347-464b-9d87-6583ce5c3121.png" icon: "https://user-images.githubusercontent.com/11793384/203492273-64f38a22-a347-464b-9d87-6583ce5c3121.png"
from_repo: PaddleNLP from_repo: PaddleNLP
Task: Task:
- tag: 自然语言处理
tag_en: Natural Language Processing
sub_tag: 文档分析
sub_tag_en: Document Analysis
- tag: 文心大模型 - tag: 文心大模型
tag_en: Wenxin Big Models tag_en: Wenxin Big Models
sub_tag: 文档分析 sub_tag: 文档分析
......
...@@ -4,14 +4,14 @@ ...@@ -4,14 +4,14 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 1.ERNIE-Layout模型简介\n", "# 1.ERNIE-Layout 模型简介\n",
"随着众多行业的数字化转型,电子文档的结构化分析和内容提取成为一项热门的研究课题。电子文档包括扫描图像文件和计算机生成的数字文档两大类,涉及单据、行业报告、合同、雇佣协议、发票、简历等多种类型。智能文档理解任务以理解格式、布局、内容多种多样的文档为目标,包括了文档分类、文档信息抽取、文档问答等任务。与纯文本文档不同的是,文档包含表格、图片等多种内容,包含丰富的视觉信息。因为文档内容丰富、布局复杂、字体样式多样、数据存在噪声,文档理解任务极具挑战性。随着ERNIE等预训练语言模型在NLP领域取得了巨大的成功,人们开始关注在文档理解领域进行大规模预训练。百度提出跨模态文档理解模型 ERNIE-Layout,首次将布局知识增强技术融入跨模态文档预训练,在4项文档理解任务上刷新世界最好效果,登顶 DocVQA 榜首。同时,ERNIE-Layout已集成至百度智能文档分析平台 TextMind,助力企业数字化升级。\n", "随着众多行业的数字化转型,电子文档的结构化分析和内容提取成为一项热门的研究课题。电子文档包括扫描图像文件和计算机生成的数字文档两大类,涉及单据、行业报告、合同、雇佣协议、发票、简历等多种类型。智能文档理解任务以理解格式、布局、内容多种多样的文档为目标,包括了文档分类、文档信息抽取、文档问答等任务。与纯文本文档不同的是,文档包含表格、图片等多种内容,包含丰富的视觉信息。因为文档内容丰富、布局复杂、字体样式多样、数据存在噪声,文档理解任务极具挑战性。随着ERNIE等预训练语言模型在NLP领域取得了巨大的成功,人们开始关注在文档理解领域进行大规模预训练。百度提出跨模态文档理解模型 ERNIE-Layout,首次将布局知识增强技术融入跨模态文档预训练,在4项文档理解任务上刷新世界最好效果,登顶 DocVQA 榜首。同时,ERNIE-Layout已集成至百度智能文档分析平台 TextMind,助力企业数字化升级。\n",
"\n", "\n",
"ERNIE-Layout以文心文本大模型ERNIE为底座,融合文本、图像、布局等信息进行跨模态联合建模,创新性引入布局知识增强,提出阅读顺序预测、细粒度图文匹配等自监督预训练任务,升级空间解偶注意力机制,在各数据集上效果取得大幅度提升,相关工作 [ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](https://arxiv.org/abs/2210.06155) 已被EMNLP 2022 Findings会议收录。考虑到文档智能需求广泛,PaddleNLP对外开源了业界领先的多语言跨模态文档预训练模型ERNIE-Layout。\n",
"\n", "\n",
"ERNIE-Layout以文心文本大模型ERNIE为底座,融合文本、图像、布局等信息进行跨模态联合建模,创新性引入布局知识增强,提出阅读顺序预测、细粒度图文匹配等自监督预训练任务,升级空间解偶注意力机制,在各数据集上效果取得大幅度提升,相关工作[ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](https://arxiv.org/abs/2210.06155)已被EMNLP 2022 Findings会议收录。考虑到文档智能在多语种上商用广泛,依托PaddleNLP对外开源业界最强的多语言跨模态文档预训练模型ERNIE-Layout。\n", "此外,开放文档抽取问答模型 DocPrompt,以 ERNIE-Layout 为底座,可精准理解图文信息,推理学习附加知识,准确捕捉图片、PDF 等多模态文档中的每个细节。可前往 [https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout) 了解详情。\n",
"ERNIE-Layout是由飞浆官方出品的跨模态大模型,更多有关PaddleNLP的详情请访问<https://github.com/PaddlePaddle/PaddleNLP/>了解详情。<br/>\n",
"\n", "\n",
"<img src=\"https://user-images.githubusercontent.com/40840292/195091552-86a2d174-24b0-4ddf-825a-4503e0bc390b.png\" width = 95% align=center />" "<img src=\"https://user-images.githubusercontent.com/40840292/195091552-86a2d174-24b0-4ddf-825a-4503e0bc390b.png\" width=\"60%\" height=\"60%\"> <br />"
] ]
}, },
{ {
...@@ -19,22 +19,54 @@ ...@@ -19,22 +19,54 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 2.模型效果及应用场景\n", "# 2.模型效果及应用场景\n",
"ERNIE-Layout可以用于处理但不限于带布局数据(文档、图片等)的文档分类、信息抽取、文档问答等任务,应用场景包括但不限于发票抽取问答、海报抽取问答、网页抽取问答、表格抽取问答、试卷抽取问答、英文票据多语种(中、英、日、泰、西班牙、俄语)抽取问答、中文票据多语种(中简、中繁、英、日、法语)抽取问答等。以文档信息抽取和文档视觉问答为例,使用ERNIE-Layout模型效果速览如下。\n", "ERNIE-Layout 可以用于多模态文档的分类、信息抽取、文档问答等各个任务,应用场景包括但不限于发票抽取问答、海报抽取问答、网页抽取问答、表格抽取问答、试卷抽取问答、英文票据多语种(中、英、日、泰、西班牙、俄语)抽取问答、中文票据多语种(中简、中繁、英、日、法语)抽取问答等。\n",
"## 2.1文档信息抽取任务:\n", "\n",
"### 2.1.1数据集:\n", "DocPrompt 以 ERNIE-Layout 为底座,在开放域文档抽取问答任务中效果强悍,例如:\n",
"数据集有FUNSD、XFUND-ZH等。其中FUNSD是在噪声很多的扫描文档上进行表单理解的英文数据集,数据集包含199个真实的、完全注释的、扫描的表单。文档有很多噪声,而且各种表单的外观差异很大,因此理解表单是一项很有挑战性的任务。该数据集可用于各种任务,包括文本检测、光学字符识别、空间布局分析和实体标记/链接。XFUND是一个多语言表单理解基准数据集,包括7种语言(汉语、日语、西班牙语、法语、意大利语、德语、葡萄牙语)的人为标注键值对表单,XFUND-ZH为中文版本XFUND。\n", "\n",
"### 2.1.2模型效果速览:\n", "- 发票抽取问答\n",
"ERNIE-Layout在FUNSD上的模型效果为:\n", "\n",
"\n", "<div align=\"center\">\n",
"<img src=\"https://gitee.com/doubleguy/typora/raw/master/img/202211082019436.png\" width = 95% align=center />\n", " <img src=https://user-images.githubusercontent.com/40840292/196118171-fd3e49a0-b9f1-4536-a904-c48f709a2dec.png width=\"60%\" height=\"60%\" hspace='10'/>\n",
"\n", "</div>\n",
"## 2.2文档视觉问答任务:\n", "\n",
"### 2.2.1数据集:\n", "- 海报抽取问答\n",
"数据集为DocVQA-ZH,DocVQA-ZH已停止榜单提交,因此我们将原始训练集进行重新划分以评估模型效果,划分后训练集包含4,187张图片,验证集包含500张图片,测试集包含500张图片。\n", "\n",
"### 2.2.2模型效果速览:\n", "<div align=\"center\">\n",
"ERNIE-Layout在DocVQA-ZH上的模型效果为:\n", " <img src=https://user-images.githubusercontent.com/40840292/195610368-04230855-62de-439e-b708-2c195b70461f.png width=\"60%\" height=\"60%\" hspace='15'/>\n",
"\n", "</div>\n",
"![](https://user-images.githubusercontent.com/40840292/195611075-9323ce9f-134b-4657-ab1c-f4892075d909.png)" "\n",
"- 网页抽取问答\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/195611613-bdbe692e-d7f2-4a2b-b548-1a933463b0b9.png width=\"60%\" height=\"60%\" hspace='10'/>\n",
"</div>\n",
"\n",
"\n",
"- 表格抽取问答\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/195610692-8367f1c8-32c2-4b5d-9514-a149795cf609.png width=\"60%\" height=\"60%\" hspace='10'/>\n",
"</div>\n",
"\n",
"\n",
"- 试卷抽取问答\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/195823294-d891d95a-2ef8-4519-be59-0fedb96c00de.png width=\"60%\" height=\"60%\" hspace='10'/>\n",
"</div>\n",
"\n",
"\n",
"- 英文票据多语种(中、英、日、泰、西班牙、俄语)抽取问答\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/195610820-7fb88608-b317-45fc-a6ab-97bf3b20a4ac.png width=\"60%\" height=\"60%\" hspace='15'/>\n",
"</div>\n",
"\n",
"- 中文票据多语种(中简、中繁、英、日、法语)抽取问答\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/195611075-9323ce9f-134b-4657-ab1c-f4892075d909.png width=\"60%\" height=\"60%\" hspace='15'/>\n",
"</div>"
] ]
}, },
{ {
...@@ -42,14 +74,14 @@ ...@@ -42,14 +74,14 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 3.模型如何使用\n", "# 3.模型如何使用\n",
"## 3.1模型推理\n", "## 3.1 模型推理\n",
"我们已经在[huggingface网页](https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout)集成了ERNIE-Layout DocPrompt Engine,可一键进行体验。 \n",
"<br/><br/> \n",
"**Taskflow**\n",
"<br/><br/>\n",
"当然,也可以使用Taskflow进行推理。通过`paddlenlp.Taskflow`三行代码调用DocPrompt功能,具备多语言文档抽取问答能力,部分应用场景展示如下:\n",
"\n", "\n",
"通过`paddlenlp.Taskflow`三行代码即可调用DocPrompt功能,实现多语言文档抽取问答。\n",
" \n",
"* 输入格式\n", "* 输入格式\n",
" * 支持单条、批量预测\n",
" * 支持本地图片路径输入\n",
" * 支持http图片链接输入\n",
"\n", "\n",
"```python\n", "```python\n",
"[\n", "[\n",
...@@ -66,54 +98,52 @@ ...@@ -66,54 +98,52 @@
"]\n", "]\n",
"```\n", "```\n",
"\n", "\n",
"* 支持单条、批量预测\n", " \n",
"\n",
" * 支持本地图片路径输入\n",
"\n",
" ![](https://user-images.githubusercontent.com/40840292/194748579-f9e8aa86-7f65-4827-bfae-824c037228b3.png)\n",
"\n",
" ```python \n",
" from pprint import pprint\n",
" from paddlenlp import Taskflow\n",
" docprompt = Taskflow(\"document_intelligence\")\n",
" pprint(docprompt([{\"doc\": \"./resume.png\", \"prompt\": [\"五百丁本次想要担任的是什么职位?\", \"五百丁是在哪里上的大学?\", \"大学学的是什么专业?\"]}]))\n",
" [{'prompt': '五百丁本次想要担任的是什么职位?',\n",
" 'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]},\n",
" {'prompt': '五百丁是在哪里上的大学?',\n",
" 'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]},\n",
" {'prompt': '大学学的是什么专业?',\n",
" 'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科)'}]}]\n",
" ```\n",
"\n",
" * http图片链接输入\n",
"\n",
" ![](https://user-images.githubusercontent.com/40840292/194748592-e20b2a5f-d36b-46fb-8057-86755d188af0.jpg)\n",
"\n",
" ```python \n",
" from pprint import pprint\n",
" from paddlenlp import Taskflow\n",
"\n",
" docprompt = Taskflow(\"document_intelligence\")\n",
" pprint(docprompt([{\"doc\": \"https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg\", \"prompt\": [\"发票号码是多少?\", \"校验码是多少?\"]}]))\n",
" [{'prompt': '发票号码是多少?',\n",
" 'result': [{'end': 2, 'prob': 0.74, 'start': 2, 'value': 'No44527206'}]},\n",
" {'prompt': '校验码是多少?',\n",
" 'result': [{'end': 233,\n",
" 'prob': 1.0,\n",
" 'start': 231,\n",
" 'value': '01107 555427109891646'}]}]\n",
" ```\n",
"\n", "\n",
"* 可配置参数说明\n", "* 可配置参数说明\n",
" * `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。\n", " * `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。\n",
" * `lang`:选择PaddleOCR的语言,`ch`可在中英混合的图片中使用,`en`在英文图片上的效果更好,默认为`ch`。\n", " * `lang`:选择PaddleOCR的语言,`ch`可在中英混合的图片中使用,`en`在英文图片上的效果更好,默认为`ch`。\n",
" * `topn`: 如果模型识别出多个结果,将返回前n个概率值最高的结果,默认为1。\n", " * `topn`: 如果模型识别出多个结果,将返回前n个概率值最高的结果,默认为1。\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"# 安装最新版PaddleNLP和PaddleOCR\n",
"!pip install --upgrade paddlenlp\n",
"!pip install --upgrade paddleocr "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"from pprint import pprint\n",
"from paddlenlp import Taskflow\n",
"\n", "\n",
"## 3.2模型微调与部署\n", "docprompt = Taskflow(\"document_intelligence\")\n",
"ERNIE-Layout是依托文心ERNIE,基于布局知识增强技术,融合文本、图像、布局等信息进行联合建模的跨模态通用文档预训练模型,能够在包括但不限于文档信息抽取、文档视觉问答、文档图像分类等任务上表现出优秀的跨模态语义对齐能力和布局理解能力。\n", "pprint(docprompt([{\"doc\": \"https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg\", \"prompt\": [\"发票号码是多少?\", \"校验码是多少?\"]}]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 模型微调与部署\n",
"\n", "\n",
"有关使用ERNIE-Layout进行上述任务的微调与部署详情请参考:[ERNIE-Layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout\n", "PaddleNLP开源了ERNIE-Layout在文档信息抽取、文档视觉问答、文档图像分类等任务上的微调与部署示例,详情可参考:[ERNIE-Layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout\n",
")" ")"
] ]
}, },
{ {
...@@ -121,20 +151,12 @@ ...@@ -121,20 +151,12 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 4.模型原理\n", "# 4.模型原理\n",
"* 布局知识增强技术<br/><br/>\n",
"* 融合文本、图像、布局等信息进行联合建模<br/><br/>\n",
"* 阅读顺序预测 + 细粒度图文匹配两个自监督预训练任务<br/><br/>\n",
"\n",
"<figure>\n",
"对文档理解来说,文档中的文字阅读顺序至关重要,目前主流的基于 OCR(Optical Character Recognition,文字识别)技术的模型大多遵循「从左到右、从上到下」的原则,然而对于文档中分栏、文本图片表格混杂的复杂布局,根据 OCR 结果获取的阅读顺序多数情况下都是错误的,从而导致模型无法准确地进行文档内容的理解。\n",
"\n", "\n",
"而人类通常会根据文档结构和布局进行层次化分块阅读,受此启发,百度研究者提出在文档预训模型中对阅读顺序进行校正的布局知识增强创新思路。TextMind 平台上业界领先的文档解析工具(Document Parser)能够准确识别文档中的分块信息,产出正确的文档阅读顺序,将阅读顺序信号融合到模型的训练中,从而增强对布局信息的有效利用,提升模型对于复杂文档的理解能力。\n", "文心 ERNIE-Layout 以文心 ERNIE 为底座,融合文本、图像、布局等信息进行跨模态联合建模,创新性引入布局知识增强,提出阅读顺序预测、细粒度图文匹配等自监督预训练任务,升级空间解耦注意力机制。输入基于 VIMER-StrucTexT 大模型提供的 OCR 结果,在各数据集上效果取得大幅度提升。\n",
"\n", "\n",
"基于布局知识增强技术,同时依托文心 ERNIE,百度研究者提出了融合文本、图像、布局等信息进行联合建模的跨模态通用文档预训练模型 ERNIE-Layout。如下图所示,ERNIE-Layout 创新性地提出了阅读顺序预测和细粒度图文匹配两个自监督预训练任务,有效提升模型在文档任务上跨模态语义对齐能力和布局理解能力。\n", "文心 ERNIE-mmLayout 为进一步探索不同粒度元素关系对文档理解的价值,在文心 ERNIE-Layout 的基础上引入基于 GNN 的多粒度、多模态 Transformer 层,实现文档图聚合(Document Graph Aggregation)表示。最终,在多个信息抽取任务上以更少的模型参数量超过 SOTA 成绩,\n",
"\n", "\n",
"\n", "附:文档智能(DI,Document Intelligence)主要指对于网页、数字文档或扫描文档所包含的文本以及丰富的排版格式等信息,通过人工智能技术进行理解、分类、提取以及信息归纳的过程。百度文档智能技术体系立足于强大的 NLP 与 OCR 技术积累,以多语言跨模态布局增强文档智能大模型文心 ERNIE-Layout 为核心底座,结合图神经网络技术,支撑文档布局分析、抽取问答、表格理解、语义表示多个核心模块,满足上层应用各类文档智能分析功能需求。"
"![](https://bce.bdstatic.com/doc/ai-doc/wenxin/image%20%2814%29_59cc6c8.png)\n",
"</figure>"
] ]
}, },
{ {
...@@ -142,13 +164,8 @@ ...@@ -142,13 +164,8 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 5.注意事项\n", "# 5.注意事项\n",
"## 5.1参数配置\n",
"* batch_size:批处理大小,请结合机器情况进行调整,默认为1。\n",
"\n",
"* lang:选择PaddleOCR的语言,ch可在中英混合的图片中使用,en在英文图片上的效果更好,默认为ch。\n",
"\n", "\n",
"* topn: 如果模型识别出多个结果,将返回前n个概率值最高的结果,默认为1。\n", "### DocPrompt 使用技巧\n",
"## 5.2使用技巧\n",
"\n", "\n",
"* Prompt设计:在DocPrompt中,Prompt可以是陈述句(例如,文档键值对中的Key),也可以是疑问句。因为是开放域的抽取问答,DocPrompt对Prompt的设计没有特殊限制,只要符合自然语言语义即可。如果对当前的抽取结果不满意,可以多尝试一些不同的Prompt。 \n", "* Prompt设计:在DocPrompt中,Prompt可以是陈述句(例如,文档键值对中的Key),也可以是疑问句。因为是开放域的抽取问答,DocPrompt对Prompt的设计没有特殊限制,只要符合自然语言语义即可。如果对当前的抽取结果不满意,可以多尝试一些不同的Prompt。 \n",
"\n", "\n",
...@@ -205,15 +222,22 @@ ...@@ -205,15 +222,22 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3.8.5 ('base')", "display_name": "Python 3",
"language": "python", "language": "python",
"name": "python3" "name": "py35-paddle1.2.0"
}, },
"language_info": { "language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python", "name": "python",
"version": "3.8.5" "nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}, },
"orig_nbformat": 4,
"vscode": { "vscode": {
"interpreter": { "interpreter": {
"hash": "a5f44439766e47113308a61c45e3ba0ce79cefad900abb614d22e5ec5db7fbe0" "hash": "a5f44439766e47113308a61c45e3ba0ce79cefad900abb614d22e5ec5db7fbe0"
...@@ -221,5 +245,5 @@ ...@@ -221,5 +245,5 @@
} }
}, },
"nbformat": 4, "nbformat": 4,
"nbformat_minor": 2 "nbformat_minor": 4
} }
...@@ -5,36 +5,66 @@ ...@@ -5,36 +5,66 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 1.ERNIE-Layout Introduction\n", "# 1.ERNIE-Layout Introduction\n",
"With the digital transformation of many industries, the structural analysis and content extraction of electronic documents have become a hot research topic. Electronic documents include scanned image documents and computer-generated digital documents, involving documents, industry reports, contracts, employment agreements, invoices, resumes and other types. The intelligent document understanding task aims to understand documents with various formats, layouts and contents, including document classification, document information extraction, document question answering and other tasks. Different from plain text documents, documents contain tables, pictures and other contents, and contain rich visual information. Because the document is rich in content, complex in layout, diverse in font style, and noisy in data, the task of document understanding is extremely challenging. With the great success of pre training language models such as ERNIE in the NLP field, people began to focus on large-scale pre training in the field of document understanding. Baidu put forward the cross modal document understanding model ERNIE-Layout, which is the first time to integrate the layout knowledge enhancement technology into the cross modal document pre training, refreshing the world's best results in four document understanding tasks, and topping the DocVQA list. At the same time, ERNIE Layout has been integrated into Baidu's intelligent document analysis platform TextMind to help enterprises upgrade digitally.\n",
"\n", "\n",
"Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets.\n",
"\n", "\n",
"ERNIE-Layout takes the Wenxin text big model ERNIE as the base, integrates text, image, layout and other information for cross modal joint modeling, innovatively introduces layout knowledge enhancement, proposes self-monitoring pre training tasks such as reading order prediction, fine grain image text matching, upgrades spatial decoupling attention mechanism, and greatly improves the effect on each data set. Related work [ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](https://arxiv.org/abs/2210.06155) has been included in the EMNLP 2022 Findings Conference. Considering that document intelligence is widely commercially available in multiple languages, it relies on PaddleNLP to open source the strongest multilingual cross modal document pre training model ERNIE Layout in the industry.\n", "The work is accepted by EMNLP 2022 (Findings). To expand the scope of commercial applications for document intelligence, we release the multilingual model of ERNIE-Layout in PaddleNLP. You can visit [https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout) for more details.<br/>\n",
"ERNIE-Layout is a large cross modal model officially produced by the Flying Slurry. For more details about PaddleNLP, please visit <https://github.com/PaddlePaddle/PaddleNLP/> for details.<br/>\n",
"\n", "\n",
"<img src=\"https://user-images.githubusercontent.com/40840292/195091552-86a2d174-24b0-4ddf-825a-4503e0bc390b.png\" width = 95% align=center />" "<img src=\"https://user-images.githubusercontent.com/40840292/195091552-86a2d174-24b0-4ddf-825a-4503e0bc390b.png\" width=\"60%\" height=\"60%\"> <br />"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 2.Model Effect and Application Scenario\n", "# 2.Model Performance\n",
"ERNIE-Layout can be used to process but not limited to tasks such as document classification, information extraction, document Q&A with layout data (documents, pictures, etc.). Application scenarios include but not limited to invoice extraction Q&A, poster extraction Q&A, web page extraction Q&A, table extraction Q&A, test paper extraction Q&A, English bill multilingual (Chinese, English, Japanese, Thai, Spanish, Russian) extraction Q&A Chinese bills in multiple languages (simplified, traditional, English, Japanese, French). Taking document information extraction and document visual Q&A as examples, the effect of using ERNIE-Layout model is shown below.\n", "\n",
"## 2.1Document Information Extraction Task:\n", "ERNIE-Layout can be used to process and analyze multimodal documents. ERNIE-Layout is effective in tasks such as document classification, information extraction, document VQA with layout data (documents, pictures, etc). \n",
"### 2.1.1Dataset:\n", "\n",
"Data sets include FUNSD, XFUND-ZH, etc. FUNSD is an English data set for form understanding on noisy scanned documents. The data set contains 199 real, fully annotated and scanned forms. Documents are noisy, and the appearance of various forms varies greatly, so understanding forms is a challenging task. The dataset can be used for a variety of tasks, including text detection, optical character recognition, spatial layout analysis, and entity tagging/linking. XFUND is a multilingual form understanding benchmark dataset, including manually labeled key value pair forms in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). XFUND-ZH is the Chinese version of XFUND.\n", "- Invoice VQA\n",
"### 2.1.2Quick View Of Model Effect:\n", "\n",
"The model effect of ERNIE-Layout on FUNSD is:\n", "<div align=\"center\">\n",
"\n", " <img src=https://user-images.githubusercontent.com/40840292/196118171-fd3e49a0-b9f1-4536-a904-c48f709a2dec.png width=\"60%\" height=\"60%\" hspace='10'/>\n",
"<img src=\"https://gitee.com/doubleguy/typora/raw/master/img/202211082019436.png\" width = 95% align=center />\n", "</div>\n",
"\n", "\n",
"## 2.2Document Visual Question And Answer Task:\n", "- Poster VQA\n",
"### 2.2.1Dataset:\n", "\n",
"The data set is DocVQA-ZH, and DocVQA-ZH has stopped submitting the list. Therefore, we will re divide the original training set to evaluate the model effect. After division, the training set contains 4187 pictures, the verification set contains 500 pictures, and the test set contains 500 pictures.\n", "<div align=\"center\">\n",
"### 2.2.2Quick View Of Model Effect:\n", " <img src=https://user-images.githubusercontent.com/40840292/195610368-04230855-62de-439e-b708-2c195b70461f.png width=\"60%\" height=\"60%\" hspace='15'/>\n",
"The model effect of ERNIE-Layout on DocVQA-ZH is:\n", "</div>\n",
"\n", "\n",
"![](https://user-images.githubusercontent.com/40840292/195611075-9323ce9f-134b-4657-ab1c-f4892075d909.png)\n" "- WebPage VQA\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/195611613-bdbe692e-d7f2-4a2b-b548-1a933463b0b9.png width=\"60%\" height=\"60%\" hspace='10'/>\n",
"</div>\n",
"\n",
"\n",
"- Table VQA\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/195610692-8367f1c8-32c2-4b5d-9514-a149795cf609.png width=\"60%\" height=\"60%\" hspace='10'/>\n",
"</div>\n",
"\n",
"\n",
"- Exam Paper VQA\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/195823294-d891d95a-2ef8-4519-be59-0fedb96c00de.png width=\"60%\" height=\"60%\" hspace='10'/>\n",
"</div>\n",
"\n",
"\n",
"- English invoice VQA by multilingual(CH, EN, JP, Th, ES, RUS) prompt\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/195610820-7fb88608-b317-45fc-a6ab-97bf3b20a4ac.png width=\"60%\" height=\"60%\" hspace='15'/>\n",
"</div>\n",
"\n",
"- Chinese invoice VQA by multilingual(CHS, CHT, EN, JP, FR) prompt\n",
"\n",
"<div align=\"center\">\n",
" <img src=https://user-images.githubusercontent.com/40840292/195611075-9323ce9f-134b-4657-ab1c-f4892075d909.png width=\"60%\" height=\"60%\" hspace='15'/>\n",
"</div>"
] ]
}, },
{ {
...@@ -42,14 +72,15 @@ ...@@ -42,14 +72,15 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 3.How To Use The Model\n", "# 3.How To Use The Model\n",
"## 3.1Model Reasoning\n",
"We have integrated the ERNIE-Layout DocPrompt Engine on the [huggingface page](https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout), which can be experienced with one click.\n",
"\n", "\n",
"**Taskflow**\n", "## 3.1 Model Inference\n",
"\n", "\n",
"Of course, you can also use Taskflow for reasoning. Through `paddlenlp.Taskflow` calls DocPrompt with three lines of code, and has the ability to extract questions and answers from multilingual documents. Some application scenarios are shown below:\n", "You can use DocPrompt through `paddlenlp.Taskflow` for model inference.\n",
"\n", "\n",
"* Input Format\n", "* Input Format\n",
" * Support single and batch forecasting\n",
" * Support local image path input\n",
" * Support http image link input\n",
"\n", "\n",
"```python\n", "```python\n",
"[\n", "[\n",
...@@ -66,51 +97,53 @@ ...@@ -66,51 +97,53 @@
"]\n", "]\n",
"```\n", "```\n",
"\n", "\n",
"* Support single and batch forecasting\n",
"\n",
" * Support local image path input\n",
"\n",
" ![](https://user-images.githubusercontent.com/40840292/194748579-f9e8aa86-7f65-4827-bfae-824c037228b3.png)\n",
"\n",
" ```python \n",
" from pprint import pprint\n",
" from paddlenlp import Taskflow\n",
" docprompt = Taskflow(\"document_intelligence\")\n",
" pprint(docprompt([{\"doc\": \"./resume.png\", \"prompt\": [\"五百丁本次想要担任的是什么职位?\", \"五百丁是在哪里上的大学?\", \"大学学的是什么专业?\"]}]))\n",
" [{'prompt': '五百丁本次想要担任的是什么职位?',\n",
" 'result': [{'end': 7, 'prob': 1.0, 'start': 4, 'value': '客户经理'}]},\n",
" {'prompt': '五百丁是在哪里上的大学?',\n",
" 'result': [{'end': 37, 'prob': 1.0, 'start': 31, 'value': '广州五百丁学院'}]},\n",
" {'prompt': '大学学的是什么专业?',\n",
" 'result': [{'end': 44, 'prob': 0.82, 'start': 38, 'value': '金融学(本科)'}]}]\n",
" ```\n",
"\n",
" * http image link input\n",
"\n",
" ![](https://user-images.githubusercontent.com/40840292/194748592-e20b2a5f-d36b-46fb-8057-86755d188af0.jpg)\n",
"\n", "\n",
" ```python \n",
" from pprint import pprint\n",
" from paddlenlp import Taskflow\n",
"\n", "\n",
" docprompt = Taskflow(\"document_intelligence\")\n", " \n",
" pprint(docprompt([{\"doc\": \"https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg\", \"prompt\": [\"发票号码是多少?\", \"校验码是多少?\"]}]))\n",
" [{'prompt': '发票号码是多少?',\n",
" 'result': [{'end': 2, 'prob': 0.74, 'start': 2, 'value': 'No44527206'}]},\n",
" {'prompt': '校验码是多少?',\n",
" 'result': [{'end': 233,\n",
" 'prob': 1.0,\n",
" 'start': 231,\n",
" 'value': '01107 555427109891646'}]}]\n",
" ```\n",
"\n", "\n",
"* Description of configurable parameters\n", "* Description of configurable parameters\n",
" * `batch_size`:Please adjust the batch size according to the machine conditions. The default value is 1.\n", " * `batch_size`:Please adjust the batch size according to the machine conditions. The default value is 1.\n",
" * `lang`:Select the language of PaddleOCR. `ch` can be used in Chinese English mixed pictures. `en` is better in English pictures. The default is `ch`.\n", " * `lang`:Select the language of PaddleOCR. `ch` can be used in Chinese English mixed pictures. `en` is better in English pictures. The default is `ch`.\n",
" * `topn`: If the model identifies multiple results, it will return the first n results with the highest probability value, which is 1 by default.\n", " * `topn`: If the model identifies multiple results, it will return the first n results with the highest probability value, which is 1 by default."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"# Install PaddleNLP and PaddleOCR\n",
"!pip install --upgrade paddlenlp\n",
"!pip install --upgrade paddleocr "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"from pprint import pprint\n",
"from paddlenlp import Taskflow\n",
"\n",
"docprompt = Taskflow(\"document_intelligence\")\n",
"pprint(docprompt([{\"doc\": \"https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg\", \"prompt\": [\"发票号码是多少?\", \"校验码是多少?\"]}]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 Model Fine-tuning And Deployment\n",
"\n", "\n",
"## 3.2Model Fine-tuning And Deployment\n", "ERNIE-Layout is a multimodal pretrained model based on layout knowledge enhancement technology, and it integrates text, image, layout and other information for joint modeling. It can show excellent cross modal semantic alignment and layout understanding ability on tasks including but not limited to document information extraction, document visual question answering, document image classification and so on.\n",
"ERNIE-Layout is a cross modal general document pre training model that relies on Wenxin ERNIE, based on layout knowledge enhancement technology, and integrates text, image, layout and other information for joint modeling. It can show excellent cross modal semantic alignment and layout understanding ability on tasks including but not limited to document information extraction, document visual question answering, document image classification and so on.\n",
"\n", "\n",
"For details about the fine-tuning and deployment of the above tasks using ERNIE-Layout, please refer to: [ERNIE-Layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout\n", "For details about the fine-tuning and deployment of the above tasks using ERNIE-Layout, please refer to: [ERNIE-Layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout\n",
")" ")"
...@@ -120,23 +153,13 @@ ...@@ -120,23 +153,13 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 4.Model Principle\n", "# 4. Model Principle\n",
"* Layout knowledge enhancement technology\n",
"\n",
"* Fusion of text, image, layout and other information for joint modeling\n",
"\n", "\n",
"* Reading order prediction+fine-grained image text matching: two self-monitoring pre training tasks\n",
"\n",
"<figure>\n",
"For document understanding, the text reading order in the document is very important. At present, most mainstream models based on OCR (Optical Character Recognition) technology follow the principle of \"from left to right, from top to bottom\". However, for the complex layout of the document with a mixture of columns, text, graphics and tables, the reading order obtained according to the OCR results is wrong in most cases, As a result, the model cannot accurately understand the content of the document.\n", "For document understanding, the text reading order in the document is very important. At present, most mainstream models based on OCR (Optical Character Recognition) technology follow the principle of \"from left to right, from top to bottom\". However, for the complex layout of the document with a mixture of columns, text, graphics and tables, the reading order obtained according to the OCR results is wrong in most cases, As a result, the model cannot accurately understand the content of the document.\n",
"\n", "\n",
"Humans usually read in hierarchies and blocks according to the document structure and layout. Inspired by this, Baidu researchers proposed an innovative idea of layout knowledge enhancement to correct the reading order in the document pre training model. The industry-leading document parsing tool (Document Parser) on the TextMind platform can accurately identify the block information in the document, produce the correct document reading order, and integrate the reading order signal into the model training, thus enhancing the effective use of layout information and improving the model's understanding of complex documents.\n", "Humans usually read in hierarchies and blocks according to the document structure and layout. Inspired by this, Baidu researchers proposed an innovative idea of layout knowledge enhancement to correct the reading order in the document pre training model. The industry-leading document parsing tool (Document Parser) on the TextMind platform can accurately identify the block information in the document, produce the correct document reading order, and integrate the reading order signal into the model training, thus enhancing the effective use of layout information and improving the model's understanding of complex documents.\n",
"\n", "\n",
"Based on the layout knowledge enhancement technology, and relying on Wenxin ERNIE, Baidu researchers proposed a cross modal general document pre training model ERNIE-Layout, which integrates text, image, layout and other information for joint modeling. As shown in the figure below, ERNIE-Layout innovatively proposed two self-monitoring pre training tasks: reading order prediction and fine-grained image text matching, which effectively improved the model's cross modal semantic alignment ability and layout understanding ability in document tasks.\n", "Based on the layout knowledge enhancement technology, and relying on Wenxin ERNIE, Baidu researchers proposed a cross modal general document pre training model ERNIE-Layout, which integrates text, image, layout and other information for joint modeling. As shown in the figure below, ERNIE-Layout innovatively proposed two self-monitoring pre training tasks: reading order prediction and fine-grained image text matching, which effectively improved the model's cross modal semantic alignment ability and layout understanding ability in document tasks."
"\n",
"\n",
"![](https://bce.bdstatic.com/doc/ai-doc/wenxin/image%20%2814%29_59cc6c8.png)\n",
"</figure>"
] ]
}, },
{ {
...@@ -144,14 +167,8 @@ ...@@ -144,14 +167,8 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"# 5.Matters Needing Attention\n", "# 5.Matters Needing Attention\n",
"## 5.1Parameter Configuration\n",
"* batch_size:Please adjust the batch size according to the machine conditions. The default value is 1.\n",
"\n",
"* lang:Choose the language of PaddleOCR. ch can be used in Chinese English mixed pictures. en has better effect on English pictures. The default is ch.\n",
"\n",
"* topn: If the model identifies multiple results, it will return the first n results with the highest probability value, which is 1 by default.\n",
"\n", "\n",
"## 5.2Tips\n", "## DocPrompt Tips\n",
"\n", "\n",
"* Prompt design: In DocPrompt, Prompt can be a statement (for example, the Key in the document key value pair) or a question. Because it is an open domain extracted question and answer, DocPrompt has no special restrictions on the design of Prompt, as long as it conforms to natural language semantics. If you are not satisfied with the current extraction results, you can try some different Prompts.\n", "* Prompt design: In DocPrompt, Prompt can be a statement (for example, the Key in the document key value pair) or a question. Because it is an open domain extracted question and answer, DocPrompt has no special restrictions on the design of Prompt, as long as it conforms to natural language semantics. If you are not satisfied with the current extraction results, you can try some different Prompts.\n",
"\n", "\n",
...@@ -208,15 +225,22 @@ ...@@ -208,15 +225,22 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3.8.5 ('base')", "display_name": "Python 3",
"language": "python", "language": "python",
"name": "python3" "name": "py35-paddle1.2.0"
}, },
"language_info": { "language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python", "name": "python",
"version": "3.8.5" "nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}, },
"orig_nbformat": 4,
"vscode": { "vscode": {
"interpreter": { "interpreter": {
"hash": "a5f44439766e47113308a61c45e3ba0ce79cefad900abb614d22e5ec5db7fbe0" "hash": "a5f44439766e47113308a61c45e3ba0ce79cefad900abb614d22e5ec5db7fbe0"
...@@ -224,5 +248,5 @@ ...@@ -224,5 +248,5 @@
} }
}, },
"nbformat": 4, "nbformat": 4,
"nbformat_minor": 2 "nbformat_minor": 4
} }
...@@ -10,7 +10,10 @@ Task: ...@@ -10,7 +10,10 @@ Task:
tag: "自然语言处理" tag: "自然语言处理"
sub_tag_en: "Pretrained Model" sub_tag_en: "Pretrained Model"
sub_tag: "预训练模型" sub_tag: "预训练模型"
- tag_en: "Wenxin Big Models"
tag: "文心大模型"
sub_tag_en: "Pretrained Model"
sub_tag: "预训练模型"
Example: Example:
Datasets: "" Datasets: ""
Publisher: "Baidu" Publisher: "Baidu"
......
...@@ -24,11 +24,11 @@ with gr.Blocks() as demo: ...@@ -24,11 +24,11 @@ with gr.Blocks() as demo:
with gr.Column(scale=1, min_width=100): with gr.Column(scale=1, min_width=100):
schema = gr.Textbox( schema = gr.Textbox(
placeholder="ex. ['时间', '选手', '赛事名称']", placeholder="ex. ['时间', '选手', '赛事名称']",
label="Schema (You can type any schema.)", label="Type any schema you want:",
lines=2) lines=2)
text = gr.Textbox( text = gr.Textbox(
placeholder="ex. 2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!", placeholder="ex. 2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中中国选手谷爱凌以188.25分获得金牌!",
label="Text (You can type any input sequence.)", label="Input Sequence:",
lines=2) lines=2)
with gr.Row(): with gr.Row():
......
...@@ -6,28 +6,52 @@ Model_Info: ...@@ -6,28 +6,52 @@ Model_Info:
icon: "https://user-images.githubusercontent.com/11793384/203492521-8d09d089-5576-41d3-8bc7-eec7a5385c35.png" icon: "https://user-images.githubusercontent.com/11793384/203492521-8d09d089-5576-41d3-8bc7-eec7a5385c35.png"
from_repo: "PaddleNLP" from_repo: "PaddleNLP"
Task: Task:
- tag_en: "Natural Language Processing"
tag: "自然语言处理"
sub_tag_en: "Named Entity Recognition"
sub_tag: "命名实体识别"
- tag_en: "Natural Language Processing" - tag_en: "Natural Language Processing"
tag: "自然语言处理" tag: "自然语言处理"
sub_tag_en: "Relationship Extraction" sub_tag_en: "Relationship Extraction"
sub_tag: "关系抽取" sub_tag: "关系抽取"
- tag_en: "Natural Language Processing" - tag_en: "Natural Language Processing"
tag: "自然语言处理" tag: "自然语言处理"
sub_tag_en: "Named Entity Recognition" sub_tag_en: "Event Extraction"
sub_tag: "命名实体识别" sub_tag: "事件抽取"
- tag_en: "Natural Language Processing" - tag_en: "Natural Language Processing"
tag: "自然语言处理" tag: "自然语言处理"
sub_tag_en: "Emotional Classification" sub_tag_en: "Opinion Extraction"
sub_tag: "情感分类" sub_tag: "评论观点抽取"
- tag_en: "Natural Language Processing" - tag_en: "Natural Language Processing"
tag: "自然语言处理" tag: "自然语言处理"
sub_tag_en: "Pos labeling" sub_tag_en: "Emotional Classification"
sub_tag: "词性标注" sub_tag: "情感分类"
- tag_en: "Wenxin Big Models"
tag: "文心大模型"
sub_tag_en: "Named Entity Recognition"
sub_tag: "命名实体识别"
- tag_en: "Wenxin Big Models"
tag: "文心大模型"
sub_tag_en: "Relationship Extraction"
sub_tag: "关系抽取"
- tag_en: "Wenxin Big Models"
tag: "文心大模型"
sub_tag_en: "Event Extraction"
sub_tag: "事件抽取"
- tag_en: "Wenxin Big Models"
tag: "文心大模型"
sub_tag_en: "Opinion Extraction"
sub_tag: "评论观点抽取"
- tag_en: "Wenxin Big Models"
tag: "文心大模型"
sub_tag_en: "Emotional Classification"
sub_tag: "情感分类"
Example: Example:
- tag_en: "Intelligent Finance" - tag_en: "Intelligent Finance"
tag: "智慧金融" tag: "智慧金融"
sub_tag_en: "Key Word Extraction" sub_tag_en: "Key Word Extraction"
title: "使用PaddleNLP UIE模型抽取PDF版上市公司公告" title: "使用PaddleNLP UIE模型抽取PDF版上市公司公告"
sub_tag: "关键字段抽取" sub_tag: "上市公司公告信息抽取"
url: "https://aistudio.baidu.com/aistudio/projectdetail/4497591" url: "https://aistudio.baidu.com/aistudio/projectdetail/4497591"
- tag_en: "Intelligent Retail" - tag_en: "Intelligent Retail"
tag: "智慧零售" tag: "智慧零售"
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册