{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1.ERNIE-Layout 模型简介\n", "随着众多行业的数字化转型,电子文档的结构化分析和内容提取成为一项热门的研究课题。电子文档包括扫描图像文件和计算机生成的数字文档两大类,涉及单据、行业报告、合同、雇佣协议、发票、简历等多种类型。智能文档理解任务以理解格式、布局、内容多种多样的文档为目标,包括了文档分类、文档信息抽取、文档问答等任务。与纯文本文档不同的是,文档包含表格、图片等多种内容,包含丰富的视觉信息。因为文档内容丰富、布局复杂、字体样式多样、数据存在噪声,文档理解任务极具挑战性。随着ERNIE等预训练语言模型在NLP领域取得了巨大的成功,人们开始关注在文档理解领域进行大规模预训练。百度提出跨模态文档理解模型 ERNIE-Layout,首次将布局知识增强技术融入跨模态文档预训练,在4项文档理解任务上刷新世界最好效果,登顶 DocVQA 榜首。同时,ERNIE-Layout已集成至百度智能文档分析平台 TextMind,助力企业数字化升级。\n", "\n", "ERNIE-Layout以文心文本大模型ERNIE为底座,融合文本、图像、布局等信息进行跨模态联合建模,创新性引入布局知识增强,提出阅读顺序预测、细粒度图文匹配等自监督预训练任务,升级空间解偶注意力机制,在各数据集上效果取得大幅度提升,相关工作 [ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](https://arxiv.org/abs/2210.06155) 已被EMNLP 2022 Findings会议收录。考虑到文档智能需求广泛,PaddleNLP对外开源了业界领先的多语言跨模态文档预训练模型ERNIE-Layout。\n", "\n", "此外,开放文档抽取问答模型 DocPrompt,以 ERNIE-Layout 为底座,可精准理解图文信息,推理学习附加知识,准确捕捉图片、PDF 等多模态文档中的每个细节。可前往 [https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout) 了解详情。\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2.模型效果及应用场景\n", "ERNIE-Layout 可以用于多模态文档的分类、信息抽取、文档问答等各个任务,应用场景包括但不限于发票抽取问答、海报抽取问答、网页抽取问答、表格抽取问答、试卷抽取问答、英文票据多语种(中、英、日、泰、西班牙、俄语)抽取问答、中文票据多语种(中简、中繁、英、日、法语)抽取问答等。\n", "\n", "DocPrompt 以 ERNIE-Layout 为底座,在开放域文档抽取问答任务中效果强悍,例如:\n", "\n", "- 发票抽取问答\n", "\n", "
\n", " \n", "
\n", "\n", "- 海报抽取问答\n", "\n", "
\n", " \n", "
\n", "\n", "- 网页抽取问答\n", "\n", "
\n", " \n", "
\n", "\n", "\n", "- 表格抽取问答\n", "\n", "
\n", " \n", "
\n", "\n", "\n", "- 试卷抽取问答\n", "\n", "
\n", " \n", "
\n", "\n", "\n", "- 英文票据多语种(中、英、日、泰、西班牙、俄语)抽取问答\n", "\n", "
\n", " \n", "
\n", "\n", "- 中文票据多语种(中简、中繁、英、日、法语)抽取问答\n", "\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3.模型如何使用\n", "## 3.1 模型推理\n", "\n", "通过`paddlenlp.Taskflow`三行代码即可调用DocPrompt功能,实现多语言文档抽取问答。\n", " \n", "* 输入格式\n", " * 支持单条、批量预测\n", " * 支持本地图片路径输入\n", " * 支持http图片链接输入\n", "\n", "```python\n", "[\n", " {\"doc\": \"./invoice.jpg\", \"prompt\": [\"发票号码是多少?\", \"校验码是多少?\"]},\n", " {\"doc\": \"./resume.png\", \"prompt\": [\"五百丁本次想要担任的是什么职位?\", \"五百丁是在哪里上的大学?\", \"大学学的是什么专业?\"]}\n", "]\n", "```\n", "\n", "默认使用PaddleOCR进行OCR识别,同时支持用户通过`word_boxes`传入自己的OCR结果,格式为`List[str, List[float, float, float, float]]`。\n", "\n", "```python \n", "[\n", " {\"doc\": doc_path, \"prompt\": prompt, \"word_boxes\": word_boxes}\n", "]\n", "```\n", "\n", " \n", "\n", "* 可配置参数说明\n", " * `batch_size`:批处理大小,请结合机器情况进行调整,默认为1。\n", " * `lang`:选择PaddleOCR的语言,`ch`可在中英混合的图片中使用,`en`在英文图片上的效果更好,默认为`ch`。\n", " * `topn`: 如果模型识别出多个结果,将返回前n个概率值最高的结果,默认为1。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "# 安装最新版PaddleNLP和PaddleOCR\n", "!pip install --upgrade paddlenlp\n", "!pip install --upgrade paddleocr " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "from pprint import pprint\n", "from paddlenlp import Taskflow\n", "\n", "docprompt = Taskflow(\"document_intelligence\")\n", "pprint(docprompt([{\"doc\": \"https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg\", \"prompt\": [\"发票号码是多少?\", \"校验码是多少?\"]}]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2 模型微调与部署\n", "\n", "PaddleNLP开源了ERNIE-Layout在文档信息抽取、文档视觉问答、文档图像分类等任务上的微调与部署示例,详情可参考:[ERNIE-Layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout\n", ")。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4.模型原理\n", "\n", "文心 ERNIE-Layout 以文心 ERNIE 为底座,融合文本、图像、布局等信息进行跨模态联合建模,创新性引入布局知识增强,提出阅读顺序预测、细粒度图文匹配等自监督预训练任务,升级空间解耦注意力机制。输入基于 VIMER-StrucTexT 大模型提供的 OCR 结果,在各数据集上效果取得大幅度提升。\n", "\n", "文心 ERNIE-mmLayout 为进一步探索不同粒度元素关系对文档理解的价值,在文心 ERNIE-Layout 的基础上引入基于 GNN 的多粒度、多模态 Transformer 层,实现文档图聚合(Document Graph Aggregation)表示。最终,在多个信息抽取任务上以更少的模型参数量超过 SOTA 成绩,\n", "\n", "附:文档智能(DI,Document Intelligence)主要指对于网页、数字文档或扫描文档所包含的文本以及丰富的排版格式等信息,通过人工智能技术进行理解、分类、提取以及信息归纳的过程。百度文档智能技术体系立足于强大的 NLP 与 OCR 技术积累,以多语言跨模态布局增强文档智能大模型文心 ERNIE-Layout 为核心底座,结合图神经网络技术,支撑文档布局分析、抽取问答、表格理解、语义表示多个核心模块,满足上层应用各类文档智能分析功能需求。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.注意事项\n", "\n", "### DocPrompt 使用技巧\n", "\n", "* Prompt设计:在DocPrompt中,Prompt可以是陈述句(例如,文档键值对中的Key),也可以是疑问句。因为是开放域的抽取问答,DocPrompt对Prompt的设计没有特殊限制,只要符合自然语言语义即可。如果对当前的抽取结果不满意,可以多尝试一些不同的Prompt。 \n", "\n", "* 支持的语言:支持本地路径或者HTTP链接的中英文图片输入,Prompt支持多种不同语言,参考以上不同场景的例子。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6.相关论文以及引用信息\n", "#### ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding\n", "```\n", "@misc{https://doi.org/10.48550/arxiv.2210.06155,\n", " doi = {10.48550/ARXIV.2210.06155},\n", " \n", " url = {https://arxiv.org/abs/2210.06155},\n", " \n", " author = {Peng, Qiming and Pan, Yinxu and Wang, Wenjin and Luo, Bin and Zhang, Zhenyu and Huang, Zhengjie and Hu, Teng and Yin, Weichong and Chen, Yongfeng and Zhang, Yin and Feng, Shikun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},\n", " \n", " keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},\n", " \n", " title = {ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding},\n", " \n", " publisher = {arXiv},\n", " \n", " year = {2022},\n", " \n", " copyright = {arXiv.org perpetual, non-exclusive license}\n", "}\n", "```\n", "#### ICDAR 2019 Competition on Scene Text Visual Question Answering\n", "```\n", "@misc{https://doi.org/10.48550/arxiv.1907.00490,\n", " doi = {10.48550/ARXIV.1907.00490},\n", " \n", " url = {https://arxiv.org/abs/1907.00490},\n", " \n", " author = {Biten, Ali Furkan and Tito, Rubèn and Mafla, Andres and Gomez, Lluis and Rusiñol, Marçal and Mathew, Minesh and Jawahar, C. V. and Valveny, Ernest and Karatzas, Dimosthenis},\n", " \n", " keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},\n", " \n", " title = {ICDAR 2019 Competition on Scene Text Visual Question Answering},\n", " \n", " publisher = {arXiv},\n", " \n", " year = {2019},\n", " \n", " copyright = {arXiv.org perpetual, non-exclusive license}\n", "}\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "py35-paddle1.2.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "vscode": { "interpreter": { "hash": "a5f44439766e47113308a61c45e3ba0ce79cefad900abb614d22e5ec5db7fbe0" } } }, "nbformat": 4, "nbformat_minor": 4 }