{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1.ERNIE-Layout Introduction\n", "\n", "Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets.\n", "\n", "The work is accepted by EMNLP 2022 (Findings). To expand the scope of commercial applications for document intelligence, we release the multilingual model of ERNIE-Layout in PaddleNLP. You can visit [https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout) for more details.
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2.Model Performance\n", "\n", "ERNIE-Layout can be used to process and analyze multimodal documents. ERNIE-Layout is effective in tasks such as document classification, information extraction, document VQA with layout data (documents, pictures, etc). \n", "\n", "- Invoice VQA\n", "\n", "
\n", " \n", "
\n", "\n", "- Poster VQA\n", "\n", "
\n", " \n", "
\n", "\n", "- WebPage VQA\n", "\n", "
\n", " \n", "
\n", "\n", "\n", "- Table VQA\n", "\n", "
\n", " \n", "
\n", "\n", "\n", "- Exam Paper VQA\n", "\n", "
\n", " \n", "
\n", "\n", "\n", "- English invoice VQA by multilingual(CH, EN, JP, Th, ES, RUS) prompt\n", "\n", "
\n", " \n", "
\n", "\n", "- Chinese invoice VQA by multilingual(CHS, CHT, EN, JP, FR) prompt\n", "\n", "
\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3.How To Use The Model\n", "\n", "## 3.1 Model Inference\n", "\n", "You can use DocPrompt through `paddlenlp.Taskflow` for model inference.\n", "\n", "* Input Format\n", " * Support single and batch forecasting\n", " * Support local image path input\n", " * Support http image link input\n", "\n", "```python\n", "[\n", " {\"doc\": \"./invoice.jpg\", \"prompt\": [\"发票号码是多少?\", \"校验码是多少?\"]},\n", " {\"doc\": \"./resume.png\", \"prompt\": [\"五百丁本次想要担任的是什么职位?\", \"五百丁是在哪里上的大学?\", \"大学学的是什么专业?\"]}\n", "]\n", "```\n", "\n", "By default, PaddleOCR is used for OCR identification, and users can use the `word_ boxes` Pass in your own OCR results in the format `List[str, List[float, float, float, float]]`.\n", "\n", "```python \n", "[\n", " {\"doc\": doc_path, \"prompt\": prompt, \"word_boxes\": word_boxes}\n", "]\n", "```\n", "\n", "\n", "\n", " \n", "\n", "* Description of configurable parameters\n", " * `batch_size`:Please adjust the batch size according to the machine conditions. The default value is 1.\n", " * `lang`:Select the language of PaddleOCR. `ch` can be used in Chinese English mixed pictures. `en` is better in English pictures. The default is `ch`.\n", " * `topn`: If the model identifies multiple results, it will return the first n results with the highest probability value, which is 1 by default." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "# Install PaddleNLP and PaddleOCR\n", "!pip install --upgrade paddlenlp\n", "!pip install --upgrade paddleocr " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true, "tags": [] }, "outputs": [], "source": [ "from pprint import pprint\n", "from paddlenlp import Taskflow\n", "\n", "docprompt = Taskflow(\"document_intelligence\")\n", "pprint(docprompt([{\"doc\": \"https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg\", \"prompt\": [\"发票号码是多少?\", \"校验码是多少?\"]}]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3.2 Model Fine-tuning And Deployment\n", "\n", "ERNIE-Layout is a multimodal pretrained model based on layout knowledge enhancement technology, and it integrates text, image, layout and other information for joint modeling. It can show excellent cross modal semantic alignment and layout understanding ability on tasks including but not limited to document information extraction, document visual question answering, document image classification and so on.\n", "\n", "For details about the fine-tuning and deployment of the above tasks using ERNIE-Layout, please refer to: [ERNIE-Layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4. Model Principle\n", "\n", "For document understanding, the text reading order in the document is very important. At present, most mainstream models based on OCR (Optical Character Recognition) technology follow the principle of \"from left to right, from top to bottom\". However, for the complex layout of the document with a mixture of columns, text, graphics and tables, the reading order obtained according to the OCR results is wrong in most cases, As a result, the model cannot accurately understand the content of the document.\n", "\n", "Humans usually read in hierarchies and blocks according to the document structure and layout. Inspired by this, Baidu researchers proposed an innovative idea of layout knowledge enhancement to correct the reading order in the document pre training model. The industry-leading document parsing tool (Document Parser) on the TextMind platform can accurately identify the block information in the document, produce the correct document reading order, and integrate the reading order signal into the model training, thus enhancing the effective use of layout information and improving the model's understanding of complex documents.\n", "\n", "Based on the layout knowledge enhancement technology, and relying on Wenxin ERNIE, Baidu researchers proposed a cross modal general document pre training model ERNIE-Layout, which integrates text, image, layout and other information for joint modeling. As shown in the figure below, ERNIE-Layout innovatively proposed two self-monitoring pre training tasks: reading order prediction and fine-grained image text matching, which effectively improved the model's cross modal semantic alignment ability and layout understanding ability in document tasks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5.Matters Needing Attention\n", "\n", "## DocPrompt Tips\n", "\n", "* Prompt design: In DocPrompt, Prompt can be a statement (for example, the Key in the document key value pair) or a question. Because it is an open domain extracted question and answer, DocPrompt has no special restrictions on the design of Prompt, as long as it conforms to natural language semantics. If you are not satisfied with the current extraction results, you can try some different Prompts.\n", "\n", "* Languages supported:Support Chinese and English image input of local path or HTTP link. Prompt supports multiple languages. Refer to the examples of different scenarios above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6.Relevant Papers And Citations\n", "#### ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding\n", "```\n", "@misc{https://doi.org/10.48550/arxiv.2210.06155,\n", " doi = {10.48550/ARXIV.2210.06155},\n", " \n", " url = {https://arxiv.org/abs/2210.06155},\n", " \n", " author = {Peng, Qiming and Pan, Yinxu and Wang, Wenjin and Luo, Bin and Zhang, Zhenyu and Huang, Zhengjie and Hu, Teng and Yin, Weichong and Chen, Yongfeng and Zhang, Yin and Feng, Shikun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},\n", " \n", " keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},\n", " \n", " title = {ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding},\n", " \n", " publisher = {arXiv},\n", " \n", " year = {2022},\n", " \n", " copyright = {arXiv.org perpetual, non-exclusive license}\n", "}\n", "```\n", "#### ICDAR 2019 Competition on Scene Text Visual Question Answering\n", "```\n", "@misc{https://doi.org/10.48550/arxiv.1907.00490,\n", " doi = {10.48550/ARXIV.1907.00490},\n", " \n", " url = {https://arxiv.org/abs/1907.00490},\n", " \n", " author = {Biten, Ali Furkan and Tito, Rubèn and Mafla, Andres and Gomez, Lluis and Rusiñol, Marçal and Mathew, Minesh and Jawahar, C. V. and Valveny, Ernest and Karatzas, Dimosthenis},\n", " \n", " keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},\n", " \n", " title = {ICDAR 2019 Competition on Scene Text Visual Question Answering},\n", " \n", " publisher = {arXiv},\n", " \n", " year = {2019},\n", " \n", " copyright = {arXiv.org perpetual, non-exclusive license}\n", "}\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "py35-paddle1.2.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "vscode": { "interpreter": { "hash": "a5f44439766e47113308a61c45e3ba0ce79cefad900abb614d22e5ec5db7fbe0" } } }, "nbformat": 4, "nbformat_minor": 4 }