introduction_en.ipynb 12.3 KB
Notebook
Newer Older
小小的香辛料 已提交
1 2 3 4 5 6 7 8
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1.ERNIE-Layout Introduction\n",
    "\n",
9
    "Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets.\n",
小小的香辛料 已提交
10
    "\n",
11
    "The work is accepted by EMNLP 2022 (Findings). To expand the scope of commercial applications for document intelligence, we release the multilingual model of ERNIE-Layout in PaddleNLP. You can visit [https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout) for more details.<br/>\n",
小小的香辛料 已提交
12
    "\n",
13
    "<img src=\"https://user-images.githubusercontent.com/40840292/195091552-86a2d174-24b0-4ddf-825a-4503e0bc390b.png\" width=\"60%\" height=\"60%\"> <br />"
小小的香辛料 已提交
14 15 16 17 18 19
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
    "# 2.Model Performance\n",
    "\n",
    "ERNIE-Layout can be used to process and analyze multimodal documents. ERNIE-Layout is effective in tasks such as document classification, information extraction, document VQA with layout data (documents, pictures, etc). \n",
    "\n",
    "- Invoice VQA\n",
    "\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/196118171-fd3e49a0-b9f1-4536-a904-c48f709a2dec.png width=\"60%\" height=\"60%\" hspace='10'/>\n",
    "</div>\n",
    "\n",
    "- Poster VQA\n",
    "\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/195610368-04230855-62de-439e-b708-2c195b70461f.png width=\"60%\" height=\"60%\" hspace='15'/>\n",
    "</div>\n",
    "\n",
    "- WebPage VQA\n",
    "\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/195611613-bdbe692e-d7f2-4a2b-b548-1a933463b0b9.png width=\"60%\" height=\"60%\" hspace='10'/>\n",
    "</div>\n",
    "\n",
    "\n",
    "- Table VQA\n",
    "\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/195610692-8367f1c8-32c2-4b5d-9514-a149795cf609.png width=\"60%\" height=\"60%\" hspace='10'/>\n",
    "</div>\n",
    "\n",
    "\n",
    "- Exam Paper VQA\n",
    "\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/195823294-d891d95a-2ef8-4519-be59-0fedb96c00de.png width=\"60%\" height=\"60%\" hspace='10'/>\n",
    "</div>\n",
    "\n",
    "\n",
    "- English invoice VQA by multilingual(CH, EN, JP, Th, ES, RUS) prompt\n",
    "\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/195610820-7fb88608-b317-45fc-a6ab-97bf3b20a4ac.png width=\"60%\" height=\"60%\" hspace='15'/>\n",
    "</div>\n",
    "\n",
    "- Chinese invoice VQA by multilingual(CHS, CHT, EN, JP, FR) prompt\n",
    "\n",
    "<div align=\"center\">\n",
    "    <img src=https://user-images.githubusercontent.com/40840292/195611075-9323ce9f-134b-4657-ab1c-f4892075d909.png width=\"60%\" height=\"60%\" hspace='15'/>\n",
    "</div>"
小小的香辛料 已提交
68 69 70 71 72 73 74 75
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3.How To Use The Model\n",
    "\n",
76
    "## 3.1 Model Inference\n",
小小的香辛料 已提交
77
    "\n",
78
    "You can use DocPrompt through `paddlenlp.Taskflow` for model inference.\n",
小小的香辛料 已提交
79 80
    "\n",
    "* Input Format\n",
81 82 83
    "    * Support single and batch forecasting\n",
    "    * Support local image path input\n",
    "    * Support http image link input\n",
小小的香辛料 已提交
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
    "\n",
    "```python\n",
    "[\n",
    "  {\"doc\": \"./invoice.jpg\", \"prompt\": [\"发票号码是多少?\", \"校验码是多少?\"]},\n",
    "  {\"doc\": \"./resume.png\", \"prompt\": [\"五百丁本次想要担任的是什么职位?\", \"五百丁是在哪里上的大学?\", \"大学学的是什么专业?\"]}\n",
    "]\n",
    "```\n",
    "\n",
    "By default, PaddleOCR is used for OCR identification, and users can use the `word_ boxes` Pass in your own OCR results in the format `List[str, List[float, float, float, float]]`.\n",
    "\n",
    "```python \n",
    "[\n",
    "  {\"doc\": doc_path, \"prompt\": prompt, \"word_boxes\": word_boxes}\n",
    "]\n",
    "```\n",
    "\n",
    "\n",
    "\n",
102
    "  \n",
小小的香辛料 已提交
103 104 105 106
    "\n",
    "* Description of configurable parameters\n",
    "  * `batch_size`:Please adjust the batch size according to the machine conditions. The default value is 1.\n",
    "  * `lang`:Select the language of PaddleOCR. `ch` can be used in Chinese English mixed pictures. `en` is better in English pictures. The default is `ch`.\n",
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
    "  * `topn`: If the model identifies multiple results, it will return the first n results with the highest probability value, which is 1 by default."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Install PaddleNLP and PaddleOCR\n",
    "!pip install --upgrade paddlenlp\n",
    "!pip install --upgrade paddleocr "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pprint import pprint\n",
    "from paddlenlp import Taskflow\n",
    "\n",
    "docprompt = Taskflow(\"document_intelligence\")\n",
    "pprint(docprompt([{\"doc\": \"https://bj.bcebos.com/paddlenlp/taskflow/document_intelligence/images/invoice.jpg\", \"prompt\": [\"发票号码是多少?\", \"校验码是多少?\"]}]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 Model Fine-tuning And Deployment\n",
小小的香辛料 已提交
145
    "\n",
146
    "ERNIE-Layout is a multimodal pretrained model based on layout knowledge enhancement technology, and it integrates text, image, layout and other information for joint modeling. It can show excellent cross modal semantic alignment and layout understanding ability on tasks including but not limited to document information extraction, document visual question answering, document image classification and so on.\n",
小小的香辛料 已提交
147 148 149 150 151 152 153 154 155
    "\n",
    "For details about the fine-tuning and deployment of the above tasks using ERNIE-Layout, please refer to: [ERNIE-Layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
156
    "# 4. Model Principle\n",
小小的香辛料 已提交
157 158 159 160 161
    "\n",
    "For document understanding, the text reading order in the document is very important. At present, most mainstream models based on OCR (Optical Character Recognition) technology follow the principle of \"from left to right, from top to bottom\". However, for the complex layout of the document with a mixture of columns, text, graphics and tables, the reading order obtained according to the OCR results is wrong in most cases, As a result, the model cannot accurately understand the content of the document.\n",
    "\n",
    "Humans usually read in hierarchies and blocks according to the document structure and layout. Inspired by this, Baidu researchers proposed an innovative idea of layout knowledge enhancement to correct the reading order in the document pre training model. The industry-leading document parsing tool (Document Parser) on the TextMind platform can accurately identify the block information in the document, produce the correct document reading order, and integrate the reading order signal into the model training, thus enhancing the effective use of layout information and improving the model's understanding of complex documents.\n",
    "\n",
162
    "Based on the layout knowledge enhancement technology, and relying on Wenxin ERNIE, Baidu researchers proposed a cross modal general document pre training model ERNIE-Layout, which integrates text, image, layout and other information for joint modeling. As shown in the figure below, ERNIE-Layout innovatively proposed two self-monitoring pre training tasks: reading order prediction and fine-grained image text matching, which effectively improved the model's cross modal semantic alignment ability and layout understanding ability in document tasks."
小小的香辛料 已提交
163 164 165 166 167 168 169 170
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5.Matters Needing Attention\n",
    "\n",
171
    "## DocPrompt Tips\n",
小小的香辛料 已提交
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227
    "\n",
    "* Prompt design: In DocPrompt, Prompt can be a statement (for example, the Key in the document key value pair) or a question. Because it is an open domain extracted question and answer, DocPrompt has no special restrictions on the design of Prompt, as long as it conforms to natural language semantics. If you are not satisfied with the current extraction results, you can try some different Prompts.\n",
    "\n",
    "* Languages supported:Support Chinese and English image input of local path or HTTP link. Prompt supports multiple languages. Refer to the examples of different scenarios above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6.Relevant Papers And Citations\n",
    "#### ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding\n",
    "```\n",
    "@misc{https://doi.org/10.48550/arxiv.2210.06155,\n",
    "  doi = {10.48550/ARXIV.2210.06155},\n",
    "  \n",
    "  url = {https://arxiv.org/abs/2210.06155},\n",
    "  \n",
    "  author = {Peng, Qiming and Pan, Yinxu and Wang, Wenjin and Luo, Bin and Zhang, Zhenyu and Huang, Zhengjie and Hu, Teng and Yin, Weichong and Chen, Yongfeng and Zhang, Yin and Feng, Shikun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},\n",
    "  \n",
    "  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},\n",
    "  \n",
    "  title = {ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding},\n",
    "  \n",
    "  publisher = {arXiv},\n",
    "  \n",
    "  year = {2022},\n",
    "  \n",
    "  copyright = {arXiv.org perpetual, non-exclusive license}\n",
    "}\n",
    "```\n",
    "#### ICDAR 2019 Competition on Scene Text Visual Question Answering\n",
    "```\n",
    "@misc{https://doi.org/10.48550/arxiv.1907.00490,\n",
    "  doi = {10.48550/ARXIV.1907.00490},\n",
    "  \n",
    "  url = {https://arxiv.org/abs/1907.00490},\n",
    "  \n",
    "  author = {Biten, Ali Furkan and Tito, Rubèn and Mafla, Andres and Gomez, Lluis and Rusiñol, Marçal and Mathew, Minesh and Jawahar, C. V. and Valveny, Ernest and Karatzas, Dimosthenis},\n",
    "  \n",
    "  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},\n",
    "  \n",
    "  title = {ICDAR 2019 Competition on Scene Text Visual Question Answering},\n",
    "  \n",
    "  publisher = {arXiv},\n",
    "  \n",
    "  year = {2019},\n",
    "  \n",
    "  copyright = {arXiv.org perpetual, non-exclusive license}\n",
    "}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
228
   "display_name": "Python 3",
小小的香辛料 已提交
229
   "language": "python",
230
   "name": "py35-paddle1.2.0"
小小的香辛料 已提交
231 232
  },
  "language_info": {
233 234 235 236 237 238
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
小小的香辛料 已提交
239
   "name": "python",
240 241 242
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
小小的香辛料 已提交
243 244 245 246 247 248 249 250
  },
  "vscode": {
   "interpreter": {
    "hash": "a5f44439766e47113308a61c45e3ba0ce79cefad900abb614d22e5ec5db7fbe0"
   }
  }
 },
 "nbformat": 4,
251
 "nbformat_minor": 4
小小的香辛料 已提交
252
}