"With the digital transformation of many industries, the structural analysis and content extraction of electronic documents have become a hot research topic. Electronic documents include scanned image documents and computer-generated digital documents, involving documents, industry reports, contracts, employment agreements, invoices, resumes and other types. The intelligent document understanding task aims to understand documents with various formats, layouts and contents, including document classification, document information extraction, document question answering and other tasks. Different from plain text documents, documents contain tables, pictures and other contents, and contain rich visual information. Because the document is rich in content, complex in layout, diverse in font style, and noisy in data, the task of document understanding is extremely challenging. With the great success of pre training language models such as ERNIE in the NLP field, people began to focus on large-scale pre training in the field of document understanding. Baidu put forward the cross modal document understanding model ERNIE-Layout, which is the first time to integrate the layout knowledge enhancement technology into the cross modal document pre training, refreshing the world's best results in four document understanding tasks, and topping the DocVQA list. At the same time, ERNIE Layout has been integrated into Baidu's intelligent document analysis platform TextMind to help enterprises upgrade digitally.\n",
"\n",
"\n",
"Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets.\n",
"\n",
"\n",
"ERNIE-Layout takes the Wenxin text big model ERNIE as the base, integrates text, image, layout and other information for cross modal joint modeling, innovatively introduces layout knowledge enhancement, proposes self-monitoring pre training tasks such as reading order prediction, fine grain image text matching, upgrades spatial decoupling attention mechanism, and greatly improves the effect on each data set. Related work [ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding](https://arxiv.org/abs/2210.06155) has been included in the EMNLP 2022 Findings Conference. Considering that document intelligence is widely commercially available in multiple languages, it relies on PaddleNLP to open source the strongest multilingual cross modal document pre training model ERNIE Layout in the industry.\n",
"The work is accepted by EMNLP 2022 (Findings). To expand the scope of commercial applications for document intelligence, we release the multilingual model of ERNIE-Layout in PaddleNLP. You can visit [https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout) for more details.<br/>\n",
"ERNIE-Layout is a large cross modal model officially produced by the Flying Slurry. For more details about PaddleNLP, please visit <https://github.com/PaddlePaddle/PaddleNLP/> for details.<br/>\n",
"ERNIE-Layout can be used to process but not limited to tasks such as document classification, information extraction, document Q&A with layout data (documents, pictures, etc.). Application scenarios include but not limited to invoice extraction Q&A, poster extraction Q&A, web page extraction Q&A, table extraction Q&A, test paper extraction Q&A, English bill multilingual (Chinese, English, Japanese, Thai, Spanish, Russian) extraction Q&A Chinese bills in multiple languages (simplified, traditional, English, Japanese, French). Taking document information extraction and document visual Q&A as examples, the effect of using ERNIE-Layout model is shown below.\n",
"\n",
"## 2.1Document Information Extraction Task:\n",
"ERNIE-Layout can be used to process and analyze multimodal documents. ERNIE-Layout is effective in tasks such as document classification, information extraction, document VQA with layout data (documents, pictures, etc). \n",
"### 2.1.1Dataset:\n",
"\n",
"Data sets include FUNSD, XFUND-ZH, etc. FUNSD is an English data set for form understanding on noisy scanned documents. The data set contains 199 real, fully annotated and scanned forms. Documents are noisy, and the appearance of various forms varies greatly, so understanding forms is a challenging task. The dataset can be used for a variety of tasks, including text detection, optical character recognition, spatial layout analysis, and entity tagging/linking. XFUND is a multilingual form understanding benchmark dataset, including manually labeled key value pair forms in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese). XFUND-ZH is the Chinese version of XFUND.\n",
"- Invoice VQA\n",
"### 2.1.2Quick View Of Model Effect:\n",
"\n",
"The model effect of ERNIE-Layout on FUNSD is:\n",
"## 2.2Document Visual Question And Answer Task:\n",
"- Poster VQA\n",
"### 2.2.1Dataset:\n",
"\n",
"The data set is DocVQA-ZH, and DocVQA-ZH has stopped submitting the list. Therefore, we will re divide the original training set to evaluate the model effect. After division, the training set contains 4187 pictures, the verification set contains 500 pictures, and the test set contains 500 pictures.\n",
"We have integrated the ERNIE-Layout DocPrompt Engine on the [huggingface page](https://huggingface.co/spaces/PaddlePaddle/ERNIE-Layout), which can be experienced with one click.\n",
"\n",
"\n",
"**Taskflow**\n",
"## 3.1 Model Inference\n",
"\n",
"\n",
"Of course, you can also use Taskflow for reasoning. Through `paddlenlp.Taskflow` calls DocPrompt with three lines of code, and has the ability to extract questions and answers from multilingual documents. Some application scenarios are shown below:\n",
"You can use DocPrompt through `paddlenlp.Taskflow` for model inference.\n",
" * `batch_size`:Please adjust the batch size according to the machine conditions. The default value is 1.\n",
" * `batch_size`:Please adjust the batch size according to the machine conditions. The default value is 1.\n",
" * `lang`:Select the language of PaddleOCR. `ch` can be used in Chinese English mixed pictures. `en` is better in English pictures. The default is `ch`.\n",
" * `lang`:Select the language of PaddleOCR. `ch` can be used in Chinese English mixed pictures. `en` is better in English pictures. The default is `ch`.\n",
" * `topn`: If the model identifies multiple results, it will return the first n results with the highest probability value, which is 1 by default.\n",
" * `topn`: If the model identifies multiple results, it will return the first n results with the highest probability value, which is 1 by default."
"ERNIE-Layout is a cross modal general document pre training model that relies on Wenxin ERNIE, based on layout knowledge enhancement technology, and integrates text, image, layout and other information for joint modeling. It can show excellent cross modal semantic alignment and layout understanding ability on tasks including but not limited to document information extraction, document visual question answering, document image classification and so on.\n",
"ERNIE-Layout is a multimodal pretrained model based on layout knowledge enhancement technology, and it integrates text, image, layout and other information for joint modeling. It can show excellent cross modal semantic alignment and layout understanding ability on tasks including but not limited to document information extraction, document visual question answering, document image classification and so on.\n",
"\n",
"\n",
"For details about the fine-tuning and deployment of the above tasks using ERNIE-Layout, please refer to: [ERNIE-Layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout\n",
"For details about the fine-tuning and deployment of the above tasks using ERNIE-Layout, please refer to: [ERNIE-Layout](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout\n",
")"
")"
...
@@ -120,23 +153,13 @@
...
@@ -120,23 +153,13 @@
"cell_type": "markdown",
"cell_type": "markdown",
"metadata": {},
"metadata": {},
"source": [
"source": [
"# 4.Model Principle\n",
"# 4. Model Principle\n",
"* Layout knowledge enhancement technology\n",
"\n",
"* Fusion of text, image, layout and other information for joint modeling\n",
"\n",
"* Reading order prediction+fine-grained image text matching: two self-monitoring pre training tasks\n",
"\n",
"\n",
"<figure>\n",
"For document understanding, the text reading order in the document is very important. At present, most mainstream models based on OCR (Optical Character Recognition) technology follow the principle of \"from left to right, from top to bottom\". However, for the complex layout of the document with a mixture of columns, text, graphics and tables, the reading order obtained according to the OCR results is wrong in most cases, As a result, the model cannot accurately understand the content of the document.\n",
"For document understanding, the text reading order in the document is very important. At present, most mainstream models based on OCR (Optical Character Recognition) technology follow the principle of \"from left to right, from top to bottom\". However, for the complex layout of the document with a mixture of columns, text, graphics and tables, the reading order obtained according to the OCR results is wrong in most cases, As a result, the model cannot accurately understand the content of the document.\n",
"\n",
"\n",
"Humans usually read in hierarchies and blocks according to the document structure and layout. Inspired by this, Baidu researchers proposed an innovative idea of layout knowledge enhancement to correct the reading order in the document pre training model. The industry-leading document parsing tool (Document Parser) on the TextMind platform can accurately identify the block information in the document, produce the correct document reading order, and integrate the reading order signal into the model training, thus enhancing the effective use of layout information and improving the model's understanding of complex documents.\n",
"Humans usually read in hierarchies and blocks according to the document structure and layout. Inspired by this, Baidu researchers proposed an innovative idea of layout knowledge enhancement to correct the reading order in the document pre training model. The industry-leading document parsing tool (Document Parser) on the TextMind platform can accurately identify the block information in the document, produce the correct document reading order, and integrate the reading order signal into the model training, thus enhancing the effective use of layout information and improving the model's understanding of complex documents.\n",
"\n",
"\n",
"Based on the layout knowledge enhancement technology, and relying on Wenxin ERNIE, Baidu researchers proposed a cross modal general document pre training model ERNIE-Layout, which integrates text, image, layout and other information for joint modeling. As shown in the figure below, ERNIE-Layout innovatively proposed two self-monitoring pre training tasks: reading order prediction and fine-grained image text matching, which effectively improved the model's cross modal semantic alignment ability and layout understanding ability in document tasks.\n",
"Based on the layout knowledge enhancement technology, and relying on Wenxin ERNIE, Baidu researchers proposed a cross modal general document pre training model ERNIE-Layout, which integrates text, image, layout and other information for joint modeling. As shown in the figure below, ERNIE-Layout innovatively proposed two self-monitoring pre training tasks: reading order prediction and fine-grained image text matching, which effectively improved the model's cross modal semantic alignment ability and layout understanding ability in document tasks."
"* batch_size:Please adjust the batch size according to the machine conditions. The default value is 1.\n",
"\n",
"* lang:Choose the language of PaddleOCR. ch can be used in Chinese English mixed pictures. en has better effect on English pictures. The default is ch.\n",
"\n",
"\n",
"* topn: If the model identifies multiple results, it will return the first n results with the highest probability value, which is 1 by default.\n",
"## DocPrompt Tips\n",
"\n",
"## 5.2Tips\n",
"\n",
"\n",
"* Prompt design: In DocPrompt, Prompt can be a statement (for example, the Key in the document key value pair) or a question. Because it is an open domain extracted question and answer, DocPrompt has no special restrictions on the design of Prompt, as long as it conforms to natural language semantics. If you are not satisfied with the current extraction results, you can try some different Prompts.\n",
"* Prompt design: In DocPrompt, Prompt can be a statement (for example, the Key in the document key value pair) or a question. Because it is an open domain extracted question and answer, DocPrompt has no special restrictions on the design of Prompt, as long as it conforms to natural language semantics. If you are not satisfied with the current extraction results, you can try some different Prompts.\n",