"我们在三个文档结构化理解任务上微调了VIMER-StrucTexT,分别是字符级别实体分类(**T-ELB**)、字段级别实体分类(**S-ELB**)、字段级别实体连接(**S-ELK**)。**注意:**后文中的实体分类任务的效果指标均为**字段级别 F1 score**,该指标计算相关字段类型的Macro F1 score。\n",
" title={StrucTexT: Structured Text Understanding with Multi-Modal Transformers},\n",
" author={Li, Yulin and Qian, Yuxi and Yu, Yuechen and Qin, Xiameng and Zhang, Chengquan and Liu, Yan and Yao, Kun and Han, Junyu and Liu, Jingtuo and Ding, Errui},\n",
" booktitle={Proceedings of the 29th ACM International Conference on Multimedia},\n",
"VIMER-StrucTexT is a joint segment-level and token-level representation enhancement model for document image understanding, such as pdf, invoice, receipt and so on. Due to the complexity of content and layout in visually rich documents (VRDs), structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling and entity linking, which require an entire understanding of the context of documents at both token and segment levels. VIMER-StrucTexT is a unified framework, which can flexibly and effectively deal with the above subtasks in their required representation granularity. Besides, we design a novel pre-training strategy with the self-supervised tasks, named **Segment Length Prediction** and **Paired Box Direction**, to learn a richer multi-modal representation.\n",
"We funtune VIMER-StrucTexT on three structured text understanding tasks, i.e, Token-based Entity Labeling (**T-ELB**), Segment-based Entity Labeling (**S-ELB**), and Entity Linking (**S-ELK**). **Note:** Below performance of all **ELB** tasks are the **entity-level F1 score** that calculate Macro F1 scores among related entity types.\n",
"\n",
"### 2.1 Token-based Entity Labeling\n",
"\n",
"#### 2.1.1 Datasets\n",
"\n",
"- [EPHOIE](https://github.com/HCIILAB/EPHOIE) is collected from scanned Chinese examination papers.\n",
"There are 10 text types, which are marked at the character level, where means a text segment composed of characters with different categories.\n",
"The required token-based entity types are as follow: ``` Subject, Test Time, Name, School, Examination Number, Seat Number, Class, Student Number, Grade, and Score. ```\n",
"- [SROIE](https://rrc.cvc.uab.es/?ch=13&com=introduction) is a public dateset for receipt information extraction in ICDAR 2019 Chanllenge. It contains of 626 receipts for training and 347 receipts for testing. Every receipt contains four predefined values: `company, date, address, and total`.\n",
"- [FUNSD](https://guillaumejaume.github.io/FUNSD/) is a form understanding benchmark with 199 real, fully annotated, scanned form images, such as marketing, advertising, and scientific reports, which is split into 149 training samples and 50 testing samples. FUNSD dataset is suitable for a variety of tasks and we force on the sub-tasks of **T-ELB** and **S-ELK** for the semantic entity `question, answer and header`.\n",
"- [XFUND](https://github.com/doc-analysis/XFUND) is a multilingual form understanding benchmark dataset that includes human-labeled forms with key-value pairs in 7 languages. Each language includes 199 forms, where the training set includes 149 forms, and the test set includes 50 forms. We evaluate on the Chinses section of XFUND.\n",
"- [FUNSD](https://guillaumejaume.github.io/FUNSD) annotates the links formatted as `(entity_from, entity_to)`, resulting in a question–answer pair. It guides the task of predicting the relations between semantic entities.\n",
"- [XFUND](https://github.com/doc-analysis/XFUND) is the same setting as FUNSD. We evaluate on the Chinses section of the dataset.\n",
"This code base needs to be executed on the `PaddlePaddle 2.1.0+`. You can find how to prepare the environment from this [paddlepaddle-quick](https://www.paddlepaddle.org.cn/install/quick) or use pip:"
"The following visualized images are sampled from the domestic practical applications of StrucTexT. *Different colors of masks represent different entity categories. There are black lines between different segments or text lines, indicating that they belong to the same entity. And the orange lines indicate the link relationship between entities.*\n",
"- For more information and applications, please visit [Baidu OCR](https://ai.baidu.com/tech/ocr) open platform."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5.Citation\n",
"You can cite our paper as below:\n",
"```\n",
"@inproceedings{li2021structext,\n",
" title={StrucTexT: Structured Text Understanding with Multi-Modal Transformers},\n",
" author={Li, Yulin and Qian, Yuxi and Yu, Yuechen and Qin, Xiameng and Zhang, Chengquan and Liu, Yan and Yao, Kun and Han, Junyu and Liu, Jingtuo and Ding, Errui},\n",
" booktitle={Proceedings of the 29th ACM International Conference on Multimedia},\n",