未验证 提交 7ee76192 编写于 作者: E Evezerest 提交者: GitHub

Merge pull request #6236 from Evezerest/release2.5

remove notebooks
"cells": [
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"# 1. OCR技术背景\n",
"## 1.1 OCR技术的应用场景\n",
"* **<font color=red>OCR是什么</font>**\n",
"OCR(Optical Character Recognition,光学字符识别)是计算机视觉重要方向之一。传统定义的OCR一般面向扫描文档类对象,现在我们常说的OCR一般指场景文字识别(Scene Text Recognition,STR),主要面向自然场景,如下图中所示的牌匾等各种自然场景可见的文字。\n",
"<center>图1 文档场景文字识别 VS. 自然场景文字识别</center>\n",
"* **<font color=red>OCR有哪些应用场景?</font>**\n",
"<center>图2 OCR技术的应用场景</center>\n",
"<center>图3 多模态场景中的通用OCR</center>\n",
"## 1.2 OCR技术挑战\n",
"* **<font color=red>算法层</font>**\n",
"<center>图4 OCR算法层技术难点</center>\n",
"* **<font color=red>应用层</font>**\n",
"1. **海量数据要求OCR能够实时处理。** OCR应用常对接海量数据,我们要求或希望数据能够得到实时处理,模型的速度做到实时是一个不小的挑战。\n",
"2. **端侧应用要求OCR模型足够轻量,识别速度足够快。** OCR应用常部署在移动端或嵌入式硬件,端侧OCR应用一般有两种模式:上传到服务器 vs. 端侧直接识别,考虑到上传到服务器的方式对网络有要求,实时性较低,并且请求量过大时服务器压力大,以及数据传输的安全性问题,我们希望能够直接在端侧完成OCR识别,而端侧的存储空间和计算能力有限,因此对OCR模型的大小和预测速度有很高的要求。\n",
"<center>图5 OCR应用层技术难点</center>\n",
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"# 2. OCR前沿算法\n",
"## 2.1 文本检测\n",
"文本检测的任务是定位出输入图像中的文字区域。近年来学术界关于文本检测的研究非常丰富,一类方法将文本检测视为目标检测中的一个特定场景,基于通用目标检测算法进行改进适配,如TextBoxes[1]基于一阶段目标检测器SSD[2]算法,调整目标框使之适合极端长宽比的文本行,CTPN[3]则是基于Faster RCNN[4]架构改进而来。但是文本检测与目标检测在目标信息以及任务本身上仍存在一些区别,如文本一般长宽比较大,往往呈“条状”,文本行之间可能比较密集,弯曲文本等,因此又衍生了很多专用于文本检测的算法,如EAST[5]、PSENet[6]、DBNet[7]等等。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/548b50212935402abb2e671c158c204737c2c64b9464442a8f65192c8a31b44d\" width=\"500\"></center>\n",
"<center>图6 文本检测任务示例</center>\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/4f4ea65578384900909efff93d0b7386e86ece144d8c4677b7bc94b4f0337cfb\" width=\"800\"></center>\n",
"<center>图7 文本检测算法概览</center>\n",
"<center>图8 (左)基于回归的CTPN[3]算法优化anchor (中)基于分割的DB[7]算法优化后处理 (右)回归+分割的SAST[10]算法</center>\n",
"## 2.2 文本识别\n",
"<center>图9 (左)规则文本 VS. (右)不规则文本</center>\n",
"规则文本识别的算法根据解码方式的不同可以大致分为基于CTC和Sequence2Sequence两种,将网络学习到的序列特征 转化为 最终的识别结果 的处理方式不同。基于CTC的算法以经典的CRNN[11]为代表。\n",
"<center>图10 基于CTC的识别算法 VS. 基于Attention的识别算法</center>\n",
"<center>图11 基于字符分割的识别算法[15]</center>\n",
"## 2.3 文档结构化识别\n",
"* **版面分析**\n",
"版面分析(Layout Analysis)主要是对文档图像进行内容分类,类别一般可分为纯文本、标题、表格、图片等。现有方法一般将文档中不同的板式当做不同的目标进行检测或分割,如Soto Carlos[16]在目标检测算法Faster R-CNN的基础上,结合上下文信息并利用文档内容的固有位置信息来提高区域检测性能;Sarkar Mausoom[17]等人提出了一种基于先验的分割机制,在非常高的分辨率的图像上训练文档分割模型,解决了过度缩小原始图像导致的密集区域不同结构无法区分进而合并的问题。\n",
"<center>图12 版面分析任务示意图</center>\n",
"* **表格识别**\n",
"表格识别(Table Recognition)的任务就是将文档里的表格信息进行识别和转换到excel文件中。文本图像中表格种类和样式复杂多样,例如不同的行列合并,不同的内容文本类型等,除此之外文档的样式和拍摄时的光照环境等都为表格识别带来了极大的挑战。这些挑战使得表格识别一直是文档理解领域的研究难点。\n",
"<center>图13 表格识别任务示意图</center>\n",
"表格识别的方法种类较为丰富,早期的基于启发式规则的传统算法,如Kieninger[18]等人提出的T-Rect等算法,一般通过人工设计规则,连通域检测分析处理;近年来随着深度学习的发展,开始涌现一些基于CNN的表格结构识别算法,如Siddiqui Shoaib Ahmed[19]等人提出的DeepTabStR,Raja Sachin[20]等人提出的TabStruct-Net等;此外,随着图神经网络(Graph Neural Network)的兴起,也有一些研究者尝试将图神经网络应用到表格结构识别问题上,基于图神经网络,将表格识别看作图重建问题,如Xue Wenyuan[21]等人提出的TGRNet;基于端到端的方法直接使用网络完成表格结构的HTML表示输出,端到端的方法大多采用Seq2Seq方法来完成表格结构的预测,如一些基于Attention或Transformer的方法,如TableMaster[22]。\n",
"<center>图14 表格识别方法示意图</center>\n",
"* **关键信息提取**\n",
"关键信息提取(Key Information Extraction,KIE)是Document VQA中的一个重要任务,主要从图像中提取所需要的关键信息,如从身份证中提取出姓名和公民身份号码信息,这类信息的种类往往在特定任务下是固定的,但是在不同任务间是不同的。\n",
"<center>图15 DocVQA任务示意图</center>\n",
"- SER: 语义实体识别 (Semantic Entity Recognition),对每一个检测到的文本进行分类,如将其分为姓名,身份证。如下图中的黑色框和红色框。\n",
"- RE: 关系抽取 (Relation Extraction),对每一个检测到的文本进行分类,如将其分为问题和的答案。然后对每一个问题找到对应的答案。如下图中的红色框和黑色框分别代表问题和答案,黄色线代表问题和答案之间的对应关系。\n",
"<center>图16 ser与re任务</center>\n",
"一般的KIE方法基于命名实体识别(Named Entity Recognition,NER)[4]来研究,但是这类方法只利用了图像中的文本信息,缺少对视觉和结构信息的使用,因此精度不高。在此基础上,近几年的方法都开始将视觉和结构信息与文本信息融合到一起,按照对多模态信息进行融合时所采用的原理可以将这些方法分为下面四种:\n",
"- 基于Grid的方法\n",
"- 基于Token的方法\n",
"- 基于GCN的方法\n",
"- 基于End to End 的方法\n",
"## 2.4 其他相关技术\n",
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"# 3. OCR技术的产业实践\n",
"> 你是小王,该怎么办? \n",
"> 1. 我不会,我不行,我不干了😭\n",
"> 2. 建议老板找外包公司或者商业化方案,反正花老板的钱😊\n",
"> 3. 网上找找类似项目,面向Github编程😏\n",
"## 3.1 产业实践难点\n",
"<center>图17 OCR技术产业实践三大难题</center>\n",
"**1. 找不到、选不出**\n",
"**2. 不适用产业场景**\n",
"**3. 优化难、训练部署问题多**\n",
"## 3.2 产业级OCR开发套件PaddleOCR\n",
"<center>图18 PaddleOCR开发套件全景图</center>\n",
"**在工业级部署层面**,PaddleOCR提供了基于Paddle Inference的服务器端预测方案,基于Paddle Serving的服务化部署方案,以及基于Paddle-Lite的端侧部署方案,满足不同硬件环境下的部署需求,同时提供了基于PaddleSlim的模型压缩方案,可以进一步压缩模型大小。以上部署方式都完成了训推一体全流程打通,以保障开发者可以高效部署,稳定可靠。\n",
"<center>图19 PPOCRLabel使用示意图</center>\n",
"<center>图20 Style-Text合成效果示例</center>\n",
"### 3.2.1 PP-OCR与PP-Structrue\n",
"<center>图21 PP-OCR中英文模型识别结果示例</center>\n",
"<center>图22 PP-OCR系统pipeline示意图</center>\n",
"- 文本检测模块:核心是一个基于[DB](https://arxiv.org/abs/1911.08947)检测算法训练的文本检测模型,检测出图像中的文字区域;\n",
"- 检测框矫正模块:将检测到的文本框输入检测框矫正模块,在这一阶段,将四点表示的文本框矫正为矩形框,方便后续进行文本识别,另一方面会进行文本方向判断和校正,例如如果判断文本行是倒立的情况,则会进行转正,该功能通过训练一个文本方向分类器实现;\n",
"- 文本识别模块:最后文本识别模块对矫正后的检测框进行文本识别,得到每个文本框内的文字内容,PP-OCR中使用的经典文本识别算法[CRNN](https://arxiv.org/abs/1507.05717)。\n",
"PP-OCR模型分为mobile版(轻量版)和server版(通用版),其中mobile版模型主要基于轻量级骨干网络MobileNetV3进行优化,优化后模型(检测模型+文本方向分类模型+识别模型)大小仅8.1M,CPU上平均单张图像预测耗时350ms,T4 GPU上约110ms,裁剪量化后,可在精度不变的情况下进一步压缩到3.5M,便于端侧部署,在骁龙855上测试预测耗时仅260ms。更多的PP-OCR评估数据可参考[benchmark](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.2/doc/doc_ch/benchmark.md)。\n",
"- 在模型效果上,相对于PP-OCR mobile版本提升超7%;\n",
"- 在速度上,相对于PP-OCR server版本提升超过220%;\n",
"- 在模型大小上,11.6M的总大小,服务器端和移动端都可以轻松部署。\n",
"<center>图23 PP-OCR的英文数字模型和多语言模型识别效果示意图</center>\n",
"PP-Structure支持版面分析(layout analysis)、表格识别(table recognition)、文档视觉问答(DocVQA)三种子任务。\n",
"- 支持对图片形式的文档进行版面分析,可以划分文字、标题、表格、图片以及列表5类区域(与Layout-Parser联合使用)\n",
"- 支持文字、标题、图片以及列表区域提取为文字字段(与PP-OCR联合使用)\n",
"- 支持表格区域进行结构化分析,最终结果输出Excel文件\n",
"- 支持Python whl包和命令行两种方式,简单易用\n",
"- 支持版面分析和表格结构化两类任务自定义训练\n",
"- 支持VQA任务-SER和RE\n",
"<center>图24 PP-Structure系统示意图(本图仅含版面分析+表格识别)</center>\n",
"### 3.2.2 工业级部署方案\n",
"飞桨支持全流程、全场景推理部署,模型来源主要分为三种,第一种使用PaddlePaddle API构建网络结构进行训练所得,第二种是基于飞桨套件系列,飞桨套件提供了丰富的模型库、简洁易用的API,具备开箱即用,包括视觉模型库PaddleCV、智能语音库PaddleSpeech以及自然语言处理库PaddleNLP等,第三种采用X2Paddle工具从第三方框架(PyTorh、ONNX、TensorFlow等)产出的模型。\n",
"飞桨模型可以选用PaddleSlim工具进行压缩、量化以及蒸馏,支持五种部署方案,分别为服务化Paddle Serving、服务端/云端Paddle Inference、移动端/边缘端Paddle Lite、网页前端Paddle.js, 对于Paddle不支持的硬件,比如MCU、地平线、鲲云等国产芯片,可以借助Paddle2ONNX转化为支持ONNX的第三方框架。\n",
"<center>图25 飞桨支持部署方式</center>\n",
"Paddle Inference支持服务端和云端部署,具备高性能与通用性,针对不同平台和不同应用场景进行了深度的适配和优化,Paddle Inference是飞桨的原生推理库,保证模型在服务器端即训即用,快速部署,适用于高性能硬件上使用多种应用语言环境部署算法复杂的模型,硬件覆盖x86 CPU、Nvidia GPU、以及百度昆仑XPU、华为昇腾等AI加速器。\n",
"Paddle Lite 是端侧推理引擎,具有轻量化和高性能特点,针对端侧设备和各应用场景进行了深度的设配和优化。当前支持Android、IOS、嵌入式Linux设备、macOS 等多个平台,硬件覆盖ARM CPU和GPU、X86 CPU和新硬件如百度昆仑、华为昇腾与麒麟、瑞芯微等。\n",
"Paddle Serving是一套高性能服务框架,旨在帮助用户几个步骤快速将模型在云端服务化部署。目前Paddle Serving支持自定义前后处理、模型组合、模型热加载更新、多机多卡多模型、分布式推理、K8S部署、安全网关和模型加密部署、支持多语言多客户端访问等功能,Paddle Serving官方还提供了包括PaddleOCR在内的40多种模型的部署示例,以帮助用户更快上手。\n",
"<center>图26 飞桨支持部署方式</center>\n",
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"# 4. 总结\n",
"* 第二、三章分别介绍检测、识别技术并实践;\n",
"* 第四章介绍PP-OCR优化策略; \n",
"* 第五章进行预测部署实战; \n",
"* 第六章介绍文档结构化; \n",
"* 第七章介绍端到端、数据预处理、数据合成等其他OCR相关算法; \n",
"* 第八章介绍OCR相关数据集和数据合成工具。\n",
"# 参考文献\n",
"[1] Liao, Minghui, et al. \"Textboxes: A fast text detector with a single deep neural network.\" Thirty-first AAAI conference on artificial intelligence. 2017.\n",
"[2] Liu W, Anguelov D, Erhan D, et al. Ssd: Single shot multibox detector[C]//European conference on computer vision. Springer, Cham, 2016: 21-37.\n",
"[3] Tian, Zhi, et al. \"Detecting text in natural image with connectionist text proposal network.\" European conference on computer vision. Springer, Cham, 2016.\n",
"[4] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[J]. Advances in neural information processing systems, 2015, 28: 91-99.\n",
"[5] Zhou, Xinyu, et al. \"East: an efficient and accurate scene text detector.\" Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017.\n",
"[6] Wang, Wenhai, et al. \"Shape robust text detection with progressive scale expansion network.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.\n",
"[7] Liao, Minghui, et al. \"Real-time scene text detection with differentiable binarization.\" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07. 2020.\n",
"[8] Deng, Dan, et al. \"Pixellink: Detecting scene text via instance segmentation.\" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.\n",
"[9] He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969.\n",
"[10] Wang P, Zhang C, Qi F, et al. A single-shot arbitrarily-shaped text detector based on context attended multi-task \n",
"learning[C]//Proceedings of the 27th ACM international conference on multimedia. 2019: 1277-1285.\n",
"[11] Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11), 2298-2304.\n",
"[12] Star-Net Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spa- tial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.\n",
"[13] Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene text recognition with automatic rectification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4168-4176).\n",
"[14] Sheng, F., Chen, Z., & Xu, B. (2019, September). NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) (pp. 781-786). IEEE.\n",
"[15] Lyu P, Liao M, Yao C, et al. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 67-83.\n",
"[16] Soto C, Yoo S. Visual detection with context for document layout analysis[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 3464-3470.\n",
"[17] Sarkar M, Aggarwal M, Jain A, et al. Document Structure Extraction using Prior based High Resolution Hierarchical Semantic Segmentation[C]//European Conference on Computer Vision. Springer, Cham, 2020: 649-666.\n",
"[18] Kieninger T, Dengel A. A paper-to-HTML table converting system[C]//Proceedings of document analysis systems (DAS). 1998, 98: 356-365.\n",
"[19] Siddiqui S A, Fateh I A, Rizvi S T R, et al. Deeptabstr: Deep learning based table structure recognition[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1403-1409.\n",
"[20] Raja S, Mondal A, Jawahar C V. Table structure recognition using top-down and bottom-up cues[C]//European Conference on Computer Vision. Springer, Cham, 2020: 70-86.\n",
"[21] Xue W, Yu B, Wang W, et al. TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition[J]. arXiv preprint arXiv:2106.10598, 2021.\n",
"[22] Ye J, Qi X, He Y, et al. PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML[J]. arXiv preprint arXiv:2105.01848, 2021.\n",
"[23] Du Y, Li C, Guo R, et al. PP-OCR: A practical ultra lightweight OCR system[J]. arXiv preprint arXiv:2009.09941, 2020.\n",
"[24] Du Y, Li C, Guo R, et al. PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System[J]. arXiv preprint arXiv:2109.03144, 2021.\n",
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
"nbformat": 4,
"nbformat_minor": 1
"cells": [
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"# 文本检测FAQ\n",
" - 文本检测训练相关\n",
" - 文本检测预测相关"
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"## 1. 文本检测训练相关FAQ\n",
"**1.1 PaddleOCR提供的文本检测算法包括哪些?**\n",
"**1.3 文本检测训练标签是否需要具体文本标注,标签中的”###”是什么意思?**\n",
" \n",
"**1.4 对于文本行较紧密的情况下训练的文本检测模型效果较差?**\n",
"**1.5 对于一些尺寸较大的文档类图片, DB在检测时会有较多的漏检,怎么避免这种漏检的问题呢?**\n",
"**1.6 DB模型弯曲文本(如略微形变的文档图像)漏检问题?**\n",
"**A**: DB后处理中计算文本框平均得分时,是求rectangle区域的平均分数,容易造成弯曲文本漏检,已新增求polygon区域的平均分数,会更准确,但速度有所降低,可按需选择,在相关pr中可查看[可视化对比效果](https://github.com/PaddlePaddle/PaddleOCR/pull/2604)。该功能通过参数 [det_db_score_mode](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.1/tools/infer/utility.py#L51)进行选择,参数值可选[`fast`(默认)、`slow`],`fast`对应原始的rectangle方式,`slow`对应polygon方式。感谢用户[buptlihang](https://github.com/buptlihang)提[pr](https://github.com/PaddlePaddle/PaddleOCR/pull/2574)帮助解决该问题。\n",
"**1.7 简单的对于精度要求不高的OCR任务,数据集需要准备多少张呢?**\n",
"**1.8 当训练数据量少时,如何获取更多的数据?**\n",
"**A**:当训练数据量少时,可以尝试以下三种方式获取更多的数据:(1)人工采集更多的训练数据,最直接也是最有效的方式。(2)基于PIL和opencv基本图像处理或者变换。例如PIL中ImageFont, Image, ImageDraw三个模块将文字写到背景中,opencv的旋转仿射变换,高斯滤波等。(3)利用数据生成算法合成数据,例如pix2pix等算法。\n",
"**1.9 如何更换文本检测/识别的backbone?**\n",
"**1.10 如何对检测模型finetune,比如冻结前面的层或某些层使用小的学习率学习?**\n",
"**1.11 DB的预处理部分,图片的长和宽为什么要处理成32的倍数?**\n",
"**1.12 在PP-OCR系列的模型中,文本检测的骨干网络为什么没有使用SEBlock?**\n",
"**1.13 PP-OCR检测效果不好,该如何优化?**\n",
"**A**: 具体问题具体分析:\n",
"- 如果在你的场景上检测效果不可用,首选是在你的数据上做finetune训练;\n",
"- 如果图像过大,文字过于密集,建议不要过度压缩图像,可以尝试修改检测预处理的resize逻辑,防止图像被过度压缩;\n",
"- 检测框大小过于紧贴文字或检测框过大,可以调整db_unclip_ratio这个参数,加大参数可以扩大检测框,减小参数可以减小检测框大小;\n",
"- 检测框存在很多漏检问题,可以减小DB检测后处理的阈值参数det_db_box_thresh,防止一些检测框被过滤掉,也可以尝试设置det_db_score_mode为'slow';\n",
"- 其他方法可以选择use_dilation为True,对检测输出的feature map做膨胀处理,一般情况下,会有效果改善;\n",
"## 2. 文本检测预测相关FAQ\n",
"**2.1 DB有些框太贴文本了反而去掉了一些文本的边角影响识别,这个问题有什么办法可以缓解吗?**\n",
"**2.2 为什么PaddleOCR检测预测是只支持一张图片测试?即test_batch_size_per_card=1**\n",
"**2.3 在CPU上加速PaddleOCR的文本检测模型预测?**\n",
"**A**:x86 CPU可以使用mkldnn(OneDNN)进行加速;在支持mkldnn加速的CPU上开启[enable_mkldnn](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py#L105)参数。另外,配合增加CPU上预测使用的[线程数num_threads](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py#L106),可以有效加快CPU上的预测速度。\n",
"**2.4 在GPU上加速PaddleOCR的文本检测模型预测?**\n",
"- 1. 从[链接](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html)下载带TensorRT的Paddle安装包或者预测库。\n",
"- 2. 从Nvidia官网下载[TensorRT](https://developer.nvidia.com/tensorrt),注意下载的TensorRT版本与paddle安装包中编译的TensorRT版本一致。\n",
"- 3. 设置环境变量`LD_LIBRARY_PATH`,指向TensorRT的lib文件夹\n",
"export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<TensorRT-${version}/lib>\n",
"- 4. 开启PaddleOCR预测的[tensorrt选项](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L38)。\n",
"**2.5 如何在移动端部署PaddleOCR模型?**\n",
"**A**: 飞桨Paddle有专门针对移动端部署的工具[PaddleLite](https://github.com/PaddlePaddle/Paddle-Lite),并且PaddleOCR提供了DB+CRNN为demo的android arm部署代码,参考[链接](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/deploy/lite/readme.md)。\n",
"**2.6 如何使用PaddleOCR多进程预测?**\n",
"**A**: 近期PaddleOCR新增了[多进程预测控制参数](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L111),`use_mp`表示是否使用多进程,`total_process_num`表示在使用多进程时的进程数。具体使用方式请参考[文档](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/doc/doc_ch/inference.md#1-%E8%B6%85%E8%BD%BB%E9%87%8F%E4%B8%AD%E6%96%87ocr%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86)。\n",
"**2.7 预测时显存爆炸、内存泄漏问题?**\n",
"**A**: 如果是训练模型的预测,由于模型太大或者输入图像太大导致显存不够用,可以参考代码在主函数运行前加上paddle.no_grad(),即可减小显存占用。如果是inference模型预测时显存占用过高,可以配置Config时,加入[config.enable_memory_optim()](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L267)用于减小内存占用。\n",
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
"nbformat": 4,
"nbformat_minor": 1
因为 它太大了无法显示 source diff 。你可以改为 查看blob
"cells": [
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"# 文本检测算法理论\n"
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"## 1 文本检测\n",
"- 目标检测:给定图像或者视频,找出目标的位置(box),并给出目标的类别;\n",
"- 文本检测:给定输入图像或者视频,找出文本的区域,可以是单字符位置或者整个文本行位置;\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/af2d8eca913a4d5a968945ae6cac180b009c6cc94abc43bfbaf1ba6a3de98125\" width=\"400\" ></center>\n",
"<br><center>图1 目标检测示意图</center>\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/400b9100573b4286b40b0a668358bcab9627f169ab934133a1280361505ddd33\" width=\"1000\" ></center>\n",
"<br><center>图2 文本检测示意图</center>\n",
"1. 自然场景中文本具有多样性:文本检测受到文字颜色、大小、字体、形状、方向、语言、以及文本长度的影响;\n",
"2. 复杂的背景和干扰;文本检测受到图像失真,模糊,低分辨率,阴影,亮度等因素的影响;\n",
"3. 文本密集甚至重叠会影响文字的检测;\n",
"4. 文字存在局部一致性:文本行的一小部分,也可视为是独立的文本。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/072f208f2aff47e886cf2cf1378e23c648356686cf1349c799b42f662d8ced00\"\n",
"width=\"1000\" ></center>\n",
"<br><center>图3 文本检测场景</center>\n",
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"## 2 文本检测方法介绍\n",
"1. 基于回归的文本检测方法\n",
"2. 基于分割的文本检测方法\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/22314238b70b486f942701107ffddca48b87235a473c4d8db05b317f132daea0\"\n",
"width=\"600\" ></center>\n",
"<br><center>图4 文本检测算法</center>\n",
"### 2.1 基于回归的文本检测\n",
"#### 2.1.1 水平文本检测\n",
"- 采用更大长宽比的预选框\n",
"- 卷积核从3x3变成了1x5,更适合长文本检测\n",
"- 采用多尺度输入\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/3864ccf9d009467cbc04225daef0eb562ac0c8c36f9b4f5eab036c319e5f05e7\" width=\"1000\" ></center>\n",
"<br><center>图5 textbox框架图</center>\n",
"CTPN[3]基于Fast-RCNN算法,扩展RPN模块并且设计了基于CRNN的模块让整个网络从卷积特征中检测到文本序列,二阶段的方法通过ROI Pooling获得了更准确的特征定位。但是TextBoxes和CTPN只支持检测横向文本。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/452833c2016e4cf7b35291efd09740c13c4bfb8f7c56446b8f7a02fc7eb3e901\" width=\"1000\" ></center>\n",
"<br><center>图6 CTPN框架图</center>\n",
"#### 2.1.2 任意角度文本检测\n",
"TextBoxes++[2]在TextBoxes基础上进行改进,支持检测任意角度的文本。从结构上来说,不同于TextBoxes,TextBoxes++针对多角度文本进行检测,首先修改预选框的宽高比,调整宽高比aspect ratio为1、2、3、5、1/2、1/3、1/5。其次是将$1*5$的卷积核改为$3*5$,更好的学习倾斜文本的特征;最后,TextBoxes++的输出旋转框的表示信息。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/ae96e3acbac04be296b6d54a4d72e5881d592fcc91f44882b24bc7d38b9d2658\"\n",
"width=\"1000\" ></center>\n",
"<br><center>图7 TextBoxes++框架图</center>\n",
"EAST[4]针对倾斜文本的定位问题,提出了two-stage的文本检测方法,包含 FCN特征提取和NMS部分。EAST提出了一种新的文本检测pipline结构,可以端对端训练并且支持检测任意朝向的文本,并且具有结构简单,性能高的特点。FCN支持输出倾斜的矩形框和水平框,可以自由选择输出格式。\n",
"- 如果输出检测形状为RBox,则输出Box旋转角度以及AABB文本形状信息,AABB表示到文本框上下左右边的偏移。RBox可以旋转矩形的文本。\n",
"- 如果输出检测框为四点框,则输出的最后一个维度为8个数字,表示从四边形的四个角顶点的位置偏移。该输出方式可以预测不规则四边形的文本。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/d7411ada08714adab73fa0edf7555a679327b71e29184446a33d81cdd910e4fc\"\n",
"width=\"1000\" ></center>\n",
"<br><center>图8 EAST框架图</center> \n",
"MOST[15]提出TFAM模块动态的调整粗粒度的检测结果的感受野,另外提出PA-NMS根据位置信息合并可靠的检测预测结果。此外,训练中还提出 Instance-wise IoU 损失函数,用于平衡训练,以处理不同尺度的文本实例。该方法可以和EAST方法结合,在检测极端长宽比和不同尺度的文本有更好的检测效果和性能。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/73052d9439714bba86ffe4a959d58c523b07baf3f1d74882b4517e71f5a645fe\"\n",
"width=\"1000\" ></center>\n",
"<br><center>图9 MOST框架图</center>\n",
"#### 2.1.3 弯曲文本检测\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/6e33d76ebb814cac9ebb2942b779054af160857125294cd69481680aca2fa98a\"\n",
"width=\"600\" ></center>\n",
"<br><center>图10 CTD框架图</center>\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/e90adf3ca25a45a0af0b84a181fbe2c4954be1fcca8f4049957128548b7131ef\"\n",
"width=\"1000\" ></center>\n",
"<br><center>图11 LOMO框架图</center>\n",
"Contournet[18]基于提出对文本轮廓点建模获取弯曲文本检测框,该方法首先使用Adaptive-RPN获取文本区域的proposal特征,然后设计了局部正交纹理感知LOTM模块学习水平与竖直方向的纹理特征,并用轮廓点表示,最后,通过同时考虑两个正交方向上的特征响应,利用Point Re-Scoring算法可以有效地滤除强单向或弱正交激活的预测,最终文本轮廓可以用一组高质量的轮廓点表示出来。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/1f59ab5db899412f8c70ba71e8dd31d4ea9480d6511f498ea492c97dd2152384\"\n",
"width=\"600\" ></center>\n",
"<br><center>图12 Contournet框架图</center>\n",
"PCR[14]提出渐进式的坐标回归处理弯曲文本检测问题,总体分为三个阶段,首先大致检测到文本区域,获得文本框,另外通过所设计的Contour Localization Mechanism预测文本最小包围框的角点坐标,然后通过叠加多个CLM模块和RCLM模块预测得到弯曲文本。该方法利用文本轮廓信息聚合得到丰富的文本轮廓特征表示,不仅能抑制冗余的噪声点对坐标回归的影响,还能更精确的定位文本区域。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/c677c4602cee44999ae4b38bd780b69795887f2ae10747968bb084db6209b6cc\"\n",
"width=\"600\" ></center>\n",
"<br><center>图13 PCR框架图</center>\n"
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"### 2.2 基于分割的文本检测\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/fb9e50c410984c339481869ba11c1f39f80a4d74920b44b084601f2f8a23099f\"\n",
"width=\"600\" ></center>\n",
"<br><center>图14 文本分割算法示意图</center>\n",
"Pixellink[7]采用分割的方法解决文本检测问题,分割对象为文本区域,将同属于一个文本行(单词)中的像素链接在一起来分割文本,直接从分割结果中提取文本边界框,无需位置回归就能达到基于回归的文本检测的效果。但是基于分割的方法存在一个问题,对于位置相近的文本,文本分割区域容易出现“粘连“问题。Wu, Yue等人[8]提出分割文本的同时,学习文本的边界位置,用于更好的区分文本区域。另外Tian等人[9]提出将同一文本的像素映射到映射空间,在映射空间中令统一文本的映射向量距离相近,不同文本的映射向量距离变远。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/462b5e1472824452a2c530939cda5e59ada226b2d0b745d19dd56068753a7f97\"\n",
"width=\"600\" ></center>\n",
"<br><center>图15 PixelLink框架图</center>\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/9597efd68a224d60b74d7c51c99f7ff0ba9939e5cdb84fb79209b7e213f7d039\"\n",
"width=\"600\" ></center>\n",
"<br><center>图16 MSR框架图</center>\n",
" \n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/fa870b69a2a5423cad7422f64c32e0645dfc31a4ecc94a52832cf8742cded5ba\"\n",
"width=\"1000\" ></center>\n",
"<br><center>图17 PSENet框架图</center>\n",
"假设PSENet后处理用了3个不同尺度的kernel,如上图s1,s2,s3所示。首先,从最小kernel s1开始,计算文本分割区域的连通域,得到(b),然后,对连通域沿着上下左右做尺度扩张,对于扩张区域属于s2但不属于s1的像素,进行归类,遇到冲突点时,采用“先到先得”原则,重复尺度扩张的操作,最终可以得到不同文本行的独立的分割区域。\n",
"Seglink++[17]针对弯曲文本和密集文本问题,提出了一种文本块单元之间的吸引关系和排斥关系的表征,然后设计了一种最小生成树算法进行单元组合得到最终的文本检测框,并提出instance-aware 损失函数使Seglink++方法可以端对端训练。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/1a16568361c0468db537ac25882eed096bca83f9c1544a92aee5239890f9d8d9\"\n",
"width=\"1000\" ></center>\n",
"<br><center>图18 Seglink++框架图</center>\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/a76771f91db246ee8be062f96fa2a8abc7598dd87e6d4755b63fac71a4ebc170\"\n",
"width=\"1000\" ></center>\n",
"<br><center>图19 PAN框架图</center>\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/0d6423e3c79448f8b09090cf2dcf9d0c7baa0f6856c645808502678ae88d2917\"\n",
"width=\"1000\" ></center>\n",
"<br><center>图20 DB框架图</center>\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/45e9a374d97145689a961977f896c8f9f470a66655234c1498e1c8477e277954\"\n",
"width=\"1000\" ></center>\n",
"<br><center>图21 FCENet框架图</center>\n",
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"## 3 总结\n",
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"## 参考文献\n",
"1. Liao, Minghui, et al. \"Textboxes: A fast text detector with a single deep neural network.\" Thirty-first AAAI conference on artificial intelligence. 2017.\n",
"2. Liao, Minghui, Baoguang Shi, and Xiang Bai. \"Textboxes++: A single-shot oriented scene text detector.\" IEEE transactions on image processing 27.8 (2018): 3676-3690.\n",
"3. Tian, Zhi, et al. \"Detecting text in natural image with connectionist text proposal network.\" European conference on computer vision. Springer, Cham, 2016.\n",
"4. Zhou, Xinyu, et al. \"East: an efficient and accurate scene text detector.\" Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017.\n",
"5. Wang, Fangfang, et al. \"Geometry-aware scene text detection with instance transformation network.\" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.\n",
"6. Yuliang, Liu, et al. \"Detecting curve text in the wild: New dataset and new solution.\" arXiv preprint arXiv:1712.02170 (2017).\n",
"7. Deng, Dan, et al. \"Pixellink: Detecting scene text via instance segmentation.\" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.\n",
"8. Wu, Yue, and Prem Natarajan. \"Self-organized text detection with minimal post-processing via border learning.\" Proceedings of the IEEE International Conference on Computer Vision. 2017.\n",
"9. Tian, Zhuotao, et al. \"Learning shape-aware embedding for scene text detection.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.\n",
"10. Wang, Wenhai, et al. \"Shape robust text detection with progressive scale expansion network.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.\n",
"11. Wang, Wenhai, et al. \"Efficient and accurate arbitrary-shaped text detection with pixel aggregation network.\" Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.\n",
"12. Liao, Minghui, et al. \"Real-time scene text detection with differentiable binarization.\" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07. 2020.\n",
"13. Hochreiter, Sepp, and Jürgen Schmidhuber. \"Long short-term memory.\" Neural computation 9.8 (1997): 1735-1780.\n",
"14. Dai, Pengwen, et al. \"Progressive Contour Regression for Arbitrary-Shape Scene Text Detection.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.\n",
"15. He, Minghang, et al. \"MOST: A Multi-Oriented Scene Text Detector with Localization Refinement.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.\n",
"16. Zhu, Yiqin, et al. \"Fourier contour embedding for arbitrary-shaped text detection.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.\n",
"17. Tang, Jun, et al. \"Seglink++: Detecting dense and arbitrary-shaped scene text by instance-aware component grouping.\" Pattern recognition 96 (2019): 106954.\n",
"18. Wang, Yuxin, et al. \"Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.\n",
"19. Zhang, Chengquan, et al. \"Look more than once: An accurate detector for text of arbitrary shapes.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.\n",
"20. Xue C, Lu S, Zhang W. Msr: Multi-scale shape regression for scene text detection[J]. arXiv preprint arXiv:1901.02596, 2019. \n"
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
"nbformat": 4,
"nbformat_minor": 1
因为 它太大了无法显示 source diff 。你可以改为 查看blob
"cells": [
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"# 文本识别算法理论\n",
"1. 文本识别的目标\n",
"2. 文本识别算法的分类\n",
"3. 各类算法的典型思想\n",
"## 1 背景介绍\n",
"文本识别是OCR(Optical Character Recognition)的一个子任务,其任务为识别一个固定区域的文本内容。在OCR的两阶段方法里,它接在文本检测后面,将图像信息转换为文字信息。\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/a7c3404f778b489db9c1f686c7d2ff4d63b67c429b454f98b91ade7b89f8e903 width=\"600\"></center>\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/e72b1d6f80c342ac951d092bc8c325149cebb3763ec849ec8a2f54e7c8ad60ca width=\"600\"></center>\n",
"* 规则文本识别:主要指印刷字体、扫描文本等,认为文本大致处在水平线位置\n",
"* 不规则文本识别: 往往出现在自然场景中,且由于文本曲率、方向、变形等方面差异巨大,文字往往不在水平位置,存在弯曲、遮挡、模糊等问题。\n",
"下图展示的是 IC15 和 IC13 的数据样式,它们分别代表了不规则文本和规则文本。可以看出不规则文本往往存在扭曲、模糊、字体差异大等问题,更贴近真实场景,也存在更大的挑战性。\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/bae4fce1370b4751a3779542323d0765a02a44eace7b44d2a87a241c13c6f8cf width=\"400\">\n",
"<br><center>IC15 图片样例(不规则文本)</center>\n",
"<img src=https://ai-studio-static-online.cdn.bcebos.com/b55800d3276f4f5fad170ea1b567eb770177fce226f945fba5d3247a48c15c34 width=\"400\"></center>\n",
"<br><center>IC13 图片样例(规则文本)</center>\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/4d0aada261064031a16816b39a37f2ff6af70dbb57004cb7a106ae6485f14684 width=\"600\"></center>\n",
"## 2 文本识别算法分类\n",
" \n",
"| 算法类别 | 主要思路 | 主要论文 |\n",
"| -------- | --------------- | -------- |\n",
"| 传统算法 | 滑动窗口、字符提取、动态规划 | - |\n",
"| ctc | 基于ctc的方法,序列不对齐,更快速识别 | CRNN, Rosetta |\n",
"| Attention | 基于attention的方法,应用于非常规文本 | RARE, DAN, PREN |\n",
"| Transformer | 基于transformer的方法 | SRN, NRTR, Master, ABINet |\n",
"| 校正 | 校正模块学习文本边界并校正成水平方向 | RARE, ASTER, SAR | \n",
"| 分割 | 基于分割的方法,提取字符位置再做分类 | Text Scanner, Mask TextSpotter |\n",
" \n",
"### 2.1 规则文本识别\n",
"文本识别的主流算法有两种,分别是基于 CTC (Conectionist Temporal Classification) 的算法和 Sequence2Sequence 算法,区别主要在解码阶段。\n",
"基于 CTC 的算法是将编码产生的序列接入 CTC 进行解码;基于 Sequence2Sequence 的方法则是把序列接入循环神经网络(Recurrent Neural Network, RNN)模块进行循环解码,两种方式都验证有效也是主流的两大做法。\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/f64eee66e4a6426f934c1befc3b138629324cf7360c74f72bd6cf3c0de9d49bd width=\"600\"></center>\n",
"<br><center>左:基于CTC的方法,右:基于Sequece2Sequence的方法 </center>\n",
"#### 2.1.1 基于CTC的算法\n",
"基于 CTC 最典型的算法是CRNN (Convolutional Recurrent Neural Network)[1],它的特征提取部分使用主流的卷积结构,常用的有ResNet、MobileNet、VGG等。由于文本识别任务的特殊性,输入数据中存在大量的上下文信息,卷积神经网络的卷积核特性使其更关注于局部信息,缺乏长依赖的建模能力,因此仅使用卷积网络很难挖掘到文本之间的上下文联系。为了解决这一问题,CRNN文本识别算法引入了双向 LSTM(Long Short-Term Memory) 用来增强上下文建模,通过实验证明双向LSTM模块可以有效的提取出图片中的上下文信息。最终将输出的特征序列输入到CTC模块,直接解码序列结果。该结构被验证有效,并广泛应用在文本识别任务中。Rosetta[2]是FaceBook提出的识别网络,由全卷积模型和CTC组成。Gao Y[3]等人使用CNN卷积替代LSTM,参数更少,性能提升精度持平。\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/d3c96dd9e9794fddb12fa16f926abdd3485194f0a2b749e792e436037490899b width=\"600\"></center>\n",
"<center> CRNN 结构图 </center>\n",
"#### 2.1.2 Sequence2Sequence 算法\n",
"Sequence2Sequence 算法是由编码器 Encoder 把所有的输入序列都编码成一个统一的语义向量,然后再由解码器Decoder解码。在解码器Decoder解码的过程中,不断地将前一个时刻的输出作为后一个时刻的输入,循环解码,直到输出停止符为止。一般编码器是一个RNN,对于每个输入的词,编码器输出向量和隐藏状态,并将隐藏状态用于下一个输入的单词,循环得到语义向量;解码器是另一个RNN,它接收编码器输出向量并输出一系列字以创建转换。受到 Sequence2Sequence 在翻译领域的启发, Shi[4]提出了一种基于注意的编解码框架来识别文本,通过这种方式,rnn能够从训练数据中学习隐藏在字符串中的字符级语言模型。\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/f575333696b7438d919975dc218e61ccda1305b638c5497f92b46a7ec3b85243 width=\"400\" hight=\"500\"></center>\n",
"<center> Sequence2Sequence 结构图 </center>\n",
"### 2.2 不规则文本识别\n",
"* 不规则文本识别算法可以被分为4大类:基于校正的方法;基于 Attention 的方法;基于分割的方法;基于 Transformer 的方法。\n",
"#### 2.2.1 基于校正的方法\n",
"RARE[4]模型首先提出了对不规则文本的校正方案,整个网络分为两个主要部分:一个空间变换网络STN(Spatial Transformer Network) 和一个基于Sequence2Squence的识别网络。其中STN就是校正模块,不规则文本图像进入STN,通过TPS(Thin-Plate-Spline)变换成一个水平方向的图像,该变换可以一定程度上校正弯曲、透射变换的文本,校正后送入序列识别网络进行解码。\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/66406f89507245e8a57969b9bed26bfe0227a8cf17a84873902dd4a464b97bb5 width=\"600\"></center>\n",
"<center> RARE 结构图 </center>\n",
"#### 2.2.2 基于Attention的方法\n",
"基于 Attention 的方法主要关注的是序列之间各部分的相关性,该方法最早在机器翻译领域提出,认为在文本翻译的过程中当前词的结果主要由某几个单词影响的,因此需要给有决定性的单词更大的权重。在文本识别领域也是如此,将编码后的序列解码时,每一步都选择恰当的context来生成下一个状态,这样有利于得到更准确的结果。\n",
"R^2AM [7] 首次将 Attention 引入文本识别领域,该模型首先将输入图像通过递归卷积层提取编码后的图像特征,然后利用隐式学习到的字符级语言统计信息通过递归神经网络解码输出字符。在解码过程中引入了Attention 机制实现了软特征选择,以更好地利用图像特征,这一有选择性的处理方式更符合人类的直觉。\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/a64ef10d4082422c8ac81dcda4ab75bf1db285d6b5fd462a8f309240445654d5 width=\"600\"></center>\n",
"<center> R^2AM 结构图 </center>\n",
"后续有大量算法在Attention领域进行探索和更新,例如SAR[8]将1D attention拓展到2D attention上,校正模块提到的RARE也是基于Attention的方法。实验证明基于Attention的方法相比CTC的方法有很好的精度提升。\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/4e2507fb58d94ec7a9b4d17151a986c84c5053114e05440cb1e7df423d32cb02 width=\"600\"></center>\n",
"#### 2.2.3 基于分割的方法\n",
"基于分割的方法是将文本行的各字符作为独立个体,相比与对整个文本行做矫正后识别,识别分割出的单个字符更加容易。它试图从输入的文本图像中定位每个字符的位置,并应用字符分类器来获得这些识别结果,将复杂的全局问题简化成了局部问题解决,在不规则文本场景下有比较不错的效果。然而这种方法需要字符级别的标注,数据获取上存在一定的难度。Lyu[9]等人提出了一种用于单词识别的实例分词模型,该模型在其识别部分使用了基于 FCN(Fully Convolutional Network) 的方法。[10]从二维角度考虑文本识别问题,设计了一个字符注意FCN来解决文本识别问题,当文本弯曲或严重扭曲时,该方法对规则文本和非规则文本都具有较优的定位结果。\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/fd3e8ef0d6ce4249b01c072de31297ca5d02fc84649846388f890163b624ff10 width=\"800\"></center>\n",
"<center> Mask TextSpotter 结构图 </center>\n",
"#### 2.2.4 基于Transformer的方法\n",
"随着 Transformer 的快速发展,分类和检测领域都验证了 Transformer 在视觉任务中的有效性。如规则文本识别部分所说,CNN在长依赖建模上存在局限性,Transformer 结构恰好解决了这一问题,它可以在特征提取器中关注全局信息,并且可以替换额外的上下文建模模块(LSTM)。\n",
"一部分文本识别算法使用 Transformer 的 Encoder 结构和卷积共同提取序列特征,Encoder 由多个 MultiHeadAttentionLayer 和 Positionwise Feedforward Layer 堆叠而成的block组成。MulitHeadAttention 中的 self-attention 利用矩阵乘法模拟了RNN的时序计算,打破了RNN中时序长时依赖的障碍。也有一部分算法使用 Transformer 的 Decoder 模块解码,相比传统RNN可获得更强的语义信息,同时并行计算具有更高的效率。\n",
"SRN[11] 算法将Transformer的Encoder模块接在ResNet50后,增强了2D视觉特征。并提出了一个并行注意力模块,将读取顺序用作查询,使得计算与时间无关,最终并行输出所有时间步长的对齐视觉特征。此外SRN还利用Transformer的Eecoder作为语义模块,将图片的视觉信息和语义信息做融合,在遮挡、模糊等不规则文本上有较大的收益。\n",
"NRTR[12] 使用了完整的Transformer结构对输入图片进行编码和解码,只使用了简单的几个卷积层做高层特征提取,在文本识别上验证了Transformer结构的有效性。\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/e7859f4469a842f0bd450e7e793a679d6e828007544241d09785c9b4ea2424a2 width=\"800\"></center>\n",
"<center> NRTR 结构图 </center>\n",
"## 3 总结\n",
"## 4 参考文献\n",
"[1]Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11), 2298-2304.\n",
"[2]Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 71–79. ACM, 2018.\n",
"[3]Gao, Y., Chen, Y., Wang, J., & Lu, H. (2017). Reading scene text with attention convolutional sequence modeling. arXiv preprint arXiv:1709.04303.\n",
"[4]Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene text recognition with automatic rectification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4168-4176).\n",
"[5] Star-Net Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spa- tial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.\n",
"[6]Baoguang Shi, Mingkun Yang, XingGang Wang, Pengyuan Lyu, Xiang Bai, and Cong Yao. Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence, 31(11):855–868, 2018.\n",
"[7] Lee C Y , Osindero S . Recursive Recurrent Nets with Attention Modeling for OCR in the Wild[C]// IEEE Conference on Computer Vision & Pattern Recognition. IEEE, 2016.\n",
"[8]Li, H., Wang, P., Shen, C., & Zhang, G. (2019, July). Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 8610-8617).\n",
"[9]P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai. Multi-oriented scene text detection via corner localization and region segmentation. In Proc. CVPR, pages 7553–7563, 2018.\n",
"[10] Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., ... & Bai, X. (2019, July). Scene text recognition from two-dimensional perspective. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 8714-8721).\n",
"[11] Yu, D., Li, X., Zhang, C., Liu, T., Han, J., Liu, J., & Ding, E. (2020). Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12113-12122).\n",
"[12] Sheng, F., Chen, Z., & Xu, B. (2019, September). NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR) (pp. 781-786). IEEE.\n",
"[13]Yang, L., Wang, P., Li, H., Li, Z., & Zhang, Y. (2020). A holistic representation guided attention network for scene text recognition. Neurocomputing, 414, 67-75.\n",
"[14]Wang, T., Zhu, Y., Jin, L., Luo, C., Chen, X., Wu, Y., ... & Cai, M. (2020, April). Decoupled attention network for text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 12216-12224).\n",
"[15] Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., & Zhang, Y. (2021). From two to one: A new scene text recognizer with visual language modeling network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 14194-14203).\n",
"[16] Fang, S., Xie, H., Wang, Y., Mao, Z., & Zhang, Y. (2021). Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7098-7107).\n",
"[17] Yan, R., Peng, L., Xiao, S., & Yao, G. (2021). Primitive Representation Learning for Scene Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 284-293)."
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": []
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
"nbformat": 4,
"nbformat_minor": 1
因为 它太大了无法显示 source diff 。你可以改为 查看blob
"cells": [
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"# 文档分析技术\n",
"1. 版面分析的分类和典型思想\n",
"2. 表格识别的分类和典型思想\n",
"3. 信息提取的分类和典型思想"
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"1. 版面分析模块: 将每个文档页面划分为不同的内容区域。该模块不仅可用于划定相关区域和不相关区域,还可用于对其识别的内容类型进行分类。\n",
"2. 光学字符识别 (OCR) 模块: 定位并识别文档中存在的所有文本。\n",
"3. 表格识别模块: 将文档里的表格信息进行识别和转换到excel文件中。\n",
"4. 信息提取模块: 借助OCR结果和图像信息来理解和识别文档中表达的特定信息或信息之间的关系。\n",
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"## 1. 版面分析\n",
"### 1.1 背景介绍\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/2510dc76c66c49b8af079f25d08a9dcba726b2ce53d14c8ba5cd9bd57acecf19\" width=\"1000\"/></center>\n",
"<center>图 1:版面分析效果图</center>\n",
"| 类别 | 主要论文 |\n",
"| ---------------- | -------- |\n",
"| 基于目标检测的方法 | [Visual Detection with Context](https://aclanthology.org/D19-1348.pdf),[Object Detection](https://arxiv.org/pdf/2003.13197v1.pdf),[VSR](https://arxiv.org/pdf/2105.06220v1.pdf)|\n",
"| 基于语义分割的方法 |[Semantic Segmentation](https://arxiv.org/pdf/1911.12170v2.pdf) |\n",
"### 1.2 基于目标检测的方法 \n",
"Soto Carlos[1]在目标检测算法Faster R-CNN的基础上,结合上下文信息并利用文档内容的固有位置信息来提高区域检测性能。Li Kai [2]等人也提出了一种基于目标检测的文档分析方法,通过引入了特征金字塔对齐模块,区域对齐模块,渲染层对齐模块来解决跨域的问题,这三个模块相互补充,并从一般的图像角度和特定的文档图像角度调整域,从而解决了大型标记训练数据集与目标域不同的问题。下图是一个基于目标检测Faster R-CNN算法进行版面分析的流程图。\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/d396e0d6183243898c0961250ee7a49bc536677079fb4ba2ac87c653f5472f01\" width=\"800\"/></center>\n",
"<center>图 2:基于Faster R-CNN的版面分析流程图</center>\n",
"### 1.3 基于语义分割的方法 \n",
"Sarkar Mausoom[3]等人提出了一种基于先验的分割机制,在非常高的分辨率的图像上训练文档分割模型,解决了过度缩小原始图像导致的密集区域不同结构无法区分进而合并的问题。Zhang Peng[4]等人结合文档中的视觉、语义和关系提出了一个统一的框架VSR(Vision, Semantics and Relations)用于文档布局分析,该框架使用一个双流网络来提取特定模态的视觉和语义特征,并通过自适应聚合模块自适应地融合这些特征,解决了现有基于CV的方法不同模态融合效率低下和布局组件之间缺乏关系建模的局限性。\n",
"### 1.4 数据集\n",
"1. PubLayNet[5]: 该数据集包含50万张文档图像,其中40万用于训练,5万用于验证,5万用于测试,共标记了表格,文本,图像,标题和列表五种形式\n",
"2. HJDataset[6]: 数据集包含2271张文档图像, 除了内容区域的边界框和掩码之外,它还包括布局元素的层次结构和阅读顺序。\n",
"<center class=\"two\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/4b153117c9384f98a0ce5a6c6e7c205a4b1c57e95c894ccb9688cbfc94e68a1c\" width=\"400\"/><img src=\"https://ai-studio-static-online.cdn.bcebos.com/efb9faea39554760b280f9e0e70631d2915399fa97774eecaa44ee84411c4994\" width=\"400\"/>\n",
"<center>图 3:PubLayNet样例</center>\n",
"[1]:Soto C, Yoo S. Visual detection with context for document layout analysis[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019: 3464-3470.\n",
"[2]:Li K, Wigington C, Tensmeyer C, et al. Cross-domain document object detection: Benchmark suite and method[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 12915-12924.\n",
"[3]:Sarkar M, Aggarwal M, Jain A, et al. Document Structure Extraction using Prior based High Resolution Hierarchical Semantic Segmentation[C]//European Conference on Computer Vision. Springer, Cham, 2020: 649-666.\n",
"[4]:Zhang P, Li C, Qiao L, et al. VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and Relations[J]. arXiv preprint arXiv:2105.06220, 2021.\n",
"[5]:Zhong X, Tang J, Yepes A J. Publaynet: largest dataset ever for document layout analysis[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1015-1022.\n",
"[6]:Li M, Xu Y, Cui L, et al. DocBank: A benchmark dataset for document layout analysis[J]. arXiv preprint arXiv:2006.01038, 2020.\n",
"[7]:Shen Z, Zhang K, Dell M. A large dataset of historical japanese documents with complex layouts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020: 548-549."
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"## 2. 表格识别\n",
"### 2.1 背景介绍\n",
"1. 表格种类和样式复杂多样,例如*不同的行列合并,不同的内容文本类型*等。\n",
"2. 文档的样式本身的样式多样。\n",
"3. 拍摄时的光照环境等\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/99faa017e28b4928a408573406870ecaa251b626e0e84ab685e4b6f06f601a5f\" width=\"1600\"/></center>\n",
"<center>图 4:表格识别示例图,其中左边为原图,右边为表格识别后的结果图,以Excel形式呈现</center>\n",
"1. 基于启发式规则的方法\n",
"2. 基于CNN的方法\n",
"3. 基于GCN的方法\n",
"4. 基于End to End的方法\n",
"| 类别 | 思路 | 主要论文 |\n",
"| ---------------- | ---- | -------- |\n",
"| 基于CNN的方法 | 目标检测,语义分割 | [CascadeTabNet](https://arxiv.org/pdf/2004.12629v2.pdf), [Multi-Type-TD-TSR](https://arxiv.org/pdf/2105.11021v1.pdf), [LGPMA](https://arxiv.org/pdf/2105.06224v2.pdf), [tabstruct-net](https://arxiv.org/pdf/2010.04565v1.pdf), [CDeC-Net](https://arxiv.org/pdf/2008.10831v1.pdf), [TableNet](https://arxiv.org/pdf/2001.01469v1.pdf), [TableSense](https://arxiv.org/pdf/2106.13500v1.pdf), [Deepdesrt](https://www.dfki.de/fileadmin/user_upload/import/9672_PID4966073.pdf), [Deeptabstr](https://www.dfki.de/fileadmin/user_upload/import/10649_DeepTabStR.pdf), [GTE](https://arxiv.org/pdf/2005.00589v2.pdf), [Cycle-CenterNet](https://arxiv.org/pdf/2109.02199v1.pdf), [FCN](https://www.researchgate.net/publication/339027294_Rethinking_Semantic_Segmentation_for_Table_Structure_Recognition_in_Documents)|\n",
"| 基于GCN的方法 | 基于图神经网络,将表格识别看作图重建问题 | [GNN](https://arxiv.org/pdf/1905.13391v2.pdf), [TGRNet](https://arxiv.org/pdf/2106.10598v3.pdf), [GraphTSR](https://arxiv.org/pdf/1908.04729v2.pdf)|\n",
"| 基于End to End的方法 | 利用attention机制 | [Table-Master](https://arxiv.org/pdf/2105.01848v1.pdf)|\n",
"### 2.2 基于启发式规则的传统算法\n",
"早期的表格识别研究主要是基于启发式规则的方法。例如由Kieninger[1]等人提出的T-Rect系统使用自底向上的方法对文档图像进行连通域分析,然后按照定义的规则进行合并,得到逻辑文本块。而之后由Yildiz[2]等人提出的pdf2table则是第一个在PDF文档上进行表格识别的方法,它利用了PDF文件的一些特有信息(例如文字、绘制路径等图像文档中难以获取的信息)来协助表格识别。而在最近的工作中,Koci[3]等人将页面中的布局区域表示为图(Graph)的形式,然后使用了Remove and Conquer(RAC)算法从中将表格作为一个子图识别出来。\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/66aeedb3f0924d80aee15f185e6799cc687b51fc20b74b98b338ca2ea25be3f3\" width=\"1000\"/></center>\n",
"<center>图 5:启发式算法示意图</center>\n",
"### 2.3 基于深度学习CNN的方法\n",
"Siddiqui Shoaib Ahmed[12]等人在DeepTabStR算法中,将表格结构识别问题表述为对象检测问题,并利用可变形卷积来进更好的进行表格单元格的检测。Raja Sachin[6]等人提出TabStruct-Net将单元格检测和结构识别在视觉上结合起来进行表格结构识别,解决了现有方法由于表格布局发生较大变化而识别错误的问题,但是该方法无法处理行列出现较多空单元格的问题。\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/838be28836444bc1835ac30a25613d8b045a1b5aedd44b258499fe9f93dd298f\" width=\"1600\"/></center>\n",
"<center>图 6:基于深度学习CNN的算法示意图</center>\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/4c40dda737bd44b09a533e1b1dd2e4c6a90ceea083bf4238b7f3c7b21087f409\" width=\"1600\"/></center>\n",
"<center>图 7:基于深度学习CNN的算法错误示例</center>\n",
"之前的表格结构识别方法一般是从不同粒度(行/列、文本区域)的元素开始处理问题,容易忽略空单元格合并的问题。Qiao Liang[10]等人提出了一个新框架LGPMA,通过掩码重评分策略充分利用来自局部和全局特征的信息,进而可以获得更可靠的对齐单元格区域,最后引入了包括单元格匹配、空单元格搜索和空单元格合并的表格结构复原pipeline来处理表格结构识别问题。\n",
"除了以上单独做表格识别的算法外,也有部分方法将表格检测和表格识别在一个模型里完成,Schreiber Sebastian[11]等人提出了DeepDeSRT,通过Faster RCNN进行表格检测,通过FCN语义分割模型用于表格结构行列检测,但是该方法是用两个独立的模型来解决这两个问题。Prasad Devashish[4]等人提出了一种基于端到端深度学习的方法CascadeTabNet,使用Cascade Mask R-CNN HRNet模型同时进行表格检测和结构识别,解决了以往方法使用独立的两个方法处理表格识别问题的不足。Paliwal Shubham[8]等人提出一种新颖的端到端深度多任务架构TableNet,用于表格检测和结构识别,同时在训练期间向TableNet添加额外的空间语义特征,进一步提高了模型性能。Zheng Xinyi[13]等人提出了表格识别的系统框架GTE,利用单元格检测网络来指导表格检测网络的训练,同时提出了一种层次网络和一种新的基于聚类的单元格结构识别算法,该框架可以接入到任何目标检测模型的后面,方便训练不同的表格识别算法。之前的研究主要集中在从扫描的PDF文档中解析具有简单布局的,对齐良好的表格图像,但是现实场景中的表格一般很复杂,可能存在严重变形,弯曲或者遮挡等问题,因此Long Rujiao[14]等人同时构造了一个现实复杂场景下的表格识别数据集WTW,并提出了一种Cycle-CenterNet方法,它利用循环配对模块优化和提出的新配对损失,将离散单元精确地分组到结构化表中,提高了表格识别的性能。\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a01f714cbe1f42fc9c45c6658317d9d7da2cec9726844f6b9fa75e30cadc9f76\" width=\"1600\"/></center>\n",
"<center>图 8:端到端算法示意图</center>\n",
"### 2.4 基于深度学习GCN的方法\n",
"近些年来,随着图卷积神经网络(Graph Convolutional Network)的兴起,也有一些研究者尝试将图神经网络应用到表格结构识别问题上。Qasim Shah Rukh[20]等人将表格结构识别问题转换为与图神经网络兼容的图问题,并设计了一种新颖的可微架构,该架构既可以利用卷积神经网络提取特征的优点,也可以利用图神经网络顶点之间有效交互的优点,但是该方法只使用了单元格的位置特征,没有利用语义特征。Chi Zewen[19]等人提出了一种新颖的图神经网络GraphTSR,用于PDF文件中的表格结构识别,它以表格中的单元格为输入,然后通过利用图的边和节点相连的特性来预测单元格之间的关系来识别表格结构,一定程度上解决了跨行或者跨列的单元格识别问题。Xue Wenyuan[21]等人将表格结构识别的问题重新表述为表图重建,并提出了一种用于表格结构识别的端到端方法TGRNet,该方法包含单元格检测分支和单元格逻辑位置分支,这两个分支共同预测不同单元格的空间位置和逻辑位置,解决了之前方法没有关注单元格逻辑位置的问题。\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/8ff89661142045a8aef54f8a7a2c69b1d243f8269034406a9e66bee2149f730f\" width=\"1600\"/></center>\n",
"<center>图 9:GraphTSR表格识别算法示意图</center>\n",
"### 2.5 基于端到端的方法\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/7865e58a83824facacfaa91bec12ccf834217cb706454dc5a0c165c203db79fb) | ![](https://ai-studio-static-online.cdn.bcebos.com/77d913b1b92f4a349b8f448e08ba78458d687eef4af142678a073830999f3edc))\n",
"图 10:端到端方法的输入输出|图 11:Image Caption示例\n",
"端到端的方法大多采用Image Caption(看图说话)的Seq2Seq方法来完成表格结构的预测,如一些基于Attention或Transformer的方法。\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/3571280a9c364d3499a062e3edc724294fb5eaef8b38440991941e87f0af0c3b\" width=\"800\"/></center>\n",
"<center>图 12:Seq2Seq示意图</center>\n",
"Ye Jiaquan[22]在TableMaster中通过改进基于Transformer的Master文字算法来得到表格结构输出模型。此外,还添加了一个分支进行框的坐标回归,作者并没有在最后一层将模型拆分为两个分支,而是在第一个 Transformer 解码层之后就将序列预测和框回归解耦为两个分支。其网络结构和原始Master网络的对比如下图所示:\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/f573709447a848b4ba7c73a2e297f0304caaca57c5c94588aada1f4cd893946c\" width=\"800\"/></center>\n",
"<center>图 13:左:master网络图,右:TableMaster网络图</center>\n",
"### 2.6 数据集\n",
"1. PubTabNet[16]: 包含568k表格图像和相应的结构化HTML表示。\n",
"2. PubMed Tables(PubTables-1M)[17]:表格结构识别数据集,包含高度详细的结构注释,460,589张pdf图像用于表格检测任务, 947,642张表格图像用于表格识别任务。\n",
"3. TableBank[18]: 表格检测和识别数据集,使用互联网上Word和Latex文档构建了包含417K高质量标注的表格数据。\n",
"4. SciTSR[19]: 表格结构识别数据集,图像大部分从论文中转换而来,其中包含来自PDF文件的15,000个表格及其相应的结构标签。\n",
"5. TabStructDB[12]: 包括1081个表格区域,这些区域用行和列信息密集标记。\n",
"6. WTW[14]: 大规模数据集场景表格检测识别数据集,该数据集包含各种变形,弯曲和遮挡等情况下的表格数据,共包含14,581 张图像。\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/c9763df56e67434f97cd435100d50ded71ba66d9d4f04d7f8f896d613cdf02b0\" /></center>\n",
"<center>图 14:PubTables-1M数据集样例图</center>\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/64de203bbe584642a74f844ac4b61d1ec3c5a38cacb84443ac961fbcc54a66ce\" width=\"600\"/></center>\n",
"<center>图 15:WTW数据集样例图</center>\n",
"[1]:Kieninger T, Dengel A. A paper-to-HTML table converting system[C]//Proceedings of document analysis systems (DAS). 1998, 98: 356-365.\n",
"[2]:Yildiz B, Kaiser K, Miksch S. pdf2table: A method to extract table information from pdf files[C]//IICAI. 2005: 1773-1785.\n",
"[3]:Koci E, Thiele M, Lehner W, et al. Table recognition in spreadsheets via a graph representation[C]//2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 2018: 139-144.\n",
"[4]:Prasad D, Gadpal A, Kapadni K, et al. CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020: 572-573.\n",
"[5]:Fischer P, Smajic A, Abrami G, et al. Multi-Type-TD-TSR–Extracting Tables from Document Images Using a Multi-stage Pipeline for Table Detection and Table Structure Recognition: From OCR to Structured Table Representations[C]//German Conference on Artificial Intelligence (Künstliche Intelligenz). Springer, Cham, 2021: 95-108.\n",
"[6]:Raja S, Mondal A, Jawahar C V. Table structure recognition using top-down and bottom-up cues[C]//European Conference on Computer Vision. Springer, Cham, 2020: 70-86.\n",
"[7]:Agarwal M, Mondal A, Jawahar C V. Cdec-net: Composite deformable cascade network for table detection in document images[C]//2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021: 9491-9498.\n",
"[8]:Paliwal S S, Vishwanath D, Rahul R, et al. Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 128-133.\n",
"[9]:Dong H, Liu S, Han S, et al. Tablesense: Spreadsheet table detection with convolutional neural networks[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33(01): 69-76.\n",
"[10]:Qiao L, Li Z, Cheng Z, et al. LGPMA: Complicated Table Structure Recognition with Local and Global Pyramid Mask Alignment[J]. arXiv preprint arXiv:2105.06224, 2021.\n",
"[11]:Schreiber S, Agne S, Wolf I, et al. Deepdesrt: Deep learning for detection and structure recognition of tables in document images[C]//2017 14th IAPR international conference on document analysis and recognition (ICDAR). IEEE, 2017, 1: 1162-1167.\n",
"[12]:Siddiqui S A, Fateh I A, Rizvi S T R, et al. Deeptabstr: Deep learning based table structure recognition[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1403-1409.\n",
"[13]:Zheng X, Burdick D, Popa L, et al. Global table extractor (gte): A framework for joint table identification and cell structure recognition using visual context[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021: 697-706.\n",
"[14]:Long R, Wang W, Xue N, et al. Parsing Table Structures in the Wild[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 944-952.\n",
"[15]:Siddiqui S A, Khan P I, Dengel A, et al. Rethinking semantic segmentation for table structure recognition in documents[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1397-1402.\n",
"[16]:Zhong X, ShafieiBavani E, Jimeno Yepes A. Image-based table recognition: data, model, and evaluation[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer International Publishing, 2020: 564-580.\n",
"[17]:Smock B, Pesala R, Abraham R. PubTables-1M: Towards a universal dataset and metrics for training and evaluating table extraction models[J]. arXiv preprint arXiv:2110.00061, 2021.\n",
"[18]:Li M, Cui L, Huang S, et al. Tablebank: Table benchmark for image-based table detection and recognition[C]//Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 1918-1925.\n",
"[19]:Chi Z, Huang H, Xu H D, et al. Complicated table structure recognition[J]. arXiv preprint arXiv:1908.04729, 2019.\n",
"[20]:Qasim S R, Mahmood H, Shafait F. Rethinking table recognition using graph neural networks[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 142-147.\n",
"[21]:Xue W, Yu B, Wang W, et al. TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition[J]. arXiv preprint arXiv:2106.10598, 2021.\n",
"[22]:Ye J, Qi X, He Y, et al. PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML[J]. arXiv preprint arXiv:2105.01848, 2021.\n"
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"## 3. Document VQA\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/63bbe893465e4f98b3aec80a042758b520d43e1a993a47e39bce1123c2d29b3f\" width=\"1600\"/></center>\n",
"> 如何选择方案 \n",
"> 1. 文字检测之后用规则来进行信息提取\n",
"> 2. 文字检测之后用规模型来进行信息提取\n",
"> 3. 外包出去\n",
"### 3.1 背景介绍\n",
"在VQA(Visual Question Answering)任务中,主要针对图像内容进行提问和回答,但是对于文本图像来说,关注的内容是图像中的文字信息,因此这类方法可以分为自然场景的Text-VQA和扫描文档场景的DocVQA,三者的关系如下图所示。\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a91cfd5152284152b020ca8a396db7a21fd982e3661540d5998cc19c17d84861\" width=\"600\"/></center>\n",
"<center>图 16: VQA层级</center>\n",
"|任务类型|VQA | Text-VQA | DocVQA| \n",
"1. 公民身份号码是什么?\n",
"2. 姓名是什么?\n",
"3. 名族是什么?\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/2d2b86468daf47c98be01f44b8d6efa64bc09e43cd764298afb127f19b07aede\" width=\"800\"/></center>\n",
"<center>图 17: 身份证示例</center>\n",
"基于这样的先验知识,DocVQA的 研究开始偏向Key Information Extraction(KIE)任务,本次我们也主要讨论KIE相关的研究,KIE任务主要从图像中提取所需要的关键信息,如从身份证中提取出姓名和公民身份号码信息。\n",
"1. SER: 语义实体识别 (Semantic Entity Recognition),对每一个检测到的文本进行分类,如将其分为姓名,身份证。如下图中的黑色框和红色框。\n",
"2. RE: 关系抽取 (Relation Extraction),对每一个检测到的文本进行分类,如将其分为问题和的答案。然后对每一个问题找到对应的答案。如下图中的红色框和黑色框分别代表问题和答案,黄色线代表问题和答案之间的对应关系。\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/899470ba601349fbbc402a4c83e6cdaee08aaa10b5004977b1f684f346ebe31f\" width=\"800\"/></center>\n",
"<center>图 18: SER,RE任务示例</center>\n",
"一般的KIE方法基于命名实体识别(Named Entity Recognition,NER)[4]来研究,但是这类方法只利用了图像中的文本信息,缺少对视觉和结构信息的使用,因此精度不高。在此基础上,近几年的方法都开始将视觉和结构信息与文本信息融合到一起,按照对多模态信息进行融合时所采用的原理可以将这些方法分为下面三种:\n",
"1. 基于Grid的方法\n",
"1. 基于Token的方法\n",
"2. 基于GCN的方法\n",
"3. 基于End to End 的方法\n",
"| 类别 | 思路 | 主要论文 |\n",
"| ---------------- | ---- | -------- |\n",
"| 基于Grid的方法 |在图像上多模态信息的融合(文本,布局,图像)| [Chargrid](https://arxiv.org/pdf/1809.08799) |\n",
"| 基于Token的方法 |利用Bert这类方法进行多模态信息的融合|[LayoutLM](https://arxiv.org/pdf/1912.13318), [LayoutLMv2](https://arxiv.org/pdf/2012.14740), [StrucText](https://arxiv.org/pdf/2108.02923), |\n",
"| 基于GCN的方法 |利用图网络结构进行多模态信息的融合 |[GCN](https://arxiv.org/pdf/1903.11279), [PICK](https://arxiv.org/pdf/2004.07464), [SDMG-R](https://arxiv.org/pdf/2103.14470),[SERA](https://arxiv.org/pdf/2110.09915) |\n",
"| 基于End to End的方法 |将OCR和关键信息提取统一到一个网络 |[Trie](https://arxiv.org/pdf/2005.13118) |\n",
"### 3.2 基于Grid的方法\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/f248841769ec4312a9015b4befda37bf29db66226431420ca1faad517783875e\" width=\"800\"/></center>\n",
"<center>图 19: Chargrid数据示例</center>\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/0682e52e275b4187a0e74f54961a50091fd3a0cdff734e17bedcbc993f6e29f9\" width=\"800\"/></center>\n",
"<center>图 20: Chargrid网络</center>\n",
"### 3.3 基于Token的方法\n",
"LayoutLM[6]将2D位置信息和文本信息一起编码到BERT模型中,并且借鉴NLP中Bert的预训练思想,在大规模的数据集上进行预训练,在下游任务中,LayoutLM还加入了图像信息来进一步提升模型性能。LayoutLM虽然将文本,位置和图像信息做了融合,但是图像信息是在下游任务的训练中进行融合,这样对三种信息的多模态融合并不充分。LayoutLMv2[7]在LayoutLM的基础上,通过transformers在预训练阶段将图像信息和文本,layout信息进行融合,还在Transformer中加入空间感知自注意力机制辅助模型更好地融合视觉和文本特征。LayoutLMv2虽然在预训练阶段对文本,位置和图像信息做了融合,但是由于预训练任务的限制,模型学到的视觉特征不够精细。StrucTexT[8]在以往多模态方法的基础上,在预训练任务提出Sentence Length Prediction (SLP) 和Paired Boxes Direction (PBD)两个新任务来帮助网络学习精细的视觉特征,其中SLP任务让模型学习文本段的长度,PDB任务让模型学习Box方向之间的匹配关系。通过这两个新的预训练任务,能够加速文本、视觉和布局信息之间的深度跨模态融合。\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/17a26ade09ee4311b90e49a1c61d88a72a82104478434f9dabd99c27a65d789b) | ![](https://ai-studio-static-online.cdn.bcebos.com/d75addba67ef4b06a02ae40145e609d3692d613ff9b74cec85123335b465b3cc))\n",
"图 21:transformer算法流程图|图 22:LayoutLMv2算法流程图\n",
"### 3.4 基于GCN的方法\n",
"现有的基于GCN的方法[10]虽然利用了文字和结构信息,但是没有对图像信息进行很好的利用。PICK[11]在GCN网络中加入了图像信息并且提出graph learning module来自动学习edge的类型。SDMG-R [12]将图像编码为双模态图,图的节点为文字区域的视觉和文本信息,边表示相邻文本直接的空间关系,通过迭代地沿边传播信息和推理图节点类别,SDMG-R解决了现有的方法对没见过的模板无能为力的问题。\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/d3282959e6b2448c89b762b3b9bbf6197a0364b101214a1f83cf01a28623c01c\" width=\"800\"/></center>\n",
"<center>图 23:PICK算法流程图</center>\n",
"SERA[10]将依存句法分析里的biaffine parser引入到文档关系抽取中,并且使用GCN来融合文本和视觉信息。\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a97b7647968a4fa59e7b14b384dd7ffe812f158db8f741459b6e6bb0e8b657c7\" width=\"800\"/></center>\n",
"<center>图 24:SERA算法流程图</center>\n",
"### 3.5 基于End to End 的方法\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/6e4a3b0f65254f6b9d40cea0875854d4f47e1dca6b1e408cad435b3629600608\" width=\"1300\"/></center>\n",
"<center>图 25: Trie算法流程图</center>\n",
"### 3.6 数据集\n",
"1. SROIE: SROIE数据集[2]的任务3旨在从扫描收据中提取四个预定义的信息:公司、日期、地址或总数。数据集中有626个样本用于训练,347个样本用于测试。\n",
"2. FUNSD: FUNSD数据集[3]是一个用于从扫描文档中提取表单信息的数据集。它包含199个标注好的真实扫描表单。199个样本中149个用于训练,50个用于测试。FUNSD数据集为每个单词分配一个语义实体标签:问题、答案、标题或其他。\n",
"3. XFUN: XFUN数据集是微软提出的一个多语言数据集,包含7种语言,每种语言包含149张训练集,50张测试集。\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/dfdf530d79504761919c1f093f9a86dac21e6db3304c4892998ea1823f3187c6) | ![](https://ai-studio-static-online.cdn.bcebos.com/3b2a9f9476be4e7f892b73bd7096ce8d88fe98a70bae47e6ab4c5fcc87e83861))\n",
"图 26: sroie示例图|图 27: xfun示例图\n",
"[1]:Mathew M, Karatzas D, Jawahar C V. Docvqa: A dataset for vqa on document images[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021: 2200-2209.\n",
"[2]:Huang Z, Chen K, He J, et al. Icdar2019 competition on scanned receipt ocr and information extraction[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1516-1520.\n",
"[3]:Jaume G, Ekenel H K, Thiran J P. Funsd: A dataset for form understanding in noisy scanned documents[C]//2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). IEEE, 2019, 2: 1-6.\n",
"[4]:Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition[J]. arXiv preprint arXiv:1603.01360, 2016.\n",
"[5]:Katti A R, Reisswig C, Guder C, et al. Chargrid: Towards understanding 2d documents[J]. arXiv preprint arXiv:1809.08799, 2018.\n",
"[6]:Xu Y, Li M, Cui L, et al. Layoutlm: Pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020: 1192-1200.\n",
"[7]:Xu Y, Xu Y, Lv T, et al. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding[J]. arXiv preprint arXiv:2012.14740, 2020.\n",
"[8]:Li Y, Qian Y, Yu Y, et al. StrucTexT: Structured Text Understanding with Multi-Modal Transformers[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 1912-1920.\n",
"[9]:Zhang P, Xu Y, Cheng Z, et al. Trie: End-to-end text reading and information extraction for document understanding[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 1413-1422.\n",
"[10]:Liu X, Gao F, Zhang Q, et al. Graph convolution for multimodal information extraction from visually rich documents[J]. arXiv preprint arXiv:1903.11279, 2019.\n",
"[11]:Yu W, Lu N, Qi X, et al. Pick: Processing key information extraction from documents using improved graph learning-convolutional networks[C]//2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021: 4363-4370.\n",
"[12]:Sun H, Kuang Z, Yue X, et al. Spatial Dual-Modality Graph Reasoning for Key Information Extraction[J]. arXiv preprint arXiv:2103.14470, 2021."
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"## 4. 总结\n",
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
"nbformat": 4,
"nbformat_minor": 1
"cells": [
"cell_type": "markdown",
"metadata": {
"collapsed": false
"source": [
"# 1. 课程预备知识\n",
"### 1.1 预备知识\n",
"### 1.2 基础环境准备\n",
"如果你没有本地资源,可以通过AI Studio实训平台完成代码运行,其中的每个项目都通过Notebook的方式呈现,方便开发者学习。若对Notebook的相关操作不熟悉,可以参考[AI Studio项目说明](https://ai.baidu.com/ai-doc/AISTUDIO/0k3e2tfzm)。\n",
"### 1.3 获取和运行代码\n",
"git clone https://github.com/PaddlePaddle/PaddleOCR\n",
"# 如果因为网络问题无法pull成功,也可选择使用码云上的托管:\n",
"git clone https://gitee.com/paddlepaddle/PaddleOCR\n",
"> 注:码云托管代码可能无法实时同步本github项目更新,存在3~5天延时,请优先使用推荐方式。\n",
"> ​\t\t如果你不熟悉git操作,可以直接在PaddleOCR的首页的 `Code` 中下载压缩包\n",
"cd PaddleOCR\n",
"pip3 install -r requirements.txt\n",
"### 1.4 查阅资料\n",
"[PaddleOCR使用文档](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/README_ch.md#%E6%96%87%E6%A1%A3%E6%95%99%E7%A8%8B) (中文) 中详细介绍了如何使用PaddleOCR完成模型应用、训练和部署。文档内容丰富,大多数用户的问题都在文档或FAQ中有所描述,尤其在[FAQ(中文)](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/doc/doc_ch/FAQ.md)中,按照深度学习的应用过程沉淀了用户的常见问题,建议大家仔细阅读。\n",
"### 1.5 寻求帮助\n",
"如果你在使用PaddleOCR的过程中遇到BUG、易用性或者文档相关的问题,可通过[Github issue](https://github.com/PaddlePaddle/PaddleOCR/issues)与官方联系,请按照issue模板尽可能多的提供信息,以便官方人员迅速定位问题。同时,微信群是广大PaddleOCR用户的日常交流阵地,更适合提问一些咨询类问题,除了有PaddleOCR团队成员以外,还会有热心开发者回答大家的问题。"
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
"nbformat": 4,
"nbformat_minor": 1
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"*Note: The above pictures are from the Internet*\n",
"# 1. OCR Technical Background\n",
"## 1.1 Application Scenarios of OCR Technology\n",
"* **<font color=red>What is OCR</font>**\n",
"OCR (Optical Character Recognition) is one of the key directions in computer vision. The traditional definition of OCR is generally oriented to scanned document objects. Now we often say OCR generally refers to scene text recognition (Scene Text Recognition, STR), mainly for natural scenes, such as plaques and other visible texts in various natural scenes as shown in the figure below.\n",
"<center>Figure 1: Document scene text recognition VS. Natural scene text recognition</center>\n",
"* **<font color=red>What are the application scenarios of OCR? </font>**\n",
"OCR technology has a wealth of application scenarios. A typical scenario is vertically-oriented structured text recognition widely used in daily life, such as license plate recognition, bank card information recognition, ID card information recognition, train ticket information recognition, and so on. The common feature of these small verticals is that the format is fixed. Therefore, it is very suitable to use OCR technology for automation, greatly reducing labor costs and improving.\n",
"This vertically-oriented structured text recognition is currently the most widely used and relatively mature technology scene in OCR.\n",
"<center>Figure 2: Application scenarios of OCR technology</center>\n",
"In addition to vertically-oriented structured text recognition, general OCR technology also has a wide range of applications and is often combined with other technologies to complete multi-modal tasks. For example, in video scenes, OCR technology is often used for subtitle automatic translation, content security monitoring, etc., Or combined with visual features to complete tasks such as video understanding and video search.\n",
"<center>Figure 3: General OCR in a multi-modal scene</center>\n",
"## 1.2 OCR Technical Challenge\n",
"The technical difficulties of OCR can be divided into two aspects: the algorithm layer and the application layer.\n",
"* **<font color=red>Algorithm layer</font>**\n",
"The rich application scenarios of OCR determine that it will have many technical difficulties. Here are 8 common problems:\n",
"<center>Figure 4: Technical difficulties of OCR algorithm layer</center>\n",
"These problems bring huge technical challenges to both text detection and text recognition. It can be seen that these challenges are mainly oriented to natural scenes. At present, research in academia mainly focuses on natural scenes, and the commonly used academic datasets in the OCR field are also natural scenes. There are many studies on these issues. Relatively speaking, identification is more challenging than detection.\n",
"* **<font color=red>Application layer</font>**\n",
"In practical applications, especially in a wide range of general scenarios, in addition to the technical difficulties at the algorithm level such as affine transformation, scale problems, insufficient lighting, and shooting blur summarized in the previous section, OCR technology also faces two major difficulties:\n",
"1. **Massive data requires OCR to be able to process in real time.** OCR applications are often connected to massive data. Real-time processing of the data is required or hoped for. Real-time model speed is a big challenge.\n",
"2. **The end-side application requires that the OCR model is light enough and the recognition speed is fast enough.** OCR applications are often deployed on mobile terminals or embedded hardware. There are generally two modes for terminal-side OCR applications: upload to server vs. terminal-side direct recognition. Considering that the method of uploading to the server has requirements on the network, the real-time performance is low, and the server pressure is high when the request volume is too large, as well as the security of data transmission, we hope to complete the OCR identification directly on the terminal side. However, the storage space and computing power of the terminal side are limited, so there are high requirements for the size and prediction speed of the OCR model.\n",
"<center>Figure 5: Technical difficulties of OCR application layer</center>"
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. OCR Cutting-edge Algorithm\n",
"Although OCR is a relatively specific task, it involves many aspects of technology, including text detection, text recognition, end-to-end text recognition, document analysis, and so on. Academic research on various related technologies of OCR emerges endlessly. The following will briefly introduce the related work of several key technologies in the OCR task.\n",
"## 2.1 Text Detection\n",
"The task of text detection is to locate text regions in the input image. In recent years, research on text detection in academia has been very rich. A class of methods regard text detection as a specific scene in target detection, and improve and adapt based on general target detection algorithms. For example, TextBoxes[1] is based on one-stage target detector SSD. The algorithm [2] adjusts the target frame to fit text lines with extreme aspect ratios, while CTPN [3] is improved based on the Faster RCNN [4] architecture. However, there are still some differences between text detection and target detection in the target information and the task itself. For example, the text is generally larger in length and width, often in the shape of \"stripes\", and the text lines may be denser, curved text, etc. Therefore, many algorithms dedicated to text detection have been derived, such as EAST[5], PSENet[6], DBNet[7] and so on.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/548b50212935402abb2e671c158c204737c2c64b9464442a8f65192c8a31b44d\" width=\"500\"></center>\n",
"<center>Figure 6: Example of text detection task</center>\n",
"At present, the more popular text detection algorithms can be roughly divided into two categories: **based on regression** and **based on segmentation**. There are also some algorithms that combine the two. Algorithms based on regression draw on general object detection algorithms, by setting the anchor regression detection frame, or directly doing pixel regression. This type of method has a better detection effect on regular-shaped text, but the detection effect on irregularly-shaped text will be relatively poor. For example, CTPN [3] has better detection effect on horizontal text, but poor detection effect on oblique and curved text. SegLink [8] is more effective for long text, but has limited effect on sparsely distributed text; algorithm based on segmentation Introduced Mask-RCNN [9], this type of algorithm can reach a higher level in various scenes and texts of various shapes, but the disadvantage is that the post-processing is generally more complicated, so there are often speed problems. And it cannot solve the problem of detecting overlapping text.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/4f4ea65578384900909efff93d0b7386e86ece144d8c4677b7bc94b4f0337cfb\" width=\"800\"></center>\n",
"<center>Figure 7: Overview of text detection algorithms</center>\n",
"<center>Figure 8: (left) CTPN[3] algorithm optimization based on regression anchor (middle) DB[7] algorithm optimization post-processing based on segmentation (right) SAST[10] algorithm of regression + segmentation</center>\n",
"The related technology of text detection will be interpreted and actual combat in detail in Chapter 2.\n",
"## 2.2 Text Recognition\n",
"The task of text recognition is to recognize the text content in the image, and the input generally comes from the text area of the image cut out by the text box obtained by text detection. Text recognition can generally be divided into two categories: **Regular Text Recognition** and **Irregular Text Recognition** according to the shape of the text to be recognized. Regular text mainly refers to printed fonts, scanned text, etc., and the text is roughly in the horizontal line position. Irregular text is often not in a horizontal position, and has problems such as bending, occlusion, and blurring. Irregular text scenes are very challenging, and it is also the main research direction in the field of text recognition.\n",
"<center>Figure 9: (Left) Regular text VS. (Right) Irregular text</center>\n",
"The algorithm of regular text recognition can be roughly divided into two types based on CTC and Sequence2Sequence according to the different decoding methods. The processing methods of converting the sequence features learned by the network into the final recognition result are different. The algorithm based on CTC is represented by the classic CRNN [11].\n",
"<center>Figure 10: CTC-based recognition algorithm VS. Attention-based recognition algorithm</center>\n",
"The recognition algorithms for irregular texts are more abundant. Methods such as STAR-Net [12] correct the irregular texts into regular rectangles by adding correction modules such as TPS before recognition. Attention-based methods such as RARE [13] enhance the attention to the correlation of parts between sequences. The segmentation-based method treats each character of a text line as an independent individual, and it is easier to recognize a single segmented character than to recognize the entire text line after correction. In addition, with the rapid development of Transformer [14] and its effectiveness in various tasks in recent years, a number of Transformer-based text recognition algorithms have also appeared. These methods use the transformer structure to solve the long-dependency modeling of CNN. The limitations of the problem, but also achieved good results.\n",
"<center>Figure 11: Recognition algorithm based on character segmentation [15]</center>\n",
"The related technologies of text recognition will be interpreted and actual combat in detail in Chapter 3.\n",
"## 2.3 Document Structure Recognition\n",
"OCR technology in the traditional sense can solve the detection and recognition needs of text. However, in practical application scenarios, structured information is often needed in the end, such as information formatting and extraction of ID cards and invoices, structured identification of tables, and so on. The application scenarios of OCR technology are mostly express document extraction, contract content comparison, financial factoring document information comparison, and logistics document identification. OCR result + post-processing is a commonly used structuring scheme, but the process is often complicated, and post-processing requires fine design and poor generalization. Under the background of the gradual maturity of OCR technology and the growing demand for structured information extraction, various technologies related to intelligent document analysis, such as layout analysis, table recognition, and key information extraction, have received more and more attention and research.\n",
"* **Layout Analysis**\n",
"Layout Analysis is mainly used to classify the content of document images. The categories can generally be divided into plain text, titles, tables, pictures, etc. Existing methods generally regard different plates in the document as different targets for detection or segmentation. For example, Soto Carlos [16], based on the target detection algorithm Faster R-CNN, combines context information and uses the inherent position information of the document content to improve the performance. Region detection performance. Sarkar Mausoom et al.[17] proposed a priori-based segmentation mechanism to train a document segmentation model on very high-resolution images, solving the problem that different structures in dense regions cannot be distinguished and merged due to excessive reduction of the original image.\n",
"<center>Figure 12: Schematic diagram of layout analysis tasks</center>\n",
"* **Table Recognition**\n",
"The task of table recognition is to identify and convert the table information in the document into an excel file. The types and styles of tables in text images are complex and diverse, such as different row and column combinations, different content text types, etc. In addition, the style of the document and the lighting environment during shooting have brought great challenges to table recognition. These challenges make table recognition always a research difficulty in the field of document understanding.\n",
"<center>Figure 13: Schematic diagram of form recognition task</center>\n",
"There are many types of table recognition methods. The early traditional algorithms based on heuristic rules, such as the T-Rect algorithm proposed by Kieninger [18] and others, generally use manual design rules and connected domain detection and analysis. In recent years, with the development of deep learning, some CNN-based table structure recognition algorithms have emerged, such as DeepTabStR proposed by Siddiqui Shoaib Ahmed [19] and others, and TabStruct-Net proposed by Raja Sachin [20] and others. In addition, with the rise of *Graph Neural Network*, some researchers try to apply *Graph Neural Network* to the problem of table structure recognition. Based on the *Graph Neural Network*, table recognition is regarded as a graph reconstruction problem, such as Xue Wenyuan [21] TGRNet proposed by et al. The end-to-end method directly uses the network to complete the HTML representation output of the table structure. Most of the end-to-end methods use the Seq2Seq method to complete the prediction of the table structure, such as some methods based on Attention or Transformer, such as TableMaster [22].\n",
"<center>Figure 14: Schematic diagram of form identification method</center>\n",
"* **Key Information Extraction**\n",
"Key Information Extraction (KIE) is an important task in Document VQA. It mainly extracts the key information needed from images, such as extracting name and citizen ID number information from ID cards. The types of such information are often It is fixed under a specific task, but is different between different tasks.\n",
"<center>Figure 15: Schematic diagram of DocVQA tasks</center>\n",
"KIE is usually divided into two sub-tasks for research:\n",
"- SER: Semantic Entity Recognition, to classify each detected text, such as dividing it into name and ID. As shown in the black box and red box in the figure below.\n",
"- RE: Relation Extraction, which classifies each detected text, such as dividing it into questions and answers. Then find the corresponding answer to each question. As shown in the figure below, the red and black boxes represent the question and the answer, respectively, and the yellow line represents the correspondence between the question and the answer.\n",
"<center>Figure 16: ser and re tasks</center>\n",
"The general KIE method is researched based on Named Entity Recognition (NER) [4], but this type of method only uses the text information in the image and lacks the use of visual and structural information, so the accuracy is not high. On this basis, the methods in recent years have begun to merge visual and structural information with text information. According to the principles used when fusing multi-modal information, these methods can be divided into the following four types:\n",
"- Grid-based method\n",
"- Token-based method\n",
"- GCN-based method\n",
"- Based on End to End method\n",
"Document analysis related technologies will be explained and actual combat in detail in Chapter 6.\n",
"## 2.4 Other Related Technologies\n",
"The previous mainly introduced three key technologies in the OCR field: text detection, text recognition, document structured recognition, and more other cutting-edge technologies related to OCR, including end-to-end text recognition, image preprocessing technology in OCR, and OCR data synthesis Etc., please refer to Chapter 7 and Chapter 8 of the tutorial.\n"
"# 3. Industrial Practice of OCR Technology\n",
"> You are Xiao Wang, what should I do?\n",
"> 1. I won't, I can't, I won't do it 😭\n",
"> 2. It is recommended that the boss find an outsourcing company or commercialization plan, anyway, spend the boss's money 😊\n",
"> 3. Find similar projects online, programming for Github😏\n",
"OCR technology will eventually fall into industrial practice. Although there is a lot of academic research on OCR technology, and the commercial application of OCR technology is relatively mature compared with other AI technologies, there are still some difficulties and challenges in actual industrial applications. The following will analyze from two perspectives of technology and industrial practice.\n",
"## 3.1 Difficulties in Industrial Practice\n",
"In actual industrial practice, developers often need to rely on open source community resources to start or promote projects, and developers using open source models often face three major problems:\n",
"<center>Figure 17: Three major problems in the practice of OCR technology industry</center>\n",
"**1. Can't find & can't choose**\n",
"The open source community is rich in resources, but information asymmetry makes developers unable to solve pain points efficiently. On the one hand, the open source community resources are too rich. Faced with a requirement, developers cannot quickly find a project that matches the business requirement from the massive code repository, that is, there is a problem of \"can't find\". On the other hand, when selecting algorithms, the indicators on the English public dataset cannot provide a direct reference for the Chinese scenarios that developers often face. Algorithm-by-algorithm verification takes a lot of time and manpower, and there is no guarantee that the most suitable algorithm will be selected, that is, \"can't choose\".\n",
"**2. Not applicable to industry scenarios**\n",
"The work in the open source community tends to focus more on effect optimization, such as open source or reproduction of academic paper codes, and generally focus more on algorithm effects. Compared with the work that balances the size and speed of the model, it is much less, and the model size and prediction are time-consuming Two indicators that cannot be ignored in industrial practice are as important as the model effect. Whether it is on the mobile side or the server side, the number of images to be recognized is often very large, and it is hoped that the model will be smaller, more accurate, and faster in prediction. GPU is too expensive, it is better to use CPU to run more economically. On the premise of meeting business needs, the lighter the model, the less resources it takes.\n",
"**3. Difficult optimization and many training deployment problems**\n",
"The direct use of open source algorithms or models generally cannot directly meet business needs. In actual business scenarios, OCR faces a variety of problems. The personalization of business scenarios often requires retraining of custom data sets. On existing open source projects, various optimizations are experimented. The cost of the method is higher. In addition, OCR application scenarios are very rich. There are a wide range of application requirements on the server and various mobile devices. The diversification of the hardware environment needs to support rich deployment methods. The open source community’s projects focus more on algorithms and models, and predict deployment. This part is obviously under-supported. To apply OCR technology from the algorithm in the paper to the application of technology, it has high requirements for the algorithm and engineering ability of the developer.\n",
"## 3.2 Industrial OCR Development Kit PaddleOCR\n",
"OCR industry practice requires a complete set of full-process solutions to speed up the research and development progress and save valuable research and development time. In other words, the ultra-lightweight model and its full-process solutions, especially for mobile terminals and embedded devices with limited computing power and storage space, can be said to be a rigid demand.\n",
"In this context, the industrial-grade OCR development kit [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) came into being.\n",
"The construction idea of ​​PaddleOCR starts from user portraits and needs, and relies on the core framework of flying oars, selects and reproduces a wealth of cutting-edge algorithms, and develops PP characteristic models that are more suitable for industrial landing based on recurring algorithms, and integrates training and promotion to provide A variety of predictive deployment methods to meet different demand scenarios of actual applications.\n",
"<center>Figure 18: Panorama of PaddleOCR development kit</center>\n",
"It can be seen from the panorama that PaddleOCR relies on the core framework of the flying paddle, and provides a wealth of solutions in the model algorithm, pre-training model library, industrial-grade deployment, etc., and provides data synthesis and semi-automatic data annotation tools to meet the needs of development Data production needs of the author.\n",
"**At the model algorithm level**, PaddleOCR provides solutions for the two tasks of **text detection and recognition** and **document structure analysis** respectively. In terms of text detection and recognition, PaddleOCR has reproduced or open sourced 4 text detection algorithms, 8 text recognition algorithms, and 1 end-to-end text recognition algorithm. On this basis, a general text detection and recognition solution of the PP-OCR series is developed. In terms of document structure analysis, PaddleOCR provides algorithms such as layout analysis, table recognition, key information extraction, and named entity recognition, and based on this, it proposes a PP-Structure document analysis solution. A rich selection of algorithms can meet the needs of developers in different business scenarios. The unification of the code framework also facilitates the optimization and performance comparison of different algorithms for developers.\n",
"**At the level of pre-training model library**, based on PP-OCR and PP-Structure solutions, PaddleOCR has developed and open-sourced PP series characteristic models suitable for industrial practice, including general-purpose, ultra-lightweight and multi-language text detection and recognition Models, and complex document analysis models. The PP series characteristic models are deeply optimized on the original algorithm, so that they can reach the practical level of the industry in terms of effect and performance. Developers can either directly apply to business scenarios or use business data for simple finetune. Easily develop a \"practical model\" suitable for your business needs.\n",
"**At the industrial level of deployment**, PaddleOCR provides a server-side prediction solution based on Paddle Inference, a service-based deployment solution based on Paddle Serving, and an end-side deployment solution based on Paddle-Lite to meet the deployment needs of different hardware environments , At the same time, it provides a model compression scheme based on PaddleSlim, which can further compress the model size. The above deployment methods have completed the whole process of training and pushing to ensure that developers can deploy efficiently, stably and reliably.\n",
"**At the data tool level**, PaddleOCR provides a semi-automatic data annotation tool PPOCRLabel and a data synthesis tool Style-Text to help developers more conveniently produce the data sets and annotation information required for model training. PPOCRLabel, as the industry's first open source semi-automatic OCR data annotation tool, is aimed at the tedious and tedious process of labeling, high mechanicality, manual labeling of a large amount of training data, and expensive time and money. The built-in PP-OCR model realizes pre-labeling + manual verification. The labeling mode can greatly improve labeling efficiency and save labor costs. The data synthesis tool Style-Text mainly solves the serious shortage of real data in actual scenes. Traditional synthesis algorithms cannot synthesize text styles (fonts, colors, spacing, background). Only a few target scene images are needed to synthesize a large number of target scene styles in batches. Similar text images.\n",
"<center>Figure 19: Schematic diagram of PPOCRLabel usage</center>\n",
"<center>Figure 20: Example of Style-Text synthesis effect</center>\n",
"### 3.2.1 PP-OCR and PP-Structrue\n",
"The PP series characteristic model is a model that is deeply optimized for the practical needs of the industry by various visual development kits of the flying propeller, striving for a balance between speed and accuracy. The PP series featured models in PaddleOCR include PP-OCR series models for text detection and recognition tasks and PP-Structure series models for document analysis.\n",
"**(1) PP-OCR Chinese and English model**\n",
"<center>Figure 21: Example of PP-OCR model recognition results in Chinese and English</center>\n",
"The typical two-stage OCR algorithm adopted by the Chinese and English models of PP-OCR, that is, the composition method of detection model + recognition model, the specific algorithm framework is as follows:\n",
"<center>Figure 22: Schematic diagram of PP-OCR system pipeline</center>\n",
"It can be seen that in addition to input and output, the core framework of PP-OCR contains 3 modules, namely: text detection module, detection frame correction module, and text recognition module.\n",
"- Text detection module: The core is a text detection model trained on the [DB](https://arxiv.org/abs/1911.08947) detection algorithm to detect the text area in the image;\n",
"- Detection frame correction module: Input the detected text box into the detection frame correction module. At this stage, the text box represented by the four points is corrected into a rectangular frame, which is convenient for subsequent text recognition. On the other hand, the text direction will be judged and corrected. For example, if the text line is judged to be upside down, it will be corrected. This function is realized by training a text direction classifier;\n",
"- Text recognition module: Finally, the text recognition module performs text recognition on the corrected detection box to obtain the text content in each text box. The classic text recognition algorithm used in PP-OCR [CRNN](https://arxiv.org/abs/1507.05717).\n",
"PaddleOCR has successively introduced PP-OCR[23] and PP-OCRv2[24] models.\n",
"PP-OCR model is divided into mobile version (lightweight version) and server version (universal version). The mobile version model is mainly optimized based on the lightweight backbone network MobileNetV3. The optimized model (detection model + text direction classification model + recognition model) ) The size is only 8.1M, the average single image prediction on the CPU takes 350ms, and the T4 GPU is about 110ms. After cropping and quantization, it can be further compressed to 3.5M without changing the accuracy, which is convenient for end-side deployment. The previous test predicts that it will only take 260ms. For more PP-OCR evaluation data, please refer to [benchmark](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.2/doc/doc_ch/benchmark.md).\n",
"PP-OCRv2 maintains the overall framework of PP-OCR, mainly for further strategic optimization of effects. The improvement includes 3 aspects:\n",
"- Compared with the PP-OCR mobile version, the model effect is improved by over 7%;\n",
"- In terms of speed, compared to the PP-OCR server version, it has increased by more than 220%;\n",
"- In terms of model size, with a total size of 11.6M, both server and mobile terminals can be easily deployed.\n",
"The specific optimization strategies of PP-OCR and PP-OCRv2 will be explained in detail in Chapter 4.\n",
"In addition to the Chinese and English models, PaddleOCR also trained and open-sourced English digital models and multi-language recognition models based on different data sets. All of the above are ultra-lightweight models and are suitable for different language scenarios.\n",
"<center>Figure 23: Schematic diagram of the recognition effect of the English digital model and multilingual model of PP-OCR</center>\n",
"**(2) PP-Structure document analysis model**\n",
"PP-Structure supports three subtasks: layout analysis, table recognition, and DocVQA.\n",
"The core functions of PP-Structure are as follows:\n",
"- Support layout analysis of documents in the form of pictures, which can be divided into 5 types of areas: text, title, table, picture and list (used in conjunction with Layout-Parser)\n",
"- Support text, title, picture and list area to be extracted as text fields (used in conjunction with PP-OCR)\n",
"- Supports structured analysis in the table area, and the final result is output to an Excel file\n",
"- Support Python whl package and command line two ways, simple and easy to use\n",
"- Support custom training for two types of tasks: layout analysis and table structuring\n",
"- Support VQA tasks-SER and RE\n",
"<center>Figure 24: Schematic diagram of PP-Structure system (this figure only contains layout analysis + table identification)</center>\n",
"The specific plan of PP-Structure will be explained in detail in Chapter 6.\n",
"### 3.2.2 Industrial-grade Deployment Plan\n",
"The flying paddle supports full-process and full-scene inference deployment. There are three main sources of models. The first one uses PaddlePaddle API to build a network structure for training. The second is based on the flying paddle kit series. The flying paddle kit provides a wealth of models. Library, simple and easy-to-use API, with out-of-the-box use, including visual model library PaddleCV, intelligent speech library PaddleSpeech and natural language processing library PaddleNLP, etc. The third type uses X2Paddle tools from third-party frameworks (PyTorh, ONNX, TensorFlow, etc.) The output model.\n",
"The paddle model can be compressed, quantified, and distilled using PaddleSlim tools. It supports five deployment schemes, namely, servicing Paddle Serving, server/cloud Paddle Inference, mobile/edge Paddle Lite, web front end Paddle.js, and for Paddle. Unsupported hardware, such as MCU, Horizon, Kunyun and other domestic chips, can be converted into a third-party framework that supports ONNX with the help of Paddle2ONNX.\n",
"<center>Figure 25: Flying propeller support deployment method</center>\n",
"Paddle Inference supports server-side and cloud deployment, with high performance and versatility. It is deeply adapted and optimized for different platforms and different application scenarios. Paddle Inference is the native reasoning library for flying paddles, ensuring that the model can be trained on the server side. Use, rapid deployment, suitable for high-performance hardware using multiple application language environments to deploy models with complex algorithms. The hardware covers x86 CPUs, Nvidia GPUs, and AI accelerators such as Baidu Kunlun XPU and Huawei Shengteng.\n",
"Paddle Lite is an end-side inference engine with lightweight and high-performance features. It has been configured and optimized in-depth for end-side devices and various application scenarios. Currently supports multiple platforms such as Android, IOS, embedded Linux devices, macOS, etc. The hardware covers ARM CPU and GPU, X86 CPU and new hardware such as Baidu Kunlun, Huawei Ascend and Kirin, Rockchip, etc.\n",
"Paddle Serving is a high-performance service framework designed to help users quickly deploy models in cloud services in a few steps. At present, Paddle Serving supports functions such as custom pre-processing, model combination, model hot load update, multi-machine multi-card multi-model, distributed reasoning, K8S deployment, security gateway and model encryption deployment, and support for multi-language and multi-client access. Paddle Serving The official also provides deployment examples of more than 40 models, including PaddleOCR, to help users get started faster.\n",
"<center>Figure 26: Support deployment mode of flying propeller</center>\n",
"The above deployment plan will be explained in detail and actual combat based on the PP-OCRv2 model in Chapter 5."
"# 4. Summary\n",
"This section first introduces the application scenarios and cutting-edge algorithms of OCR technology, and then analyzes the difficulties and three major challenges of OCR technology in industrial practice.\n",
"The contents of the subsequent chapters of this tutorial are arranged as follows:\n",
"* The second and third chapters introduce detection and identification technology and practice respectively;\n",
"* Chapter 4 introduces PP-OCR optimization strategy;\n",
"* Chapter 5 Predicting and deploying actual combat;\n",
"* Chapter 6 introduces document structuring;\n",
"* Chapter 7 introduces other OCR-related algorithms such as end-to-end, data preprocessing, and data synthesis;\n",
"* Chapter 8 introduces OCR related data sets and data synthesis tools.\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"# Text detection FAQ\n",
"This section lists some of the problems that developers often encounter when using PaddleOCR's text detection model, and gives corresponding solutions or suggestions.\n",
"The FAQ is introduced in two parts, namely:\n",
" -Text detection training related\n",
" -Text detection and prediction related"
"## 1. FAQ about Text Detection Training\n",
"**1.1 What are the text detection algorithms provided by PaddleOCR?**\n",
"**A**: PaddleOCR contains a variety of text detection models, including regression-based text detection methods EAST and SAST, and segmentation-based text detection methods DB, PSENet.\n",
"**1.2: What data sets are used in the Chinese ultra-lightweight and general models in the PaddleOCR project? How many samples were trained, what configuration of GPUs, how many epochs were run, and how long did they run?**\n",
"**A**: For the ultra-lightweight DB detection model, the training data includes open source data sets lsvt, rctw, CASIA, CCPD, MSRA, MLT, BornDigit, iflytek, SROIE and synthetic data sets, etc. The total data volume is 10W, The data set is divided into 5 parts. A random sampling strategy is used during training. The training takes about 500 epochs on a 4-card V100GPU, which takes 3 days.\n",
"**1.3 Does the text detection training label require specific text labeling? What does the \"###\" in the label mean?**\n",
"**A**: Text detection training only needs the coordinates of the text area. The label can be four or fourteen points, arranged in the order of upper left, upper right, lower right, and lower left. The label file provided by PaddleOCR contains text fields. For unclear text in the text area, ### will be used instead. When training the detection model, the text field in the label will not be used.\n",
" \n",
"**1.4 Is the effect of the text detection model trained when the text lines are tight?**\n",
"**A**: When using segmentation-based methods, such as DB, to detect dense text lines, it is best to collect a batch of data for training, and during training, a binary image will be generated [shrink_ratio](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/ppocr/data/imaug/make_shrink_map.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L37)Turn down the parameter. In addition, when forecasting, you can appropriately reduce [unclip_ratio](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/configs/det/ch_ppocr_v2.0/ch_det_mv3_db_v2.0.yml?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L59) parameter, the larger the unclip_ratio parameter value, the larger the detection frame.\n",
"**1.5 For some large-sized document images, DB will have more missed inspections during inspection. How to avoid this kind of missed inspections?**\n",
"**A**: First of all, you need to determine whether the model is not well-trained or is the problem handled during prediction. If the model is not well trained, it is recommended to add more data for training, or add more data to enhance it during training.\n",
"If the problem is that the predicted image is too large, you can increase the longest side setting parameter [det_limit_side_len] entered during prediction [det_limit_side_len](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L47), which is 960 by default.\n",
"Secondly, you can observe whether the missed text has segmentation results by visualizing the post-processed segmentation map. If there is no segmentation result, the model is not well trained. If there is a complete segmentation area, it means that it is a problem of post-prediction processing. In this case, it is recommended to adjust [DB post-processing parameters](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L51-L53)。\n",
"**1.6 The problem of missed detection of DB model bending text (such as a slightly deformed document image)?**\n",
"**A**: When calculating the average score of the text box in the DB post-processing, it is the average score of the rectangle area, which is easy to cause the missed detection of the curved text. The average score of the polygon area has been added, which will be more accurate, but the speed is somewhat different. Decrease, can be selected as needed, and you can view the [Visual Contrast Effect] (https://github.com/PaddlePaddle/PaddleOCR/pull/2604) in the relevant pr. This function is selected by the parameter [det_db_score_mode](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.1/tools/infer/utility.py#L51), the parameter value is optional [`fast` (default) , `slow`], `fast` corresponds to the original rectangle mode, and `slow` corresponds to the polygon mode. Thanks to the user [buptlihang](https://github.com/buptlihang) for mentioning [pr](https://github.com/PaddlePaddle/PaddleOCR/pull/2574) to help solve this problem.\n",
"**1.7 For simple OCR tasks with low accuracy requirements, how many data sets do I need to prepare?**\n",
"**A**: (1) The amount of training data is related to the complexity of the problem to be solved. The greater the difficulty and the higher the accuracy requirements, the greater the data set requirements, and in general, the more training data in practice, the better the effect.\n",
"(2) For scenes with low accuracy requirements, the amount of data required for detection tasks and recognition tasks is different. For inspection tasks, 500 images can guarantee the basic inspection results. For recognition tasks, it is necessary to ensure that the number of line text images in which each character in the recognition dictionary appears in different scenes needs to be greater than 200 (for example, if there are 5 words in the dictionary, each word needs to appear in more than 200 pictures, then The minimum required number of images should be between 200-1000), so that the basic recognition effect can be guaranteed.\n",
"**1.8 How to get more data when the amount of training data is small?**\n",
"**A**: When the amount of training data is small, you can try the following three ways to get more data: (1) Collect more training data manually, the most direct and effective way. (2) Basic image processing or transformation based on PIL and opencv. For example, the three modules of ImageFont, Image, ImageDraw in PIL write text into the background, opencv's rotating affine transformation, Gaussian filtering and so on. (3) Synthesize data using data generation algorithms, such as algorithms such as pix2pix.\n",
"**1.9 How to replace the backbone of text detection/recognition?**\n",
"A: Whether it is text detection or text recognition, the choice of backbone network is a trade-off between prediction effect and prediction efficiency. Generally, if you choose a larger-scale backbone network, such as ResNet101_vd, the detection or recognition will be more accurate, but the prediction time will increase accordingly. However, choosing a smaller-scale backbone network, such as MobileNetV3_small_x0_35, will predict faster, but the accuracy of detection or recognition will be greatly reduced. Fortunately, the detection or recognition effects of different backbone networks are positively correlated with the image 1000 classification task in the ImageNet dataset. PaddleClas, a flying paddle image classification suite, summarizes 23 series of classification network structures such as ResNet_vd, Res2Net, HRNet, MobileNetV3, GhostNet, etc. The top1 recognition accuracy rate of the above image classification task, GPU (V100 and T4) and CPU (Snapdragon 855) The prediction time-consuming and the corresponding 117 pre-training model download addresses.\n",
"(1) The replacement of the text detection backbone network is mainly to determine 4 stages similar to ResNet to facilitate the integration of subsequent detection heads similar to FPN. In addition, for the text detection problem, the classification pre-training model trained by ImageNet can accelerate the convergence and improve the effect.\n",
"(2) The replacement of the backbone network for text recognition requires attention to the drop position of the network width and height stride. Since text recognition generally has a large ratio of width to height, the frequency of height reduction is less, and the frequency of width reduction is more. You can refer to [Changes to the MobileNetV3 backbone network in PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/ppocr/modeling/backbones/rec_mobilenet_v3.py)。\n",
"**1.10 How to finetune the detection model, such as freezing the previous layer or learning with a small learning rate for some layers?**\n",
"**A**: If you freeze certain layers, you can set the stop_gradient property of the variable to True, so that all the parameters before calculating this variable will not be updated, refer to: https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/faq/train_cn.html#id4\n",
"If learning with a smaller learning rate for some layers is not very convenient in the static graph, one method is to set a fixed learning rate for the weight attribute when the parameters are initialized, refer to: https://www.paddlepaddle.org.cn/documentation/docs/en/develop/api/paddle/fluid/param_attr/ParamAttr_cn.html#paramattr\n",
"In fact, our experiment found that directly loading the model to fine-tune without setting different learning rates of certain layers, the effect is also good.\n",
"**1.11 In the preprocessing part of DB, why should the length and width of the picture be processed into multiples of 32?**\n",
"**A**: It is related to the stride of the network downsampling. Take the resnet backbone network under inspection as an example. After the image is input to the network, it needs to be downsampled by 2 times for 5 times, a total of 32 times. Therefore, it is recommended that the input image size be a multiple of 32.\n",
"**1.12 In the PP-OCR series models, why does the backbone network for text detection not use SEBlock?**\n",
"**A**: The SE module is an important module of the MobileNetV3 network. Its purpose is to estimate the importance of each feature channel of the feature map, assign weights to each feature of the feature map, and improve the expressive ability of the network. However, for text detection, the resolution of the input network is relatively large, generally 640\\*640. It is difficult to use the SE module to estimate the importance of each feature channel of the feature map. The network improvement ability is limited, but the module is relatively time-consuming. In the PP-OCR system, the backbone network for text detection does not use the SE module. Experiments also show that when the SE module is removed, the size of the ultra-lightweight model can be reduced by 40%, and the text detection effect is basically not affected. For details, please refer to the PP-OCR technical article, https://arxiv.org/abs/2009.09941.\n",
"**1.13 The PP-OCR detection effect is not good, how to optimize it?**\n",
"**A**: Specific analysis of specific issues:\n",
"- If the detection effect is not available on your scene, the first choice is to do finetune training on your data;\n",
"- If the image is too large and the text is too dense, it is recommended not to over-compress the image. You can try to modify the resize logic of the detection preprocessing to prevent the image from being over-compressed;\n",
"- The size of the detection frame is too close to the text or the detection frame is too large, you can adjust the db_unclip_ratio parameter, increasing the parameter can enlarge the detection frame, and reducing the parameter can reduce the size of the detection frame;\n",
"- There are many missed detection problems in the detection frame, which can reduce the threshold parameter det_db_box_thresh for DB detection to prevent some detection frames from being filtered out. You can also try to set det_db_score_mode to'slow';\n",
"- Other methods can choose use_dilation as True to expand the feature map of the detection output. In general, the effect will be improved.\n",
"## 2. FAQ about Text Detection and Prediction\n",
"**2.1 In DB, some boxes are too pasted with text, but some corners of the text are removed to affect the recognition. Is there any way to alleviate this problem?**\n",
"**A**: The post-processing parameter [unclip_ratio](https://github.com/PaddlePaddle/PaddleOCR/blob/d80afce9b51f09fd3d90e539c40eba8eb5e50dd6/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L52) can be appropriately increased. the larger the parameter, the larger the text box.\n",
"**2.2 Why does the PaddleOCR detection prediction only support one image test? That is, test_batch_size_per_card=1**\n",
"**A**: When predicting, the image is scaled in equal proportions, the longest side is 960, and the length and width of different images after scaling in equal proportions are inconsistent, and they cannot form a batch, so set test_batch_size to 1.\n",
"**2.3 Accelerate PaddleOCR's text detection model prediction on the CPU?**\n",
"**A**: x86 CPU can use mkldnn (OneDNN) for acceleration; enable [enable_mkldnn](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py#L105) Parameters. In addition, in conjunction with increasing the number of threads used for prediction on the CPU, [num_threads](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py#L106) can effectively speed up the prediction speed on the CPU.\n",
"**2.4 Accelerate PaddleOCR's text detection model prediction on GPU?**\n",
"**A**: TensorRT is recommended for GPU accelerated prediction.\n",
"- 1. Download the Paddle installation package or prediction library with TensorRT from [link](https://paddleinference.paddlepaddle.org.cn/master/user_guides/download_lib.html).\n",
"- 2. Download the [TensorRT](https://developer.nvidia.com/tensorrt) from the Nvidia official website. Note that the downloaded TensorRT version is consistent with the TensorRT version compiled in the paddle installation package.\n",
"- 3. Set the environment variable `LD_LIBRARY_PATH` to point to the lib folder of TensorRT\n",
"export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<TensorRT-${version}/lib>\n",
"- 4. Enable [tensorrt option](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L38).\n",
"**2.5 How to deploy PaddleOCR model on the mobile terminal?**\n",
"**A**: Flying Oar Paddle has a special tool for mobile deployment [PaddleLite](https://github.com/PaddlePaddle/Paddle-Lite), and PaddleOCR provides DB+CRNN as the demo android arm deployment code , Refer to [link](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/deploy/lite/readme.md).\n",
"**2.6 How to use PaddleOCR multi-process prediction?**\n",
"**A**: PaddleOCR recently added [Multi-Process Predictive Control Parameters](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L111), `use_mp` indicates whether When using multiple processes, `total_process_num` indicates the number of processes when using multiple processes. For specific usage, please refer to [document](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/doc/doc_ch/inference.md#1-%E8%B6%85%E8%BD%BB%E9%87%8F%E4%B8%AD%E6%96%87ocr%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86).\n",
"**2.7 Video memory explosion and memory leak during prediction?**\n",
"**A**: If it is the prediction of the training model, the video memory is not enough because the model is too large or the input image is too large, you can refer to the code and add paddle.no_grad() before the main function runs to reduce the video memory usage. If the memory usage of the inference model is too high, you can add [config.enable_memory_optim()](https://github.com/PaddlePaddle/PaddleOCR/blob/8b656a3e13631dfb1ac21d2095d4d4a4993ef710/tools/infer/utility.py?_pjax=%23js-repo-pjax-container%2C%20div%5Bitemtype%3D%22http%3A%2F%2Fschema.org%2FSoftwareSourceCode%22%5D%20main%2C%20%5Bdata-pjax-container%5D#L267) to reduce the memory usage when configuring Config.\n",
"In addition, regarding the memory leak when using Paddle to predict, it is recommended to install the latest version of paddle. The memory leak has been fixed."
"# Text Detection Algorithm Theory\n"
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1 Text Detection\n",
"The task of text detection is to find out the position of text in an image or video. Different from the task of target detection, target detection must not only solve the positioning problem, but also solve the problem of target classification.\n",
"The manifestation of text in images can be regarded as a kind of 'target', and general target detection methods are also suitable for text detection. From the perspective of the task itself:\n",
"- Target detection: Given an image or video, find out the location (box) of the target, and give the target category;\n",
"- Text detection: Given an input image or video, find out the area of the text, which can be a single character position or a whole text line position;\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/af2d8eca913a4d5a968945ae6cac180b009c6cc94abc43bfbaf1ba6a3de98125\" width=\"400\" ></center>\n",
"<br><center>Figure 1: Schematic diagram of target detection</center>\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/400b9100573b4286b40b0a668358bcab9627f169ab934133a1280361505ddd33\" width=\"1000\" ></center>\n",
"<br><center>Figure 2: Schematic diagram of text detection</center>\n",
"Object detection and text detection are both \"location\" problems. But text detection does not need to classify the target, and the shape of the text is complex and diverse.\n",
"The current text detection is generally natural scene text detection. The difficulty lies in:\n",
"1. Text in natural scenes is diverse: text detection is affected by text color, size, font, shape, direction, language, and text length;\n",
"2. Complex background and interference; text detection is affected by image distortion, blur, low resolution, shadow, brightness and other factors;\n",
"3. Dense or overlapping text will affect text detection;\n",
"4. Local consistency of text: a small part of a text line can also be considered as independent text.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/072f208f2aff47e886cf2cf1378e23c648356686cf1349c799b42f662d8ced00\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 3: Text detection scene</center>\n",
"In response to the above problems, many text detection algorithms based on deep learning have been derived to solve the problem of text detection in natural scenes. These methods can be divided into regression-based and segmentation-based text detection methods.\n",
"The next section will briefly introduce the classic text detection algorithm based on deep learning technology."
"## 2 Introduction to Text Detection Methods\n",
"In recent years, deep learning-based text detection algorithms have emerged one after another. These methods can be roughly divided into two categories:\n",
"1. Regression-based text detection method\n",
"2. Text detection method based on segmentation\n",
"This section screens the commonly used text detection methods from 2017 to 2021, and is classified according to the above two types of methods as shown in the following table:\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/22314238b70b486f942701107ffddca48b87235a473c4d8db05b317f132daea0\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 4: Text detection algorithm</center>\n",
"### 2.1 Regression-based Text Detection\n",
"The method based on the regression text detection method is similar to the method of the target detection algorithm. The text detection method has only two categories. The text in the image is regarded as the target to be detected, and the rest is regarded as the background.\n",
"#### 2.1.1 Horizontal Text Detection\n",
"Early text detection algorithms based on deep learning are improved from the target detection method and support horizontal text detection. For example, the TextBoxes algorithm is improved based on the SSD algorithm, and the CTPN is improved based on the two-stage target detection Fast-RCNN algorithm.\n",
"In TextBoxes[1], the algorithm is adjusted according to the one-stage target detector SSD, and the default text box is changed to a quadrilateral that adapts to the specifications of the text direction and aspect ratio, providing an end-to-end training text detection method without complicated Post-processing.\n",
"-Use a pre-selection box with a larger aspect ratio\n",
"-The convolution kernel has been changed from 3x3 to 1x5, which is more suitable for long text detection\n",
"-Adopt multi-scale input\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/3864ccf9d009467cbc04225daef0eb562ac0c8c36f9b4f5eab036c319e5f05e7\" width=\"1000\" ></center>\n",
"<br><center>Figure 5: Textbox frame diagram</center>\n",
"CTPN[3] is based on the Fast-RCNN algorithm, expands the RPN module and designs a CRNN-based module to allow the entire network to detect text sequences from convolutional features. The two-stage method obtains more accurate feature positioning through ROI Pooling. But TextBoxes and CTPN only support the detection of horizontal text.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/452833c2016e4cf7b35291efd09740c13c4bfb8f7c56446b8f7a02fc7eb3e901\" width=\"1000\" ></center>\n",
"<br><center>Figure 6: CTPN frame diagram</center>\n",
"#### 2.1.2 Any Angle Text Detection\n",
"TextBoxes++[2] is improved on the basis of TextBoxes, and supports the detection of text at any angle. Structurally, unlike TextBoxes, TextBoxes++ detects multi-angle text. First, modify the aspect ratio of the preselection box and adjust the aspect ratio to 1, 2, 3, 5, 1/2, 1/3, 1/5. The second is to change the $1*5$ convolution kernel to $3*5$ to better learn the characteristics of the slanted text; finally, the representation information of the output rotating box of TextBoxes++.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/ae96e3acbac04be296b6d54a4d72e5881d592fcc91f44882b24bc7d38b9d2658\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 7: TextBoxes++ frame diagram</center>\n",
"EAST [4] proposed a two-stage text detection method for the location of slanted text, including FCN feature extraction and NMS part. EAST proposes a new text detection pipline structure, which can be trained end-to-end and supports the detection of text in any orientation, and has the characteristics of simple structure and high performance. FCN supports the output of inclined rectangular frame and horizontal frame, and the output format can be freely selected.\n",
"-If the output detection shape is RBox, output Box rotation angle and AABB text shape information, AABB represents the offset to the top, bottom, left, and right sides of the text box. RBox can rotate rectangular text.\n",
"-If the output detection box is a four-point box, the last dimension of the output is 8 numbers, which represents the position offset from the four corner vertices of the quadrilateral. This output method can predict irregular quadrilateral text.\n",
"Considering that the text box output by FCN is relatively redundant, for example, the box generated by the adjacent pixels of a text area has a high degree of coincidence. But it is not the detection frame generated by the same text, and the degree of coincidence is very small. Therefore, EAST proposes to merge the prediction boxes by row first. Finally, filter the remaining quads with the original NMS.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/d7411ada08714adab73fa0edf7555a679327b71e29184446a33d81cdd910e4fc\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 8: EAST frame diagram</center>\n",
"MOST [15] proposed that the TFAM module dynamically adjusts the receptive field of coarse-grained detection results, and also proposed that PA-NMS combines reliable detection and prediction results based on location information. In addition, the Instance-wise IoU loss function is also proposed during training, which is used to balance training to handle text instances of different scales. This method can be combined with the EAST method, and has better detection effect and performance in detecting texts with extreme aspect ratios and different scales.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/73052d9439714bba86ffe4a959d58c523b07baf3f1d74882b4517e71f5a645fe\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 9: MOST frame diagram</center>\n",
"#### 2.1.3 Curved Text Detection\n",
"Using regression to solve the problem of curved text detection, a simple idea is to describe the boundary polygon of the curved text with multi-point coordinates, and then directly predict the vertex coordinates of the polygon.\n",
"CTD [6] proposed to directly predict the boundary polygons of 14 vertices of curved text. The network uses the Bi-LSTM [13] layer to refine the prediction coordinates of the vertices, and realizes the detection of curved text based on the regression method.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/6e33d76ebb814cac9ebb2942b779054af160857125294cd69481680aca2fa98a\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 10: CTD frame diagram</center>\n",
"LOMO [19] proposes iterative optimization of text localization features to obtain finer text localization for long text and curved text problems. The method consists of three parts: the coordinate regression module DR, the iterative optimization module IRM and the arbitrary shape expression module SEM. They are used to generate approximate text regions, iteratively optimize text localization features, and predict text regions, text centerlines, and text boundaries. Iteratively optimized text features can better solve the problem of long text localization and obtain more accurate text area localization.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/e90adf3ca25a45a0af0b84a181fbe2c4954be1fcca8f4049957128548b7131ef\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 11: LOMO frame diagram</center>\n",
"Contournet [18] is based on the proposed modeling of text contour points to obtain a curved text detection frame. This method first uses Adaptive-RPN to obtain the proposal features of the text area, and then designs a local orthogonal texture perception LOTM module to learn horizontal and vertical textures. The feature is represented by contour points. Finally, by considering the feature responses in two orthogonal directions at the same time, the Point Re-Scoring algorithm can effectively filter out the prediction of strong unidirectional or weak orthogonal activation, and the final text contour can be used as a group of high-quality contour points are shown.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/1f59ab5db899412f8c70ba71e8dd31d4ea9480d6511f498ea492c97dd2152384\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 12: Contournet frame diagram</center>\n",
"PCR [14] proposed progressive coordinate regression to deal with curved text detection. The problem is divided into three stages. Firstly, the text area is roughly detected, and the text box is obtained. In addition, the corners of the smallest bounding box of the text are predicted by the designed Contour Localization Mechanism. Coordinates, and then the curved text is predicted by superimposing multiple CLM modules and RCLM modules. This method uses the text contour information aggregation to obtain a rich text contour feature representation, which can not only suppress the influence of redundant noise points on the coordinate regression, but also locate the text area more accurately.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/c677c4602cee44999ae4b38bd780b69795887f2ae10747968bb084db6209b6cc\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 13: PCR frame diagram</center>"
"### 2.2 Text Detection Based on Segmentation\n",
"Although the regression-based method has achieved good results in text detection, it is often difficult to obtain a smooth text surrounding curve for solving curved text, and the model is more complex and does not have performance advantages. Therefore, researchers proposed a text segmentation method based on image segmentation. First, classify at the pixel level, determine whether each pixel belongs to a text target, obtain the probability map of the text area, and obtain the enclosing curve of the text segmentation area through post-processing. .\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/fb9e50c410984c339481869ba11c1f39f80a4d74920b44b084601f2f8a23099f\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 14: Schematic diagram of text segmentation algorithm</center>\n",
"Such methods are usually based on segmentation to achieve text detection, and segmentation-based methods have natural advantages for text detection with irregular shapes. The main idea of ​​the segmentation-based text detection method is to obtain the text area in the image through the segmentation method, and then use opencv, polygon and other post-processing to obtain the minimum enclosing curve of the text area.\n",
"Pixellink [7] uses a segmentation method to solve the text detection problem. The segmentation object is a text area. The pixels in the same text line (word) are linked together to segment the text, and the text bounding box is directly extracted from the segmentation result without a position. Regression can achieve the effect of text detection based on regression. However, there is a problem with the segmentation-based method. For texts with similar positions, the text segmentation area is prone to \"sticky\" problems. Wu, Yue et al. [8] proposed to separate the text while learning the boundary position of the text to better distinguish the text area. In addition, Tian et al. [9] proposed to map the pixels of the same text to the mapping space. In the mapping space, the distance of the mapping vector of the unified text is close, and the distance of the mapping vector of different texts becomes longer.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/462b5e1472824452a2c530939cda5e59ada226b2d0b745d19dd56068753a7f97\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 15: PixelLink frame diagram</center>\n",
"For the multi-scale problem of text detection, MSR [20] proposes to extract multiple scale features of the same image, then merge these features and up-sample to the original image size. The network finally predicts the text center area and each point of the text center area. The x-coordinate offset and y-coordinate offset of the nearest boundary point can finally get the contour coordinate set of the text area.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/9597efd68a224d60b74d7c51c99f7ff0ba9939e5cdb84fb79209b7e213f7d039\"\n",
"width=\"600\" ></center>\n",
"<br><center>Figure 16: MSR frame diagram</center>\n",
" \n",
"Aiming at the problem of segmentation-based text algorithms that are difficult to distinguish between adjacent texts, PSENet [10] proposed a progressive scale expansion network to learn text segmentation regions, predict text regions with different shrinkage ratios, and expand the detected text regions one by one. The essence of this method is The above is a variant of the boundary learning method, which can effectively solve the problem of detecting adjacent text of any shape.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/fa870b69a2a5423cad7422f64c32e0645dfc31a4ecc94a52832cf8742cded5ba\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 17: PSENet frame diagram</center>\n",
"Assume that PSENet post-processing uses 3 kernels of different scales, as shown in the above figure s1, s2, and s3. First, start from the minimum kernel s1, calculate the connected domain of the text segmentation area, and get (b), and then expand the connected domain along the up, down, left, and right, and classify the pixels that belong to s2 but not to s1 in the expanded area. When encountering conflicts, the principle of \"first come first served\" is adopted, and the scale expansion operation is repeated, and finally independent segmented regions of different text lines can be obtained.\n",
"Seglink++ [17] proposed a characterization of the attraction and repulsion relationship between text block units for curved text and dense text, and then designed a minimum spanning tree algorithm to combine the units to obtain the final text detection box, and An instance-aware loss function is proposed so that the Seglink++ method can be trained end-to-end.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/1a16568361c0468db537ac25882eed096bca83f9c1544a92aee5239890f9d8d9\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 18: Seglink++ frame diagram</center>\n",
"Although the segmentation method solves the problem of curved text detection, complex post-processing logic and prediction speed are also goals that need to be optimized.\n",
"PAN [11] aims at the problem of slow text detection and prediction speed, and improves the performance of the algorithm from the aspects of network design and post-processing. First, PAN uses the lightweight ResNet18 as the Backbone, and also designs the lightweight feature enhancement module FPEM and feature fusion module FFM to enhance the features extracted by the Backbone. In terms of post-processing, a pixel clustering method is used to merge pixels whose distance from the kernel is less than the threshold d along the predicted text center (kernel). PAN guarantees high accuracy while having faster prediction speed.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/a76771f91db246ee8be062f96fa2a8abc7598dd87e6d4755b63fac71a4ebc170\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 19: PAN frame diagram</center>\n",
"DBNet [12] aimed at the problem of time-consuming post-processing that requires the use of thresholds for binarization based on segmentation methods. It proposed a learnable threshold and cleverly designed a binarization function that approximates the step function to make the segmentation The network can learn the threshold of text segmentation end-to-end during training. The automatic adjustment of the threshold not only improves accuracy, but also simplifies post-processing and improves the performance of text detection.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/0d6423e3c79448f8b09090cf2dcf9d0c7baa0f6856c645808502678ae88d2917\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 20: DB frame diagram</center>\n",
"FCENet [16] proposed to express the text enclosing curve with Fourier transform parameters. Since the Fourier coefficient representation can theoretically fit any closed curve, by designing a suitable model to predict an arbitrary shape text enclosing box based on Fourier transform In this way, the detection accuracy of highly curved text instances in natural scene text detection is improved.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/45e9a374d97145689a961977f896c8f9f470a66655234c1498e1c8477e277954\"\n",
"width=\"1000\" ></center>\n",
"<br><center>Figure 21: FCENet frame diagram</center>\n",
"## 3 Summary\n",
"This section introduces the development of the field of text detection in recent years, including text detection methods based on regression and segmentation, and respectively enumerates and introduces the method ideas of some classic papers. The next section takes the PaddleOCR open source library as an example to introduce in detail the algorithm principles and core code implementation of DBNet."
"cell_type": "markdown",
"# Text Recognition Algorithm Theory\n",
"This chapter mainly introduces the theoretical knowledge of text recognition algorithms, including background introduction, algorithm classification and some classic paper ideas.\n",
"Through the study of this chapter, you can master:\n",
"1. The goal of text recognition\n",
"2. Classification of text recognition algorithms\n",
"3. Typical ideas of various algorithms\n",
"## 1 Background Introduction\n",
"Text recognition is a subtask of OCR (Optical Character Recognition), and its task is to recognize the text content of a fixed area. In the two-stage method of OCR, it is followed by text detection and converts image information into text information.\n",
"Specifically, the model inputs a positioned text line, and the model predicts the text content and confidence level in the picture. The visualization results are shown in the following figure:\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/a7c3404f778b489db9c1f686c7d2ff4d63b67c429b454f98b91ade7b89f8e903 width=\"600\"></center>\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/e72b1d6f80c342ac951d092bc8c325149cebb3763ec849ec8a2f54e7c8ad60ca width=\"600\"></center>\n",
"<br><center>Figure 1: Visualization results of model predicttion</center>\n",
"There are many application scenarios for text recognition, including document recognition, road sign recognition, license plate recognition, industrial number recognition, etc. According to actual scenarios, text recognition tasks can be divided into two categories: **Regular text recognition** and **Irregular Text recognition**.\n",
"* Regular text recognition: mainly refers to printed fonts, scanned text, etc., and the text is considered to be roughly in the horizontal position\n",
"* Irregular text recognition: It often appears in natural scenes, and due to the huge differences in text curvature, direction, deformation, etc., the text is often not in the horizontal position, and there are problems such as bending, occlusion, and blurring.\n",
"The figure below shows the data patterns of IC15 and IC13, which represent irregular text and regular text respectively. It can be seen that irregular text often has problems such as distortion, blurring, and large font differences. It is closer to the real scene and is also more challenging.\n",
"Therefore, the current major algorithms are trying to obtain higher indicators on irregular data sets.\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/bae4fce1370b4751a3779542323d0765a02a44eace7b44d2a87a241c13c6f8cf width=\"400\">\n",
"<br><center>Figure 2: IC15 picture sample (irregular text)</center>\n",
"<img src=https://ai-studio-static-online.cdn.bcebos.com/b55800d3276f4f5fad170ea1b567eb770177fce226f945fba5d3247a48c15c34 width=\"400\"></center>\n",
"<br><center>Figure 3: IC13 picture sample (rule text)</center>\n",
"When comparing the capabilities of different recognition algorithms, they are often compared on these two types of public data sets. Comparing the effects on multiple dimensions, currently the more common English benchmark data sets are classified as follows:\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/4d0aada261064031a16816b39a37f2ff6af70dbb57004cb7a106ae6485f14684 width=\"600\"></center>\n",
"<br><center>Figure 4: Common English benchmark data sets</center>\n",
"## 2 Text Recognition Algorithm Classification\n",
"In the traditional text recognition method, the task is divided into 3 steps, namely image preprocessing, character segmentation and character recognition. It is necessary to model a specific scene, and it will become invalid once the scene changes. In the face of complex text backgrounds and scene changes, methods based on deep learning have better performance.\n",
"Most existing recognition algorithms can be represented by the following unified framework, and the algorithm flow is divided into 4 stages:\n",
"We have sorted out the mainstream algorithm categories and main papers, refer to the following table:\n",
" \n",
"| Algorithm category | Main ideas | Main papers |\n",
"| -------- | --------------- | -------- |\n",
"| Traditional algorithm | Sliding window, character extraction, dynamic programming |-|\n",
"| ctc | Based on ctc method, sequence is not aligned, faster recognition | CRNN, Rosetta |\n",
"| Attention | Attention-based method, applied to unconventional text | RARE, DAN, PREN |\n",
"| Transformer | Transformer-based method | SRN, NRTR, Master, ABINet |\n",
"| Correction | The correction module learns the text boundary and corrects it to the horizontal direction | RARE, ASTER, SAR |\n",
"| Segmentation | Based on the method of segmentation, extract the character position and then do classification | Text Scanner, Mask TextSpotter |\n",
" \n",
"### 2.1 Regular Text Recognition\n",
"There are two mainstream algorithms for text recognition, namely the CTC (Conectionist Temporal Classification)-based algorithm and the Sequence2Sequence algorithm. The difference is mainly in the decoding stage.\n",
"The CTC-based algorithm connects the encoded sequence to the CTC for decoding; the Sequence2Sequence-based method connects the sequence to the Recurrent Neural Network (RNN) module for cyclic decoding. Both methods have been verified to be effective and mainstream. Two major practices.\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/f64eee66e4a6426f934c1befc3b138629324cf7360c74f72bd6cf3c0de9d49bd width=\"600\"></center>\n",
"<br><center>Figure 5: Left: CTC-based method, right: Sequece2Sequence-based method </center>\n",
"#### 2.1.1 Algorithm Based on CTC\n",
"The most typical algorithm based on CTC is CRNN (Convolutional Recurrent Neural Network) [1], and its feature extraction part uses mainstream convolutional structures, commonly used ResNet, MobileNet, VGG, etc. Due to the particularity of text recognition tasks, there is a large amount of contextual information in the input data. The convolution kernel characteristics of convolutional neural networks make it more focused on local information and lack long-dependent modeling capabilities, so it is difficult to use only convolutional networks. Dig into the contextual connections between texts. In order to solve this problem, the CRNN text recognition algorithm introduces the bidirectional LSTM (Long Short-Term Memory) to enhance the context modeling. Experiments prove that the bidirectional LSTM module can effectively extract the context information in the picture. Finally, the output feature sequence is input to the CTC module, and the sequence result is directly decoded. This structure has been verified to be effective and widely used in text recognition tasks. Rosetta [2] is a recognition network proposed by FaceBook, which consists of a fully convolutional model and CTC. Gao Y [3] et al. used CNN convolution instead of LSTM, with fewer parameters, and the performance improvement accuracy was the same.\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/d3c96dd9e9794fddb12fa16f926abdd3485194f0a2b749e792e436037490899b width=\"600\"></center>\n",
"<center>Figure 6: CRNN structure diagram </center>\n",
"#### 2.1.2 Sequence2Sequence algorithm\n",
"In the Sequence2Sequence algorithm, the Encoder encodes all input sequences into a unified semantic vector, which is then decoded by the Decoder. In the decoding process of the decoder, the output of the previous moment is continuously used as the input of the next moment, and the decoding is performed in a loop until the stop character is output. The general encoder is an RNN. For each input word, the encoder outputs a vector and hidden state, and uses the hidden state for the next input word to get the semantic vector in a loop; the decoder is another RNN, which receives the encoder Output a vector and output a series of words to create a transformation. Inspired by Sequence2Sequence in the field of translation, Shi [4] proposed an attention-based codec framework to recognize text. In this way, rnn can learn character-level language models hidden in strings from training data.\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/f575333696b7438d919975dc218e61ccda1305b638c5497f92b46a7ec3b85243 width=\"400\" hight=\"500\"></center>\n",
"<center>Figure 7: Sequence2Sequence structure diagram </center>\n",
"The above two algorithms have very good effects on regular text, but due to the limitations of network design, this type of method is difficult to solve the task of irregular text recognition of bending and rotation. In order to solve such problems, some algorithm researchers have proposed a series of improved algorithms on the basis of the above two types of algorithms.\n",
"### 2.2 Irregular Text Recognition\n",
"* Irregular text recognition algorithms can be divided into 4 categories: correction-based methods; Attention-based methods; segmentation-based methods; and Transformer-based methods.\n",
"#### 2.2.1 Correction-based Method\n",
"The correction-based method uses some visual transformation modules to convert irregular text into regular text as much as possible, and then uses conventional methods for recognition.\n",
"The RARE [4] model first proposed a correction scheme for irregular text. The entire network is divided into two main parts: a spatial transformation network STN (Spatial Transformer Network) and a recognition network based on Sequence2Squence. Among them, STN is the correction module. Irregular text images enter STN and are transformed into a horizontal image through TPS (Thin-Plate-Spline). This transformation can correct curved and transmissive text to a certain extent, and send it to sequence recognition after correction. Network for decoding.\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/66406f89507245e8a57969b9bed26bfe0227a8cf17a84873902dd4a464b97bb5 width=\"600\"></center>\n",
"<center>Figure 8: RARE structure diagram </center>\n",
"The RARE paper pointed out that this method has greater advantages in irregular text data sets, especially comparing the two data sets CUTE80 and SVTP, which are more than 5 percentage points higher than CRNN, which proves the effectiveness of the correction module. Based on this [6] also combines a text recognition system with a spatial transformation network (STN) and an attention-based sequence recognition network.\n",
"Correction-based methods have better migration. In addition to Attention-based methods such as RARE, STAR-Net [5] applies correction modules to CTC-based algorithms, which is also a good improvement compared to traditional CRNN.\n",
"#### 2.2.2 Attention-based Method\n",
"The Attention-based method mainly focuses on the correlation between the parts of the sequence. This method was first proposed in the field of machine translation. It is believed that the result of the current word in the process of text translation is mainly affected by certain words, so it needs to be The decisive word has greater weight. The same is true in the field of text recognition. When decoding the encoded sequence, each step selects the appropriate context to generate the next state, which is conducive to obtaining more accurate results.\n",
"R^2AM [7] first introduced Attention into the field of text recognition. The model first extracts the encoded image features from the input image through a recursive convolutional layer, and then uses the implicitly learned character-level language statistics to decode the output through a recurrent neural network character. In the decoding process, the Attention mechanism is introduced to realize soft feature selection to make better use of image features. This selective processing method is more in line with human intuition.\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/a64ef10d4082422c8ac81dcda4ab75bf1db285d6b5fd462a8f309240445654d5 width=\"600\"></center>\n",
"<center>Figure 9: R^2AM structure drawing </center>\n",
"A large number of algorithms will be explored and updated in the field of Attention in the future. For example, SAR[8] extends 1D attention to 2D attention. The RARE mentioned in the correction module is also a method based on Attention. Experiments prove that the Attention-based method has a good accuracy improvement compared with the CTC method.\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/4e2507fb58d94ec7a9b4d17151a986c84c5053114e05440cb1e7df423d32cb02 width=\"600\"></center>\n",
"<center>Figure 10: Attention diagram</center>\n",
"#### 2.2.3 Method Based on Segmentation\n",
"The method based on segmentation is to treat each character of the text line as an independent individual, and it is easier to recognize the segmented individual characters than to recognize the entire text line after correction. It attempts to locate the position of each character in the input text image, and applies a character classifier to obtain these recognition results, simplifying the complex global problem into a local problem solving, and it has a relatively good effect in the irregular text scene. However, this method requires character-level labeling, and there is a certain degree of difficulty in data acquisition. Lyu [9] et al. proposed an instance word segmentation model for word recognition, which uses a method based on FCN (Fully Convolutional Network) in its recognition part. [10] Considering the problem of text recognition from a two-dimensional perspective, a character attention FCN is designed to solve the problem of text recognition. When the text is bent or severely distorted, this method has better positioning results for both regular and irregular text.\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/fd3e8ef0d6ce4249b01c072de31297ca5d02fc84649846388f890163b624ff10 width=\"800\"></center>\n",
"<center>Figure 11: Mask TextSpotter structure diagram </center>\n",
"#### 2.2.4 Transformer-based Method\n",
"With the rapid development of Transformer, both classification and detection fields have verified the effectiveness of Transformer in visual tasks. As mentioned in the regular text recognition part, CNN has limitations in long-dependency modeling. The Transformer structure just solves this problem. It can focus on global information in the feature extractor and can replace additional context modeling modules (LSTM ).\n",
"Part of the text recognition algorithm uses Transformer's Encoder structure and convolution to extract sequence features. The Encoder is composed of multiple blocks stacked by MultiHeadAttentionLayer and Positionwise Feedforward Layer. The self-attention in MulitHeadAttention uses matrix multiplication to simulate the timing calculation of RNN, breaking the barrier of long-term dependence on timing in RNN. There are also some algorithms that use Transformer's Decoder module to decode, which can obtain stronger semantic information than traditional RNNs, and parallel computing has higher efficiency.\n",
"The SRN[11] algorithm connects the Encoder module of Transformer to ResNet50 to enhance the 2D visual features. A parallel attention module is proposed, which uses the reading order as a query, making the calculation independent of time, and finally outputs the aligned visual features of all time steps in parallel. In addition, SRN also uses Transformer's Eecoder as a semantic module to integrate the visual information and semantic information of the picture, which has greater benefits in irregular text such as occlusion and blur.\n",
"NRTR [12] uses a complete Transformer structure to encode and decode the input picture, and only uses a few simple convolutional layers for high-level feature extraction, and verifies the effectiveness of the Transformer structure in text recognition.\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/e7859f4469a842f0bd450e7e793a679d6e828007544241d09785c9b4ea2424a2 width=\"800\"></center>\n",
"<center>Figure 12: NRTR structure drawing </center>\n",
"SRACN [13] uses Transformer's decoder to replace LSTM, once again verifying the efficiency and accuracy advantages of parallel training.\n",
"## 3 Summary\n",
"This section mainly introduces the theoretical knowledge and mainstream algorithms related to text recognition, including CTC-based methods, Sequence2Sequence-based methods, and segmentation-based methods. The ideas and contributions of classic papers are listed respectively. The next section will explain the practical course based on the CRNN algorithm, from networking to optimization to complete the entire training process,\n",
"# Document Analysis Technology\n",
"This chapter mainly introduces the theoretical knowledge of document analysis technology, including background introduction, algorithm classification and corresponding ideas.\n",
"Through the study of this chapter, you can master:\n",
"1. Classification and typical ideas of layout analysis\n",
"2. Classification and typical ideas of table recognition\n",
"3. Classification and typical ideas of information extraction"
"Layout analysis is mainly used for document retrieval, key information extraction, content classification, etc. Its task is mainly to classify the content of document images. Content categories can generally be divided into plain text, titles, tables, pictures, and lists. However, the diversity and complexity of document layout, formats, poor document image quality, and the lack of large-scale annotated datasets make layout analysis still a challenging task. Document analysis often includes the following research directions:\n",
"1. Layout analysis module: Divide each document page into different content areas. This module can be used not only to delimit relevant and irrelevant areas, but also to classify the types of content it recognizes.\n",
"2. Optical Character Recognition (OCR) module: Locate and recognize all text present in the document.\n",
"3. Form recognition module: Recognize and convert the form information in the document into an excel file.\n",
"4. Information extraction module: Use OCR results and image information to understand and identify the specific information expressed in the document or the relationship between the information.\n",
"Since the OCR module has been introduced in detail in the previous chapters, the following three modules will be introduced separately for the above layout analysis, table recognition and information extraction. For each module, the classic or common methods and data sets of the module will be introduced."
"## 1. Layout Analysis\n",
"### 1.1 Background Introduction\n",
"Layout analysis is mainly used for document retrieval, key information extraction, content classification, etc. Its task is mainly to classify document images. Content categories can generally be divided into plain text, titles, tables, pictures, and lists. However, the diversity and complexity of document layouts, formats, poor document image quality, and the lack of large-scale annotated data sets make layout analysis still a challenging task.\n",
"The visualization of the layout analysis task is shown in the figure below:\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/2510dc76c66c49b8af079f25d08a9dcba726b2ce53d14c8ba5cd9bd57acecf19\" width=\"1000\"/></center>\n",
"<center>Figure 1: Layout analysis effect diagram</center>\n",
"The existing solutions are generally based on target detection or semantic segmentation methods, which are based on detecting or segmenting different patterns in the document as different targets.\n",
"Some representative papers are divided into the above two categories, as shown in the following table:\n",
"| Category | Main paper |\n",
"| ---------------- | -------- |\n",
"| Method based on target detection | [Visual Detection with Context](https://aclanthology.org/D19-1348.pdf),[Object Detection](https://arxiv.org/pdf/2003.13197v1.pdf),[VSR](https://arxiv.org/pdf/2105.06220v1.pdf)\n",
"| Semantic segmentation method |[Semantic Segmentation](https://arxiv.org/pdf/1911.12170v2.pdf) |\n",
"### 1.2 Method Based on Target Detection \n",
"Soto Carlos [1] is based on the target detection algorithm Faster R-CNN, combines context information and uses the inherent location information of the document content to improve the area detection performance. Li Kai [2] et al. also proposed a document analysis method based on object detection, which solved the cross-domain problem by introducing a feature pyramid alignment module, a region alignment module, and a rendering layer alignment module. These three modules complement each other. And adjust the domain from a general image perspective and a specific document image perspective, thus solving the problem of large labeled training datasets being different from the target domain. The following figure is a flow chart of layout analysis based on the target detection Faster R-CNN algorithm. \n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/d396e0d6183243898c0961250ee7a49bc536677079fb4ba2ac87c653f5472f01\" width=\"800\"/></center>\n",
"<center>Figure 2: Flow chart of layout analysis based on Faster R-CNN</center>\n",
"### 1.3 Methods Based on Semantic Segmentation \n",
"Sarkar Mausoom [3] et al. proposed a priori-based segmentation mechanism to train a document segmentation model on very high-resolution images, which solves the problem of indistinguishable and merging of dense regions caused by excessively shrinking the original image. Zhang Peng [4] et al. proposed a unified framework VSR (Vision, Semantics and Relations) for document layout analysis in combination with the vision, semantics and relations in the document. The framework uses a two-stream network to extract the visual and Semantic features, and adaptively fusion of these features through the adaptive aggregation module, solves the limitations of the existing CV-based methods of low efficiency of different modal fusion and lack of relationship modeling between layout components.\n",
"### 1.4 Data Set\n",
"Although the existing methods can solve the layout analysis task to a certain extent, these methods rely on a large amount of labeled training data. Recently, many data sets have been proposed for document analysis tasks.\n",
"1. PubLayNet[5]: The data set contains 500,000 document images, of which 400,000 are used for training, 50,000 are used for verification, and 50,000 are used for testing. Five forms of table, text, image, title and list are marked.\n",
"2. HJDataset[6]: The data set contains 2271 document images. In addition to the bounding box and mask of the content area, it also includes the hierarchical structure and reading order of layout elements.\n",
"A sample of the PubLayNet data set is shown in the figure below:\n",
"<center class=\"two\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/4b153117c9384f98a0ce5a6c6e7c205a4b1c57e95c894ccb9688cbfc94e68a1c\" width=\"400\"/><img src=\"https://ai-studio-static-online.cdn.bcebos.com/efb9faea39554760b280f9e0e70631d2915399fa97774eecaa44ee84411c4994\" width=\"400\"/>\n",
"<center>Figure 3: PubLayNet example</center>\n",
"## 2. Form Recognition\n",
"### 2.1 Background Introduction\n",
"Tables are common page elements in various types of documents. With the explosive growth of various types of documents, how to efficiently find tables from documents and obtain content and structure information, that is, table identification has become an urgent problem to be solved. The difficulties of form identification are summarized as follows:\n",
"1. The types and styles of tables are complex and diverse, such as *different rows and columns combined, different content text types*, etc.\n",
"2. The style of the document itself has various styles.\n",
"3. The lighting environment during shooting, etc.\n",
"The task of table recognition is to convert the table information in the document to an excel file. The task visualization is as follows:\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/99faa017e28b4928a408573406870ecaa251b626e0e84ab685e4b6f06f601a5f\" width=\"1600\"/></center>\n",
"<center>Figure 4: Example image of table recognition, the left side is the original image, and the right side is the result image after table recognition, presented in Excel format</center>\n",
"Existing table recognition algorithms can be divided into the following four categories according to the principle of table structure reconstruction:\n",
"1. Method based on heuristic rules\n",
"2. CNN-based method\n",
"3. GCN-based method\n",
"4. Method based on End to End\n",
"Some representative papers are divided into the above four categories, as shown in the following table:\n",
"| Category | Idea | Main papers |\n",
"| ---------------- | ---- | -------- |\n",
"|Method based on heuristic rules|Manual design rules, connected domain detection analysis and processing|[T-Rect](https://www.researchgate.net/profile/Andreas-Dengel/publication/249657389_A_Paper-to-HTML_Table_Converting_System/links/0c9605322c9a67274d000000/A-Paper-to-HTML-Table-Converting-System.pdf),[pdf2table](https://citeseerx.ist.psu.edu/viewdoc/download?doi=|\n",
"| CNN-based method | target detection, semantic segmentation | [CascadeTabNet](https://arxiv.org/pdf/2004.12629v2.pdf), [Multi-Type-TD-TSR](https://arxiv.org/pdf/2105.11021v1.pdf), [LGPMA](https://arxiv.org/pdf/2105.06224v2.pdf), [tabstruct-net](https://arxiv.org/pdf/2010.04565v1.pdf), [CDeC-Net](https://arxiv.org/pdf/2008.10831v1.pdf), [TableNet](https://arxiv.org/pdf/2001.01469v1.pdf), [TableSense](https://arxiv.org/pdf/2106.13500v1.pdf), [Deepdesrt](https://www.dfki.de/fileadmin/user_upload/import/9672_PID4966073.pdf), [Deeptabstr](https://www.dfki.de/fileadmin/user_upload/import/10649_DeepTabStR.pdf), [GTE](https://arxiv.org/pdf/2005.00589v2.pdf), [Cycle-CenterNet](https://arxiv.org/pdf/2109.02199v1.pdf), [FCN](https://www.researchgate.net/publication/339027294_Rethinking_Semantic_Segmentation_for_Table_Structure_Recognition_in_Documents)|\n",
"| GCN-based method | Based on graph neural network, the table recognition is regarded as a graph reconstruction problem | [GNN](https://arxiv.org/pdf/1905.13391v2.pdf), [TGRNet](https://arxiv.org/pdf/2106.10598v3.pdf), [GraphTSR](https://arxiv.org/pdf/1908.04729v2.pdf)|\n",
"| Method based on End to End | Use attention mechanism | [Table-Master](https://arxiv.org/pdf/2105.01848v1.pdf)|\n",
"### 2.2 Traditional Algorithm Based on Heuristic Rules\n",
"Early research on table recognition was mainly based on heuristic rules. For example, the T-Rect system proposed by Kieninger [1] et al. used a bottom-up method to analyze the connected domain of document images, and then merge them according to defined rules to obtain logical text blocks. Then, pdf2table proposed by Yildiz[2] et al. is the first method for table recognition on PDF documents, which utilizes some unique information of PDF files (such as text, drawing paths and other information that are difficult to obtain in image documents) to assist with table recognition. In recent work, Koci[3] et al. expressed the layout area in the page as a graph, and then used the Remove and Conquer (RAC) algorithm to identify the table as a subgraph.\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/66aeedb3f0924d80aee15f185e6799cc687b51fc20b74b98b338ca2ea25be3f3\" width=\"1000\"/></center>\n",
"<center>Figure 5: Schematic diagram of heuristic algorithm</center>\n",
"### 2.3 Method Based on Deep Learning CNN\n",
"With the rapid development of deep learning technology in computer vision, natural language processing, speech processing and other fields, researchers have applied deep learning technology to the field of table recognition and achieved good results.\n",
"In the DeepTabStR algorithm, Siddiqui Shoaib Ahmed [12] et al. described the table structure recognition problem as an object detection problem, and used deformable convolution to better detect table cells. Raja Sachin[6] et al. proposed that TabStruct-Net combines cell detection and structure recognition visually to perform table structure recognition, which solves the problem of recognition errors due to large changes in the table layout. However, this method cannot Deal with the problem of more empty cells in rows and columns.\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/838be28836444bc1835ac30a25613d8b045a1b5aedd44b258499fe9f93dd298f\" width=\"1600\"/></center>\n",
"<center>Figure 6: Schematic diagram of algorithm based on deep learning CNN</center>\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/4c40dda737bd44b09a533e1b1dd2e4c6a90ceea083bf4238b7f3c7b21087f409\" width=\"1600\"/></center>\n",
"<center>Figure 7: Example of algorithm error based on deep learning CNN</center>\n",
"The previous table structure recognition method generally starts from the elements of different granularities (row/column, text area), and it is easy to ignore the problem of merging empty cells. Qiao Liang [10] et al. proposed a new framework LGPMA, which makes full use of the information from local and global features through mask re-scoring strategy, and then can obtain more reliable alignment of the cell area, and finally introduces the inclusion of cell matching, empty The table structure restoration pipeline of cell search and empty cell merging handles the problem of table structure identification.\n",
"In addition to the above separate table recognition algorithms, there are also some methods that complete table detection and table recognition in one model. Schreiber Sebastian [11] et al. proposed DeepDeSRT, which uses Faster RCNN for table detection and FCN semantic segmentation model for Table structure row and column detection, but this method uses two independent models to solve these two problems. Prasad Devashish [4] et al. proposed an end-to-end deep learning method CascadeTabNet, which uses the Cascade Mask R-CNN HRNet model to perform table detection and structure recognition at the same time, which solves the problem of using two independent methods to process table recognition in the past. The insufficiency of the problem. Paliwal Shubham [8] et al. proposed a novel end-to-end deep multi-task architecture TableNet, which is used for table detection and structure recognition. At the same time, additional spatial semantic features are added to TableNet during training to further improve the performance of the model. Zheng Xinyi [13] et al. proposed a system framework GTE for table recognition, using a cell detection network to guide the training of the table detection network, and proposed a hierarchical network and a new clustering-based cell structure recognition algorithm, the framework can be connected to the back of any target detection model to facilitate the training of different table recognition algorithms. Previous research mainly focused on parsing from scanned PDF documents with a simple layout and well-aligned table images. However, the tables in real scenes are generally complex and may have serious deformation, bending or occlusion. Therefore, Long Rujiao [14] et al. also constructed a table recognition data set WTW in a realistic complex scene, and proposed a Cycle-CenterNet method, which uses the cyclic pairing module optimization and the proposed new pairing loss to accurately group discrete units into structured In the table, the performance of table recognition is improved.\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a01f714cbe1f42fc9c45c6658317d9d7da2cec9726844f6b9fa75e30cadc9f76\" width=\"1600\"/></center>\n",
"<center>Figure 8: Schematic diagram of end-to-end algorithm</center>\n",
"The CNN-based method cannot handle the cross-row and column tables well, so in the follow-up method, two research methods are divided to solve the cross-row and column problems in the table.\n",
"### 2.4 Method Based on Deep Learning GCN\n",
"In recent years, with the rise of Graph Convolutional Networks (Graph Convolutional Network), some researchers have tried to apply graph neural networks to table structure recognition problems. Qasim Shah Rukh [20] et al. converted the table structure recognition problem into a graph problem compatible with graph neural networks, and designed a novel differentiable architecture that can not only take advantage of the advantages of convolutional neural networks to extract features, but also The advantages of the effective interaction between the vertices of the graph neural network can be used, but this method only uses the location features of the cells, and does not use the semantic features. Chi Zewen [19] et al. proposed a novel graph neural network, GraphTSR, for table structure recognition in PDF files. It takes cells in the table as input, and then uses the characteristics of the edges and nodes of the graph to be connected. Predicting the relationship between cells to identify the table structure solves the problem of cell identification across rows or columns to a certain extent. Xue Wenyuan [21] et al. reformulated the problem of table structure recognition as table map reconstruction, and proposed an end-to-end method for table structure recognition, TGRNet, which includes cell detection branch and cell logic location branch. , These two branches jointly predict the spatial and logical positions of different cells, which solves the problem that the previous method did not pay attention to the logical position of cells.\n",
"Diagram of GraphTSR table recognition algorithm:\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/8ff89661142045a8aef54f8a7a2c69b1d243f8269034406a9e66bee2149f730f\" width=\"1600\"/></center>\n",
"<center>Figure 9: Diagram of GraphTSR table recognition algorithm</center>\n",
"### 2.5 Based on An End-to-End Approach\n",
"Different from other post-processing to complete the reconstruction of the table structure, based on the end-to-end method, directly use the network to complete the HTML representation output of the table structure\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/7865e58a83824facacfaa91bec12ccf834217cb706454dc5a0c165c203db79fb) | ![](https://ai-studio-static-online.cdn.bcebos.com/77d913b1b92f4a349b8f448e08ba78458d687eef4af142678a073830999f3edc))\n",
"Figure 10: Input and output of the end-to-end method | Figure 11: Image Caption example\n",
"Most end-to-end methods use Image Caption's Seq2Seq method to complete the prediction of the table structure, such as some methods based on Attention or Transformer.\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/3571280a9c364d3499a062e3edc724294fb5eaef8b38440991941e87f0af0c3b\" width=\"800\"/></center>\n",
"<center>Figure 12: Schematic diagram of Seq2Seq</center>\n",
"Ye Jiaquan [22] obtained the table structure output model in TableMaster by improving the Master text algorithm based on Transformer. In addition, a branch is added for the coordinate regression of the box. The author did not split the model into two branches in the last layer, but decoupled the sequence prediction and the box regression into two after the first Transformer decoding layer. Branches. The comparison between its network structure and the original Master network is shown in the figure below:\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/f573709447a848b4ba7c73a2e297f0304caaca57c5c94588aada1f4cd893946c\" width=\"800\"/></center>\n",
"<center>Figure 13: Left: master network diagram, right: TableMaster network diagram</center>\n",
"### 2.6 Data Set\n",
"Since the deep learning method is data-driven, a large amount of labeled data is required to train the model, and the small size of the existing data set is also an important constraint, so some data sets have also been proposed.\n",
"1. PubTabNet[16]: Contains 568k table images and corresponding structured HTML representations.\n",
"2. PubMed Tables (PubTables-1M) [17]: Table structure recognition data set, containing highly detailed structural annotations, 460,589 pdf images used for table inspection tasks, and 947,642 table images used for table recognition tasks.\n",
"3. TableBank[18]: Table detection and recognition data set, using Word and Latex documents on the Internet to construct table data containing 417K high-quality annotations.\n",
"4. SciTSR[19]: Table structure recognition data set, most of the images are converted from the paper, which contains 15,000 tables from PDF files and their corresponding structure tags.\n",
"5. TabStructDB[12]: Includes 1081 table areas, which are marked with dense row and column information.\n",
"6. WTW[14]: Large-scale data set scene table detection and recognition data set, this data set contains table data in various deformation, bending and occlusion situations, and contains 14,581 images in total.\n",
"Data set example\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/c9763df56e67434f97cd435100d50ded71ba66d9d4f04d7f8f896d613cdf02b0\" /></center>\n",
"<center>Figure 14: Sample diagram of PubTables-1M data set</center>\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/64de203bbe584642a74f844ac4b61d1ec3c5a38cacb84443ac961fbcc54a66ce\" width=\"600\"/></center>\n",
"<center>Figure 15: Sample diagram of WTW data set</center>\n",
"## 3. Document VQA\n",
"The boss sent a task: develop an ID card recognition system\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/63bbe893465e4f98b3aec80a042758b520d43e1a993a47e39bce1123c2d29b3f\" width=\"1600\"/></center>\n",
"> How to choose a plan\n",
"> 1. Use rules to extract information after text detection\n",
"> 2. Use scale type to extract information after text detection\n",
"> 3. Outsourcing\n",
"### 3.1 Background Introduction\n",
"In the VQA (Visual Question Answering) task, questions and answers are mainly aimed at the content of the image, but for text images, the content of interest is the text information in the image, so this type of method can be divided into Text-VQA and text-VQA in natural scenes. DocVQA of the scanned document scene, the relationship between the three is shown in the figure below.\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a91cfd5152284152b020ca8a396db7a21fd982e3661540d5998cc19c17d84861\" width=\"600\"/></center>\n",
"<center>Figure 16: VQA level</center>\n",
"The sample pictures of VQA, Text-VQA and DocVQA are shown in the figure below.\n",
"|Task type|VQA | Text-VQA | DocVQA| \n",
"|Task description|Ask questions regarding **picture content**|Ask questions regarding **text content on pictures**|Ask questions regarding **text content of document images**|\n",
"|Sample picture|![vqa](https://ai-studio-static-online.cdn.bcebos.com/fc21b593276247249591231b3373608151ed8ae7787f4d6ba39e8779fdd12201)|![textvqa](https://ai-studio-static-online.cdn.bcebos.com/cd2404edf3bf430b89eb9b2509714499380cd02e4aa74ec39ca6d7aebcf9a559)|![docvqa](https://ai-studio-static-online.cdn.bcebos.com/0eec30a6f91b4f949c56729b856f7ff600d06abee0774642801c070303edfe83)|\n",
"Because DocVQA is closer to actual application scenarios, a large number of academic and industrial work has emerged. In common scenarios, the questions asked in DocVQA are fixed. For example, the questions in the ID card scenario are generally\n",
"1. What is the citizenship number?\n",
"2. What is your name?\n",
"3. What is a clan?\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/2d2b86468daf47c98be01f44b8d6efa64bc09e43cd764298afb127f19b07aede\" width=\"800\"/></center>\n",
"<center>Figure 17: Example of an ID card</center>\n",
"Based on this prior knowledge, DocVQA's research began to lean towards the Key Information Extraction (KIE) task. This time we also mainly discuss the KIE-related research. The KIE task mainly extracts the key information needed from the image, such as extracting from the ID card. Name and citizen identification number information.\n",
"KIE is usually divided into two sub-tasks for research\n",
"1. SER: Semantic Entity Recognition, to classify each detected text, such as dividing it into name and ID. As shown in the black box and red box in the figure below.\n",
"2. RE: Relation Extraction, which classifies each detected text, such as dividing it into questions and answers. Then find the corresponding answer to each question. As shown in the figure below, the red and black boxes represent the question and the answer, respectively, and the yellow line represents the correspondence between the question and the answer.\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/899470ba601349fbbc402a4c83e6cdaee08aaa10b5004977b1f684f346ebe31f\" width=\"800\"/></center>\n",
"<center>Figure 18: Example of SER, RE task</center>\n",
"The general KIE method is researched based on Named Entity Recognition (NER) [4], but this type of method only uses the text information in the image and lacks the use of visual and structural information, so the accuracy is not high. On this basis, the methods in recent years have begun to integrate visual and structural information with text information. According to the principles used when fusing multimodal information, these methods can be divided into the following three types:\n",
"1. Grid-based approach\n",
"1. Token-based approach\n",
"2. GCN-based method\n",
"3. Based on the End to End method\n",
"Some representative papers are divided into the above three categories, as shown in the following table:\n",
"| Category | Ideas | Main Papers |\n",
"| ---------------- | ---- | -------- |\n",
"| Grid-based method | Fusion of multi-modal information on images (text, layout, image) | [Chargrid](https://arxiv.org/pdf/1809.08799) |\n",
"| Token-based method|Using methods such as Bert for multi-modal information fusion|[LayoutLM](https://arxiv.org/pdf/1912.13318), [LayoutLMv2](https://arxiv.org/pdf/2012.14740), [StrucText](https://arxiv.org/pdf/2108.02923), |\n",
"| GCN-based method|Using graph network structure for multi-modal information fusion|[GCN](https://arxiv.org/pdf/1903.11279), [PICK](https://arxiv.org/pdf/2004.07464), [SDMG-R](https://arxiv.org/pdf/2103.14470), [SERA](https://arxiv.org/pdf/2110.09915) |\n",
"| Based on End to End method | Unify OCR and key information extraction into one network | [Trie](https://arxiv.org/pdf/2005.13118) |\n",
"### 3.2 Grid-Based Method\n",
"The Grid-based method performs multimodal information fusion at the image level. Chargrid[5] firstly performs character-level text detection and recognition on the image, and then completes the construction of the network input by filling the one-hot encoding of the category into the corresponding character area (the non-black part in the right image below) , the input is finally passed through the CNN network of the encoder-decoder structure to perform coordinate detection and category classification of key information.\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/f248841769ec4312a9015b4befda37bf29db66226431420ca1faad517783875e\" width=\"800\"/></center>\n",
"<center>Figure 19: Chargrid data example</center>\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/0682e52e275b4187a0e74f54961a50091fd3a0cdff734e17bedcbc993f6e29f9\" width=\"800\"/></center>\n",
"<center>Figure 20: Chargrid network</center>\n",
"Compared with the traditional method based only on text, this method can use both text information and structural information, so it can achieve a certain accuracy improvement. It's good to combine the two.\n",
"### 3.3 Token-Based Method\n",
"LayoutLM[6] encodes 2D position information and text information together into the BERT model, and draws on the pre-training ideas of Bert in NLP, pre-training on large-scale data sets, and in downstream tasks, LayoutLM also adds image information To further improve the performance of the model. Although LayoutLM combines text, location and image information, the image information is fused in the training of downstream tasks, so the multi-modal fusion of the three types of information is not sufficient. Based on LayoutLM, LayoutLMv2 [7] integrates image information with text and layout information in the pre-training stage through transformers, and also adds a spatial perception self-attention mechanism to the Transformer to assist the model to better integrate visual and text features. Although LayoutLMv2 fuses text, location and image information in the pre-training stage, the visual features learned by the model are not fine enough due to the limitation of the pre-training task. StrucTexT [8] based on the previous multi-modal methods, proposed two new tasks, Sentence Length Prediction (SLP) and Paired Boxes Direction (PBD) in the pre-training task to help the network learn fine visual features. Among them, the SLP task makes the model Learn the length of the text segment, the PDB task allows the model to learn the matching relationship between Box directions. Through these two new pre-training tasks, the deep cross-modal fusion between text, visual and layout information can be accelerated.\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/17a26ade09ee4311b90e49a1c61d88a72a82104478434f9dabd99c27a65d789b) | ![](https://ai-studio-static-online.cdn.bcebos.com/d75addba67ef4b06a02ae40145e609d3692d613ff9b74cec85123335b465b3cc))\n",
"Figure 21: Transformer algorithm flow chart | Figure 22: LayoutLMv2 algorithm flow chart\n",
"### 3.4 GCN-Based Method\n",
"Although the existing GCN-based methods [10] use text and structure information, they do not make good use of image information. PICK [11] added image information to the GCN network and proposed a graph learning module to automatically learn edge types. SDMG-R [12] encodes the image as a bimodal graph. The nodes of the graph are the visual and textual information of the text area. The edges represent the direct spatial relationship between adjacent texts. By iteratively spreading information along the edges and inferring graph node categories, SDMG -R solves the problem that existing methods are incapable of unseen templates.\n",
"The PICK flow chart is shown in the figure below:\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/d3282959e6b2448c89b762b3b9bbf6197a0364b101214a1f83cf01a28623c01c\" width=\"800\"/></center>\n",
"<center>Figure 23: PICK algorithm flow chart</center>\n",
"SERA[10]The biaffine parser in dependency syntax analysis is introduced into document relation extraction, and GCN is used to fuse text and visual information.\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a97b7647968a4fa59e7b14b384dd7ffe812f158db8f741459b6e6bb0e8b657c7\" width=\"800\"/></center>\n",
"<center>Figure 24: SERA algorithm flow chart</center>\n",
"### 3.5 Method Based on End to End\n",
"Existing methods divide KIE into two independent tasks: text reading and information extraction. However, they mainly focus on improving the task of information extraction, ignoring that text reading and information extraction are interrelated. Therefore, Trie [9] Proposed a unified end-to-end network that can learn these two tasks at the same time and reinforce each other in the learning process.\n",
"<center class=\"img\">\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/6e4a3b0f65254f6b9d40cea0875854d4f47e1dca6b1e408cad435b3629600608\" width=\"1300\"/></center>\n",
"<center>Figure 25: Trie algorithm flow chart</center>\n",
"### 3.6 Data Set\n",
"The data sets used for KIE mainly include the following two:\n",
"1. SROIE: Task 3 of the SROIE data set [2] aims to extract four predefined information from the scanned receipt: company, date, address or total. There are 626 samples in the data set for training and 347 samples for testing.\n",
"2. FUNSD: FUNSD data set [3] is a data set used to extract form information from scanned documents. It contains 199 marked real scan forms. Of the 199 samples, 149 are used for training and 50 are used for testing. The FUNSD data set assigns a semantic entity tag to each word: question, answer, title or other.\n",
"3. XFUN: The XFUN data set is a multilingual data set proposed by Microsoft. It contains 7 languages. Each language contains 149 training sets and 50 test sets.\n",
"![](https://ai-studio-static-online.cdn.bcebos.com/dfdf530d79504761919c1f093f9a86dac21e6db3304c4892998ea1823f3187c6) | ![](https://ai-studio-static-online.cdn.bcebos.com/3b2a9f9476be4e7f892b73bd7096ce8d88fe98a70bae47e6ab4c5fcc87e83861))\n",
"Figure 26: sroie example image | Figure 27: xfun example image\n",
"## 4. Summary\n",
"In this section, we mainly introduce the theoretical knowledge of three sub-modules related to document analysis technology: layout analysis, table recognition and information extraction. Below we will explain this form recognition and DOC-VQA practical tutorial based on the PaddleOCR framework."
"# 1. Course Prerequisites\n",
"The OCR model involved in this course is based on deep learning, so its related basic knowledge, environment configuration, project engineering and other materials will be introduced in this section, especially for readers who are not familiar with deep learning. content.\n",
"### 1.1 Preliminary Knowledge\n",
"The \"learning\" of deep learning has been developed from the content of neurons, perceptrons, and multilayer neural networks in machine learning. Therefore, understanding the basic machine learning algorithms is of great help to the understanding and application of deep learning. The \"deepness\" of deep learning is embodied in a series of vector-based mathematical operations such as convolution and pooling used in the process of processing a large amount of information. If you lack the theoretical foundation of the two, you can learn from teacher Li Hongyi's [Linear Algebra](https://aistudio.baidu.com/aistudio/course/introduce/2063) and [Machine Learning](https://aistudio.baidu.com/aistudio/course/introduce/1978) courses.\n",
"For the understanding of deep learning itself, you can refer to the zero-based course of Bai Ran, an outstanding architect of Baidu: [Baidu architects take you hands-on with zero-based practice deep learning](https://aistudio.baidu.com/aistudio/course/introduce/1297), which covers the development history of deep learning and introduces the complete components of deep learning through a classic case. It is a set of practice-oriented deep learning courses.\n",
"For the practice of theoretical knowledge, [Python basic knowledge](https://aistudio.baidu.com/aistudio/course/introduce/1224) is essential. At the same time, in order to quickly reproduce the deep learning model, the deep learning framework used in this course For: Flying PaddlePaddle. If you have used other frameworks, you can quickly learn how to use flying paddles through [Quick Start Document](https://www.paddlepaddle.org.cn/documentation/docs/zh/practices/quick_start/hello_paddle.html).\n",
"### 1.2 Basic Environment Preparation\n",
"If you want to run the code of this course in a local environment and have not built a Python environment before, you can follow the [zero-base operating environment preparation](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/doc/doc_ch/environment.md), install Anaconda or docker environment according to your operating system.\n",
"If you don't have local resources, you can run the code through the AI Studio training platform. Each item in it is presented in a notebook, which is convenient for developers to learn. If you are not familiar with the related operations of Notebook, you can refer to [AI Studio Project Description](https://ai.baidu.com/ai-doc/AISTUDIO/0k3e2tfzm).\n",
"### 1.3 Get and Run the Code\n",
"This course relies on the formation of PaddleOCR's code repository. First, clone the complete project of PaddleOCR:\n",
"# [recommend]\n",
"git clone https://github.com/PaddlePaddle/PaddleOCR\n",
"# If you cannot pull successfully due to network problems, you can also choose to use the hosting on Code Cloud:\n",
"git clone https://gitee.com/paddlepaddle/PaddleOCR\n",
"> Note: The code cloud hosted code may not be able to synchronize the update of this github project in real time, there is a delay of 3~5 days, please use the recommended method first.\n",
"> If you are not familiar with git operations, you can download the compressed package directly from the `Code` on the homepage of PaddleOCR\n",
"Then install third-party libraries:\n",
"cd PaddleOCR\n",
"pip3 install -r requirements.txt\n",
"### 1.4 Access to Information\n",
"[PaddleOCR Usage Document](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/README.md) describes in detail how to use PaddleOCR to complete model application, training and deployment. The document is rich in content, most of the user’s questions are described in the document or FAQ, especially in [FAQ](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.3/doc/doc_en/FAQ_en.md), in accordance with the application process of deep learning, has precipitated the user's common questions, it is recommended that you read it carefully.\n",
"### 1.5 Ask for Help\n",
"If you encounter BUG, ease of use or documentation related issues while using PaddleOCR, you can contact the official via [Github issue](https://github.com/PaddlePaddle/PaddleOCR/issues), please follow the issue template Provide as much information as possible so that official personnel can quickly locate the problem. At the same time, the WeChat group is the daily communication position for the majority of PaddleOCR users, and it is more suitable for asking some consulting questions. In addition to the PaddleOCR team members, there will also be enthusiastic developers answering your questions."
