text_detection_theory.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Detection Algorithm Theory\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1 Text Detection\n",
    "\n",
    "The task of text detection is to find out the position of text in an image or video. Different from the task of target detection, target detection must not only solve the positioning problem, but also solve the problem of target classification.\n",
    "\n",
    "The manifestation of text in images can be regarded as a kind of 'target', and general target detection methods are also suitable for text detection. From the perspective of the task itself:\n",
    "\n",
    "- Target detection: Given an image or video, find out the location (box) of the target, and give the target category;\n",
    "- Text detection: Given an input image or video, find out the area of the text, which can be a single character position or a whole text line position;\n",
    "\n",
    "\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/af2d8eca913a4d5a968945ae6cac180b009c6cc94abc43bfbaf1ba6a3de98125\" width=\"400\" ></center>\n",
    "\n",
    "<br><center>Figure 1: Schematic diagram of target detection</center>\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/400b9100573b4286b40b0a668358bcab9627f169ab934133a1280361505ddd33\" width=\"1000\" ></center>\n",
    "\n",
    "<br><center>Figure 2: Schematic diagram of text detection</center>\n",
    "\n",
    "Object detection and text detection are both \"location\" problems. But text detection does not need to classify the target, and the shape of the text is complex and diverse.\n",
    "\n",
    "The current text detection is generally natural scene text detection. The difficulty lies in:\n",
    "\n",
    "1. Text in natural scenes is diverse: text detection is affected by text color, size, font, shape, direction, language, and text length;\n",
    "2. Complex background and interference; text detection is affected by image distortion, blur, low resolution, shadow, brightness and other factors;\n",
    "3. Dense or overlapping text will affect text detection;\n",
    "4. Local consistency of text: a small part of a text line can also be considered as independent text.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/072f208f2aff47e886cf2cf1378e23c648356686cf1349c799b42f662d8ced00\"\n",
    "width=\"1000\" ></center>\n",
    "\n",
    "<br><center>Figure 3: Text detection scene</center>\n",
    "\n",
    "In response to the above problems, many text detection algorithms based on deep learning have been derived to solve the problem of text detection in natural scenes. These methods can be divided into regression-based and segmentation-based text detection methods.\n",
    "\n",
    "The next section will briefly introduce the classic text detection algorithm based on deep learning technology."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## 2 Introduction to Text Detection Methods\n",
    "\n",
    "\n",
    "In recent years, deep learning-based text detection algorithms have emerged one after another. These methods can be roughly divided into two categories:\n",
    "1. Regression-based text detection method\n",
    "2. Text detection method based on segmentation\n",
    "\n",
    "\n",
    "This section screens the commonly used text detection methods from 2017 to 2021, and is classified according to the above two types of methods as shown in the following table:\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/22314238b70b486f942701107ffddca48b87235a473c4d8db05b317f132daea0\"\n",
    "width=\"600\" ></center>\n",
    "<br><center>Figure 4: Text detection algorithm</center>\n",
    "\n",
    "\n",
    "### 2.1 Regression-based Text Detection\n",
    "\n",
    "The method based on the regression text detection method is similar to the method of the target detection algorithm. The text detection method has only two categories. The text in the image is regarded as the target to be detected, and the rest is regarded as the background.\n",
    "\n",
    "#### 2.1.1 Horizontal Text Detection\n",
    "\n",
    "Early text detection algorithms based on deep learning are improved from the target detection method and support horizontal text detection. For example, the TextBoxes algorithm is improved based on the SSD algorithm, and the CTPN is improved based on the two-stage target detection Fast-RCNN algorithm.\n",
    "\n",
    "In TextBoxes[1], the algorithm is adjusted according to the one-stage target detector SSD, and the default text box is changed to a quadrilateral that adapts to the specifications of the text direction and aspect ratio, providing an end-to-end training text detection method without complicated Post-processing.\n",
    "-Use a pre-selection box with a larger aspect ratio\n",
    "-The convolution kernel has been changed from 3x3 to 1x5, which is more suitable for long text detection\n",
    "-Adopt multi-scale input\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/3864ccf9d009467cbc04225daef0eb562ac0c8c36f9b4f5eab036c319e5f05e7\" width=\"1000\" ></center>\n",
    "<br><center>Figure 5: Textbox frame diagram</center>\n",
    "\n",
    "CTPN[3] is based on the Fast-RCNN algorithm, expands the RPN module and designs a CRNN-based module to allow the entire network to detect text sequences from convolutional features. The two-stage method obtains more accurate feature positioning through ROI Pooling. But TextBoxes and CTPN only support the detection of horizontal text.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/452833c2016e4cf7b35291efd09740c13c4bfb8f7c56446b8f7a02fc7eb3e901\" width=\"1000\" ></center>\n",
    "<br><center>Figure 6: CTPN frame diagram</center>\n",
    "\n",
    "#### 2.1.2 Any Angle Text Detection\n",
    "\n",
    "TextBoxes++[2] is improved on the basis of TextBoxes, and supports the detection of text at any angle. Structurally, unlike TextBoxes, TextBoxes++ detects multi-angle text. First, modify the aspect ratio of the preselection box and adjust the aspect ratio to 1, 2, 3, 5, 1/2, 1/3, 1/5. The second is to change the $1*5$ convolution kernel to $3*5$ to better learn the characteristics of the slanted text; finally, the representation information of the output rotating box of TextBoxes++.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/ae96e3acbac04be296b6d54a4d72e5881d592fcc91f44882b24bc7d38b9d2658\"\n",
    "width=\"1000\" ></center>\n",
    "<br><center>Figure 7: TextBoxes++ frame diagram</center>\n",
    "\n",
    "\n",
    "EAST [4] proposed a two-stage text detection method for the location of slanted text, including FCN feature extraction and NMS part. EAST proposes a new text detection pipline structure, which can be trained end-to-end and supports the detection of text in any orientation, and has the characteristics of simple structure and high performance. FCN supports the output of inclined rectangular frame and horizontal frame, and the output format can be freely selected.\n",
    "-If the output detection shape is RBox, output Box rotation angle and AABB text shape information, AABB represents the offset to the top, bottom, left, and right sides of the text box. RBox can rotate rectangular text.\n",
    "-If the output detection box is a four-point box, the last dimension of the output is 8 numbers, which represents the position offset from the four corner vertices of the quadrilateral. This output method can predict irregular quadrilateral text.\n",
    "\n",
    "Considering that the text box output by FCN is relatively redundant, for example, the box generated by the adjacent pixels of a text area has a high degree of coincidence. But it is not the detection frame generated by the same text, and the degree of coincidence is very small. Therefore, EAST proposes to merge the prediction boxes by row first. Finally, filter the remaining quads with the original NMS.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/d7411ada08714adab73fa0edf7555a679327b71e29184446a33d81cdd910e4fc\"\n",
    "width=\"1000\" ></center>\n",
    "<br><center>Figure 8: EAST frame diagram</center>\n",
    "\n",
    "\n",
    "MOST [15] proposed that the TFAM module dynamically adjusts the receptive field of coarse-grained detection results, and also proposed that PA-NMS combines reliable detection and prediction results based on location information. In addition, the Instance-wise IoU loss function is also proposed during training, which is used to balance training to handle text instances of different scales. This method can be combined with the EAST method, and has better detection effect and performance in detecting texts with extreme aspect ratios and different scales.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/73052d9439714bba86ffe4a959d58c523b07baf3f1d74882b4517e71f5a645fe\"\n",
    "width=\"1000\" ></center>\n",
    "<br><center>Figure 9: MOST frame diagram</center>\n",
    "\n",
    "\n",
    "#### 2.1.3 Curved Text Detection\n",
    "\n",
    "Using regression to solve the problem of curved text detection, a simple idea is to describe the boundary polygon of the curved text with multi-point coordinates, and then directly predict the vertex coordinates of the polygon.\n",
    "\n",
    "CTD [6] proposed to directly predict the boundary polygons of 14 vertices of curved text. The network uses the Bi-LSTM [13] layer to refine the prediction coordinates of the vertices, and realizes the detection of curved text based on the regression method.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/6e33d76ebb814cac9ebb2942b779054af160857125294cd69481680aca2fa98a\"\n",
    "width=\"600\" ></center>\n",
    "<br><center>Figure 10: CTD frame diagram</center>\n",
    "\n",
    "\n",
    "\n",
    "LOMO [19] proposes iterative optimization of text localization features to obtain finer text localization for long text and curved text problems. The method consists of three parts: the coordinate regression module DR, the iterative optimization module IRM and the arbitrary shape expression module SEM. They are used to generate approximate text regions, iteratively optimize text localization features, and predict text regions, text centerlines, and text boundaries. Iteratively optimized text features can better solve the problem of long text localization and obtain more accurate text area localization.\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/e90adf3ca25a45a0af0b84a181fbe2c4954be1fcca8f4049957128548b7131ef\"\n",
    "width=\"1000\" ></center>\n",
    "<br><center>Figure 11: LOMO frame diagram</center>\n",
    "\n",
    "\n",
    "Contournet [18] is based on the proposed modeling of text contour points to obtain a curved text detection frame. This method first uses Adaptive-RPN to obtain the proposal features of the text area, and then designs a local orthogonal texture perception LOTM module to learn horizontal and vertical textures. The feature is represented by contour points. Finally, by considering the feature responses in two orthogonal directions at the same time, the Point Re-Scoring algorithm can effectively filter out the prediction of strong unidirectional or weak orthogonal activation, and the final text contour can be used as a group of high-quality contour points are shown.\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/1f59ab5db899412f8c70ba71e8dd31d4ea9480d6511f498ea492c97dd2152384\"\n",
    "width=\"600\" ></center>\n",
    "<br><center>Figure 12: Contournet frame diagram</center>\n",
    "\n",
    "\n",
    "PCR [14] proposed progressive coordinate regression to deal with curved text detection. The problem is divided into three stages. Firstly, the text area is roughly detected, and the text box is obtained. In addition, the corners of the smallest bounding box of the text are predicted by the designed Contour Localization Mechanism. Coordinates, and then the curved text is predicted by superimposing multiple CLM modules and RCLM modules. This method uses the text contour information aggregation to obtain a rich text contour feature representation, which can not only suppress the influence of redundant noise points on the coordinate regression, but also locate the text area more accurately.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/c677c4602cee44999ae4b38bd780b69795887f2ae10747968bb084db6209b6cc\"\n",
    "width=\"600\" ></center>\n",
    "<br><center>Figure 13: PCR frame diagram</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "### 2.2 Text Detection Based on Segmentation\n",
    "\n",
    "Although the regression-based method has achieved good results in text detection, it is often difficult to obtain a smooth text surrounding curve for solving curved text, and the model is more complex and does not have performance advantages. Therefore, researchers proposed a text segmentation method based on image segmentation. First, classify at the pixel level, determine whether each pixel belongs to a text target, obtain the probability map of the text area, and obtain the enclosing curve of the text segmentation area through post-processing. .\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/fb9e50c410984c339481869ba11c1f39f80a4d74920b44b084601f2f8a23099f\"\n",
    "width=\"600\" ></center>\n",
    "<br><center>Figure 14: Schematic diagram of text segmentation algorithm</center>\n",
    "\n",
    "\n",
    "Such methods are usually based on segmentation to achieve text detection, and segmentation-based methods have natural advantages for text detection with irregular shapes. The main idea of the segmentation-based text detection method is to obtain the text area in the image through the segmentation method, and then use opencv, polygon and other post-processing to obtain the minimum enclosing curve of the text area.\n",
    "\n",
    "\n",
    "Pixellink [7] uses a segmentation method to solve the text detection problem. The segmentation object is a text area. The pixels in the same text line (word) are linked together to segment the text, and the text bounding box is directly extracted from the segmentation result without a position. Regression can achieve the effect of text detection based on regression. However, there is a problem with the segmentation-based method. For texts with similar positions, the text segmentation area is prone to \"sticky\" problems. Wu, Yue et al. [8] proposed to separate the text while learning the boundary position of the text to better distinguish the text area. In addition, Tian et al. [9] proposed to map the pixels of the same text to the mapping space. In the mapping space, the distance of the mapping vector of the unified text is close, and the distance of the mapping vector of different texts becomes longer.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/462b5e1472824452a2c530939cda5e59ada226b2d0b745d19dd56068753a7f97\"\n",
    "width=\"600\" ></center>\n",
    "<br><center>Figure 15: PixelLink frame diagram</center>\n",
    "\n",
    "For the multi-scale problem of text detection, MSR [20] proposes to extract multiple scale features of the same image, then merge these features and up-sample to the original image size. The network finally predicts the text center area and each point of the text center area. The x-coordinate offset and y-coordinate offset of the nearest boundary point can finally get the contour coordinate set of the text area.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/9597efd68a224d60b74d7c51c99f7ff0ba9939e5cdb84fb79209b7e213f7d039\"\n",
    "width=\"600\" ></center>\n",
    "<br><center>Figure 16: MSR frame diagram</center>\n",
    "  \n",
    "Aiming at the problem of segmentation-based text algorithms that are difficult to distinguish between adjacent texts, PSENet [10] proposed a progressive scale expansion network to learn text segmentation regions, predict text regions with different shrinkage ratios, and expand the detected text regions one by one. The essence of this method is The above is a variant of the boundary learning method, which can effectively solve the problem of detecting adjacent text of any shape.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/fa870b69a2a5423cad7422f64c32e0645dfc31a4ecc94a52832cf8742cded5ba\"\n",
    "width=\"1000\" ></center>\n",
    "<br><center>Figure 17: PSENet frame diagram</center>\n",
    "\n",
    "Assume that PSENet post-processing uses 3 kernels of different scales, as shown in the above figure s1, s2, and s3. First, start from the minimum kernel s1, calculate the connected domain of the text segmentation area, and get (b), and then expand the connected domain along the up, down, left, and right, and classify the pixels that belong to s2 but not to s1 in the expanded area. When encountering conflicts, the principle of \"first come first served\" is adopted, and the scale expansion operation is repeated, and finally independent segmented regions of different text lines can be obtained.\n",
    "\n",
    "\n",
    "Seglink++ [17] proposed a characterization of the attraction and repulsion relationship between text block units for curved text and dense text, and then designed a minimum spanning tree algorithm to combine the units to obtain the final text detection box, and An instance-aware loss function is proposed so that the Seglink++ method can be trained end-to-end.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/1a16568361c0468db537ac25882eed096bca83f9c1544a92aee5239890f9d8d9\"\n",
    "width=\"1000\" ></center>\n",
    "<br><center>Figure 18: Seglink++ frame diagram</center>\n",
    "\n",
    "Although the segmentation method solves the problem of curved text detection, complex post-processing logic and prediction speed are also goals that need to be optimized.\n",
    "\n",
    "PAN [11] aims at the problem of slow text detection and prediction speed, and improves the performance of the algorithm from the aspects of network design and post-processing. First, PAN uses the lightweight ResNet18 as the Backbone, and also designs the lightweight feature enhancement module FPEM and feature fusion module FFM to enhance the features extracted by the Backbone. In terms of post-processing, a pixel clustering method is used to merge pixels whose distance from the kernel is less than the threshold d along the predicted text center (kernel). PAN guarantees high accuracy while having faster prediction speed.\n",
    "\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/a76771f91db246ee8be062f96fa2a8abc7598dd87e6d4755b63fac71a4ebc170\"\n",
    "width=\"1000\" ></center>\n",
    "<br><center>Figure 19: PAN frame diagram</center>\n",
    "\n",
    "DBNet [12] aimed at the problem of time-consuming post-processing that requires the use of thresholds for binarization based on segmentation methods. It proposed a learnable threshold and cleverly designed a binarization function that approximates the step function to make the segmentation The network can learn the threshold of text segmentation end-to-end during training. The automatic adjustment of the threshold not only improves accuracy, but also simplifies post-processing and improves the performance of text detection.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/0d6423e3c79448f8b09090cf2dcf9d0c7baa0f6856c645808502678ae88d2917\"\n",
    "width=\"1000\" ></center>\n",
    "<br><center>Figure 20: DB frame diagram</center>\n",
    "\n",
    "FCENet [16] proposed to express the text enclosing curve with Fourier transform parameters. Since the Fourier coefficient representation can theoretically fit any closed curve, by designing a suitable model to predict an arbitrary shape text enclosing box based on Fourier transform In this way, the detection accuracy of highly curved text instances in natural scene text detection is improved.\n",
    "\n",
    "<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/45e9a374d97145689a961977f896c8f9f470a66655234c1498e1c8477e277954\"\n",
    "width=\"1000\" ></center>\n",
    "<br><center>Figure 21: FCENet frame diagram</center>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3 Summary\n",
    "\n",
    "This section introduces the development of the field of text detection in recent years, including text detection methods based on regression and segmentation, and respectively enumerates and introduces the method ideas of some classic papers. The next section takes the PaddleOCR open source library as an example to introduce in detail the algorithm principles and core code implementation of DBNet."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference\n",
    "1. Liao, Minghui, et al. \"Textboxes: A fast text detector with a single deep neural network.\" Thirty-first AAAI conference on artificial intelligence. 2017.\n",
    "2. Liao, Minghui, Baoguang Shi, and Xiang Bai. \"Textboxes++: A single-shot oriented scene text detector.\" IEEE transactions on image processing 27.8 (2018): 3676-3690.\n",
    "3. Tian, Zhi, et al. \"Detecting text in natural image with connectionist text proposal network.\" European conference on computer vision. Springer, Cham, 2016.\n",
    "4. Zhou, Xinyu, et al. \"East: an efficient and accurate scene text detector.\" Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2017.\n",
    "5. Wang, Fangfang, et al. \"Geometry-aware scene text detection with instance transformation network.\" Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.\n",
    "6. Yuliang, Liu, et al. \"Detecting curve text in the wild: New dataset and new solution.\" arXiv preprint arXiv:1712.02170 (2017).\n",
    "7. Deng, Dan, et al. \"Pixellink: Detecting scene text via instance segmentation.\" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. No. 1. 2018.\n",
    "8. Wu, Yue, and Prem Natarajan. \"Self-organized text detection with minimal post-processing via border learning.\" Proceedings of the IEEE International Conference on Computer Vision. 2017.\n",
    "9. Tian, Zhuotao, et al. \"Learning shape-aware embedding for scene text detection.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.\n",
    "10. Wang, Wenhai, et al. \"Shape robust text detection with progressive scale expansion network.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.\n",
    "11. Wang, Wenhai, et al. \"Efficient and accurate arbitrary-shaped text detection with pixel aggregation network.\" Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.\n",
    "12. Liao, Minghui, et al. \"Real-time scene text detection with differentiable binarization.\" Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 07. 2020.\n",
    "13. Hochreiter, Sepp, and Jürgen Schmidhuber. \"Long short-term memory.\" Neural computation 9.8 (1997): 1735-1780.\n",
    "14. Dai, Pengwen, et al. \"Progressive Contour Regression for Arbitrary-Shape Scene Text Detection.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.\n",
    "15. He, Minghang, et al. \"MOST: A Multi-Oriented Scene Text Detector with Localization Refinement.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.\n",
    "16. Zhu, Yiqin, et al. \"Fourier contour embedding for arbitrary-shaped text detection.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.\n",
    "17. Tang, Jun, et al. \"Seglink++: Detecting dense and arbitrary-shaped scene text by instance-aware component grouping.\" Pattern recognition 96 (2019): 106954.\n",
    "18. Wang, Yuxin, et al. \"Contournet: Taking a further step toward accurate arbitrary-shaped scene text detection.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.\n",
    "19. Zhang, Chengquan, et al. \"Look more than once: An accurate detector for text of arbitrary shapes.\" Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.\n",
    "20. Xue C, Lu S, Zhang W. Msr: Multi-scale shape regression for scene text detection[J]. arXiv preprint arXiv:1901.02596, 2019. \n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "py35-paddle1.2.0"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}