introduction_en.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. PP-OCRv2 Introduction\n",
    "\n",
    "PP-OCR is a text detection and recognition system released by PaddleOCR for the OCR field. PP-OCRv2 has made some improvements for PP-OCR and constructed a new OCR system. The pipeline of PP-OCRv2 system is as follows (the pink box is the new policy of PP-OCRv2):\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://user-images.githubusercontent.com/12406017/200258931-771f5a1d-230c-4168-9130-0b79321558a9.png\"  width = \"80%\"  />\n",
    "</div>\n",
    "\n",
    "There are five enhancement strategies:\n",
    "1. Text detection enhancement strategies: collaborative mutual learning；\n",
    "2. Text detection enhancement strategies: CcopyPaste data augmentation;\n",
    "3. Text recognition enhancement strategies: PP-LCNet lightweight backbone network;\n",
    "4. Text recognition enhancement strategies: UDML knowledge distillation;\n",
    "5. Text recognition enhancement strategies: enhanced CTC loss;\n",
    "\n",
    "Compared with PP-OCR, the performance improvement of PP-OCRv2 is as follows:\n",
    "1. Compared with the PP-OCR mobile version, the accuracy of the model is improved by more than 7%;\n",
    "2. Compared with the PP-OCR server version, the speed is increased by more than 220%;\n",
    "3. The total model size is 11.6M, which can be easily deployed on both the server side and the mobile side;\n",
    "\n",
    "For more details, please refer to the technical report: https://arxiv.org/abs/2109.03144 .\n",
    "\n",
    "For more information about PaddleOCR, you can click https://github.com/PaddlePaddle/PaddleOCR to learn more.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Model Effects\n",
    "\n",
    "The results of PP-OCRv2 are as follows:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://user-images.githubusercontent.com/12406017/200239467-a082eef9-fee0-4587-be48-b276a95bf8d0.gif\"  width = \"80%\"  />\n",
    "</div>\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. How to Use the Model\n",
    "\n",
    "### 3.1 Inference\n",
    "* Install PaddleOCR whl package"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "! pip install paddleocr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Quick experience"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# command line usage\n",
    "! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/dygraph/doc/imgs/11.jpg\n",
    "! paddleocr --image_dir 11.jpg --use_angle_cls true --ocr_version PP-OCRv2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the operation is complete, the following results will be output in the terminal:\n",
    "```log\n",
    "[[[24.0, 36.0], [304.0, 34.0], [304.0, 72.0], [24.0, 74.0]], ['纯臻营养护发素', 0.964739]]\n",
    "[[[24.0, 80.0], [172.0, 80.0], [172.0, 104.0], [24.0, 104.0]], ['产品信息/参数', 0.98069626]]\n",
    "[[[24.0, 109.0], [333.0, 109.0], [333.0, 136.0], [24.0, 136.0]], ['（45元/每公斤，100公斤起订）', 0.9676722]]\n",
    "......\n",
    "```\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Train the model\n",
    "The PP-OCR system consists of a text detection model, an angle classifier and a text recognition model. For the three model training tutorials, please refer to the following documents:\n",
    "1. text detection model: [text detection training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/detection.md)\n",
    "1. angle classifier: [angle classifier training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/angle_class.md)\n",
    "1. text recognition model: [text recognition training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/recognition.md)\n",
    "\n",
    "After the model training is completed, it can be used in series by specifying the model path. The command reference is as follows:\n",
    "```python\n",
    "paddleocr --image_dir 11.jpg --use_angle_cls true --ocr_version PP-OCRv2 --det_model_dir=/path/to/det_inference_model --cls_model_dir=/path/to/cls_inference_model --rec_model_dir=/path/to/rec_inference_model\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Model Principles\n",
    "\n",
    "The enhancement strategies are as follows\n",
    "\n",
    "1. Text detection enhancement strategies\n",
    "- Adopt CML (Collaborative Mutual Learning) collaborative mutual learning knowledge distillation strategy.\n",
    "     <div align=\"center\">\n",
    "   <img src=\"https://pic4.zhimg.com/80/v2-05f12bcd1784993edabdadbc89b5e9e7_720w.webp\"  width = \"60%\"  />\n",
    "   </div>\n",
    "\n",
    "As shown in the figure above, the core idea of CML combines ① the standard distillation of the traditional Teacher guiding Students and ② the direct DML mutual learning of the Students network, which allows the Students network to learn from each other and the Teacher network to guide. Correspondingly, the three key Loss loss functions are carefully designed: GT Loss, DML Loss and Distill Loss. Under the condition that the Teacher network Backbone is ResNet18, it has a good effect on the Student's MobileNetV3.\n",
    "\n",
    "   - CopyPaste data augmentation\n",
    "   <div align=\"center\">\n",
    "   <img src=\"https://pic1.zhimg.com/80/v2-90239608c554972ac307be07f487f254_720w.webp\"  width = \"60%\"  />\n",
    "   </div>\n",
    "\n",
    "Data augmentation is one of the important means to improve the generalization ability of the model. CopyPaste is a novel data augmentation technique, which has been proven effective in object detection and instance segmentation tasks. With CopyPaste, text instances can be synthesized to balance the ratio between positive and negative samples in training images. In contrast, this is impossible with traditional image rotation, random flipping, and random cropping.\n",
    "\n",
    "The main steps of CopyPaste include: ① randomly select two training images, ② randomly scale jitter and zoom, ③ randomly flip horizontally, ④ randomly select a target subset in one image, and ⑤ paste at a random position in another image. In this way, the sample richness can be greatly improved, and the robustness of the model to the environment can be increased.\n",
    "\n",
    "2. Text recognition enhancement strategies\n",
    "- PP-LCNet lightweight backbone network\n",
    "  \n",
    "The PP-LCNet with balanced speed and accuracy is adopted to effectively improve the accuracy and reduce the network inference time.\n",
    "\n",
    "- UDML Knowledge Distillation Strategy\n",
    "   <div align=\"center\">\n",
    "   <img src=\"https://pic1.zhimg.com/80/v2-642d94e092c7d5f90bedbd7c7511636c_720w.webp\"  width = \"60%\"  />\n",
    "   </div>\n",
    "On the basis of standard DML knowledge distillation, a supervision mechanism for Feature Map is introduced, Feature Loss is added, the number of iterations is increased, and an additional FC network is added to the Head part, which finally speeds up the distillation and improves the effect.\n",
    "\n",
    "- Enhanced CTC loss\n",
    "   <div align=\"center\">\n",
    "   <img src=\"https://pic3.zhimg.com/80/v2-864d255454b5196e0a2a916d81ff92c6_720w.webp\"  width = \"40%\"  />\n",
    "   </div>\n",
    "   In Chinese text recognition tasks, the problem of misrecognition of the number of similar characters is often encountered. Here, we draw on Metric Learning and introduce Center Loss to further increase the distance between classes to enhance the model's ability to distinguish similar characters. The core idea is shown in the formula above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Attention\n",
    "\n",
    "General data are used in the training process of PP-OCR series models. If the performance is not satisfactory in the actual scene, a small amount of data can be marked for finetune."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Related papers and citations\n",
    "```\n",
    "@article{du2021pp,\n",
    "  title={PP-OCRv2: bag of tricks for ultra lightweight OCR system},\n",
    "  author={Du, Yuning and Li, Chenxia and Guo, Ruoyu and Cui, Cheng and Liu, Weiwei and Zhou, Jun and Lu, Bin and Yang, Yehua and Liu, Qiwen and Hu, Xiaoguang and others},\n",
    "  journal={arXiv preprint arXiv:2109.03144},\n",
    "  year={2021}\n",
    "}\n",
    "```\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.13 ('py38')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "vscode": {
   "interpreter": {
    "hash": "58fd1890da6594cebec461cf98c6cb9764024814357f166387d10d267624ecd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}