introduction_cn.ipynb 12.5 KB
Notebook
Newer Older
Z
zhoujun 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. PP-OCRv3模型简介\n",
    "\n",
    "PP-OCRv3在PP-OCRv2的基础上进一步升级。整体的框架图保持了与PP-OCRv2相同的pipeline,针对检测模型和识别模型进行了优化。其中,检测模块仍基于DB算法优化,而识别模块不再采用CRNN,换成了IJCAI 2022最新收录的文本识别算法SVTR,并对其进行产业适配。PP-OCRv3系统框图如下所示(粉色框中为PP-OCRv3新增策略):\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/doc/ppocrv3_framework.png\"  width = \"80%\"  />\n",
    "</div>\n",
    "\n",
    "\n",
    "从算法改进思路上看,分别针对检测和识别模型,进行了共9个方面的改进:\n",
    "\n",
    "- 检测模块:\n",
    "    - LK-PAN:大感受野的PAN结构;\n",
    "    - DML:教师模型互学习策略;\n",
    "    - RSE-FPN:残差注意力机制的FPN结构;\n",
    "\n",
    "\n",
    "- 识别模块:\n",
    "    - SVTR_LCNet:轻量级文本识别网络;\n",
    "    - GTC:Attention指导CTC训练策略;\n",
    "    - TextConAug:挖掘文字上下文信息的数据增广策略;\n",
    "    - TextRotNet:自监督的预训练模型;\n",
    "    - UDML:联合互学习策略;\n",
    "    - UIM:无标注数据挖掘方案。\n",
    "\n",
    "从效果上看,速度可比情况下,多种场景精度均有大幅提升:\n",
    "- 中文场景,相对于PP-OCRv2中文模型提升超5%;\n",
    "- 英文数字场景,相比于PP-OCRv2英文模型提升11%;\n",
    "- 多语言场景,优化80+语种识别效果,平均准确率提升超5%。\n",
    "\n",
    "\n",
    "更详细的优化细节可参考技术报告:https://arxiv.org/abs/2206.03001 。\n",
    "\n",
    "更多关于PaddleOCR的内容,可以点击 https://github.com/PaddlePaddle/PaddleOCR 进行了解。\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. 模型效果\n",
    "\n",
    "PP-OCRv3的效果如下:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://user-images.githubusercontent.com/12406017/200261622-1b928d93-93ab-4575-8c60-214bcc03eda1.png\"  width = \"80%\"  />\n",
    "</div>\n",
    "<div align=\"center\">\n",
    "<img src=\"https://user-images.githubusercontent.com/12406017/200261711-9f18bb04-3736-4f51-892c-de801db9ab9e.png\"  width = \"80%\"  />\n",
    "</div>\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. 模型如何使用\n",
    "\n",
文幕地方's avatar
文幕地方 已提交
70
    "### 3.1 模型推理\n",
Z
zhoujun 已提交
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
    "* 安装PaddleOCR whl包"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
文幕地方's avatar
文幕地方 已提交
87
    "! pip install paddleocr --user"
Z
zhoujun 已提交
88 89 90 91 92 93 94 95 96 97 98
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* 快速体验"
   ]
  },
  {
   "cell_type": "code",
文幕地方's avatar
文幕地方 已提交
99
   "execution_count": 1,
Z
zhoujun 已提交
100 101 102 103 104 105 106
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# 命令行使用\n",
107 108
    "! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/dygraph/doc/imgs/11.jpg\n",
    "! paddleocr --image_dir 11.jpg --use_angle_cls true"
Z
zhoujun 已提交
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "运行完成后,会在终端输出如下结果:\n",
    "```log\n",
    "[[[28.0, 37.0], [302.0, 39.0], [302.0, 72.0], [27.0, 70.0]], ('纯臻营养护发素', 0.96588134765625)]\n",
    "[[[26.0, 81.0], [172.0, 83.0], [172.0, 104.0], [25.0, 101.0]], ('产品信息/参数', 0.9113278985023499)]\n",
    "[[[28.0, 115.0], [330.0, 115.0], [330.0, 132.0], [28.0, 132.0]], ('(45元/每公斤,100公斤起订)', 0.8843421936035156)]\n",
    "......\n",
    "```\n",
    "\n",
    "\n"
   ]
  },
126 127 128 129 130 131 132 133 134 135 136 137 138
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 模型训练\n",
    "PP-OCR系统由文本检测模型、方向分类器和文本识别模型构成,三个模型训练教程可参考如下文档:\n",
    "1. 文本检测模型:[文本检测训练教程](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/detection.md)\n",
    "1. 方向分类器: [方向分类器训练教程](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/angle_class.md)\n",
    "1. 文本识别模型:[文本识别训练教程](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/recognition.md)\n",
    "\n",
    "模型训练完成后,可以通过指定模型路径的方式串联使用\n",
    "命令参考如下:\n",
    "```python\n",
文幕地方's avatar
文幕地方 已提交
139
    "paddleocr --image_dir 11.jpg --use_angle_cls true --det_model_dir=/path/to/det_inference_model --cls_model_dir=/path/to/cls_inference_model --rec_model_dir=/path/to/rec_inference_model\n",
140 141 142
    "```"
   ]
  },
Z
zhoujun 已提交
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. 原理\n",
    "\n",
    "优化思路具体如下\n",
    "\n",
    "1. 检测模型优化\n",
    "- LK-PAN:大感受野的PAN结构。\n",
    "  \n",
    "LK-PAN (Large Kernel PAN) 是一个具有更大感受野的轻量级PAN结构,核心是将PAN结构的path augmentation中卷积核从3*3改为9*9。通过增大卷积核,提升特征图每个位置覆盖的感受野,更容易检测大字体的文字以及极端长宽比的文字。使用LK-PAN结构,可以将教师模型的hmean从83.2%提升到85.0%。\n",
    "     <div align=\"center\">\n",
    "   <img src=\"https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/doc/ppocr_v3/LKPAN.png\"  width = \"60%\"  />\n",
    "   </div>\n",
    "\n",
    "- DML:教师模型互学习策略\n",
    "\n",
    "[DML](https://arxiv.org/abs/1706.00384) (Deep Mutual Learning)互学习蒸馏方法,如下图所示,通过两个结构相同的模型互相学习,可以有效提升文本检测模型的精度。教师模型采用DML策略,hmean从85%提升到86%。将PP-OCRv2中CML的教师模型更新为上述更高精度的教师模型,学生模型的hmean可以进一步从83.2%提升到84.3%。\n",
    "   <div align=\"center\">\n",
    "   <img src=\"https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/doc/ppocr_v3/teacher_dml.png\"  width = \"60%\"  />\n",
    "   </div>\n",
    "\n",
    "- RSE-FPN:残差注意力机制的FPN结构\n",
    "\n",
    "RSE-FPN(Residual Squeeze-and-Excitation FPN)如下图所示,引入残差结构和通道注意力结构,将FPN中的卷积层更换为通道注意力结构的RSEConv层,进一步提升特征图的表征能力。考虑到PP-OCRv2的检测模型中FPN通道数非常小,仅为96,如果直接用SEblock代替FPN中卷积会导致某些通道的特征被抑制,精度会下降。RSEConv引入残差结构会缓解上述问题,提升文本检测效果。进一步将PP-OCRv2中CML的学生模型的FPN结构更新为RSE-FPN,学生模型的hmean可以进一步从84.3%提升到85.4%。\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/doc/ppocr_v3/RSEFPN.png\"  width = \"60%\"  />\n",
    "</div>\n",
    "\n",
    "1. 识别模型优化\n",
    "- SVTR_LCNet:轻量级文本识别网络\n",
    "\n",
    "SVTR_LCNet是针对文本识别任务,将基于Transformer的SVTR网络和轻量级CNN网络PP-LCNet 融合的一种轻量级文本识别网络。使用该网络,预测速度优于PP-OCRv2的识别模型20%,但是由于没有采用蒸馏策略,该识别模型效果略差。此外,进一步将输入图片规范化高度从32提升到48,预测速度稍微变慢,但是模型效果大幅提升,识别准确率达到73.98%(+2.08%),接近PP-OCRv2采用蒸馏策略的识别模型效果。\n",
    "\n",
    "- GTC:Attention指导CTC训练策略\n",
    "  \n",
    "[GTC](https://arxiv.org/pdf/2002.01276.pdf)(Guided Training of CTC),利用Attention模块CTC训练,融合多种文本特征的表达,是一种有效的提升文本识别的策略。使用该策略,预测时完全去除 Attention 模块,在推理阶段不增加任何耗时,识别模型的准确率进一步提升到75.8%(+1.82%)。训练流程如下所示:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://user-images.githubusercontent.com/12406017/200265540-1bbb730f-35d4-4d72-8e00-70856bb932ee.png\"  width = \"60%\"  />\n",
    "</div>\n",
    "\n",
    "- TextConAug:挖掘文字上下文信息的数据增广策略\n",
    "\n",
    "TextConAug是一种挖掘文字上下文信息的数据增广策略,主要思想来源于论文[ConCLR](https://www.cse.cuhk.edu.hk/~byu/papers/C139-AAAI2022-ConCLR.pdf),作者提出ConAug数据增广,在一个batch内对2张不同的图像进行联结,组成新的图像并进行自监督对比学习。PP-OCRv3将此方法应用到有监督的学习任务中,设计了TextConAug数据增强方法,可以丰富训练数据上下文信息,提升训练数据多样性。使用该策略,识别模型的准确率进一步提升到76.3%(+0.5%)。TextConAug示意图如下所示:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://user-images.githubusercontent.com/12406017/200265540-1bbb730f-35d4-4d72-8e00-70856bb932ee.png\"  width = \"60%\"  />\n",
    "</div>\n",
    "\n",
    "- TextRotNet:自监督的预训练模型\n",
    "\n",
    "TextRotNet是使用大量无标注的文本行数据,通过自监督方式训练的预训练模型,参考于论文[STR-Fewer-Labels](https://github.com/ku21fan/STR-Fewer-Labels)。该模型可以初始化SVTR_LCNet的初始权重,从而帮助文本识别模型收敛到更佳位置。使用该策略,识别模型的准确率进一步提升到76.9%(+0.6%)。TextRotNet训练流程如下图所示:\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/doc/ppocr_v3/SSL.png\"  width = \"60%\"  />\n",
    "</div>\n",
    "\n",
    "- UDML:联合互学习策略\n",
    "\n",
    "UDML(Unified-Deep Mutual Learning)联合互学习是PP-OCRv2中就采用的对于文本识别非常有效的提升模型效果的策略。在PP-OCRv3中,针对两个不同的SVTR_LCNet和Attention结构,对他们之间的PP-LCNet的特征图、SVTR模块的输出和Attention模块的输出同时进行监督训练。使用该策略,识别模型的准确率进一步提升到78.4%(+1.5%)。\n",
    "\n",
    "- UDML:联合互学习策略\n",
    "\n",
    "UIM(Unlabeled Images Mining)是一种非常简单的无标注数据挖掘方案。核心思想是利用高精度的文本识别大模型对无标注数据进行预测,获取伪标签,并且选择预测置信度高的样本作为训练数据,用于训练小模型。使用该策略,识别模型的准确率进一步提升到79.4%(+1%)。\n",
    "\n",
    "<div align=\"center\">\n",
    "<img src=\"https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/doc/ppocr_v3/UIM.png\"  width = \"60%\"  />\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. 注意事项\n",
    "\n",
    "PP-OCR系列模型训练过程中均使用通用数据,如在实际场景中表现不满意,可标注少量数据进行finetune。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. 相关论文以及引用信息\n",
    "```\n",
文幕地方's avatar
文幕地方 已提交
231 232 233 234 235
    "@article{li2022pp,\n",
    "  title={PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System},\n",
    "  author={Li, Chenxia and Liu, Weiwei and Guo, Ruoyu and Yin, Xiaoting and Jiang, Kaitao and Du, Yongkun and Du, Yuning and Zhu, Lingfeng and Lai, Baohua and Hu, Xiaoguang and others},\n",
    "  journal={arXiv preprint arXiv:2206.03001},\n",
    "  year={2022}\n",
Z
zhoujun 已提交
236 237 238 239 240 241 242
    "}\n",
    "```\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
文幕地方's avatar
文幕地方 已提交
243
   "display_name": "Python 3.8.13 ('py38')",
Z
zhoujun 已提交
244 245 246 247 248 249 250 251 252 253 254 255 256
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
文幕地方's avatar
文幕地方 已提交
257 258 259 260 261 262
   "version": "3.8.13"
  },
  "vscode": {
   "interpreter": {
    "hash": "58fd1890da6594cebec461cf98c6cb9764024814357f166387d10d267624ecd6"
   }
Z
zhoujun 已提交
263 264 265 266 267
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}