multi_languages_en.md 7.5 KB
Newer Older
T
tink2123 已提交
1 2 3 4 5 6 7
# Multi-language model

**Recent Update**

-2021.4.9 supports the detection and recognition of 80 languages
-2021.4.9 supports **lightweight high-precision** English model detection and recognition

M
MissPenguin 已提交
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
- [1 Installation](#Install)
    - [1.1 paddle installation](#paddleinstallation)
    - [1.2 paddleocr package installation](#paddleocr_package_install)

- [2 Quick Use](#Quick_Use)
    - [2.1 Command line operation](#Command_line_operation)
        - [2.1.1 Prediction of the whole image](#bash_detection+recognition)
        - [2.1.2 Recognition](#bash_Recognition)
        - [2.1.3 Detection](#bash_detection)
    - [2.2 python script running](#python_Script_running)
        - [2.2.1 Whole image prediction](#python_detection+recognition)
        - [2.2.2 Recognition](#python_Recognition)
        - [2.2.3 Detection](#python_detection)
- [3 Custom Training](#Custom_Training)
- [4 Supported languages and abbreviations](#language_abbreviations)
T
tink2123 已提交
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

<a name="Install"></a>
## 1 Installation

<a name="paddle_install"></a>
### 1.1 paddle installation
```
# cpu
pip install paddlepaddle

# gpu
pip instll paddlepaddle-gpu
```

<a name="paddleocr_package_install"></a>
### 1.2 paddleocr package installation


pip install
```
pip install "paddleocr>=2.0.4" # 2.0.4 version is recommended
```
Build and install locally
```
python3 setup.py bdist_wheel
pip3 install dist/paddleocr-x.x.x-py3-none-any.whl # x.x.x is the version number of paddleocr
```

<a name="Quick_use"></a>
## 2 Quick use

<a name="Command_line_operation"></a>
### 2.1 Command line operation

View help information

```
paddleocr -h
```

* Whole image prediction (detection + recognition)

T
tink2123 已提交
65 66
Paddleocr currently supports 80 languages, which can be switched by modifying the --lang parameter.
The specific supported [language] (#language_abbreviations) can be viewed in the table.
T
tink2123 已提交
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82

``` bash

paddleocr --image_dir doc/imgs/japan_2.jpg --lang=japan
```
![](https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.0/doc/imgs/japan_2.jpg)

The result is a list, each item contains a text box, text and recognition confidence
```text
[[[671.0, 60.0], [847.0, 63.0], [847.0, 104.0], [671.0, 102.0]], ('もちもち', 0.9993342)]
[[[394.0, 82.0], [536.0, 77.0], [538.0, 127.0], [396.0, 132.0]], ('自然の', 0.9919842)]
[[[880.0, 89.0], [1014.0, 93.0], [1013.0, 127.0], [879.0, 124.0]], ('とろっと', 0.9976762)]
[[[1067.0, 101.0], [1294.0, 101.0], [1294.0, 138.0], [1067.0, 138.0]], ('后味のよい', 0.9988712)]
......
```

T
tink2123 已提交
83
* Recognition
T
tink2123 已提交
84 85 86 87 88 89 90 91 92 93 94 95 96

```bash
paddleocr --image_dir doc/imgs_words/japan/1.jpg --det false --lang=japan
```

![](https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.0/doc/imgs_words/japan/1.jpg)

The result is a tuple, which returns the recognition result and recognition confidence

```text
('したがって', 0.99965394)
```

T
tink2123 已提交
97
* Detection
T
tink2123 已提交
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144

```
paddleocr --image_dir PaddleOCR/doc/imgs/11.jpg --rec false
```

The result is a list, each item contains only text boxes

```
[[26.0, 457.0], [137.0, 457.0], [137.0, 477.0], [26.0, 477.0]]
[[25.0, 425.0], [372.0, 425.0], [372.0, 448.0], [25.0, 448.0]]
[[128.0, 397.0], [273.0, 397.0], [273.0, 414.0], [128.0, 414.0]]
......
```

<a name="python_script_running"></a>
### 2.2 python script running

ppocr also supports running in python scripts for easy embedding in your own code:

* Whole image prediction (detection + recognition)

```
from paddleocr import PaddleOCR, draw_ocr

# Also switch the language by modifying the lang parameter
ocr = PaddleOCR(lang="korean") # The model file will be downloaded automatically when executed for the first time
img_path ='doc/imgs/korean_1.jpg'
result = ocr.ocr(img_path)
# Print detection frame and recognition result
for line in result:
    print(line)

# Visualization
from PIL import Image
image = Image.open(img_path).convert('RGB')
boxes = [line[0] for line in result]
txts = [line[1][0] for line in result]
scores = [line[1][1] for line in result]
im_show = draw_ocr(image, boxes, txts, scores, font_path='/path/to/PaddleOCR/doc/korean.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')
```

Visualization of results:
![](https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.0/doc/imgs_results/korean.jpg)


T
tink2123 已提交
145
* Recognition
T
tink2123 已提交
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163

```
from paddleocr import PaddleOCR
ocr = PaddleOCR(lang="german")
img_path ='PaddleOCR/doc/imgs_words/german/1.jpg'
result = ocr.ocr(img_path, det=False, cls=True)
for line in result:
    print(line)
```

![](https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.0/doc/imgs_words/german/1.jpg)

The result is a tuple, which only contains the recognition result and recognition confidence

```
('leider auch jetzt', 0.97538936)
```

T
tink2123 已提交
164
* Detection
T
tink2123 已提交
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200

```python
from paddleocr import PaddleOCR, draw_ocr
ocr = PaddleOCR() # need to run only once to download and load model into memory
img_path ='PaddleOCR/doc/imgs_en/img_12.jpg'
result = ocr.ocr(img_path, rec=False)
for line in result:
    print(line)

# show result
from PIL import Image

image = Image.open(img_path).convert('RGB')
im_show = draw_ocr(image, result, txts=None, scores=None, font_path='/path/to/PaddleOCR/doc/fonts/simfang.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')
```
The result is a list, each item contains only text boxes
```bash
[[26.0, 457.0], [137.0, 457.0], [137.0, 477.0], [26.0, 477.0]]
[[25.0, 425.0], [372.0, 425.0], [372.0, 448.0], [25.0, 448.0]]
[[128.0, 397.0], [273.0, 397.0], [273.0, 414.0], [128.0, 414.0]]
......
```

Visualization of results:
![](https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.0/doc/imgs_results/whl/12_det.jpg)

ppocr also supports direction classification. For more usage methods, please refer to: [whl package instructions](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.0/doc/doc_ch/whl.md).

<a name="Custom_training"></a>
## 3 Custom training

ppocr supports using your own data for custom training or finetune, where the recognition model can refer to [French configuration file](../../configs/rec/multi_language/rec_french_lite_train.yml)
Modify the training data path, dictionary and other parameters.

T
tink2123 已提交
201 202
For specific data preparation and training process, please refer to: [Text Detection](../doc_en/detection_en.md), [Text Recognition](../doc_en/recognition_en.md), more functions such as predictive deployment,
For functions such as data annotation, you can read the complete [Document Tutorial](../../README.md).
T
tink2123 已提交
203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285

<a name="language_abbreviation"></a>
## 4 Support languages and abbreviations

| Language  | Abbreviation |
| ---  | --- |
|chinese and english|ch|
|english|en|
|french|fr|
|german|german|
|japan|japan|
|korean|korean|
|chinese traditional |ch_tra|
| Italian |it|
|Spanish |es|
| Portuguese|pt|
|Russia|ru|
|Arabic|ar|
|Hindi|hi|
|Uyghur|ug|
|Persian|fa|
|Urdu|ur|
| Serbian(latin) |rs_latin|
|Occitan |oc|
|Marathi|mr|
|Nepali|ne|
|Serbian(cyrillic)|rs_cyrillic|
|Bulgarian |bg|
|Ukranian|uk|
|Belarusian|be|
|Telugu |te|
|Kannada |kn|
|Tamil |ta|
|Afrikaans |af|
|Azerbaijani    |az|
|Bosnian|bs|
|Czech|cs|
|Welsh |cy|
|Danish|da|
|Estonian |et|
|Irish |ga|
|Croatian |hr|
|Hungarian |hu|
|Indonesian|id|
|Icelandic|is|
|Kurdish|ku|
|Lithuanian |lt|
 |Latvian |lv|
|Maori|mi|
|Malay|ms|
|Maltese |mt|
|Dutch |nl|
|Norwegian |no|
|Polish |pl|
|Romanian |ro|
|Slovak |sk|
|Slovenian |sl|
|Albanian |sq|
|Swedish |sv|
|Swahili |sw|
|Tagalog |tl|
|Turkish |tr|
|Uzbek |uz|
|Vietnamese |vi|
|Mongolian |mn|
|Abaza |abq|
|Adyghe |ady|
|Kabardian |kbd|
|Avar |ava|
|Dargwa |dar|
|Ingush |inh|
|Lak |lbe|
|Lezghian |lez|
|Tabassaran |tab|
|Bihari |bh|
|Maithili |mai|
|Angika |ang|
|Bhojpuri |bho|
|Magahi |mah|
|Nagpur |sck|
|Newari |new|
|Goan Konkani|gom|
|Saudi Arabia|sa|