diff --git a/ppstructure/recovery/README.md b/ppstructure/recovery/README.md index 41fb3e45b83329cd7bfa7021b19ffcee33dc947b..209c995f8efb097fd744094ea0e102db3212387e 100644 --- a/ppstructure/recovery/README.md +++ b/ppstructure/recovery/README.md @@ -86,7 +86,7 @@ git clone https://gitee.com/paddlepaddle/PaddleOCR - **(2) Install recovery `requirements`** -The layout restoration is exported as docx files, so python-docx API need to be installed, and PyMuPDF api([requires Python >= 3.7](https://pypi.org/project/PyMuPDF/)) need to be installed to process the input files in pdf format. And if using pdf parse method, we need to install pdf2docx api. +The layout restoration is exported as docx files, so python-docx API need to be installed, and PyMuPDF api([requires Python >= 3.7](https://pypi.org/project/PyMuPDF/)) need to be installed to process the input files in pdf format. Install all the libraries by running the following command: @@ -94,6 +94,13 @@ Install all the libraries by running the following command: python3 -m pip install -r ppstructure/recovery/requirements.txt ```` + And if using pdf parse method, we need to install pdf2docx api. + +```bash +wget https://paddleocr.bj.bcebos.com/whl/pdf2docx-0.0.0-py3-none-any.whl +pip3 install pdf2docx-0.0.0-py3-none-any.whl +``` + ## 3. Quick Start using PDF parse diff --git a/ppstructure/recovery/README_ch.md b/ppstructure/recovery/README_ch.md index eaa5260b57db81bc483ff4b32ec4334340002335..5ef823d43488f0c765f0757aa6745ae4cc49e52c 100644 --- a/ppstructure/recovery/README_ch.md +++ b/ppstructure/recovery/README_ch.md @@ -82,7 +82,7 @@ git clone https://gitee.com/paddlepaddle/PaddleOCR - **(2)安装recovery的`requirements`** -版面恢复导出为docx文件,所以需要安装Python处理word文档的python-docx API,同时处理pdf格式的输入文件,需要安装PyMuPDF API([要求Python >= 3.7](https://pypi.org/project/PyMuPDF/))。使用pdf2docx库解析的方式恢复文档需要安装pdf2docx等。 +版面恢复导出为docx文件,所以需要安装Python处理word文档的python-docx API,同时处理pdf格式的输入文件,需要安装PyMuPDF API([要求Python >= 3.7](https://pypi.org/project/PyMuPDF/))。 通过如下命令安装全部库: @@ -90,6 +90,13 @@ git clone https://gitee.com/paddlepaddle/PaddleOCR python3 -m pip install -r ppstructure/recovery/requirements.txt ``` +使用pdf2docx库解析的方式恢复文档需要安装优化的pdf2docx。 + +```bash +wget https://paddleocr.bj.bcebos.com/whl/pdf2docx-0.0.0-py3-none-any.whl +pip3 install pdf2docx-0.0.0-py3-none-any.whl +``` + ## 3.使用 PDF解析进行版面恢复 diff --git a/ppstructure/recovery/requirements.txt b/ppstructure/recovery/requirements.txt index d67e0a95aa929e767c0289d54056e8677bf83607..4e4239a14af9b6f95aca1171f25d50da5eac37cf 100644 --- a/ppstructure/recovery/requirements.txt +++ b/ppstructure/recovery/requirements.txt @@ -2,5 +2,4 @@ python-docx PyMuPDF==1.19.0 beautifulsoup4 fonttools>=4.24.0 -fire>=0.3.0 -pdf2docx==0.0.0 \ No newline at end of file +fire>=0.3.0 \ No newline at end of file