From d58c70223e57a004d55d54cc10712905140f3b4b Mon Sep 17 00:00:00 2001 From: an1018 <614803115@qq.com> Date: Fri, 14 Oct 2022 18:45:39 +0800 Subject: [PATCH] add_pdf2docx_api --- ppstructure/recovery/README.md | 9 ++++++++- ppstructure/recovery/README_ch.md | 9 ++++++++- ppstructure/recovery/requirements.txt | 3 +-- 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/ppstructure/recovery/README.md b/ppstructure/recovery/README.md index 41fb3e45..209c995f 100644 --- a/ppstructure/recovery/README.md +++ b/ppstructure/recovery/README.md @@ -86,7 +86,7 @@ git clone https://gitee.com/paddlepaddle/PaddleOCR - **(2) Install recovery `requirements`** -The layout restoration is exported as docx files, so python-docx API need to be installed, and PyMuPDF api([requires Python >= 3.7](https://pypi.org/project/PyMuPDF/)) need to be installed to process the input files in pdf format. And if using pdf parse method, we need to install pdf2docx api. +The layout restoration is exported as docx files, so python-docx API need to be installed, and PyMuPDF api([requires Python >= 3.7](https://pypi.org/project/PyMuPDF/)) need to be installed to process the input files in pdf format. Install all the libraries by running the following command: @@ -94,6 +94,13 @@ Install all the libraries by running the following command: python3 -m pip install -r ppstructure/recovery/requirements.txt ```` + And if using pdf parse method, we need to install pdf2docx api. + +```bash +wget https://paddleocr.bj.bcebos.com/whl/pdf2docx-0.0.0-py3-none-any.whl +pip3 install pdf2docx-0.0.0-py3-none-any.whl +``` + ## 3. Quick Start using PDF parse diff --git a/ppstructure/recovery/README_ch.md b/ppstructure/recovery/README_ch.md index eaa5260b..5ef823d4 100644 --- a/ppstructure/recovery/README_ch.md +++ b/ppstructure/recovery/README_ch.md @@ -82,7 +82,7 @@ git clone https://gitee.com/paddlepaddle/PaddleOCR - **(2)安装recovery的`requirements`** -版面恢复导出为docx文件,所以需要安装Python处理word文档的python-docx API,同时处理pdf格式的输入文件,需要安装PyMuPDF API([要求Python >= 3.7](https://pypi.org/project/PyMuPDF/))。使用pdf2docx库解析的方式恢复文档需要安装pdf2docx等。 +版面恢复导出为docx文件,所以需要安装Python处理word文档的python-docx API,同时处理pdf格式的输入文件,需要安装PyMuPDF API([要求Python >= 3.7](https://pypi.org/project/PyMuPDF/))。 通过如下命令安装全部库: @@ -90,6 +90,13 @@ git clone https://gitee.com/paddlepaddle/PaddleOCR python3 -m pip install -r ppstructure/recovery/requirements.txt ``` +使用pdf2docx库解析的方式恢复文档需要安装优化的pdf2docx。 + +```bash +wget https://paddleocr.bj.bcebos.com/whl/pdf2docx-0.0.0-py3-none-any.whl +pip3 install pdf2docx-0.0.0-py3-none-any.whl +``` + ## 3.使用 PDF解析进行版面恢复 diff --git a/ppstructure/recovery/requirements.txt b/ppstructure/recovery/requirements.txt index d67e0a95..4e4239a1 100644 --- a/ppstructure/recovery/requirements.txt +++ b/ppstructure/recovery/requirements.txt @@ -2,5 +2,4 @@ python-docx PyMuPDF==1.19.0 beautifulsoup4 fonttools>=4.24.0 -fire>=0.3.0 -pdf2docx==0.0.0 \ No newline at end of file +fire>=0.3.0 \ No newline at end of file -- GitLab