未验证 提交 03d88168 编写于 作者: U user1018 提交者: GitHub

update code_doc (#7667)

* update code_doc

* update code_doc
上级 4cec888b
ppstructure/docs/layout/layout.png

178.6 KB | W: | H:

ppstructure/docs/layout/layout.png

1.2 MB | W: | H:

ppstructure/docs/layout/layout.png
ppstructure/docs/layout/layout.png
ppstructure/docs/layout/layout.png
ppstructure/docs/layout/layout.png
  • 2-up
  • Swipe
  • Onion skin
...@@ -23,7 +23,7 @@ English | [简体中文](README_ch.md) ...@@ -23,7 +23,7 @@ English | [简体中文](README_ch.md)
## 1. Introduction ## 1. Introduction
Layout analysis refers to the regional division of documents in the form of pictures and the positioning of key areas, such as text, title, table, picture, etc. The layout analysis algorithm is based on the lightweight model PP-picodet of [PaddleDetection]( https://github.com/PaddlePaddle/PaddleDetection ) Layout analysis refers to the regional division of documents in the form of pictures and the positioning of key areas, such as text, title, table, picture, etc. The layout analysis algorithm is based on the lightweight model PP-picodet of [PaddleDetection]( https://github.com/PaddlePaddle/PaddleDetection ), including English layout analysis, Chinese layout analysis and table layout analysis models. English layout analysis models can detect document layout elements such as text, title, table, figure, list. Chinese layout analysis models can detect document layout elements such as text, figure, figure caption, table, table caption, header, footer, reference, and equation. Table layout analysis models can detect table regions.
<div align="center"> <div align="center">
<img src="../docs/layout/layout.png" width="800"> <img src="../docs/layout/layout.png" width="800">
...@@ -152,7 +152,7 @@ We provide CDLA(Chinese layout analysis), TableBank(Table layout analysis)etc. d ...@@ -152,7 +152,7 @@ We provide CDLA(Chinese layout analysis), TableBank(Table layout analysis)etc. d
| [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | For form detection (TRACKA) and form identification (TRACKB).Image types include historical data sets (beginning with cTDaR_t0, such as CTDAR_T00872.jpg) and modern data sets (beginning with cTDaR_t1, CTDAR_T10482.jpg). | | [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | For form detection (TRACKA) and form identification (TRACKB).Image types include historical data sets (beginning with cTDaR_t0, such as CTDAR_T00872.jpg) and modern data sets (beginning with cTDaR_t1, CTDAR_T10482.jpg). |
| [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | Data sets constructed by manually annotating figures or pages from publicly available annual reports, containing 5 categories:table, figure, natural image, logo, and signature. | | [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | Data sets constructed by manually annotating figures or pages from publicly available annual reports, containing 5 categories:table, figure, natural image, logo, and signature. |
| [TableBank](https://github.com/doc-analysis/TableBank) | For table detection and recognition of large datasets, including Word and Latex document formats | | [TableBank](https://github.com/doc-analysis/TableBank) | For table detection and recognition of large datasets, including Word and Latex document formats |
| [CDLA](https://github.com/buptlihang/CDLA) | Chinese document layout analysis data set, for Chinese literature (paper) scenarios, including 10 categories:Table, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation | | [CDLA](https://github.com/buptlihang/CDLA) | Chinese document layout analysis data set, for Chinese literature (paper) scenarios, including 10 categories:Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation |
| [DocBank](https://github.com/doc-analysis/DocBank) | Large-scale dataset (500K document pages) constructed using weakly supervised methods for document layout analysis, containing 12 categories:Author, Caption, Date, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table, Title | | [DocBank](https://github.com/doc-analysis/DocBank) | Large-scale dataset (500K document pages) constructed using weakly supervised methods for document layout analysis, containing 12 categories:Author, Caption, Date, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table, Title |
...@@ -175,7 +175,7 @@ If the test image is Chinese, the pre-trained model of Chinese CDLA dataset can ...@@ -175,7 +175,7 @@ If the test image is Chinese, the pre-trained model of Chinese CDLA dataset can
### 5.1. Train ### 5.1. Train
Train: Start training with the PaddleDetection [layout analysis profile](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/configs/picodet/legacy_model/application/layout_analysis)
* Modify Profile * Modify Profile
......
...@@ -22,7 +22,7 @@ ...@@ -22,7 +22,7 @@
## 1. 简介 ## 1. 简介
版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等。版面分析算法基于[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)的轻量模型PP-PicoDet进行开发 版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等。版面分析算法基于[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)的轻量模型PP-PicoDet进行开发,包含英文、中文、表格版面分析3类模型。其中,英文模型支持Text、Title、Tale、Figure、List5类区域的检测,中文模型支持Text、Title、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation10类区域的检测,表格版面分析支持Table区域的检测,版面分析效果如下图所示:
<div align="center"> <div align="center">
<img src="../docs/layout/layout.png" width="800"> <img src="../docs/layout/layout.png" width="800">
...@@ -152,7 +152,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放, ...@@ -152,7 +152,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放,
| ------------------------------------------------------------ | ------------------------------------------------------------ | | ------------------------------------------------------------ | ------------------------------------------------------------ |
| [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | 用于表格检测(TRACKA)和表格识别(TRACKB)。图片类型包含历史数据集(以cTDaR_t0开头,如cTDaR_t00872.jpg)和现代数据集(以cTDaR_t1开头,cTDaR_t10482.jpg)。 | | [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | 用于表格检测(TRACKA)和表格识别(TRACKB)。图片类型包含历史数据集(以cTDaR_t0开头,如cTDaR_t00872.jpg)和现代数据集(以cTDaR_t1开头,cTDaR_t10482.jpg)。 |
| [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | 手动注释公开的年度报告中的图形或页面而构建的数据集,包含5类:table, figure, natural image, logo, and signature | | [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | 手动注释公开的年度报告中的图形或页面而构建的数据集,包含5类:table, figure, natural image, logo, and signature |
| [CDLA](https://github.com/buptlihang/CDLA) | 中文文档版面分析数据集,面向中文文献类(论文)场景,包含10类:Table、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation | | [CDLA](https://github.com/buptlihang/CDLA) | 中文文档版面分析数据集,面向中文文献类(论文)场景,包含10类:Text、Title、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation |
| [TableBank](https://github.com/doc-analysis/TableBank) | 用于表格检测和识别大型数据集,包含Word和Latex2种文档格式 | | [TableBank](https://github.com/doc-analysis/TableBank) | 用于表格检测和识别大型数据集,包含Word和Latex2种文档格式 |
| [DocBank](https://github.com/doc-analysis/DocBank) | 使用弱监督方法构建的大规模数据集(500K文档页面),用于文档布局分析,包含12类:Author、Caption、Date、Equation、Figure、Footer、List、Paragraph、Reference、Section、Table、Title | | [DocBank](https://github.com/doc-analysis/DocBank) | 使用弱监督方法构建的大规模数据集(500K文档页面),用于文档布局分析,包含12类:Author、Caption、Date、Equation、Figure、Footer、List、Paragraph、Reference、Section、Table、Title |
...@@ -161,7 +161,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放, ...@@ -161,7 +161,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放,
提供了训练脚本、评估脚本和预测脚本,本节将以PubLayNet预训练模型为例进行讲解。 提供了训练脚本、评估脚本和预测脚本,本节将以PubLayNet预训练模型为例进行讲解。
如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型(PubLayNet数据集),并跳过本部分 如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型(PubLayNet数据集),并跳过5.1和5.2
``` ```
mkdir pretrained_model mkdir pretrained_model
...@@ -176,7 +176,7 @@ wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_ ...@@ -176,7 +176,7 @@ wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_
### 5.1. 启动训练 ### 5.1. 启动训练
开始训练: 使用PaddleDetection[版面分析配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/configs/picodet/legacy_model/application/layout_analysis)启动训练
* 修改配置文件 * 修改配置文件
......
...@@ -254,8 +254,7 @@ def main(args): ...@@ -254,8 +254,7 @@ def main(args):
if args.recovery and all_res != []: if args.recovery and all_res != []:
try: try:
convert_info_docx(img, all_res, save_folder, img_name, convert_info_docx(img, all_res, save_folder, img_name)
args.save_pdf)
except Exception as ex: except Exception as ex:
logger.error("error in layout recovery image:{}, err msg: {}". logger.error("error in layout recovery image:{}, err msg: {}".
format(image_file, ex)) format(image_file, ex))
......
...@@ -82,8 +82,11 @@ Through layout analysis, we divided the image/PDF documents into regions, locate ...@@ -82,8 +82,11 @@ Through layout analysis, we divided the image/PDF documents into regions, locate
We can restore the test picture through the layout information, OCR detection and recognition structure, table information, and saved pictures. We can restore the test picture through the layout information, OCR detection and recognition structure, table information, and saved pictures.
The whl package is also provided for quick use, see [quickstart](../docs/quickstart_en.md) for details. The whl package is also provided for quick use, follow the above code, for more infomation please refer to [quickstart](../docs/quickstart_en.md) for details.
```bash
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en'
```
<a name="3.1"></a> <a name="3.1"></a>
### 3.1 Download models ### 3.1 Download models
......
...@@ -83,7 +83,16 @@ python3 -m pip install -r ppstructure/recovery/requirements.txt ...@@ -83,7 +83,16 @@ python3 -m pip install -r ppstructure/recovery/requirements.txt
我们通过版面信息、OCR检测和识别结构、表格信息、保存的图片,对测试图片进行恢复即可。 我们通过版面信息、OCR检测和识别结构、表格信息、保存的图片,对测试图片进行恢复即可。
提供如下代码实现版面恢复,也提供了whl包的形式方便快速使用,详见 [quickstart](../docs/quickstart.md) 提供如下代码实现版面恢复,也提供了whl包的形式方便快速使用,代码如下,更多信息详见 [quickstart](../docs/quickstart.md)
```bash
# 中文测试图
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true
# 英文测试图
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en'
# pdf测试文件
paddleocr --image_dir=ppstructure/recovery/UnrealText.pdf --type=structure --recovery=true --lang='en'
```
<a name="3.1"></a> <a name="3.1"></a>
......
...@@ -28,7 +28,7 @@ from ppocr.utils.logging import get_logger ...@@ -28,7 +28,7 @@ from ppocr.utils.logging import get_logger
logger = get_logger() logger = get_logger()
def convert_info_docx(img, res, save_folder, img_name, save_pdf=False): def convert_info_docx(img, res, save_folder, img_name):
doc = Document() doc = Document()
doc.styles['Normal'].font.name = 'Times New Roman' doc.styles['Normal'].font.name = 'Times New Roman'
doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体') doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体')
...@@ -60,14 +60,9 @@ def convert_info_docx(img, res, save_folder, img_name, save_pdf=False): ...@@ -60,14 +60,9 @@ def convert_info_docx(img, res, save_folder, img_name, save_pdf=False):
elif region['type'].lower() == 'title': elif region['type'].lower() == 'title':
doc.add_heading(region['res'][0]['text']) doc.add_heading(region['res'][0]['text'])
elif region['type'].lower() == 'table': elif region['type'].lower() == 'table':
paragraph = doc.add_paragraph() parser = HtmlToDocx()
new_parser = HtmlToDocx() parser.table_style = 'TableGrid'
new_parser.table_style = 'TableGrid' parser.handle_table(region['res']['html'], doc)
table = new_parser.handle_table(html=region['res']['html'])
new_table = deepcopy(table)
new_table.alignment = WD_TABLE_ALIGNMENT.CENTER
paragraph.add_run().element.addnext(new_table._tbl)
else: else:
paragraph = doc.add_paragraph() paragraph = doc.add_paragraph()
paragraph_format = paragraph.paragraph_format paragraph_format = paragraph.paragraph_format
...@@ -82,13 +77,6 @@ def convert_info_docx(img, res, save_folder, img_name, save_pdf=False): ...@@ -82,13 +77,6 @@ def convert_info_docx(img, res, save_folder, img_name, save_pdf=False):
doc.save(docx_path) doc.save(docx_path)
logger.info('docx save to {}'.format(docx_path)) logger.info('docx save to {}'.format(docx_path))
# save to pdf
if save_pdf:
pdf_path = os.path.join(save_folder, '{}.pdf'.format(img_name))
from docx2pdf import convert
convert(docx_path, pdf_path)
logger.info('pdf save to {}'.format(pdf_path))
def sorted_layout_boxes(res, w): def sorted_layout_boxes(res, w):
""" """
......
python-docx python-docx
docx2pdf
PyMuPDF PyMuPDF
beautifulsoup4 beautifulsoup4
\ No newline at end of file
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
...@@ -13,62 +12,59 @@ ...@@ -13,62 +12,59 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
""" """
This code is refer from:https://github.com/pqzx/html2docx/blob/8f6695a778c68befb302e48ac0ed5201ddbd4524/htmldocx/h2d.py This code is refer from: https://github.com/weizwx/html2docx/blob/master/htmldocx/h2d.py
""" """
import re, argparse
import io, os
import urllib.request
from urllib.parse import urlparse
from html.parser import HTMLParser
import docx, docx.table import re
import docx
from docx import Document from docx import Document
from docx.shared import RGBColor, Pt, Inches
from docx.enum.text import WD_COLOR, WD_ALIGN_PARAGRAPH
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from html.parser import HTMLParser
# values in inches
INDENT = 0.25
LIST_INDENT = 0.5
MAX_INDENT = 5.5 # To stop indents going off the page
# Style to use with tables. By default no style is used. def get_table_rows(table_soup):
DEFAULT_TABLE_STYLE = None table_row_selectors = [
'table > tr', 'table > thead > tr', 'table > tbody > tr',
'table > tfoot > tr'
]
# If there's a header, body, footer or direct child tr tags, add row dimensions from there
return table_soup.select(', '.join(table_row_selectors), recursive=False)
# Style to use with paragraphs. By default no style is used.
DEFAULT_PARAGRAPH_STYLE = None
def get_table_columns(row):
# Get all columns for the specified row tag.
return row.find_all(['th', 'td'], recursive=False) if row else []
def get_filename_from_url(url):
return os.path.basename(urlparse(url).path)
def is_url(url): def get_table_dimensions(table_soup):
""" # Get rows for the table
Not to be used for actually validating a url, but in our use case we only rows = get_table_rows(table_soup)
care if it's a url or a file path, and they're pretty distinguishable # Table is either empty or has non-direct children between table and tr tags
""" # Thus the row dimensions and column dimensions are assumed to be 0
parts = urlparse(url)
return all([parts.scheme, parts.netloc, parts.path])
def fetch_image(url): cols = get_table_columns(rows[0]) if rows else []
""" # Add colspan calculation column number
Attempts to fetch an image from a url. col_count = 0
If successful returns a bytes object, else returns None for col in cols:
:return: colspan = col.attrs.get('colspan', 1)
""" col_count += int(colspan)
try:
with urllib.request.urlopen(url) as response: return rows, col_count
# security flaw?
return io.BytesIO(response.read())
except urllib.error.URLError: def get_cell_html(soup):
return None # Returns string of td element with opening and closing <td> tags removed
# Cannot use find_all as it only finds element tags and does not find text which
# is not inside an element
return ' '.join([str(i) for i in soup.contents])
def delete_paragraph(paragraph):
# https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
def remove_last_occurence(ls, x):
ls.pop(len(ls) - ls[::-1].index(x) - 1)
def remove_whitespace(string, leading=False, trailing=False): def remove_whitespace(string, leading=False, trailing=False):
"""Remove white space from a string. """Remove white space from a string.
...@@ -122,11 +118,6 @@ def remove_whitespace(string, leading=False, trailing=False): ...@@ -122,11 +118,6 @@ def remove_whitespace(string, leading=False, trailing=False):
# TODO need some way to get rid of extra spaces in e.g. text <span> </span> text # TODO need some way to get rid of extra spaces in e.g. text <span> </span> text
return re.sub(r'\s+', ' ', string) return re.sub(r'\s+', ' ', string)
def delete_paragraph(paragraph):
# https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
font_styles = { font_styles = {
'b': 'bold', 'b': 'bold',
...@@ -145,13 +136,8 @@ font_names = { ...@@ -145,13 +136,8 @@ font_names = {
'pre': 'Courier', 'pre': 'Courier',
} }
styles = {
'LIST_BULLET': 'List Bullet',
'LIST_NUMBER': 'List Number',
}
class HtmlToDocx(HTMLParser): class HtmlToDocx(HTMLParser):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
self.options = { self.options = {
...@@ -161,13 +147,11 @@ class HtmlToDocx(HTMLParser): ...@@ -161,13 +147,11 @@ class HtmlToDocx(HTMLParser):
'styles': True, 'styles': True,
} }
self.table_row_selectors = [ self.table_row_selectors = [
'table > tr', 'table > tr', 'table > thead > tr', 'table > tbody > tr',
'table > thead > tr',
'table > tbody > tr',
'table > tfoot > tr' 'table > tfoot > tr'
] ]
self.table_style = DEFAULT_TABLE_STYLE self.table_style = None
self.paragraph_style = DEFAULT_PARAGRAPH_STYLE self.paragraph_style = None
def set_initial_attrs(self, document=None): def set_initial_attrs(self, document=None):
self.tags = { self.tags = {
...@@ -178,7 +162,8 @@ class HtmlToDocx(HTMLParser): ...@@ -178,7 +162,8 @@ class HtmlToDocx(HTMLParser):
self.doc = document self.doc = document
else: else:
self.doc = Document() self.doc = Document()
self.bs = self.options['fix-html'] # whether or not to clean with BeautifulSoup self.bs = self.options[
'fix-html'] # whether or not to clean with BeautifulSoup
self.document = self.doc self.document = self.doc
self.include_tables = True #TODO add this option back in? self.include_tables = True #TODO add this option back in?
self.include_images = self.options['images'] self.include_images = self.options['images']
...@@ -193,55 +178,52 @@ class HtmlToDocx(HTMLParser): ...@@ -193,55 +178,52 @@ class HtmlToDocx(HTMLParser):
self.table_style = other.table_style self.table_style = other.table_style
self.paragraph_style = other.paragraph_style self.paragraph_style = other.paragraph_style
def get_cell_html(self, soup): def ignore_nested_tables(self, tables_soup):
# Returns string of td element with opening and closing <td> tags removed """
# Cannot use find_all as it only finds element tags and does not find text which Returns array containing only the highest level tables
# is not inside an element Operates on the assumption that bs4 returns child elements immediately after
return ' '.join([str(i) for i in soup.contents]) the parent element in `find_all`. If this changes in the future, this method will need to be updated
:return:
"""
new_tables = []
nest = 0
for table in tables_soup:
if nest:
nest -= 1
continue
new_tables.append(table)
nest = len(table.find_all('table'))
return new_tables
def add_styles_to_paragraph(self, style): def get_tables(self):
if 'text-align' in style: if not hasattr(self, 'soup'):
align = style['text-align'] self.include_tables = False
if align == 'center': return
self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER # find other way to do it, or require this dependency?
elif align == 'right': self.tables = self.ignore_nested_tables(self.soup.find_all('table'))
self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.RIGHT self.table_no = 0
elif align == 'justify':
self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY def run_process(self, html):
if 'margin-left' in style: if self.bs and BeautifulSoup:
margin = style['margin-left'] self.soup = BeautifulSoup(html, 'html.parser')
units = re.sub(r'[0-9]+', '', margin) html = str(self.soup)
margin = int(float(re.sub(r'[a-z]+', '', margin))) if self.include_tables:
if units == 'px': self.get_tables()
self.paragraph.paragraph_format.left_indent = Inches(min(margin // 10 * INDENT, MAX_INDENT)) self.feed(html)
# TODO handle non px units
def add_html_to_cell(self, html, cell):
def add_styles_to_run(self, style): if not isinstance(cell, docx.table._Cell):
if 'color' in style: raise ValueError('Second argument needs to be a %s' %
if 'rgb' in style['color']: docx.table._Cell)
color = re.sub(r'[a-z()]+', '', style['color']) unwanted_paragraph = cell.paragraphs[0]
colors = [int(x) for x in color.split(',')] if unwanted_paragraph.text == "":
elif '#' in style['color']: delete_paragraph(unwanted_paragraph)
color = style['color'].lstrip('#') self.set_initial_attrs(cell)
colors = tuple(int(color[i:i+2], 16) for i in (0, 2, 4)) self.run_process(html)
else: # cells must end with a paragraph or will get message about corrupt file
colors = [0, 0, 0] # https://stackoverflow.com/a/29287121
# TODO map colors to named colors (and extended colors...) if not self.doc.paragraphs:
# For now set color to black to prevent crashing self.doc.add_paragraph('')
self.run.font.color.rgb = RGBColor(*colors)
if 'background-color' in style:
if 'rgb' in style['background-color']:
color = color = re.sub(r'[a-z()]+', '', style['background-color'])
colors = [int(x) for x in color.split(',')]
elif '#' in style['background-color']:
color = style['background-color'].lstrip('#')
colors = tuple(int(color[i:i+2], 16) for i in (0, 2, 4))
else:
colors = [0, 0, 0]
# TODO map colors to named colors (and extended colors...)
# For now set color to black to prevent crashing
self.run.font.highlight_color = WD_COLOR.GRAY_25 #TODO: map colors
def apply_paragraph_style(self, style=None): def apply_paragraph_style(self, style=None):
try: try:
...@@ -250,69 +232,10 @@ class HtmlToDocx(HTMLParser): ...@@ -250,69 +232,10 @@ class HtmlToDocx(HTMLParser):
elif self.paragraph_style: elif self.paragraph_style:
self.paragraph.style = self.paragraph_style self.paragraph.style = self.paragraph_style
except KeyError as e: except KeyError as e:
raise ValueError(f"Unable to apply style {self.paragraph_style}.") from e raise ValueError(
f"Unable to apply style {self.paragraph_style}.") from e
def parse_dict_string(self, string, separator=';'):
new_string = string.replace(" ", '').split(separator)
string_dict = dict([x.split(':') for x in new_string if ':' in x])
return string_dict
def handle_li(self):
# check list stack to determine style and depth
list_depth = len(self.tags['list'])
if list_depth:
list_type = self.tags['list'][-1]
else:
list_type = 'ul' # assign unordered if no tag
if list_type == 'ol':
list_style = styles['LIST_NUMBER']
else:
list_style = styles['LIST_BULLET']
self.paragraph = self.doc.add_paragraph(style=list_style)
self.paragraph.paragraph_format.left_indent = Inches(min(list_depth * LIST_INDENT, MAX_INDENT))
self.paragraph.paragraph_format.line_spacing = 1
def add_image_to_cell(self, cell, image):
# python-docx doesn't have method yet for adding images to table cells. For now we use this
paragraph = cell.add_paragraph()
run = paragraph.add_run()
run.add_picture(image)
def handle_img(self, current_attrs):
if not self.include_images:
self.skip = True
self.skip_tag = 'img'
return
src = current_attrs['src']
# fetch image
src_is_url = is_url(src)
if src_is_url:
try:
image = fetch_image(src)
except urllib.error.URLError:
image = None
else:
image = src
# add image to doc
if image:
try:
if isinstance(self.doc, docx.document.Document):
self.doc.add_picture(image)
else:
self.add_image_to_cell(self.doc, image)
except FileNotFoundError:
image = None
if not image:
if src_is_url:
self.doc.add_paragraph("<image: %s>" % src)
else:
# avoid exposing filepaths in document
self.doc.add_paragraph("<image: %s>" % get_filename_from_url(src))
def handle_table(self, html): def handle_table(self, html, doc):
""" """
To handle nested tables, we will parse tables manually as follows: To handle nested tables, we will parse tables manually as follows:
Get table soup Get table soup
...@@ -320,194 +243,42 @@ class HtmlToDocx(HTMLParser): ...@@ -320,194 +243,42 @@ class HtmlToDocx(HTMLParser):
Iterate over soup and fill docx table with new instances of this parser Iterate over soup and fill docx table with new instances of this parser
Tell HTMLParser to ignore any tags until the corresponding closing table tag Tell HTMLParser to ignore any tags until the corresponding closing table tag
""" """
doc = Document()
table_soup = BeautifulSoup(html, 'html.parser') table_soup = BeautifulSoup(html, 'html.parser')
rows, cols_len = self.get_table_dimensions(table_soup) rows, cols_len = get_table_dimensions(table_soup)
table = doc.add_table(len(rows), cols_len) table = doc.add_table(len(rows), cols_len)
table.style = doc.styles['Table Grid'] table.style = doc.styles['Table Grid']
cell_row = 0 cell_row = 0
for index, row in enumerate(rows): for index, row in enumerate(rows):
cols = self.get_table_columns(row) cols = get_table_columns(row)
cell_col = 0 cell_col = 0
for col in cols: for col in cols:
colspan = int(col.attrs.get('colspan', 1)) colspan = int(col.attrs.get('colspan', 1))
rowspan = int(col.attrs.get('rowspan', 1)) rowspan = int(col.attrs.get('rowspan', 1))
cell_html = self.get_cell_html(col) cell_html = get_cell_html(col)
if col.name == 'th': if col.name == 'th':
cell_html = "<b>%s</b>" % cell_html cell_html = "<b>%s</b>" % cell_html
docx_cell = table.cell(cell_row, cell_col) docx_cell = table.cell(cell_row, cell_col)
while docx_cell.text != '': # Skip the merged cell while docx_cell.text != '': # Skip the merged cell
cell_col += 1 cell_col += 1
docx_cell = table.cell(cell_row, cell_col) docx_cell = table.cell(cell_row, cell_col)
cell_to_merge = table.cell(cell_row + rowspan - 1, cell_col + colspan - 1) cell_to_merge = table.cell(cell_row + rowspan - 1,
cell_col + colspan - 1)
if docx_cell != cell_to_merge: if docx_cell != cell_to_merge:
docx_cell.merge(cell_to_merge) docx_cell.merge(cell_to_merge)
child_parser = HtmlToDocx() child_parser = HtmlToDocx()
child_parser.copy_settings_from(self) child_parser.copy_settings_from(self)
child_parser.add_html_to_cell(cell_html or ' ', docx_cell)
child_parser.add_html_to_cell(cell_html or ' ', docx_cell) # occupy the position
cell_col += colspan cell_col += colspan
cell_row += 1 cell_row += 1
# skip all tags until corresponding closing tag doc.save('1.docx')
self.instances_to_skip = len(table_soup.find_all('table'))
self.skip_tag = 'table'
self.skip = True
self.table = None
return table
def handle_link(self, href, text):
# Link requires a relationship
is_external = href.startswith('http')
rel_id = self.paragraph.part.relate_to(
href,
docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK,
is_external=True # don't support anchor links for this library yet
)
# Create the w:hyperlink tag and add needed values
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
hyperlink.set(docx.oxml.shared.qn('r:id'), rel_id)
# Create sub-run
subrun = self.paragraph.add_run()
rPr = docx.oxml.shared.OxmlElement('w:rPr')
# add default color
c = docx.oxml.shared.OxmlElement('w:color')
c.set(docx.oxml.shared.qn('w:val'), "0000EE")
rPr.append(c)
# add underline
u = docx.oxml.shared.OxmlElement('w:u')
u.set(docx.oxml.shared.qn('w:val'), 'single')
rPr.append(u)
subrun._r.append(rPr)
subrun._r.text = text
# Add subrun to hyperlink
hyperlink.append(subrun._r)
# Add hyperlink to run
self.paragraph._p.append(hyperlink)
def handle_starttag(self, tag, attrs):
if self.skip:
return
if tag == 'head':
self.skip = True
self.skip_tag = tag
self.instances_to_skip = 0
return
elif tag == 'body':
return
current_attrs = dict(attrs)
if tag == 'span':
self.tags['span'].append(current_attrs)
return
elif tag == 'ol' or tag == 'ul':
self.tags['list'].append(tag)
return # don't apply styles for now
elif tag == 'br':
self.run.add_break()
return
self.tags[tag] = current_attrs
if tag in ['p', 'pre']:
self.paragraph = self.doc.add_paragraph()
self.apply_paragraph_style()
elif tag == 'li':
self.handle_li()
elif tag == "hr":
# This implementation was taken from:
# https://github.com/python-openxml/python-docx/issues/105#issuecomment-62806373
self.paragraph = self.doc.add_paragraph()
pPr = self.paragraph._p.get_or_add_pPr()
pBdr = OxmlElement('w:pBdr')
pPr.insert_element_before(pBdr,
'w:shd', 'w:tabs', 'w:suppressAutoHyphens', 'w:kinsoku', 'w:wordWrap',
'w:overflowPunct', 'w:topLinePunct', 'w:autoSpaceDE', 'w:autoSpaceDN',
'w:bidi', 'w:adjustRightInd', 'w:snapToGrid', 'w:spacing', 'w:ind',
'w:contextualSpacing', 'w:mirrorIndents', 'w:suppressOverlap', 'w:jc',
'w:textDirection', 'w:textAlignment', 'w:textboxTightWrap',
'w:outlineLvl', 'w:divId', 'w:cnfStyle', 'w:rPr', 'w:sectPr',
'w:pPrChange'
)
bottom = OxmlElement('w:bottom')
bottom.set(qn('w:val'), 'single')
bottom.set(qn('w:sz'), '6')
bottom.set(qn('w:space'), '1')
bottom.set(qn('w:color'), 'auto')
pBdr.append(bottom)
elif re.match('h[1-9]', tag):
if isinstance(self.doc, docx.document.Document):
h_size = int(tag[1])
self.paragraph = self.doc.add_heading(level=min(h_size, 9))
else:
self.paragraph = self.doc.add_paragraph()
elif tag == 'img':
self.handle_img(current_attrs)
return
elif tag == 'table':
self.handle_table()
return
# set new run reference point in case of leading line breaks
if tag in ['p', 'li', 'pre']:
self.run = self.paragraph.add_run()
# add style
if not self.include_styles:
return
if 'style' in current_attrs and self.paragraph:
style = self.parse_dict_string(current_attrs['style'])
self.add_styles_to_paragraph(style)
def handle_endtag(self, tag):
if self.skip:
if not tag == self.skip_tag:
return
if self.instances_to_skip > 0:
self.instances_to_skip -= 1
return
self.skip = False
self.skip_tag = None
self.paragraph = None
if tag == 'span':
if self.tags['span']:
self.tags['span'].pop()
return
elif tag == 'ol' or tag == 'ul':
remove_last_occurence(self.tags['list'], tag)
return
elif tag == 'table':
self.table_no += 1
self.table = None
self.doc = self.document
self.paragraph = None
if tag in self.tags:
self.tags.pop(tag)
# maybe set relevant reference to None?
def handle_data(self, data): def handle_data(self, data):
if self.skip: if self.skip:
...@@ -546,87 +317,3 @@ class HtmlToDocx(HTMLParser): ...@@ -546,87 +317,3 @@ class HtmlToDocx(HTMLParser):
if tag in font_names: if tag in font_names:
font_name = font_names[tag] font_name = font_names[tag]
self.run.font.name = font_name self.run.font.name = font_name
def ignore_nested_tables(self, tables_soup):
"""
Returns array containing only the highest level tables
Operates on the assumption that bs4 returns child elements immediately after
the parent element in `find_all`. If this changes in the future, this method will need to be updated
:return:
"""
new_tables = []
nest = 0
for table in tables_soup:
if nest:
nest -= 1
continue
new_tables.append(table)
nest = len(table.find_all('table'))
return new_tables
def get_table_rows(self, table_soup):
# If there's a header, body, footer or direct child tr tags, add row dimensions from there
return table_soup.select(', '.join(self.table_row_selectors), recursive=False)
def get_table_columns(self, row):
# Get all columns for the specified row tag.
return row.find_all(['th', 'td'], recursive=False) if row else []
def get_table_dimensions(self, table_soup):
# Get rows for the table
rows = self.get_table_rows(table_soup)
# Table is either empty or has non-direct children between table and tr tags
# Thus the row dimensions and column dimensions are assumed to be 0
cols = self.get_table_columns(rows[0]) if rows else []
# Add colspan calculation column number
col_count = 0
for col in cols:
colspan = col.attrs.get('colspan', 1)
col_count += int(colspan)
# return len(rows), col_count
return rows, col_count
def get_tables(self):
if not hasattr(self, 'soup'):
self.include_tables = False
return
# find other way to do it, or require this dependency?
self.tables = self.ignore_nested_tables(self.soup.find_all('table'))
self.table_no = 0
def run_process(self, html):
if self.bs and BeautifulSoup:
self.soup = BeautifulSoup(html, 'html.parser')
html = str(self.soup)
if self.include_tables:
self.get_tables()
self.feed(html)
def add_html_to_document(self, html, document):
if not isinstance(html, str):
raise ValueError('First argument needs to be a %s' % str)
elif not isinstance(document, docx.document.Document) and not isinstance(document, docx.table._Cell):
raise ValueError('Second argument needs to be a %s' % docx.document.Document)
self.set_initial_attrs(document)
self.run_process(html)
def add_html_to_cell(self, html, cell):
self.set_initial_attrs(cell)
self.run_process(html)
def parse_html_file(self, filename_html, filename_docx=None):
with open(filename_html, 'r') as infile:
html = infile.read()
self.set_initial_attrs()
self.run_process(html)
if not filename_docx:
path, filename = os.path.split(filename_html)
filename_docx = '%s/new_docx_file_%s' % (path, filename)
self.doc.save('%s.docx' % filename_docx)
def parse_html_string(self, html):
self.set_initial_attrs()
self.run_process(html)
return self.doc
\ No newline at end of file
...@@ -90,11 +90,6 @@ def init_args(): ...@@ -90,11 +90,6 @@ def init_args():
type=str2bool, type=str2bool,
default=False, default=False,
help='Whether to enable layout of recovery') help='Whether to enable layout of recovery')
parser.add_argument(
"--save_pdf",
type=str2bool,
default=False,
help='Whether to save pdf file')
return parser return parser
...@@ -108,7 +103,38 @@ def draw_structure_result(image, result, font_path): ...@@ -108,7 +103,38 @@ def draw_structure_result(image, result, font_path):
if isinstance(image, np.ndarray): if isinstance(image, np.ndarray):
image = Image.fromarray(image) image = Image.fromarray(image)
boxes, txts, scores = [], [], [] boxes, txts, scores = [], [], []
img_layout = image.copy()
draw_layout = ImageDraw.Draw(img_layout)
text_color = (255, 255, 255)
text_background_color = (80, 127, 255)
catid2color = {}
font_size = 15
font = ImageFont.truetype(font_path, font_size, encoding="utf-8")
for region in result: for region in result:
if region['type'] not in catid2color:
box_color = (random.randint(0, 255), random.randint(0, 255),
random.randint(0, 255))
catid2color[region['type']] = box_color
else:
box_color = catid2color[region['type']]
box_layout = region['bbox']
draw_layout.rectangle(
[(box_layout[0], box_layout[1]), (box_layout[2], box_layout[3])],
outline=box_color,
width=3)
text_w, text_h = font.getsize(region['type'])
draw_layout.rectangle(
[(box_layout[0], box_layout[1]),
(box_layout[0] + text_w, box_layout[1] + text_h)],
fill=text_background_color)
draw_layout.text(
(box_layout[0], box_layout[1]),
region['type'],
fill=text_color,
font=font)
if region['type'] == 'table': if region['type'] == 'table':
pass pass
else: else:
...@@ -116,6 +142,7 @@ def draw_structure_result(image, result, font_path): ...@@ -116,6 +142,7 @@ def draw_structure_result(image, result, font_path):
boxes.append(np.array(text_result['text_region'])) boxes.append(np.array(text_result['text_region']))
txts.append(text_result['text']) txts.append(text_result['text'])
scores.append(text_result['confidence']) scores.append(text_result['confidence'])
im_show = draw_ocr_box_txt( im_show = draw_ocr_box_txt(
image, boxes, txts, scores, font_path=font_path, drop_score=0) img_layout, boxes, txts, scores, font_path=font_path, drop_score=0)
return im_show return im_show
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册