@@ -152,7 +152,7 @@ We provide CDLA(Chinese layout analysis), TableBank(Table layout analysis)etc. d
| [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | For form detection (TRACKA) and form identification (TRACKB).Image types include historical data sets (beginning with cTDaR_t0, such as CTDAR_T00872.jpg) and modern data sets (beginning with cTDaR_t1, CTDAR_T10482.jpg). |
| [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | Data sets constructed by manually annotating figures or pages from publicly available annual reports, containing 5 categories:table, figure, natural image, logo, and signature. |
| [TableBank](https://github.com/doc-analysis/TableBank) | For table detection and recognition of large datasets, including Word and Latex document formats |
-| [CDLA](https://github.com/buptlihang/CDLA) | Chinese document layout analysis data set, for Chinese literature (paper) scenarios, including 10 categories:Table, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation |
+| [CDLA](https://github.com/buptlihang/CDLA) | Chinese document layout analysis data set, for Chinese literature (paper) scenarios, including 10 categories:Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation |
| [DocBank](https://github.com/doc-analysis/DocBank) | Large-scale dataset (500K document pages) constructed using weakly supervised methods for document layout analysis, containing 12 categories:Author, Caption, Date, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table, Title |
@@ -175,7 +175,7 @@ If the test image is Chinese, the pre-trained model of Chinese CDLA dataset can
### 5.1. Train
-Train:
+Start training with the PaddleDetection [layout analysis profile](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/configs/picodet/legacy_model/application/layout_analysis)
* Modify Profile
diff --git a/ppstructure/layout/README_ch.md b/ppstructure/layout/README_ch.md
index 46d2ba74b2d5c579d4b25cf0cadac22ebc32e5b2..adef46d47389a50bf34500eee1aaf52ff5dfe449 100644
--- a/ppstructure/layout/README_ch.md
+++ b/ppstructure/layout/README_ch.md
@@ -22,7 +22,7 @@
## 1. 简介
-版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等。版面分析算法基于[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)的轻量模型PP-PicoDet进行开发。
+版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等。版面分析算法基于[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)的轻量模型PP-PicoDet进行开发,包含英文、中文、表格版面分析3类模型。其中,英文模型支持Text、Title、Tale、Figure、List5类区域的检测,中文模型支持Text、Title、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation10类区域的检测,表格版面分析支持Table区域的检测,版面分析效果如下图所示:
@@ -152,7 +152,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放,
| ------------------------------------------------------------ | ------------------------------------------------------------ |
| [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | 用于表格检测(TRACKA)和表格识别(TRACKB)。图片类型包含历史数据集(以cTDaR_t0开头,如cTDaR_t00872.jpg)和现代数据集(以cTDaR_t1开头,cTDaR_t10482.jpg)。 |
| [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | 手动注释公开的年度报告中的图形或页面而构建的数据集,包含5类:table, figure, natural image, logo, and signature |
-| [CDLA](https://github.com/buptlihang/CDLA) | 中文文档版面分析数据集,面向中文文献类(论文)场景,包含10类:Table、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation |
+| [CDLA](https://github.com/buptlihang/CDLA) | 中文文档版面分析数据集,面向中文文献类(论文)场景,包含10类:Text、Title、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation |
| [TableBank](https://github.com/doc-analysis/TableBank) | 用于表格检测和识别大型数据集,包含Word和Latex2种文档格式 |
| [DocBank](https://github.com/doc-analysis/DocBank) | 使用弱监督方法构建的大规模数据集(500K文档页面),用于文档布局分析,包含12类:Author、Caption、Date、Equation、Figure、Footer、List、Paragraph、Reference、Section、Table、Title |
@@ -161,7 +161,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放,
提供了训练脚本、评估脚本和预测脚本,本节将以PubLayNet预训练模型为例进行讲解。
-如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型(PubLayNet数据集),并跳过本部分。
+如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型(PubLayNet数据集),并跳过5.1和5.2。
```
mkdir pretrained_model
@@ -176,7 +176,7 @@ wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_
### 5.1. 启动训练
-开始训练:
+使用PaddleDetection[版面分析配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/configs/picodet/legacy_model/application/layout_analysis)启动训练
* 修改配置文件
diff --git a/ppstructure/predict_system.py b/ppstructure/predict_system.py
index 71147d3af8ec666d368234270dcb0d16aaf91938..6e68abd1b2a4e204a96ce91b8d7d291c378bda76 100644
--- a/ppstructure/predict_system.py
+++ b/ppstructure/predict_system.py
@@ -254,8 +254,7 @@ def main(args):
if args.recovery and all_res != []:
try:
- convert_info_docx(img, all_res, save_folder, img_name,
- args.save_pdf)
+ convert_info_docx(img, all_res, save_folder, img_name)
except Exception as ex:
logger.error("error in layout recovery image:{}, err msg: {}".
format(image_file, ex))
diff --git a/ppstructure/recovery/README.md b/ppstructure/recovery/README.md
index 011d6e12fda1b09c7a87367fb887a5c99a4ae00a..0e06c65475b67bcdfc119069fa6f6076322c0e99 100644
--- a/ppstructure/recovery/README.md
+++ b/ppstructure/recovery/README.md
@@ -82,8 +82,11 @@ Through layout analysis, we divided the image/PDF documents into regions, locate
We can restore the test picture through the layout information, OCR detection and recognition structure, table information, and saved pictures.
-The whl package is also provided for quick use, see [quickstart](../docs/quickstart_en.md) for details.
+The whl package is also provided for quick use, follow the above code, for more infomation please refer to [quickstart](../docs/quickstart_en.md) for details.
+```bash
+paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en'
+```
### 3.1 Download models
diff --git a/ppstructure/recovery/README_ch.md b/ppstructure/recovery/README_ch.md
index fd2e649024ec88e2ea5c88536ccac2e259538886..bc8913adca3385a88cb2decc87fa9acffc707257 100644
--- a/ppstructure/recovery/README_ch.md
+++ b/ppstructure/recovery/README_ch.md
@@ -83,7 +83,16 @@ python3 -m pip install -r ppstructure/recovery/requirements.txt
我们通过版面信息、OCR检测和识别结构、表格信息、保存的图片,对测试图片进行恢复即可。
-提供如下代码实现版面恢复,也提供了whl包的形式方便快速使用,详见 [quickstart](../docs/quickstart.md)。
+提供如下代码实现版面恢复,也提供了whl包的形式方便快速使用,代码如下,更多信息详见 [quickstart](../docs/quickstart.md)。
+
+```bash
+# 中文测试图
+paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true
+# 英文测试图
+paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en'
+# pdf测试文件
+paddleocr --image_dir=ppstructure/recovery/UnrealText.pdf --type=structure --recovery=true --lang='en'
+```
diff --git a/ppstructure/recovery/recovery_to_doc.py b/ppstructure/recovery/recovery_to_doc.py
index 73b497d49d0961b253738eddad49c88c12c13601..1d8f8d9d4babca7410d6625dbeac4c41668f58a7 100644
--- a/ppstructure/recovery/recovery_to_doc.py
+++ b/ppstructure/recovery/recovery_to_doc.py
@@ -28,7 +28,7 @@ from ppocr.utils.logging import get_logger
logger = get_logger()
-def convert_info_docx(img, res, save_folder, img_name, save_pdf=False):
+def convert_info_docx(img, res, save_folder, img_name):
doc = Document()
doc.styles['Normal'].font.name = 'Times New Roman'
doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体')
@@ -60,14 +60,9 @@ def convert_info_docx(img, res, save_folder, img_name, save_pdf=False):
elif region['type'].lower() == 'title':
doc.add_heading(region['res'][0]['text'])
elif region['type'].lower() == 'table':
- paragraph = doc.add_paragraph()
- new_parser = HtmlToDocx()
- new_parser.table_style = 'TableGrid'
- table = new_parser.handle_table(html=region['res']['html'])
- new_table = deepcopy(table)
- new_table.alignment = WD_TABLE_ALIGNMENT.CENTER
- paragraph.add_run().element.addnext(new_table._tbl)
-
+ parser = HtmlToDocx()
+ parser.table_style = 'TableGrid'
+ parser.handle_table(region['res']['html'], doc)
else:
paragraph = doc.add_paragraph()
paragraph_format = paragraph.paragraph_format
@@ -82,13 +77,6 @@ def convert_info_docx(img, res, save_folder, img_name, save_pdf=False):
doc.save(docx_path)
logger.info('docx save to {}'.format(docx_path))
- # save to pdf
- if save_pdf:
- pdf_path = os.path.join(save_folder, '{}.pdf'.format(img_name))
- from docx2pdf import convert
- convert(docx_path, pdf_path)
- logger.info('pdf save to {}'.format(pdf_path))
-
def sorted_layout_boxes(res, w):
"""
diff --git a/ppstructure/recovery/requirements.txt b/ppstructure/recovery/requirements.txt
index 25e8cdbb0d58b0a243b176f563c66717d6f4c112..7ddc3391338e5a2a87f9cea9fca006dc03da58fb 100644
--- a/ppstructure/recovery/requirements.txt
+++ b/ppstructure/recovery/requirements.txt
@@ -1,4 +1,3 @@
python-docx
-docx2pdf
PyMuPDF
beautifulsoup4
\ No newline at end of file
diff --git a/ppstructure/recovery/table_process.py b/ppstructure/recovery/table_process.py
index 243aaf8933791bf4704964d9665173fe70982f95..982e6b760f9291628d0514728dc8f684f183aa2c 100644
--- a/ppstructure/recovery/table_process.py
+++ b/ppstructure/recovery/table_process.py
@@ -1,4 +1,3 @@
-
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -13,62 +12,59 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
-This code is refer from:https://github.com/pqzx/html2docx/blob/8f6695a778c68befb302e48ac0ed5201ddbd4524/htmldocx/h2d.py
-
+This code is refer from: https://github.com/weizwx/html2docx/blob/master/htmldocx/h2d.py
"""
-import re, argparse
-import io, os
-import urllib.request
-from urllib.parse import urlparse
-from html.parser import HTMLParser
-import docx, docx.table
+import re
+import docx
from docx import Document
-from docx.shared import RGBColor, Pt, Inches
-from docx.enum.text import WD_COLOR, WD_ALIGN_PARAGRAPH
-from docx.oxml import OxmlElement
-from docx.oxml.ns import qn
-
from bs4 import BeautifulSoup
+from html.parser import HTMLParser
-# values in inches
-INDENT = 0.25
-LIST_INDENT = 0.5
-MAX_INDENT = 5.5 # To stop indents going off the page
-# Style to use with tables. By default no style is used.
-DEFAULT_TABLE_STYLE = None
+def get_table_rows(table_soup):
+ table_row_selectors = [
+ 'table > tr', 'table > thead > tr', 'table > tbody > tr',
+ 'table > tfoot > tr'
+ ]
+ # If there's a header, body, footer or direct child tr tags, add row dimensions from there
+ return table_soup.select(', '.join(table_row_selectors), recursive=False)
-# Style to use with paragraphs. By default no style is used.
-DEFAULT_PARAGRAPH_STYLE = None
+def get_table_columns(row):
+ # Get all columns for the specified row tag.
+ return row.find_all(['th', 'td'], recursive=False) if row else []
-def get_filename_from_url(url):
- return os.path.basename(urlparse(url).path)
-def is_url(url):
- """
- Not to be used for actually validating a url, but in our use case we only
- care if it's a url or a file path, and they're pretty distinguishable
- """
- parts = urlparse(url)
- return all([parts.scheme, parts.netloc, parts.path])
+def get_table_dimensions(table_soup):
+ # Get rows for the table
+ rows = get_table_rows(table_soup)
+ # Table is either empty or has non-direct children between table and tr tags
+ # Thus the row dimensions and column dimensions are assumed to be 0
-def fetch_image(url):
- """
- Attempts to fetch an image from a url.
- If successful returns a bytes object, else returns None
- :return:
- """
- try:
- with urllib.request.urlopen(url) as response:
- # security flaw?
- return io.BytesIO(response.read())
- except urllib.error.URLError:
- return None
+ cols = get_table_columns(rows[0]) if rows else []
+ # Add colspan calculation column number
+ col_count = 0
+ for col in cols:
+ colspan = col.attrs.get('colspan', 1)
+ col_count += int(colspan)
+
+ return rows, col_count
+
+
+def get_cell_html(soup):
+ # Returns string of td element with opening and closing
tags removed
+ # Cannot use find_all as it only finds element tags and does not find text which
+ # is not inside an element
+ return ' '.join([str(i) for i in soup.contents])
+
+
+def delete_paragraph(paragraph):
+ # https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907
+ p = paragraph._element
+ p.getparent().remove(p)
+ p._p = p._element = None
-def remove_last_occurence(ls, x):
- ls.pop(len(ls) - ls[::-1].index(x) - 1)
def remove_whitespace(string, leading=False, trailing=False):
"""Remove white space from a string.
@@ -122,11 +118,6 @@ def remove_whitespace(string, leading=False, trailing=False):
# TODO need some way to get rid of extra spaces in e.g. text text
return re.sub(r'\s+', ' ', string)
-def delete_paragraph(paragraph):
- # https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907
- p = paragraph._element
- p.getparent().remove(p)
- p._p = p._element = None
font_styles = {
'b': 'bold',
@@ -145,13 +136,8 @@ font_names = {
'pre': 'Courier',
}
-styles = {
- 'LIST_BULLET': 'List Bullet',
- 'LIST_NUMBER': 'List Number',
-}
class HtmlToDocx(HTMLParser):
-
def __init__(self):
super().__init__()
self.options = {
@@ -161,13 +147,11 @@ class HtmlToDocx(HTMLParser):
'styles': True,
}
self.table_row_selectors = [
- 'table > tr',
- 'table > thead > tr',
- 'table > tbody > tr',
+ 'table > tr', 'table > thead > tr', 'table > tbody > tr',
'table > tfoot > tr'
]
- self.table_style = DEFAULT_TABLE_STYLE
- self.paragraph_style = DEFAULT_PARAGRAPH_STYLE
+ self.table_style = None
+ self.paragraph_style = None
def set_initial_attrs(self, document=None):
self.tags = {
@@ -178,9 +162,10 @@ class HtmlToDocx(HTMLParser):
self.doc = document
else:
self.doc = Document()
- self.bs = self.options['fix-html'] # whether or not to clean with BeautifulSoup
+ self.bs = self.options[
+ 'fix-html'] # whether or not to clean with BeautifulSoup
self.document = self.doc
- self.include_tables = True #TODO add this option back in?
+ self.include_tables = True #TODO add this option back in?
self.include_images = self.options['images']
self.include_styles = self.options['styles']
self.paragraph = None
@@ -193,55 +178,52 @@ class HtmlToDocx(HTMLParser):
self.table_style = other.table_style
self.paragraph_style = other.paragraph_style
- def get_cell_html(self, soup):
- # Returns string of td element with opening and closing | tags removed
- # Cannot use find_all as it only finds element tags and does not find text which
- # is not inside an element
- return ' '.join([str(i) for i in soup.contents])
-
- def add_styles_to_paragraph(self, style):
- if 'text-align' in style:
- align = style['text-align']
- if align == 'center':
- self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
- elif align == 'right':
- self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.RIGHT
- elif align == 'justify':
- self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
- if 'margin-left' in style:
- margin = style['margin-left']
- units = re.sub(r'[0-9]+', '', margin)
- margin = int(float(re.sub(r'[a-z]+', '', margin)))
- if units == 'px':
- self.paragraph.paragraph_format.left_indent = Inches(min(margin // 10 * INDENT, MAX_INDENT))
- # TODO handle non px units
-
- def add_styles_to_run(self, style):
- if 'color' in style:
- if 'rgb' in style['color']:
- color = re.sub(r'[a-z()]+', '', style['color'])
- colors = [int(x) for x in color.split(',')]
- elif '#' in style['color']:
- color = style['color'].lstrip('#')
- colors = tuple(int(color[i:i+2], 16) for i in (0, 2, 4))
- else:
- colors = [0, 0, 0]
- # TODO map colors to named colors (and extended colors...)
- # For now set color to black to prevent crashing
- self.run.font.color.rgb = RGBColor(*colors)
-
- if 'background-color' in style:
- if 'rgb' in style['background-color']:
- color = color = re.sub(r'[a-z()]+', '', style['background-color'])
- colors = [int(x) for x in color.split(',')]
- elif '#' in style['background-color']:
- color = style['background-color'].lstrip('#')
- colors = tuple(int(color[i:i+2], 16) for i in (0, 2, 4))
- else:
- colors = [0, 0, 0]
- # TODO map colors to named colors (and extended colors...)
- # For now set color to black to prevent crashing
- self.run.font.highlight_color = WD_COLOR.GRAY_25 #TODO: map colors
+ def ignore_nested_tables(self, tables_soup):
+ """
+ Returns array containing only the highest level tables
+ Operates on the assumption that bs4 returns child elements immediately after
+ the parent element in `find_all`. If this changes in the future, this method will need to be updated
+ :return:
+ """
+ new_tables = []
+ nest = 0
+ for table in tables_soup:
+ if nest:
+ nest -= 1
+ continue
+ new_tables.append(table)
+ nest = len(table.find_all('table'))
+ return new_tables
+
+ def get_tables(self):
+ if not hasattr(self, 'soup'):
+ self.include_tables = False
+ return
+ # find other way to do it, or require this dependency?
+ self.tables = self.ignore_nested_tables(self.soup.find_all('table'))
+ self.table_no = 0
+
+ def run_process(self, html):
+ if self.bs and BeautifulSoup:
+ self.soup = BeautifulSoup(html, 'html.parser')
+ html = str(self.soup)
+ if self.include_tables:
+ self.get_tables()
+ self.feed(html)
+
+ def add_html_to_cell(self, html, cell):
+ if not isinstance(cell, docx.table._Cell):
+ raise ValueError('Second argument needs to be a %s' %
+ docx.table._Cell)
+ unwanted_paragraph = cell.paragraphs[0]
+ if unwanted_paragraph.text == "":
+ delete_paragraph(unwanted_paragraph)
+ self.set_initial_attrs(cell)
+ self.run_process(html)
+ # cells must end with a paragraph or will get message about corrupt file
+ # https://stackoverflow.com/a/29287121
+ if not self.doc.paragraphs:
+ self.doc.add_paragraph('')
def apply_paragraph_style(self, style=None):
try:
@@ -250,69 +232,10 @@ class HtmlToDocx(HTMLParser):
elif self.paragraph_style:
self.paragraph.style = self.paragraph_style
except KeyError as e:
- raise ValueError(f"Unable to apply style {self.paragraph_style}.") from e
-
- def parse_dict_string(self, string, separator=';'):
- new_string = string.replace(" ", '').split(separator)
- string_dict = dict([x.split(':') for x in new_string if ':' in x])
- return string_dict
-
- def handle_li(self):
- # check list stack to determine style and depth
- list_depth = len(self.tags['list'])
- if list_depth:
- list_type = self.tags['list'][-1]
- else:
- list_type = 'ul' # assign unordered if no tag
+ raise ValueError(
+ f"Unable to apply style {self.paragraph_style}.") from e
- if list_type == 'ol':
- list_style = styles['LIST_NUMBER']
- else:
- list_style = styles['LIST_BULLET']
-
- self.paragraph = self.doc.add_paragraph(style=list_style)
- self.paragraph.paragraph_format.left_indent = Inches(min(list_depth * LIST_INDENT, MAX_INDENT))
- self.paragraph.paragraph_format.line_spacing = 1
-
- def add_image_to_cell(self, cell, image):
- # python-docx doesn't have method yet for adding images to table cells. For now we use this
- paragraph = cell.add_paragraph()
- run = paragraph.add_run()
- run.add_picture(image)
-
- def handle_img(self, current_attrs):
- if not self.include_images:
- self.skip = True
- self.skip_tag = 'img'
- return
- src = current_attrs['src']
- # fetch image
- src_is_url = is_url(src)
- if src_is_url:
- try:
- image = fetch_image(src)
- except urllib.error.URLError:
- image = None
- else:
- image = src
- # add image to doc
- if image:
- try:
- if isinstance(self.doc, docx.document.Document):
- self.doc.add_picture(image)
- else:
- self.add_image_to_cell(self.doc, image)
- except FileNotFoundError:
- image = None
- if not image:
- if src_is_url:
- self.doc.add_paragraph("" % src)
- else:
- # avoid exposing filepaths in document
- self.doc.add_paragraph("" % get_filename_from_url(src))
-
-
- def handle_table(self, html):
+ def handle_table(self, html, doc):
"""
To handle nested tables, we will parse tables manually as follows:
Get table soup
@@ -320,194 +243,42 @@ class HtmlToDocx(HTMLParser):
Iterate over soup and fill docx table with new instances of this parser
Tell HTMLParser to ignore any tags until the corresponding closing table tag
"""
- doc = Document()
table_soup = BeautifulSoup(html, 'html.parser')
- rows, cols_len = self.get_table_dimensions(table_soup)
+ rows, cols_len = get_table_dimensions(table_soup)
table = doc.add_table(len(rows), cols_len)
table.style = doc.styles['Table Grid']
+
cell_row = 0
for index, row in enumerate(rows):
- cols = self.get_table_columns(row)
+ cols = get_table_columns(row)
cell_col = 0
for col in cols:
colspan = int(col.attrs.get('colspan', 1))
rowspan = int(col.attrs.get('rowspan', 1))
- cell_html = self.get_cell_html(col)
-
+ cell_html = get_cell_html(col)
if col.name == 'th':
cell_html = "%s" % cell_html
+
docx_cell = table.cell(cell_row, cell_col)
+
while docx_cell.text != '': # Skip the merged cell
cell_col += 1
docx_cell = table.cell(cell_row, cell_col)
- cell_to_merge = table.cell(cell_row + rowspan - 1, cell_col + colspan - 1)
+ cell_to_merge = table.cell(cell_row + rowspan - 1,
+ cell_col + colspan - 1)
if docx_cell != cell_to_merge:
docx_cell.merge(cell_to_merge)
child_parser = HtmlToDocx()
child_parser.copy_settings_from(self)
-
- child_parser.add_html_to_cell(cell_html or ' ', docx_cell) # occupy the position
+ child_parser.add_html_to_cell(cell_html or ' ', docx_cell)
cell_col += colspan
cell_row += 1
-
- # skip all tags until corresponding closing tag
- self.instances_to_skip = len(table_soup.find_all('table'))
- self.skip_tag = 'table'
- self.skip = True
- self.table = None
- return table
-
- def handle_link(self, href, text):
- # Link requires a relationship
- is_external = href.startswith('http')
- rel_id = self.paragraph.part.relate_to(
- href,
- docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK,
- is_external=True # don't support anchor links for this library yet
- )
-
- # Create the w:hyperlink tag and add needed values
- hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
- hyperlink.set(docx.oxml.shared.qn('r:id'), rel_id)
-
-
- # Create sub-run
- subrun = self.paragraph.add_run()
- rPr = docx.oxml.shared.OxmlElement('w:rPr')
-
- # add default color
- c = docx.oxml.shared.OxmlElement('w:color')
- c.set(docx.oxml.shared.qn('w:val'), "0000EE")
- rPr.append(c)
-
- # add underline
- u = docx.oxml.shared.OxmlElement('w:u')
- u.set(docx.oxml.shared.qn('w:val'), 'single')
- rPr.append(u)
-
- subrun._r.append(rPr)
- subrun._r.text = text
-
- # Add subrun to hyperlink
- hyperlink.append(subrun._r)
-
- # Add hyperlink to run
- self.paragraph._p.append(hyperlink)
-
- def handle_starttag(self, tag, attrs):
- if self.skip:
- return
- if tag == 'head':
- self.skip = True
- self.skip_tag = tag
- self.instances_to_skip = 0
- return
- elif tag == 'body':
- return
-
- current_attrs = dict(attrs)
-
- if tag == 'span':
- self.tags['span'].append(current_attrs)
- return
- elif tag == 'ol' or tag == 'ul':
- self.tags['list'].append(tag)
- return # don't apply styles for now
- elif tag == 'br':
- self.run.add_break()
- return
-
- self.tags[tag] = current_attrs
- if tag in ['p', 'pre']:
- self.paragraph = self.doc.add_paragraph()
- self.apply_paragraph_style()
-
- elif tag == 'li':
- self.handle_li()
-
- elif tag == "hr":
-
- # This implementation was taken from:
- # https://github.com/python-openxml/python-docx/issues/105#issuecomment-62806373
-
- self.paragraph = self.doc.add_paragraph()
- pPr = self.paragraph._p.get_or_add_pPr()
- pBdr = OxmlElement('w:pBdr')
- pPr.insert_element_before(pBdr,
- 'w:shd', 'w:tabs', 'w:suppressAutoHyphens', 'w:kinsoku', 'w:wordWrap',
- 'w:overflowPunct', 'w:topLinePunct', 'w:autoSpaceDE', 'w:autoSpaceDN',
- 'w:bidi', 'w:adjustRightInd', 'w:snapToGrid', 'w:spacing', 'w:ind',
- 'w:contextualSpacing', 'w:mirrorIndents', 'w:suppressOverlap', 'w:jc',
- 'w:textDirection', 'w:textAlignment', 'w:textboxTightWrap',
- 'w:outlineLvl', 'w:divId', 'w:cnfStyle', 'w:rPr', 'w:sectPr',
- 'w:pPrChange'
- )
- bottom = OxmlElement('w:bottom')
- bottom.set(qn('w:val'), 'single')
- bottom.set(qn('w:sz'), '6')
- bottom.set(qn('w:space'), '1')
- bottom.set(qn('w:color'), 'auto')
- pBdr.append(bottom)
-
- elif re.match('h[1-9]', tag):
- if isinstance(self.doc, docx.document.Document):
- h_size = int(tag[1])
- self.paragraph = self.doc.add_heading(level=min(h_size, 9))
- else:
- self.paragraph = self.doc.add_paragraph()
-
- elif tag == 'img':
- self.handle_img(current_attrs)
- return
-
- elif tag == 'table':
- self.handle_table()
- return
- # set new run reference point in case of leading line breaks
- if tag in ['p', 'li', 'pre']:
- self.run = self.paragraph.add_run()
-
- # add style
- if not self.include_styles:
- return
- if 'style' in current_attrs and self.paragraph:
- style = self.parse_dict_string(current_attrs['style'])
- self.add_styles_to_paragraph(style)
-
- def handle_endtag(self, tag):
- if self.skip:
- if not tag == self.skip_tag:
- return
-
- if self.instances_to_skip > 0:
- self.instances_to_skip -= 1
- return
-
- self.skip = False
- self.skip_tag = None
- self.paragraph = None
-
- if tag == 'span':
- if self.tags['span']:
- self.tags['span'].pop()
- return
- elif tag == 'ol' or tag == 'ul':
- remove_last_occurence(self.tags['list'], tag)
- return
- elif tag == 'table':
- self.table_no += 1
- self.table = None
- self.doc = self.document
- self.paragraph = None
-
- if tag in self.tags:
- self.tags.pop(tag)
- # maybe set relevant reference to None?
+ doc.save('1.docx')
def handle_data(self, data):
if self.skip:
@@ -546,87 +317,3 @@ class HtmlToDocx(HTMLParser):
if tag in font_names:
font_name = font_names[tag]
self.run.font.name = font_name
-
- def ignore_nested_tables(self, tables_soup):
- """
- Returns array containing only the highest level tables
- Operates on the assumption that bs4 returns child elements immediately after
- the parent element in `find_all`. If this changes in the future, this method will need to be updated
- :return:
- """
- new_tables = []
- nest = 0
- for table in tables_soup:
- if nest:
- nest -= 1
- continue
- new_tables.append(table)
- nest = len(table.find_all('table'))
- return new_tables
-
- def get_table_rows(self, table_soup):
- # If there's a header, body, footer or direct child tr tags, add row dimensions from there
- return table_soup.select(', '.join(self.table_row_selectors), recursive=False)
-
- def get_table_columns(self, row):
- # Get all columns for the specified row tag.
- return row.find_all(['th', 'td'], recursive=False) if row else []
-
- def get_table_dimensions(self, table_soup):
- # Get rows for the table
- rows = self.get_table_rows(table_soup)
- # Table is either empty or has non-direct children between table and tr tags
- # Thus the row dimensions and column dimensions are assumed to be 0
-
- cols = self.get_table_columns(rows[0]) if rows else []
- # Add colspan calculation column number
- col_count = 0
- for col in cols:
- colspan = col.attrs.get('colspan', 1)
- col_count += int(colspan)
-
- # return len(rows), col_count
- return rows, col_count
-
- def get_tables(self):
- if not hasattr(self, 'soup'):
- self.include_tables = False
- return
- # find other way to do it, or require this dependency?
- self.tables = self.ignore_nested_tables(self.soup.find_all('table'))
- self.table_no = 0
-
- def run_process(self, html):
- if self.bs and BeautifulSoup:
- self.soup = BeautifulSoup(html, 'html.parser')
- html = str(self.soup)
- if self.include_tables:
- self.get_tables()
- self.feed(html)
-
- def add_html_to_document(self, html, document):
- if not isinstance(html, str):
- raise ValueError('First argument needs to be a %s' % str)
- elif not isinstance(document, docx.document.Document) and not isinstance(document, docx.table._Cell):
- raise ValueError('Second argument needs to be a %s' % docx.document.Document)
- self.set_initial_attrs(document)
- self.run_process(html)
-
- def add_html_to_cell(self, html, cell):
- self.set_initial_attrs(cell)
- self.run_process(html)
-
- def parse_html_file(self, filename_html, filename_docx=None):
- with open(filename_html, 'r') as infile:
- html = infile.read()
- self.set_initial_attrs()
- self.run_process(html)
- if not filename_docx:
- path, filename = os.path.split(filename_html)
- filename_docx = '%s/new_docx_file_%s' % (path, filename)
- self.doc.save('%s.docx' % filename_docx)
-
- def parse_html_string(self, html):
- self.set_initial_attrs()
- self.run_process(html)
- return self.doc
\ No newline at end of file
diff --git a/ppstructure/utility.py b/ppstructure/utility.py
index 97b6d6fec0d70fe3014b0b2105dbbef6a292e4d7..c0aa8091273ad0fd0f71d36843322ac289417059 100644
--- a/ppstructure/utility.py
+++ b/ppstructure/utility.py
@@ -90,11 +90,6 @@ def init_args():
type=str2bool,
default=False,
help='Whether to enable layout of recovery')
- parser.add_argument(
- "--save_pdf",
- type=str2bool,
- default=False,
- help='Whether to save pdf file')
return parser
@@ -108,7 +103,38 @@ def draw_structure_result(image, result, font_path):
if isinstance(image, np.ndarray):
image = Image.fromarray(image)
boxes, txts, scores = [], [], []
+
+ img_layout = image.copy()
+ draw_layout = ImageDraw.Draw(img_layout)
+ text_color = (255, 255, 255)
+ text_background_color = (80, 127, 255)
+ catid2color = {}
+ font_size = 15
+ font = ImageFont.truetype(font_path, font_size, encoding="utf-8")
+
for region in result:
+ if region['type'] not in catid2color:
+ box_color = (random.randint(0, 255), random.randint(0, 255),
+ random.randint(0, 255))
+ catid2color[region['type']] = box_color
+ else:
+ box_color = catid2color[region['type']]
+ box_layout = region['bbox']
+ draw_layout.rectangle(
+ [(box_layout[0], box_layout[1]), (box_layout[2], box_layout[3])],
+ outline=box_color,
+ width=3)
+ text_w, text_h = font.getsize(region['type'])
+ draw_layout.rectangle(
+ [(box_layout[0], box_layout[1]),
+ (box_layout[0] + text_w, box_layout[1] + text_h)],
+ fill=text_background_color)
+ draw_layout.text(
+ (box_layout[0], box_layout[1]),
+ region['type'],
+ fill=text_color,
+ font=font)
+
if region['type'] == 'table':
pass
else:
@@ -116,6 +142,7 @@ def draw_structure_result(image, result, font_path):
boxes.append(np.array(text_result['text_region']))
txts.append(text_result['text'])
scores.append(text_result['confidence'])
+
im_show = draw_ocr_box_txt(
- image, boxes, txts, scores, font_path=font_path, drop_score=0)
+ img_layout, boxes, txts, scores, font_path=font_path, drop_score=0)
return im_show
|