diff --git a/ppstructure/docs/layout/layout.png b/ppstructure/docs/layout/layout.png index da9640e245e34659771353e328bf97da129bd622..66b95486955b5f45f3f0c16e1ed6577914cc2c7c 100644 Binary files a/ppstructure/docs/layout/layout.png and b/ppstructure/docs/layout/layout.png differ diff --git a/ppstructure/layout/README.md b/ppstructure/layout/README.md index 84b977fdd760e6de43d355b802731b5d43eb2cf5..6830f8e82153f8ae7d2e798cda6782bc5518da4c 100644 --- a/ppstructure/layout/README.md +++ b/ppstructure/layout/README.md @@ -23,7 +23,7 @@ English | [简体中文](README_ch.md) ## 1. Introduction -Layout analysis refers to the regional division of documents in the form of pictures and the positioning of key areas, such as text, title, table, picture, etc. The layout analysis algorithm is based on the lightweight model PP-picodet of [PaddleDetection]( https://github.com/PaddlePaddle/PaddleDetection ) +Layout analysis refers to the regional division of documents in the form of pictures and the positioning of key areas, such as text, title, table, picture, etc. The layout analysis algorithm is based on the lightweight model PP-picodet of [PaddleDetection]( https://github.com/PaddlePaddle/PaddleDetection ), including English layout analysis, Chinese layout analysis and table layout analysis models. English layout analysis models can detect document layout elements such as text, title, table, figure, list. Chinese layout analysis models can detect document layout elements such as text, figure, figure caption, table, table caption, header, footer, reference, and equation. Table layout analysis models can detect table regions.
@@ -152,7 +152,7 @@ We provide CDLA(Chinese layout analysis), TableBank(Table layout analysis)etc. d | [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | For form detection (TRACKA) and form identification (TRACKB).Image types include historical data sets (beginning with cTDaR_t0, such as CTDAR_T00872.jpg) and modern data sets (beginning with cTDaR_t1, CTDAR_T10482.jpg). | | [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | Data sets constructed by manually annotating figures or pages from publicly available annual reports, containing 5 categories:table, figure, natural image, logo, and signature. | | [TableBank](https://github.com/doc-analysis/TableBank) | For table detection and recognition of large datasets, including Word and Latex document formats | -| [CDLA](https://github.com/buptlihang/CDLA) | Chinese document layout analysis data set, for Chinese literature (paper) scenarios, including 10 categories:Table, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation | +| [CDLA](https://github.com/buptlihang/CDLA) | Chinese document layout analysis data set, for Chinese literature (paper) scenarios, including 10 categories:Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation | | [DocBank](https://github.com/doc-analysis/DocBank) | Large-scale dataset (500K document pages) constructed using weakly supervised methods for document layout analysis, containing 12 categories:Author, Caption, Date, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table, Title | @@ -175,7 +175,7 @@ If the test image is Chinese, the pre-trained model of Chinese CDLA dataset can ### 5.1. Train -Train: +Start training with the PaddleDetection [layout analysis profile](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/configs/picodet/legacy_model/application/layout_analysis) * Modify Profile diff --git a/ppstructure/layout/README_ch.md b/ppstructure/layout/README_ch.md index 46d2ba74b2d5c579d4b25cf0cadac22ebc32e5b2..adef46d47389a50bf34500eee1aaf52ff5dfe449 100644 --- a/ppstructure/layout/README_ch.md +++ b/ppstructure/layout/README_ch.md @@ -22,7 +22,7 @@ ## 1. 简介 -版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等。版面分析算法基于[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)的轻量模型PP-PicoDet进行开发。 +版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等。版面分析算法基于[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)的轻量模型PP-PicoDet进行开发,包含英文、中文、表格版面分析3类模型。其中,英文模型支持Text、Title、Tale、Figure、List5类区域的检测,中文模型支持Text、Title、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation10类区域的检测,表格版面分析支持Table区域的检测,版面分析效果如下图所示:
@@ -152,7 +152,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放, | ------------------------------------------------------------ | ------------------------------------------------------------ | | [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | 用于表格检测(TRACKA)和表格识别(TRACKB)。图片类型包含历史数据集(以cTDaR_t0开头,如cTDaR_t00872.jpg)和现代数据集(以cTDaR_t1开头,cTDaR_t10482.jpg)。 | | [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | 手动注释公开的年度报告中的图形或页面而构建的数据集,包含5类:table, figure, natural image, logo, and signature | -| [CDLA](https://github.com/buptlihang/CDLA) | 中文文档版面分析数据集,面向中文文献类(论文)场景,包含10类:Table、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation | +| [CDLA](https://github.com/buptlihang/CDLA) | 中文文档版面分析数据集,面向中文文献类(论文)场景,包含10类:Text、Title、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation | | [TableBank](https://github.com/doc-analysis/TableBank) | 用于表格检测和识别大型数据集,包含Word和Latex2种文档格式 | | [DocBank](https://github.com/doc-analysis/DocBank) | 使用弱监督方法构建的大规模数据集(500K文档页面),用于文档布局分析,包含12类:Author、Caption、Date、Equation、Figure、Footer、List、Paragraph、Reference、Section、Table、Title | @@ -161,7 +161,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放, 提供了训练脚本、评估脚本和预测脚本,本节将以PubLayNet预训练模型为例进行讲解。 -如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型(PubLayNet数据集),并跳过本部分。 +如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型(PubLayNet数据集),并跳过5.1和5.2。 ``` mkdir pretrained_model @@ -176,7 +176,7 @@ wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_ ### 5.1. 启动训练 -开始训练: +使用PaddleDetection[版面分析配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/configs/picodet/legacy_model/application/layout_analysis)启动训练 * 修改配置文件 diff --git a/ppstructure/predict_system.py b/ppstructure/predict_system.py index 71147d3af8ec666d368234270dcb0d16aaf91938..6e68abd1b2a4e204a96ce91b8d7d291c378bda76 100644 --- a/ppstructure/predict_system.py +++ b/ppstructure/predict_system.py @@ -254,8 +254,7 @@ def main(args): if args.recovery and all_res != []: try: - convert_info_docx(img, all_res, save_folder, img_name, - args.save_pdf) + convert_info_docx(img, all_res, save_folder, img_name) except Exception as ex: logger.error("error in layout recovery image:{}, err msg: {}". format(image_file, ex)) diff --git a/ppstructure/recovery/README.md b/ppstructure/recovery/README.md index 011d6e12fda1b09c7a87367fb887a5c99a4ae00a..0e06c65475b67bcdfc119069fa6f6076322c0e99 100644 --- a/ppstructure/recovery/README.md +++ b/ppstructure/recovery/README.md @@ -82,8 +82,11 @@ Through layout analysis, we divided the image/PDF documents into regions, locate We can restore the test picture through the layout information, OCR detection and recognition structure, table information, and saved pictures. -The whl package is also provided for quick use, see [quickstart](../docs/quickstart_en.md) for details. +The whl package is also provided for quick use, follow the above code, for more infomation please refer to [quickstart](../docs/quickstart_en.md) for details. +```bash +paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en' +``` ### 3.1 Download models diff --git a/ppstructure/recovery/README_ch.md b/ppstructure/recovery/README_ch.md index fd2e649024ec88e2ea5c88536ccac2e259538886..bc8913adca3385a88cb2decc87fa9acffc707257 100644 --- a/ppstructure/recovery/README_ch.md +++ b/ppstructure/recovery/README_ch.md @@ -83,7 +83,16 @@ python3 -m pip install -r ppstructure/recovery/requirements.txt 我们通过版面信息、OCR检测和识别结构、表格信息、保存的图片,对测试图片进行恢复即可。 -提供如下代码实现版面恢复,也提供了whl包的形式方便快速使用,详见 [quickstart](../docs/quickstart.md)。 +提供如下代码实现版面恢复,也提供了whl包的形式方便快速使用,代码如下,更多信息详见 [quickstart](../docs/quickstart.md)。 + +```bash +# 中文测试图 +paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true +# 英文测试图 +paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en' +# pdf测试文件 +paddleocr --image_dir=ppstructure/recovery/UnrealText.pdf --type=structure --recovery=true --lang='en' +``` diff --git a/ppstructure/recovery/recovery_to_doc.py b/ppstructure/recovery/recovery_to_doc.py index 73b497d49d0961b253738eddad49c88c12c13601..1d8f8d9d4babca7410d6625dbeac4c41668f58a7 100644 --- a/ppstructure/recovery/recovery_to_doc.py +++ b/ppstructure/recovery/recovery_to_doc.py @@ -28,7 +28,7 @@ from ppocr.utils.logging import get_logger logger = get_logger() -def convert_info_docx(img, res, save_folder, img_name, save_pdf=False): +def convert_info_docx(img, res, save_folder, img_name): doc = Document() doc.styles['Normal'].font.name = 'Times New Roman' doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体') @@ -60,14 +60,9 @@ def convert_info_docx(img, res, save_folder, img_name, save_pdf=False): elif region['type'].lower() == 'title': doc.add_heading(region['res'][0]['text']) elif region['type'].lower() == 'table': - paragraph = doc.add_paragraph() - new_parser = HtmlToDocx() - new_parser.table_style = 'TableGrid' - table = new_parser.handle_table(html=region['res']['html']) - new_table = deepcopy(table) - new_table.alignment = WD_TABLE_ALIGNMENT.CENTER - paragraph.add_run().element.addnext(new_table._tbl) - + parser = HtmlToDocx() + parser.table_style = 'TableGrid' + parser.handle_table(region['res']['html'], doc) else: paragraph = doc.add_paragraph() paragraph_format = paragraph.paragraph_format @@ -82,13 +77,6 @@ def convert_info_docx(img, res, save_folder, img_name, save_pdf=False): doc.save(docx_path) logger.info('docx save to {}'.format(docx_path)) - # save to pdf - if save_pdf: - pdf_path = os.path.join(save_folder, '{}.pdf'.format(img_name)) - from docx2pdf import convert - convert(docx_path, pdf_path) - logger.info('pdf save to {}'.format(pdf_path)) - def sorted_layout_boxes(res, w): """ diff --git a/ppstructure/recovery/requirements.txt b/ppstructure/recovery/requirements.txt index 25e8cdbb0d58b0a243b176f563c66717d6f4c112..7ddc3391338e5a2a87f9cea9fca006dc03da58fb 100644 --- a/ppstructure/recovery/requirements.txt +++ b/ppstructure/recovery/requirements.txt @@ -1,4 +1,3 @@ python-docx -docx2pdf PyMuPDF beautifulsoup4 \ No newline at end of file diff --git a/ppstructure/recovery/table_process.py b/ppstructure/recovery/table_process.py index 243aaf8933791bf4704964d9665173fe70982f95..982e6b760f9291628d0514728dc8f684f183aa2c 100644 --- a/ppstructure/recovery/table_process.py +++ b/ppstructure/recovery/table_process.py @@ -1,4 +1,3 @@ - # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); @@ -13,62 +12,59 @@ # See the License for the specific language governing permissions and # limitations under the License. """ -This code is refer from:https://github.com/pqzx/html2docx/blob/8f6695a778c68befb302e48ac0ed5201ddbd4524/htmldocx/h2d.py - +This code is refer from: https://github.com/weizwx/html2docx/blob/master/htmldocx/h2d.py """ -import re, argparse -import io, os -import urllib.request -from urllib.parse import urlparse -from html.parser import HTMLParser -import docx, docx.table +import re +import docx from docx import Document -from docx.shared import RGBColor, Pt, Inches -from docx.enum.text import WD_COLOR, WD_ALIGN_PARAGRAPH -from docx.oxml import OxmlElement -from docx.oxml.ns import qn - from bs4 import BeautifulSoup +from html.parser import HTMLParser -# values in inches -INDENT = 0.25 -LIST_INDENT = 0.5 -MAX_INDENT = 5.5 # To stop indents going off the page -# Style to use with tables. By default no style is used. -DEFAULT_TABLE_STYLE = None +def get_table_rows(table_soup): + table_row_selectors = [ + 'table > tr', 'table > thead > tr', 'table > tbody > tr', + 'table > tfoot > tr' + ] + # If there's a header, body, footer or direct child tr tags, add row dimensions from there + return table_soup.select(', '.join(table_row_selectors), recursive=False) -# Style to use with paragraphs. By default no style is used. -DEFAULT_PARAGRAPH_STYLE = None +def get_table_columns(row): + # Get all columns for the specified row tag. + return row.find_all(['th', 'td'], recursive=False) if row else [] -def get_filename_from_url(url): - return os.path.basename(urlparse(url).path) -def is_url(url): - """ - Not to be used for actually validating a url, but in our use case we only - care if it's a url or a file path, and they're pretty distinguishable - """ - parts = urlparse(url) - return all([parts.scheme, parts.netloc, parts.path]) +def get_table_dimensions(table_soup): + # Get rows for the table + rows = get_table_rows(table_soup) + # Table is either empty or has non-direct children between table and tr tags + # Thus the row dimensions and column dimensions are assumed to be 0 -def fetch_image(url): - """ - Attempts to fetch an image from a url. - If successful returns a bytes object, else returns None - :return: - """ - try: - with urllib.request.urlopen(url) as response: - # security flaw? - return io.BytesIO(response.read()) - except urllib.error.URLError: - return None + cols = get_table_columns(rows[0]) if rows else [] + # Add colspan calculation column number + col_count = 0 + for col in cols: + colspan = col.attrs.get('colspan', 1) + col_count += int(colspan) + + return rows, col_count + + +def get_cell_html(soup): + # Returns string of td element with opening and closing tags removed + # Cannot use find_all as it only finds element tags and does not find text which + # is not inside an element + return ' '.join([str(i) for i in soup.contents]) + + +def delete_paragraph(paragraph): + # https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907 + p = paragraph._element + p.getparent().remove(p) + p._p = p._element = None -def remove_last_occurence(ls, x): - ls.pop(len(ls) - ls[::-1].index(x) - 1) def remove_whitespace(string, leading=False, trailing=False): """Remove white space from a string. @@ -122,11 +118,6 @@ def remove_whitespace(string, leading=False, trailing=False): # TODO need some way to get rid of extra spaces in e.g. text text return re.sub(r'\s+', ' ', string) -def delete_paragraph(paragraph): - # https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907 - p = paragraph._element - p.getparent().remove(p) - p._p = p._element = None font_styles = { 'b': 'bold', @@ -145,13 +136,8 @@ font_names = { 'pre': 'Courier', } -styles = { - 'LIST_BULLET': 'List Bullet', - 'LIST_NUMBER': 'List Number', -} class HtmlToDocx(HTMLParser): - def __init__(self): super().__init__() self.options = { @@ -161,13 +147,11 @@ class HtmlToDocx(HTMLParser): 'styles': True, } self.table_row_selectors = [ - 'table > tr', - 'table > thead > tr', - 'table > tbody > tr', + 'table > tr', 'table > thead > tr', 'table > tbody > tr', 'table > tfoot > tr' ] - self.table_style = DEFAULT_TABLE_STYLE - self.paragraph_style = DEFAULT_PARAGRAPH_STYLE + self.table_style = None + self.paragraph_style = None def set_initial_attrs(self, document=None): self.tags = { @@ -178,9 +162,10 @@ class HtmlToDocx(HTMLParser): self.doc = document else: self.doc = Document() - self.bs = self.options['fix-html'] # whether or not to clean with BeautifulSoup + self.bs = self.options[ + 'fix-html'] # whether or not to clean with BeautifulSoup self.document = self.doc - self.include_tables = True #TODO add this option back in? + self.include_tables = True #TODO add this option back in? self.include_images = self.options['images'] self.include_styles = self.options['styles'] self.paragraph = None @@ -193,55 +178,52 @@ class HtmlToDocx(HTMLParser): self.table_style = other.table_style self.paragraph_style = other.paragraph_style - def get_cell_html(self, soup): - # Returns string of td element with opening and closing tags removed - # Cannot use find_all as it only finds element tags and does not find text which - # is not inside an element - return ' '.join([str(i) for i in soup.contents]) - - def add_styles_to_paragraph(self, style): - if 'text-align' in style: - align = style['text-align'] - if align == 'center': - self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER - elif align == 'right': - self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.RIGHT - elif align == 'justify': - self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY - if 'margin-left' in style: - margin = style['margin-left'] - units = re.sub(r'[0-9]+', '', margin) - margin = int(float(re.sub(r'[a-z]+', '', margin))) - if units == 'px': - self.paragraph.paragraph_format.left_indent = Inches(min(margin // 10 * INDENT, MAX_INDENT)) - # TODO handle non px units - - def add_styles_to_run(self, style): - if 'color' in style: - if 'rgb' in style['color']: - color = re.sub(r'[a-z()]+', '', style['color']) - colors = [int(x) for x in color.split(',')] - elif '#' in style['color']: - color = style['color'].lstrip('#') - colors = tuple(int(color[i:i+2], 16) for i in (0, 2, 4)) - else: - colors = [0, 0, 0] - # TODO map colors to named colors (and extended colors...) - # For now set color to black to prevent crashing - self.run.font.color.rgb = RGBColor(*colors) - - if 'background-color' in style: - if 'rgb' in style['background-color']: - color = color = re.sub(r'[a-z()]+', '', style['background-color']) - colors = [int(x) for x in color.split(',')] - elif '#' in style['background-color']: - color = style['background-color'].lstrip('#') - colors = tuple(int(color[i:i+2], 16) for i in (0, 2, 4)) - else: - colors = [0, 0, 0] - # TODO map colors to named colors (and extended colors...) - # For now set color to black to prevent crashing - self.run.font.highlight_color = WD_COLOR.GRAY_25 #TODO: map colors + def ignore_nested_tables(self, tables_soup): + """ + Returns array containing only the highest level tables + Operates on the assumption that bs4 returns child elements immediately after + the parent element in `find_all`. If this changes in the future, this method will need to be updated + :return: + """ + new_tables = [] + nest = 0 + for table in tables_soup: + if nest: + nest -= 1 + continue + new_tables.append(table) + nest = len(table.find_all('table')) + return new_tables + + def get_tables(self): + if not hasattr(self, 'soup'): + self.include_tables = False + return + # find other way to do it, or require this dependency? + self.tables = self.ignore_nested_tables(self.soup.find_all('table')) + self.table_no = 0 + + def run_process(self, html): + if self.bs and BeautifulSoup: + self.soup = BeautifulSoup(html, 'html.parser') + html = str(self.soup) + if self.include_tables: + self.get_tables() + self.feed(html) + + def add_html_to_cell(self, html, cell): + if not isinstance(cell, docx.table._Cell): + raise ValueError('Second argument needs to be a %s' % + docx.table._Cell) + unwanted_paragraph = cell.paragraphs[0] + if unwanted_paragraph.text == "": + delete_paragraph(unwanted_paragraph) + self.set_initial_attrs(cell) + self.run_process(html) + # cells must end with a paragraph or will get message about corrupt file + # https://stackoverflow.com/a/29287121 + if not self.doc.paragraphs: + self.doc.add_paragraph('') def apply_paragraph_style(self, style=None): try: @@ -250,69 +232,10 @@ class HtmlToDocx(HTMLParser): elif self.paragraph_style: self.paragraph.style = self.paragraph_style except KeyError as e: - raise ValueError(f"Unable to apply style {self.paragraph_style}.") from e - - def parse_dict_string(self, string, separator=';'): - new_string = string.replace(" ", '').split(separator) - string_dict = dict([x.split(':') for x in new_string if ':' in x]) - return string_dict - - def handle_li(self): - # check list stack to determine style and depth - list_depth = len(self.tags['list']) - if list_depth: - list_type = self.tags['list'][-1] - else: - list_type = 'ul' # assign unordered if no tag + raise ValueError( + f"Unable to apply style {self.paragraph_style}.") from e - if list_type == 'ol': - list_style = styles['LIST_NUMBER'] - else: - list_style = styles['LIST_BULLET'] - - self.paragraph = self.doc.add_paragraph(style=list_style) - self.paragraph.paragraph_format.left_indent = Inches(min(list_depth * LIST_INDENT, MAX_INDENT)) - self.paragraph.paragraph_format.line_spacing = 1 - - def add_image_to_cell(self, cell, image): - # python-docx doesn't have method yet for adding images to table cells. For now we use this - paragraph = cell.add_paragraph() - run = paragraph.add_run() - run.add_picture(image) - - def handle_img(self, current_attrs): - if not self.include_images: - self.skip = True - self.skip_tag = 'img' - return - src = current_attrs['src'] - # fetch image - src_is_url = is_url(src) - if src_is_url: - try: - image = fetch_image(src) - except urllib.error.URLError: - image = None - else: - image = src - # add image to doc - if image: - try: - if isinstance(self.doc, docx.document.Document): - self.doc.add_picture(image) - else: - self.add_image_to_cell(self.doc, image) - except FileNotFoundError: - image = None - if not image: - if src_is_url: - self.doc.add_paragraph("" % src) - else: - # avoid exposing filepaths in document - self.doc.add_paragraph("" % get_filename_from_url(src)) - - - def handle_table(self, html): + def handle_table(self, html, doc): """ To handle nested tables, we will parse tables manually as follows: Get table soup @@ -320,194 +243,42 @@ class HtmlToDocx(HTMLParser): Iterate over soup and fill docx table with new instances of this parser Tell HTMLParser to ignore any tags until the corresponding closing table tag """ - doc = Document() table_soup = BeautifulSoup(html, 'html.parser') - rows, cols_len = self.get_table_dimensions(table_soup) + rows, cols_len = get_table_dimensions(table_soup) table = doc.add_table(len(rows), cols_len) table.style = doc.styles['Table Grid'] + cell_row = 0 for index, row in enumerate(rows): - cols = self.get_table_columns(row) + cols = get_table_columns(row) cell_col = 0 for col in cols: colspan = int(col.attrs.get('colspan', 1)) rowspan = int(col.attrs.get('rowspan', 1)) - cell_html = self.get_cell_html(col) - + cell_html = get_cell_html(col) if col.name == 'th': cell_html = "%s" % cell_html + docx_cell = table.cell(cell_row, cell_col) + while docx_cell.text != '': # Skip the merged cell cell_col += 1 docx_cell = table.cell(cell_row, cell_col) - cell_to_merge = table.cell(cell_row + rowspan - 1, cell_col + colspan - 1) + cell_to_merge = table.cell(cell_row + rowspan - 1, + cell_col + colspan - 1) if docx_cell != cell_to_merge: docx_cell.merge(cell_to_merge) child_parser = HtmlToDocx() child_parser.copy_settings_from(self) - - child_parser.add_html_to_cell(cell_html or ' ', docx_cell) # occupy the position + child_parser.add_html_to_cell(cell_html or ' ', docx_cell) cell_col += colspan cell_row += 1 - - # skip all tags until corresponding closing tag - self.instances_to_skip = len(table_soup.find_all('table')) - self.skip_tag = 'table' - self.skip = True - self.table = None - return table - - def handle_link(self, href, text): - # Link requires a relationship - is_external = href.startswith('http') - rel_id = self.paragraph.part.relate_to( - href, - docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, - is_external=True # don't support anchor links for this library yet - ) - - # Create the w:hyperlink tag and add needed values - hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink') - hyperlink.set(docx.oxml.shared.qn('r:id'), rel_id) - - - # Create sub-run - subrun = self.paragraph.add_run() - rPr = docx.oxml.shared.OxmlElement('w:rPr') - - # add default color - c = docx.oxml.shared.OxmlElement('w:color') - c.set(docx.oxml.shared.qn('w:val'), "0000EE") - rPr.append(c) - - # add underline - u = docx.oxml.shared.OxmlElement('w:u') - u.set(docx.oxml.shared.qn('w:val'), 'single') - rPr.append(u) - - subrun._r.append(rPr) - subrun._r.text = text - - # Add subrun to hyperlink - hyperlink.append(subrun._r) - - # Add hyperlink to run - self.paragraph._p.append(hyperlink) - - def handle_starttag(self, tag, attrs): - if self.skip: - return - if tag == 'head': - self.skip = True - self.skip_tag = tag - self.instances_to_skip = 0 - return - elif tag == 'body': - return - - current_attrs = dict(attrs) - - if tag == 'span': - self.tags['span'].append(current_attrs) - return - elif tag == 'ol' or tag == 'ul': - self.tags['list'].append(tag) - return # don't apply styles for now - elif tag == 'br': - self.run.add_break() - return - - self.tags[tag] = current_attrs - if tag in ['p', 'pre']: - self.paragraph = self.doc.add_paragraph() - self.apply_paragraph_style() - - elif tag == 'li': - self.handle_li() - - elif tag == "hr": - - # This implementation was taken from: - # https://github.com/python-openxml/python-docx/issues/105#issuecomment-62806373 - - self.paragraph = self.doc.add_paragraph() - pPr = self.paragraph._p.get_or_add_pPr() - pBdr = OxmlElement('w:pBdr') - pPr.insert_element_before(pBdr, - 'w:shd', 'w:tabs', 'w:suppressAutoHyphens', 'w:kinsoku', 'w:wordWrap', - 'w:overflowPunct', 'w:topLinePunct', 'w:autoSpaceDE', 'w:autoSpaceDN', - 'w:bidi', 'w:adjustRightInd', 'w:snapToGrid', 'w:spacing', 'w:ind', - 'w:contextualSpacing', 'w:mirrorIndents', 'w:suppressOverlap', 'w:jc', - 'w:textDirection', 'w:textAlignment', 'w:textboxTightWrap', - 'w:outlineLvl', 'w:divId', 'w:cnfStyle', 'w:rPr', 'w:sectPr', - 'w:pPrChange' - ) - bottom = OxmlElement('w:bottom') - bottom.set(qn('w:val'), 'single') - bottom.set(qn('w:sz'), '6') - bottom.set(qn('w:space'), '1') - bottom.set(qn('w:color'), 'auto') - pBdr.append(bottom) - - elif re.match('h[1-9]', tag): - if isinstance(self.doc, docx.document.Document): - h_size = int(tag[1]) - self.paragraph = self.doc.add_heading(level=min(h_size, 9)) - else: - self.paragraph = self.doc.add_paragraph() - - elif tag == 'img': - self.handle_img(current_attrs) - return - - elif tag == 'table': - self.handle_table() - return - # set new run reference point in case of leading line breaks - if tag in ['p', 'li', 'pre']: - self.run = self.paragraph.add_run() - - # add style - if not self.include_styles: - return - if 'style' in current_attrs and self.paragraph: - style = self.parse_dict_string(current_attrs['style']) - self.add_styles_to_paragraph(style) - - def handle_endtag(self, tag): - if self.skip: - if not tag == self.skip_tag: - return - - if self.instances_to_skip > 0: - self.instances_to_skip -= 1 - return - - self.skip = False - self.skip_tag = None - self.paragraph = None - - if tag == 'span': - if self.tags['span']: - self.tags['span'].pop() - return - elif tag == 'ol' or tag == 'ul': - remove_last_occurence(self.tags['list'], tag) - return - elif tag == 'table': - self.table_no += 1 - self.table = None - self.doc = self.document - self.paragraph = None - - if tag in self.tags: - self.tags.pop(tag) - # maybe set relevant reference to None? + doc.save('1.docx') def handle_data(self, data): if self.skip: @@ -546,87 +317,3 @@ class HtmlToDocx(HTMLParser): if tag in font_names: font_name = font_names[tag] self.run.font.name = font_name - - def ignore_nested_tables(self, tables_soup): - """ - Returns array containing only the highest level tables - Operates on the assumption that bs4 returns child elements immediately after - the parent element in `find_all`. If this changes in the future, this method will need to be updated - :return: - """ - new_tables = [] - nest = 0 - for table in tables_soup: - if nest: - nest -= 1 - continue - new_tables.append(table) - nest = len(table.find_all('table')) - return new_tables - - def get_table_rows(self, table_soup): - # If there's a header, body, footer or direct child tr tags, add row dimensions from there - return table_soup.select(', '.join(self.table_row_selectors), recursive=False) - - def get_table_columns(self, row): - # Get all columns for the specified row tag. - return row.find_all(['th', 'td'], recursive=False) if row else [] - - def get_table_dimensions(self, table_soup): - # Get rows for the table - rows = self.get_table_rows(table_soup) - # Table is either empty or has non-direct children between table and tr tags - # Thus the row dimensions and column dimensions are assumed to be 0 - - cols = self.get_table_columns(rows[0]) if rows else [] - # Add colspan calculation column number - col_count = 0 - for col in cols: - colspan = col.attrs.get('colspan', 1) - col_count += int(colspan) - - # return len(rows), col_count - return rows, col_count - - def get_tables(self): - if not hasattr(self, 'soup'): - self.include_tables = False - return - # find other way to do it, or require this dependency? - self.tables = self.ignore_nested_tables(self.soup.find_all('table')) - self.table_no = 0 - - def run_process(self, html): - if self.bs and BeautifulSoup: - self.soup = BeautifulSoup(html, 'html.parser') - html = str(self.soup) - if self.include_tables: - self.get_tables() - self.feed(html) - - def add_html_to_document(self, html, document): - if not isinstance(html, str): - raise ValueError('First argument needs to be a %s' % str) - elif not isinstance(document, docx.document.Document) and not isinstance(document, docx.table._Cell): - raise ValueError('Second argument needs to be a %s' % docx.document.Document) - self.set_initial_attrs(document) - self.run_process(html) - - def add_html_to_cell(self, html, cell): - self.set_initial_attrs(cell) - self.run_process(html) - - def parse_html_file(self, filename_html, filename_docx=None): - with open(filename_html, 'r') as infile: - html = infile.read() - self.set_initial_attrs() - self.run_process(html) - if not filename_docx: - path, filename = os.path.split(filename_html) - filename_docx = '%s/new_docx_file_%s' % (path, filename) - self.doc.save('%s.docx' % filename_docx) - - def parse_html_string(self, html): - self.set_initial_attrs() - self.run_process(html) - return self.doc \ No newline at end of file diff --git a/ppstructure/utility.py b/ppstructure/utility.py index 97b6d6fec0d70fe3014b0b2105dbbef6a292e4d7..c0aa8091273ad0fd0f71d36843322ac289417059 100644 --- a/ppstructure/utility.py +++ b/ppstructure/utility.py @@ -90,11 +90,6 @@ def init_args(): type=str2bool, default=False, help='Whether to enable layout of recovery') - parser.add_argument( - "--save_pdf", - type=str2bool, - default=False, - help='Whether to save pdf file') return parser @@ -108,7 +103,38 @@ def draw_structure_result(image, result, font_path): if isinstance(image, np.ndarray): image = Image.fromarray(image) boxes, txts, scores = [], [], [] + + img_layout = image.copy() + draw_layout = ImageDraw.Draw(img_layout) + text_color = (255, 255, 255) + text_background_color = (80, 127, 255) + catid2color = {} + font_size = 15 + font = ImageFont.truetype(font_path, font_size, encoding="utf-8") + for region in result: + if region['type'] not in catid2color: + box_color = (random.randint(0, 255), random.randint(0, 255), + random.randint(0, 255)) + catid2color[region['type']] = box_color + else: + box_color = catid2color[region['type']] + box_layout = region['bbox'] + draw_layout.rectangle( + [(box_layout[0], box_layout[1]), (box_layout[2], box_layout[3])], + outline=box_color, + width=3) + text_w, text_h = font.getsize(region['type']) + draw_layout.rectangle( + [(box_layout[0], box_layout[1]), + (box_layout[0] + text_w, box_layout[1] + text_h)], + fill=text_background_color) + draw_layout.text( + (box_layout[0], box_layout[1]), + region['type'], + fill=text_color, + font=font) + if region['type'] == 'table': pass else: @@ -116,6 +142,7 @@ def draw_structure_result(image, result, font_path): boxes.append(np.array(text_result['text_region'])) txts.append(text_result['text']) scores.append(text_result['confidence']) + im_show = draw_ocr_box_txt( - image, boxes, txts, scores, font_path=font_path, drop_score=0) + img_layout, boxes, txts, scores, font_path=font_path, drop_score=0) return im_show