未验证 提交 03d88168 编写于 作者: U user1018 提交者: GitHub

update code_doc (#7667)

* update code_doc

* update code_doc
上级 4cec888b
ppstructure/docs/layout/layout.png

178.6 KB | W: | H:

ppstructure/docs/layout/layout.png

1.2 MB | W: | H:

ppstructure/docs/layout/layout.png
ppstructure/docs/layout/layout.png
ppstructure/docs/layout/layout.png
ppstructure/docs/layout/layout.png
  • 2-up
  • Swipe
  • Onion skin
......@@ -23,7 +23,7 @@ English | [简体中文](README_ch.md)
## 1. Introduction
Layout analysis refers to the regional division of documents in the form of pictures and the positioning of key areas, such as text, title, table, picture, etc. The layout analysis algorithm is based on the lightweight model PP-picodet of [PaddleDetection]( https://github.com/PaddlePaddle/PaddleDetection )
Layout analysis refers to the regional division of documents in the form of pictures and the positioning of key areas, such as text, title, table, picture, etc. The layout analysis algorithm is based on the lightweight model PP-picodet of [PaddleDetection]( https://github.com/PaddlePaddle/PaddleDetection ), including English layout analysis, Chinese layout analysis and table layout analysis models. English layout analysis models can detect document layout elements such as text, title, table, figure, list. Chinese layout analysis models can detect document layout elements such as text, figure, figure caption, table, table caption, header, footer, reference, and equation. Table layout analysis models can detect table regions.
<div align="center">
<img src="../docs/layout/layout.png" width="800">
......@@ -152,7 +152,7 @@ We provide CDLA(Chinese layout analysis), TableBank(Table layout analysis)etc. d
| [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | For form detection (TRACKA) and form identification (TRACKB).Image types include historical data sets (beginning with cTDaR_t0, such as CTDAR_T00872.jpg) and modern data sets (beginning with cTDaR_t1, CTDAR_T10482.jpg). |
| [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | Data sets constructed by manually annotating figures or pages from publicly available annual reports, containing 5 categories:table, figure, natural image, logo, and signature. |
| [TableBank](https://github.com/doc-analysis/TableBank) | For table detection and recognition of large datasets, including Word and Latex document formats |
| [CDLA](https://github.com/buptlihang/CDLA) | Chinese document layout analysis data set, for Chinese literature (paper) scenarios, including 10 categories:Table, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation |
| [CDLA](https://github.com/buptlihang/CDLA) | Chinese document layout analysis data set, for Chinese literature (paper) scenarios, including 10 categories:Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation |
| [DocBank](https://github.com/doc-analysis/DocBank) | Large-scale dataset (500K document pages) constructed using weakly supervised methods for document layout analysis, containing 12 categories:Author, Caption, Date, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table, Title |
......@@ -175,7 +175,7 @@ If the test image is Chinese, the pre-trained model of Chinese CDLA dataset can
### 5.1. Train
Train:
Start training with the PaddleDetection [layout analysis profile](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/configs/picodet/legacy_model/application/layout_analysis)
* Modify Profile
......
......@@ -22,7 +22,7 @@
## 1. 简介
版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等。版面分析算法基于[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)的轻量模型PP-PicoDet进行开发
版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等。版面分析算法基于[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)的轻量模型PP-PicoDet进行开发,包含英文、中文、表格版面分析3类模型。其中,英文模型支持Text、Title、Tale、Figure、List5类区域的检测,中文模型支持Text、Title、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation10类区域的检测,表格版面分析支持Table区域的检测,版面分析效果如下图所示:
<div align="center">
<img src="../docs/layout/layout.png" width="800">
......@@ -152,7 +152,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放,
| ------------------------------------------------------------ | ------------------------------------------------------------ |
| [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | 用于表格检测(TRACKA)和表格识别(TRACKB)。图片类型包含历史数据集(以cTDaR_t0开头,如cTDaR_t00872.jpg)和现代数据集(以cTDaR_t1开头,cTDaR_t10482.jpg)。 |
| [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | 手动注释公开的年度报告中的图形或页面而构建的数据集,包含5类:table, figure, natural image, logo, and signature |
| [CDLA](https://github.com/buptlihang/CDLA) | 中文文档版面分析数据集,面向中文文献类(论文)场景,包含10类:Table、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation |
| [CDLA](https://github.com/buptlihang/CDLA) | 中文文档版面分析数据集,面向中文文献类(论文)场景,包含10类:Text、Title、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation |
| [TableBank](https://github.com/doc-analysis/TableBank) | 用于表格检测和识别大型数据集,包含Word和Latex2种文档格式 |
| [DocBank](https://github.com/doc-analysis/DocBank) | 使用弱监督方法构建的大规模数据集(500K文档页面),用于文档布局分析,包含12类:Author、Caption、Date、Equation、Figure、Footer、List、Paragraph、Reference、Section、Table、Title |
......@@ -161,7 +161,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放,
提供了训练脚本、评估脚本和预测脚本,本节将以PubLayNet预训练模型为例进行讲解。
如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型(PubLayNet数据集),并跳过本部分
如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型(PubLayNet数据集),并跳过5.1和5.2
```
mkdir pretrained_model
......@@ -176,7 +176,7 @@ wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_
### 5.1. 启动训练
开始训练:
使用PaddleDetection[版面分析配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/configs/picodet/legacy_model/application/layout_analysis)启动训练
* 修改配置文件
......
......@@ -254,8 +254,7 @@ def main(args):
if args.recovery and all_res != []:
try:
convert_info_docx(img, all_res, save_folder, img_name,
args.save_pdf)
convert_info_docx(img, all_res, save_folder, img_name)
except Exception as ex:
logger.error("error in layout recovery image:{}, err msg: {}".
format(image_file, ex))
......
......@@ -82,8 +82,11 @@ Through layout analysis, we divided the image/PDF documents into regions, locate
We can restore the test picture through the layout information, OCR detection and recognition structure, table information, and saved pictures.
The whl package is also provided for quick use, see [quickstart](../docs/quickstart_en.md) for details.
The whl package is also provided for quick use, follow the above code, for more infomation please refer to [quickstart](../docs/quickstart_en.md) for details.
```bash
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en'
```
<a name="3.1"></a>
### 3.1 Download models
......
......@@ -83,7 +83,16 @@ python3 -m pip install -r ppstructure/recovery/requirements.txt
我们通过版面信息、OCR检测和识别结构、表格信息、保存的图片,对测试图片进行恢复即可。
提供如下代码实现版面恢复,也提供了whl包的形式方便快速使用,详见 [quickstart](../docs/quickstart.md)
提供如下代码实现版面恢复,也提供了whl包的形式方便快速使用,代码如下,更多信息详见 [quickstart](../docs/quickstart.md)
```bash
# 中文测试图
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true
# 英文测试图
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en'
# pdf测试文件
paddleocr --image_dir=ppstructure/recovery/UnrealText.pdf --type=structure --recovery=true --lang='en'
```
<a name="3.1"></a>
......
......@@ -28,7 +28,7 @@ from ppocr.utils.logging import get_logger
logger = get_logger()
def convert_info_docx(img, res, save_folder, img_name, save_pdf=False):
def convert_info_docx(img, res, save_folder, img_name):
doc = Document()
doc.styles['Normal'].font.name = 'Times New Roman'
doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体')
......@@ -60,14 +60,9 @@ def convert_info_docx(img, res, save_folder, img_name, save_pdf=False):
elif region['type'].lower() == 'title':
doc.add_heading(region['res'][0]['text'])
elif region['type'].lower() == 'table':
paragraph = doc.add_paragraph()
new_parser = HtmlToDocx()
new_parser.table_style = 'TableGrid'
table = new_parser.handle_table(html=region['res']['html'])
new_table = deepcopy(table)
new_table.alignment = WD_TABLE_ALIGNMENT.CENTER
paragraph.add_run().element.addnext(new_table._tbl)
parser = HtmlToDocx()
parser.table_style = 'TableGrid'
parser.handle_table(region['res']['html'], doc)
else:
paragraph = doc.add_paragraph()
paragraph_format = paragraph.paragraph_format
......@@ -82,13 +77,6 @@ def convert_info_docx(img, res, save_folder, img_name, save_pdf=False):
doc.save(docx_path)
logger.info('docx save to {}'.format(docx_path))
# save to pdf
if save_pdf:
pdf_path = os.path.join(save_folder, '{}.pdf'.format(img_name))
from docx2pdf import convert
convert(docx_path, pdf_path)
logger.info('pdf save to {}'.format(pdf_path))
def sorted_layout_boxes(res, w):
"""
......
python-docx
docx2pdf
PyMuPDF
beautifulsoup4
\ No newline at end of file
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
......@@ -13,62 +12,59 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This code is refer from:https://github.com/pqzx/html2docx/blob/8f6695a778c68befb302e48ac0ed5201ddbd4524/htmldocx/h2d.py
This code is refer from: https://github.com/weizwx/html2docx/blob/master/htmldocx/h2d.py
"""
import re, argparse
import io, os
import urllib.request
from urllib.parse import urlparse
from html.parser import HTMLParser
import docx, docx.table
import re
import docx
from docx import Document
from docx.shared import RGBColor, Pt, Inches
from docx.enum.text import WD_COLOR, WD_ALIGN_PARAGRAPH
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
from bs4 import BeautifulSoup
from html.parser import HTMLParser
# values in inches
INDENT = 0.25
LIST_INDENT = 0.5
MAX_INDENT = 5.5 # To stop indents going off the page
# Style to use with tables. By default no style is used.
DEFAULT_TABLE_STYLE = None
def get_table_rows(table_soup):
table_row_selectors = [
'table > tr', 'table > thead > tr', 'table > tbody > tr',
'table > tfoot > tr'
]
# If there's a header, body, footer or direct child tr tags, add row dimensions from there
return table_soup.select(', '.join(table_row_selectors), recursive=False)
# Style to use with paragraphs. By default no style is used.
DEFAULT_PARAGRAPH_STYLE = None
def get_table_columns(row):
# Get all columns for the specified row tag.
return row.find_all(['th', 'td'], recursive=False) if row else []
def get_filename_from_url(url):
return os.path.basename(urlparse(url).path)
def is_url(url):
"""
Not to be used for actually validating a url, but in our use case we only
care if it's a url or a file path, and they're pretty distinguishable
"""
parts = urlparse(url)
return all([parts.scheme, parts.netloc, parts.path])
def get_table_dimensions(table_soup):
# Get rows for the table
rows = get_table_rows(table_soup)
# Table is either empty or has non-direct children between table and tr tags
# Thus the row dimensions and column dimensions are assumed to be 0
def fetch_image(url):
"""
Attempts to fetch an image from a url.
If successful returns a bytes object, else returns None
:return:
"""
try:
with urllib.request.urlopen(url) as response:
# security flaw?
return io.BytesIO(response.read())
except urllib.error.URLError:
return None
cols = get_table_columns(rows[0]) if rows else []
# Add colspan calculation column number
col_count = 0
for col in cols:
colspan = col.attrs.get('colspan', 1)
col_count += int(colspan)
return rows, col_count
def get_cell_html(soup):
# Returns string of td element with opening and closing <td> tags removed
# Cannot use find_all as it only finds element tags and does not find text which
# is not inside an element
return ' '.join([str(i) for i in soup.contents])
def delete_paragraph(paragraph):
# https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
def remove_last_occurence(ls, x):
ls.pop(len(ls) - ls[::-1].index(x) - 1)
def remove_whitespace(string, leading=False, trailing=False):
"""Remove white space from a string.
......@@ -122,11 +118,6 @@ def remove_whitespace(string, leading=False, trailing=False):
# TODO need some way to get rid of extra spaces in e.g. text <span> </span> text
return re.sub(r'\s+', ' ', string)
def delete_paragraph(paragraph):
# https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
font_styles = {
'b': 'bold',
......@@ -145,13 +136,8 @@ font_names = {
'pre': 'Courier',
}
styles = {
'LIST_BULLET': 'List Bullet',
'LIST_NUMBER': 'List Number',
}
class HtmlToDocx(HTMLParser):
def __init__(self):
super().__init__()
self.options = {
......@@ -161,13 +147,11 @@ class HtmlToDocx(HTMLParser):
'styles': True,
}
self.table_row_selectors = [
'table > tr',
'table > thead > tr',
'table > tbody > tr',
'table > tr', 'table > thead > tr', 'table > tbody > tr',
'table > tfoot > tr'
]
self.table_style = DEFAULT_TABLE_STYLE
self.paragraph_style = DEFAULT_PARAGRAPH_STYLE
self.table_style = None
self.paragraph_style = None
def set_initial_attrs(self, document=None):
self.tags = {
......@@ -178,9 +162,10 @@ class HtmlToDocx(HTMLParser):
self.doc = document
else:
self.doc = Document()
self.bs = self.options['fix-html'] # whether or not to clean with BeautifulSoup
self.bs = self.options[
'fix-html'] # whether or not to clean with BeautifulSoup
self.document = self.doc
self.include_tables = True #TODO add this option back in?
self.include_tables = True #TODO add this option back in?
self.include_images = self.options['images']
self.include_styles = self.options['styles']
self.paragraph = None
......@@ -193,55 +178,52 @@ class HtmlToDocx(HTMLParser):
self.table_style = other.table_style
self.paragraph_style = other.paragraph_style
def get_cell_html(self, soup):
# Returns string of td element with opening and closing <td> tags removed
# Cannot use find_all as it only finds element tags and does not find text which
# is not inside an element
return ' '.join([str(i) for i in soup.contents])
def add_styles_to_paragraph(self, style):
if 'text-align' in style:
align = style['text-align']
if align == 'center':
self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
elif align == 'right':
self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.RIGHT
elif align == 'justify':
self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
if 'margin-left' in style:
margin = style['margin-left']
units = re.sub(r'[0-9]+', '', margin)
margin = int(float(re.sub(r'[a-z]+', '', margin)))
if units == 'px':
self.paragraph.paragraph_format.left_indent = Inches(min(margin // 10 * INDENT, MAX_INDENT))
# TODO handle non px units
def add_styles_to_run(self, style):
if 'color' in style:
if 'rgb' in style['color']:
color = re.sub(r'[a-z()]+', '', style['color'])
colors = [int(x) for x in color.split(',')]
elif '#' in style['color']:
color = style['color'].lstrip('#')
colors = tuple(int(color[i:i+2], 16) for i in (0, 2, 4))
else:
colors = [0, 0, 0]
# TODO map colors to named colors (and extended colors...)
# For now set color to black to prevent crashing
self.run.font.color.rgb = RGBColor(*colors)
if 'background-color' in style:
if 'rgb' in style['background-color']:
color = color = re.sub(r'[a-z()]+', '', style['background-color'])
colors = [int(x) for x in color.split(',')]
elif '#' in style['background-color']:
color = style['background-color'].lstrip('#')
colors = tuple(int(color[i:i+2], 16) for i in (0, 2, 4))
else:
colors = [0, 0, 0]
# TODO map colors to named colors (and extended colors...)
# For now set color to black to prevent crashing
self.run.font.highlight_color = WD_COLOR.GRAY_25 #TODO: map colors
def ignore_nested_tables(self, tables_soup):
"""
Returns array containing only the highest level tables
Operates on the assumption that bs4 returns child elements immediately after
the parent element in `find_all`. If this changes in the future, this method will need to be updated
:return:
"""
new_tables = []
nest = 0
for table in tables_soup:
if nest:
nest -= 1
continue
new_tables.append(table)
nest = len(table.find_all('table'))
return new_tables
def get_tables(self):
if not hasattr(self, 'soup'):
self.include_tables = False
return
# find other way to do it, or require this dependency?
self.tables = self.ignore_nested_tables(self.soup.find_all('table'))
self.table_no = 0
def run_process(self, html):
if self.bs and BeautifulSoup:
self.soup = BeautifulSoup(html, 'html.parser')
html = str(self.soup)
if self.include_tables:
self.get_tables()
self.feed(html)
def add_html_to_cell(self, html, cell):
if not isinstance(cell, docx.table._Cell):
raise ValueError('Second argument needs to be a %s' %
docx.table._Cell)
unwanted_paragraph = cell.paragraphs[0]
if unwanted_paragraph.text == "":
delete_paragraph(unwanted_paragraph)
self.set_initial_attrs(cell)
self.run_process(html)
# cells must end with a paragraph or will get message about corrupt file
# https://stackoverflow.com/a/29287121
if not self.doc.paragraphs:
self.doc.add_paragraph('')
def apply_paragraph_style(self, style=None):
try:
......@@ -250,69 +232,10 @@ class HtmlToDocx(HTMLParser):
elif self.paragraph_style:
self.paragraph.style = self.paragraph_style
except KeyError as e:
raise ValueError(f"Unable to apply style {self.paragraph_style}.") from e
def parse_dict_string(self, string, separator=';'):
new_string = string.replace(" ", '').split(separator)
string_dict = dict([x.split(':') for x in new_string if ':' in x])
return string_dict
def handle_li(self):
# check list stack to determine style and depth
list_depth = len(self.tags['list'])
if list_depth:
list_type = self.tags['list'][-1]
else:
list_type = 'ul' # assign unordered if no tag
raise ValueError(
f"Unable to apply style {self.paragraph_style}.") from e
if list_type == 'ol':
list_style = styles['LIST_NUMBER']
else:
list_style = styles['LIST_BULLET']
self.paragraph = self.doc.add_paragraph(style=list_style)
self.paragraph.paragraph_format.left_indent = Inches(min(list_depth * LIST_INDENT, MAX_INDENT))
self.paragraph.paragraph_format.line_spacing = 1
def add_image_to_cell(self, cell, image):
# python-docx doesn't have method yet for adding images to table cells. For now we use this
paragraph = cell.add_paragraph()
run = paragraph.add_run()
run.add_picture(image)
def handle_img(self, current_attrs):
if not self.include_images:
self.skip = True
self.skip_tag = 'img'
return
src = current_attrs['src']
# fetch image
src_is_url = is_url(src)
if src_is_url:
try:
image = fetch_image(src)
except urllib.error.URLError:
image = None
else:
image = src
# add image to doc
if image:
try:
if isinstance(self.doc, docx.document.Document):
self.doc.add_picture(image)
else:
self.add_image_to_cell(self.doc, image)
except FileNotFoundError:
image = None
if not image:
if src_is_url:
self.doc.add_paragraph("<image: %s>" % src)
else:
# avoid exposing filepaths in document
self.doc.add_paragraph("<image: %s>" % get_filename_from_url(src))
def handle_table(self, html):
def handle_table(self, html, doc):
"""
To handle nested tables, we will parse tables manually as follows:
Get table soup
......@@ -320,194 +243,42 @@ class HtmlToDocx(HTMLParser):
Iterate over soup and fill docx table with new instances of this parser
Tell HTMLParser to ignore any tags until the corresponding closing table tag
"""
doc = Document()
table_soup = BeautifulSoup(html, 'html.parser')
rows, cols_len = self.get_table_dimensions(table_soup)
rows, cols_len = get_table_dimensions(table_soup)
table = doc.add_table(len(rows), cols_len)
table.style = doc.styles['Table Grid']
cell_row = 0
for index, row in enumerate(rows):
cols = self.get_table_columns(row)
cols = get_table_columns(row)
cell_col = 0
for col in cols:
colspan = int(col.attrs.get('colspan', 1))
rowspan = int(col.attrs.get('rowspan', 1))
cell_html = self.get_cell_html(col)
cell_html = get_cell_html(col)
if col.name == 'th':
cell_html = "<b>%s</b>" % cell_html
docx_cell = table.cell(cell_row, cell_col)
while docx_cell.text != '': # Skip the merged cell
cell_col += 1
docx_cell = table.cell(cell_row, cell_col)
cell_to_merge = table.cell(cell_row + rowspan - 1, cell_col + colspan - 1)
cell_to_merge = table.cell(cell_row + rowspan - 1,
cell_col + colspan - 1)
if docx_cell != cell_to_merge:
docx_cell.merge(cell_to_merge)
child_parser = HtmlToDocx()
child_parser.copy_settings_from(self)
child_parser.add_html_to_cell(cell_html or ' ', docx_cell) # occupy the position
child_parser.add_html_to_cell(cell_html or ' ', docx_cell)
cell_col += colspan
cell_row += 1
# skip all tags until corresponding closing tag
self.instances_to_skip = len(table_soup.find_all('table'))
self.skip_tag = 'table'
self.skip = True
self.table = None
return table
def handle_link(self, href, text):
# Link requires a relationship
is_external = href.startswith('http')
rel_id = self.paragraph.part.relate_to(
href,
docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK,
is_external=True # don't support anchor links for this library yet
)
# Create the w:hyperlink tag and add needed values
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
hyperlink.set(docx.oxml.shared.qn('r:id'), rel_id)
# Create sub-run
subrun = self.paragraph.add_run()
rPr = docx.oxml.shared.OxmlElement('w:rPr')
# add default color
c = docx.oxml.shared.OxmlElement('w:color')
c.set(docx.oxml.shared.qn('w:val'), "0000EE")
rPr.append(c)
# add underline
u = docx.oxml.shared.OxmlElement('w:u')
u.set(docx.oxml.shared.qn('w:val'), 'single')
rPr.append(u)
subrun._r.append(rPr)
subrun._r.text = text
# Add subrun to hyperlink
hyperlink.append(subrun._r)
# Add hyperlink to run
self.paragraph._p.append(hyperlink)
def handle_starttag(self, tag, attrs):
if self.skip:
return
if tag == 'head':
self.skip = True
self.skip_tag = tag
self.instances_to_skip = 0
return
elif tag == 'body':
return
current_attrs = dict(attrs)
if tag == 'span':
self.tags['span'].append(current_attrs)
return
elif tag == 'ol' or tag == 'ul':
self.tags['list'].append(tag)
return # don't apply styles for now
elif tag == 'br':
self.run.add_break()
return
self.tags[tag] = current_attrs
if tag in ['p', 'pre']:
self.paragraph = self.doc.add_paragraph()
self.apply_paragraph_style()
elif tag == 'li':
self.handle_li()
elif tag == "hr":
# This implementation was taken from:
# https://github.com/python-openxml/python-docx/issues/105#issuecomment-62806373
self.paragraph = self.doc.add_paragraph()
pPr = self.paragraph._p.get_or_add_pPr()
pBdr = OxmlElement('w:pBdr')
pPr.insert_element_before(pBdr,
'w:shd', 'w:tabs', 'w:suppressAutoHyphens', 'w:kinsoku', 'w:wordWrap',
'w:overflowPunct', 'w:topLinePunct', 'w:autoSpaceDE', 'w:autoSpaceDN',
'w:bidi', 'w:adjustRightInd', 'w:snapToGrid', 'w:spacing', 'w:ind',
'w:contextualSpacing', 'w:mirrorIndents', 'w:suppressOverlap', 'w:jc',
'w:textDirection', 'w:textAlignment', 'w:textboxTightWrap',
'w:outlineLvl', 'w:divId', 'w:cnfStyle', 'w:rPr', 'w:sectPr',
'w:pPrChange'
)
bottom = OxmlElement('w:bottom')
bottom.set(qn('w:val'), 'single')
bottom.set(qn('w:sz'), '6')
bottom.set(qn('w:space'), '1')
bottom.set(qn('w:color'), 'auto')
pBdr.append(bottom)
elif re.match('h[1-9]', tag):
if isinstance(self.doc, docx.document.Document):
h_size = int(tag[1])
self.paragraph = self.doc.add_heading(level=min(h_size, 9))
else:
self.paragraph = self.doc.add_paragraph()
elif tag == 'img':
self.handle_img(current_attrs)
return
elif tag == 'table':
self.handle_table()
return
# set new run reference point in case of leading line breaks
if tag in ['p', 'li', 'pre']:
self.run = self.paragraph.add_run()
# add style
if not self.include_styles:
return
if 'style' in current_attrs and self.paragraph:
style = self.parse_dict_string(current_attrs['style'])
self.add_styles_to_paragraph(style)
def handle_endtag(self, tag):
if self.skip:
if not tag == self.skip_tag:
return
if self.instances_to_skip > 0:
self.instances_to_skip -= 1
return
self.skip = False
self.skip_tag = None
self.paragraph = None
if tag == 'span':
if self.tags['span']:
self.tags['span'].pop()
return
elif tag == 'ol' or tag == 'ul':
remove_last_occurence(self.tags['list'], tag)
return
elif tag == 'table':
self.table_no += 1
self.table = None
self.doc = self.document
self.paragraph = None
if tag in self.tags:
self.tags.pop(tag)
# maybe set relevant reference to None?
doc.save('1.docx')
def handle_data(self, data):
if self.skip:
......@@ -546,87 +317,3 @@ class HtmlToDocx(HTMLParser):
if tag in font_names:
font_name = font_names[tag]
self.run.font.name = font_name
def ignore_nested_tables(self, tables_soup):
"""
Returns array containing only the highest level tables
Operates on the assumption that bs4 returns child elements immediately after
the parent element in `find_all`. If this changes in the future, this method will need to be updated
:return:
"""
new_tables = []
nest = 0
for table in tables_soup:
if nest:
nest -= 1
continue
new_tables.append(table)
nest = len(table.find_all('table'))
return new_tables
def get_table_rows(self, table_soup):
# If there's a header, body, footer or direct child tr tags, add row dimensions from there
return table_soup.select(', '.join(self.table_row_selectors), recursive=False)
def get_table_columns(self, row):
# Get all columns for the specified row tag.
return row.find_all(['th', 'td'], recursive=False) if row else []
def get_table_dimensions(self, table_soup):
# Get rows for the table
rows = self.get_table_rows(table_soup)
# Table is either empty or has non-direct children between table and tr tags
# Thus the row dimensions and column dimensions are assumed to be 0
cols = self.get_table_columns(rows[0]) if rows else []
# Add colspan calculation column number
col_count = 0
for col in cols:
colspan = col.attrs.get('colspan', 1)
col_count += int(colspan)
# return len(rows), col_count
return rows, col_count
def get_tables(self):
if not hasattr(self, 'soup'):
self.include_tables = False
return
# find other way to do it, or require this dependency?
self.tables = self.ignore_nested_tables(self.soup.find_all('table'))
self.table_no = 0
def run_process(self, html):
if self.bs and BeautifulSoup:
self.soup = BeautifulSoup(html, 'html.parser')
html = str(self.soup)
if self.include_tables:
self.get_tables()
self.feed(html)
def add_html_to_document(self, html, document):
if not isinstance(html, str):
raise ValueError('First argument needs to be a %s' % str)
elif not isinstance(document, docx.document.Document) and not isinstance(document, docx.table._Cell):
raise ValueError('Second argument needs to be a %s' % docx.document.Document)
self.set_initial_attrs(document)
self.run_process(html)
def add_html_to_cell(self, html, cell):
self.set_initial_attrs(cell)
self.run_process(html)
def parse_html_file(self, filename_html, filename_docx=None):
with open(filename_html, 'r') as infile:
html = infile.read()
self.set_initial_attrs()
self.run_process(html)
if not filename_docx:
path, filename = os.path.split(filename_html)
filename_docx = '%s/new_docx_file_%s' % (path, filename)
self.doc.save('%s.docx' % filename_docx)
def parse_html_string(self, html):
self.set_initial_attrs()
self.run_process(html)
return self.doc
\ No newline at end of file
......@@ -90,11 +90,6 @@ def init_args():
type=str2bool,
default=False,
help='Whether to enable layout of recovery')
parser.add_argument(
"--save_pdf",
type=str2bool,
default=False,
help='Whether to save pdf file')
return parser
......@@ -108,7 +103,38 @@ def draw_structure_result(image, result, font_path):
if isinstance(image, np.ndarray):
image = Image.fromarray(image)
boxes, txts, scores = [], [], []
img_layout = image.copy()
draw_layout = ImageDraw.Draw(img_layout)
text_color = (255, 255, 255)
text_background_color = (80, 127, 255)
catid2color = {}
font_size = 15
font = ImageFont.truetype(font_path, font_size, encoding="utf-8")
for region in result:
if region['type'] not in catid2color:
box_color = (random.randint(0, 255), random.randint(0, 255),
random.randint(0, 255))
catid2color[region['type']] = box_color
else:
box_color = catid2color[region['type']]
box_layout = region['bbox']
draw_layout.rectangle(
[(box_layout[0], box_layout[1]), (box_layout[2], box_layout[3])],
outline=box_color,
width=3)
text_w, text_h = font.getsize(region['type'])
draw_layout.rectangle(
[(box_layout[0], box_layout[1]),
(box_layout[0] + text_w, box_layout[1] + text_h)],
fill=text_background_color)
draw_layout.text(
(box_layout[0], box_layout[1]),
region['type'],
fill=text_color,
font=font)
if region['type'] == 'table':
pass
else:
......@@ -116,6 +142,7 @@ def draw_structure_result(image, result, font_path):
boxes.append(np.array(text_result['text_region']))
txts.append(text_result['text'])
scores.append(text_result['confidence'])
im_show = draw_ocr_box_txt(
image, boxes, txts, scores, font_path=font_path, drop_score=0)
img_layout, boxes, txts, scores, font_path=font_path, drop_score=0)
return im_show
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册