Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
PaddleOCR
提交
83c2bc5d
P
PaddleOCR
项目概览
PaddlePaddle
/
PaddleOCR
大约 1 年 前同步成功
通知
1528
Star
32962
Fork
6643
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
108
列表
看板
标记
里程碑
合并请求
7
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
PaddleOCR
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
108
Issue
108
列表
看板
标记
里程碑
合并请求
7
合并请求
7
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
83c2bc5d
编写于
9月 29, 2021
作者:
MrCuiHao
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
使用PPOCRLabel标注多个图像文件夹后,此脚本用于汇总按照比例划分文本检测和文本识别的训练集和验证集
上级
49382181
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
113 addition
and
0 deletion
+113
-0
gen_ocr_train_val.py
gen_ocr_train_val.py
+113
-0
未找到文件。
gen_ocr_train_val.py
0 → 100644
浏览文件 @
83c2bc5d
# coding:utf8
import
os
import
shutil
import
random
import
argparse
# 删除划分的训练集和验证集文件夹,重新创建一个空的文件夹
def
isCreateOrDeleteFolder
(
path
,
flag
):
flagPath
=
os
.
path
.
join
(
path
,
flag
)
if
os
.
path
.
exists
(
flagPath
):
shutil
.
rmtree
(
flagPath
)
os
.
makedirs
(
flagPath
)
flagAbsPath
=
os
.
path
.
abspath
(
flagPath
)
return
flagAbsPath
def
splitTrainVal
(
root
,
dir
,
absTrainRootPath
,
absValRootPath
,
trainTxt
,
valTxt
,
flag
):
# 按照指定的比例划分训练集和验证集
labelPath
=
os
.
path
.
join
(
root
,
dir
)
labelAbsPath
=
os
.
path
.
abspath
(
labelPath
)
if
flag
==
"det"
:
labelFilePath
=
os
.
path
.
join
(
labelAbsPath
,
args
.
detLabelFileName
)
elif
flag
==
"rec"
:
labelFilePath
=
os
.
path
.
join
(
labelAbsPath
,
args
.
recLabelFileName
)
labelFileRead
=
open
(
labelFilePath
,
"r"
,
encoding
=
"UTF-8"
)
labelFileContent
=
labelFileRead
.
readlines
()
random
.
shuffle
(
labelFileContent
)
labelRecordLen
=
len
(
labelFileContent
)
for
index
,
labelRecordInfo
in
enumerate
(
labelFileContent
):
imageRelativePath
=
labelRecordInfo
.
split
(
'
\t
'
)[
0
]
imageLabel
=
labelRecordInfo
.
split
(
'
\t
'
)[
1
]
imageName
=
os
.
path
.
basename
(
imageRelativePath
)
if
flag
==
"det"
:
imagePath
=
os
.
path
.
join
(
labelAbsPath
,
imageName
)
elif
flag
==
"rec"
:
imagePath
=
os
.
path
.
join
(
labelAbsPath
,
"{}
\\
{}"
.
format
(
args
.
recImageDirName
,
imageName
))
# 小于划分比例trainValRatio时,数据集划分到训练集,否则测试集
if
index
/
labelRecordLen
<
args
.
trainValRatio
:
imageCopyPath
=
os
.
path
.
join
(
absTrainRootPath
,
imageName
)
shutil
.
copy
(
imagePath
,
imageCopyPath
)
trainTxt
.
write
(
"{}
\t
{}"
.
format
(
imageCopyPath
,
imageLabel
))
else
:
imageCopyPath
=
os
.
path
.
join
(
absValRootPath
,
imageName
)
shutil
.
copy
(
imagePath
,
imageCopyPath
)
valTxt
.
write
(
"{}
\t
{}"
.
format
(
imageCopyPath
,
imageLabel
))
def
genDetRecTrainVal
(
args
):
detAbsTrainRootPath
=
isCreateOrDeleteFolder
(
args
.
detRootPath
,
"train"
)
detAbsValRootPath
=
isCreateOrDeleteFolder
(
args
.
detRootPath
,
"val"
)
recAbsTrainRootPath
=
isCreateOrDeleteFolder
(
args
.
recRootPath
,
"train"
)
recAbsValRootPath
=
isCreateOrDeleteFolder
(
args
.
recRootPath
,
"val"
)
os
.
remove
(
os
.
path
.
join
(
args
.
detRootPath
,
"train.txt"
))
os
.
remove
(
os
.
path
.
join
(
args
.
detRootPath
,
"val.txt"
))
os
.
remove
(
os
.
path
.
join
(
args
.
recRootPath
,
"train.txt"
))
os
.
remove
(
os
.
path
.
join
(
args
.
recRootPath
,
"val.txt"
))
detTrainTxt
=
open
(
os
.
path
.
join
(
args
.
detRootPath
,
"train.txt"
),
"a"
,
encoding
=
"UTF-8"
)
detValTxt
=
open
(
os
.
path
.
join
(
args
.
detRootPath
,
"val.txt"
),
"a"
,
encoding
=
"UTF-8"
)
recTrainTxt
=
open
(
os
.
path
.
join
(
args
.
recRootPath
,
"train.txt"
),
"a"
,
encoding
=
"UTF-8"
)
recValTxt
=
open
(
os
.
path
.
join
(
args
.
recRootPath
,
"val.txt"
),
"a"
,
encoding
=
"UTF-8"
)
for
root
,
dirs
,
files
in
os
.
walk
(
args
.
labelRootPath
):
for
dir
in
dirs
:
splitTrainVal
(
root
,
dir
,
detAbsTrainRootPath
,
detAbsValRootPath
,
detTrainTxt
,
detValTxt
,
"det"
)
splitTrainVal
(
root
,
dir
,
recAbsTrainRootPath
,
recAbsValRootPath
,
recTrainTxt
,
recValTxt
,
"rec"
)
break
if
__name__
==
"__main__"
:
# 功能描述:分别划分检测和识别的训练集和验证集
# 说明:可以根据自己的路径和需求调整参数,图像数据往往多人合作分批标注,每一批图像数据放在一个文件夹内用PPOCRLabel进行标注,
# 如此会有多个标注好的图像文件夹汇总并划分训练集和验证集的需求
parser
=
argparse
.
ArgumentParser
()
parser
.
add_argument
(
"--trainValRatio"
,
type
=
float
,
default
=
0.8
,
help
=
"ratio of training set to validation set"
)
parser
.
add_argument
(
"--labelRootPath"
,
type
=
str
,
default
=
"./train_data/label"
,
help
=
"path to the dataset marked by ppocrlabel, E.g, dataset folder named 1,2,3..."
)
parser
.
add_argument
(
"--detRootPath"
,
type
=
str
,
default
=
"./train_data/det/demPanel"
,
help
=
"the path where the divided detection dataset is placed"
)
parser
.
add_argument
(
"--recRootPath"
,
type
=
str
,
default
=
"./train_data/rec/demPanel"
,
help
=
"the path where the divided recognition dataset is placed"
)
parser
.
add_argument
(
"--detLabelFileName"
,
type
=
str
,
default
=
"Label.txt"
,
help
=
"the name of the detection annotation file"
)
parser
.
add_argument
(
"--recLabelFileName"
,
type
=
str
,
default
=
"rec_gt.txt"
,
help
=
"the name of the recognition annotation file"
)
parser
.
add_argument
(
"--recImageDirName"
,
type
=
str
,
default
=
"crop_img"
,
help
=
"the name of the folder where the cropped recognition dataset is located"
)
args
=
parser
.
parse_args
()
genDetRecTrainVal
(
args
)
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录