提交 d9549ce6 编写于 作者: – MrCuiHao

数据集划分

上级 006d84bf
...@@ -204,6 +204,24 @@ For some data that are difficult to recognize, the recognition results will not ...@@ -204,6 +204,24 @@ For some data that are difficult to recognize, the recognition results will not
pip install opencv-contrib-python-headless==4.2.0.32 pip install opencv-contrib-python-headless==4.2.0.32
``` ```
### Dataset division
- Enter the following command in the terminal to execute the dataset division script:
```
cd ./PPOCRLabel # Change the directory to the PPOCRLabel folder
python gen_ocr_train_val_test.py --trainValTestRatio 6:2:2 --labelRootPath ../train_data/label --detRootPath ../train_data/det --recRootPath ../train_data/rec
```
- Parameter Description:
trainValTestRatio is the division ratio of the number of images in the training set, validation set, and test set, set according to your actual situation, the default is 6:2:2
labelRootPath is the storage path of the dataset labeled by PPOCRLabel, the default is ../train_data/label
detRootPath is the path where the text detection dataset is divided according to the dataset marked by PPOCRLabel. The default is ../train_data/det
recRootPath is the path where the character recognition dataset is divided according to the dataset marked by PPOCRLabel. The default is ../train_data/rec
### Related ### Related
1.[Tzutalin. LabelImg. Git code (2015)](https://github.com/tzutalin/labelImg) 1.[Tzutalin. LabelImg. Git code (2015)](https://github.com/tzutalin/labelImg)
\ No newline at end of file
...@@ -193,7 +193,22 @@ PPOCRLabel支持三种导出方式: ...@@ -193,7 +193,22 @@ PPOCRLabel支持三种导出方式:
``` ```
pip install opencv-contrib-python-headless==4.2.0.32 pip install opencv-contrib-python-headless==4.2.0.32
``` ```
### 数据集划分
- 在终端中输入以下命令执行数据集划分脚本:
```
cd ./PPOCRLabel # 将目录切换到PPOCRLabel文件夹下
python gen_ocr_train_val_test.py --trainValTestRatio 6:2:2 --labelRootPath ../train_data/label --detRootPath ../train_data/det --recRootPath ../train_data/rec
```
- 参数说明:
trainValTestRatio是训练集、验证集、测试集的图像数量划分比例,根据你的实际情况设定,默认是6:2:2
labelRootPath是PPOCRLabel标注的数据集存放路径,默认是../train_data/label
detRootPath是根据PPOCRLabel标注的数据集划分后的文本检测数据集存放的路径,默认是../train_data/det
recRootPath是根据PPOCRLabel标注的数据集划分后的字符识别数据集存放的路径,默认是../train_data/rec
### 参考资料 ### 参考资料
1.[Tzutalin. LabelImg. Git code (2015)](https://github.com/tzutalin/labelImg) 1.[Tzutalin. LabelImg. Git code (2015)](https://github.com/tzutalin/labelImg)
...@@ -5,45 +5,61 @@ import random ...@@ -5,45 +5,61 @@ import random
import argparse import argparse
# 删除划分的训练集和验证集文件夹,重新创建一个空的文件夹 # 删除划分的训练集、验证集、测试集文件夹,重新创建一个空的文件夹
def isCreateOrDeleteFolder(path, flag): def isCreateOrDeleteFolder(path, flag):
flagPath = os.path.join(path, flag) flagPath = os.path.join(path, flag)
if os.path.exists(flagPath): if os.path.exists(flagPath):
shutil.rmtree(flagPath) shutil.rmtree(flagPath)
os.makedirs(flagPath) os.makedirs(flagPath)
flagAbsPath = os.path.abspath(flagPath) flagAbsPath = os.path.abspath(flagPath)
return flagAbsPath return flagAbsPath
def splitTrainVal(root, dir, absTrainRootPath, absValRootPath, trainTxt, valTxt, flag): def splitTrainVal(root, dir, absTrainRootPath, absValRootPath, absTestRootPath, trainTxt, valTxt, testTxt, flag):
# 按照指定的比例划分训练集和验证 # 按照指定的比例划分训练集、验证集、测试
labelPath = os.path.join(root, dir) labelPath = os.path.join(root, dir)
labelAbsPath = os.path.abspath(labelPath) labelAbsPath = os.path.abspath(labelPath)
if flag == "det": if flag == "det":
labelFilePath = os.path.join(labelAbsPath, args.detLabelFileName) labelFilePath = os.path.join(labelAbsPath, args.detLabelFileName)
elif flag == "rec": elif flag == "rec":
labelFilePath = os.path.join(labelAbsPath, args.recLabelFileName) labelFilePath = os.path.join(labelAbsPath, args.recLabelFileName)
labelFileRead = open(labelFilePath, "r", encoding="UTF-8") labelFileRead = open(labelFilePath, "r", encoding="UTF-8")
labelFileContent = labelFileRead.readlines() labelFileContent = labelFileRead.readlines()
random.shuffle(labelFileContent) random.shuffle(labelFileContent)
labelRecordLen = len(labelFileContent) labelRecordLen = len(labelFileContent)
for index, labelRecordInfo in enumerate(labelFileContent): for index, labelRecordInfo in enumerate(labelFileContent):
imageRelativePath = labelRecordInfo.split('\t')[0] imageRelativePath = labelRecordInfo.split('\t')[0]
imageLabel = labelRecordInfo.split('\t')[1] imageLabel = labelRecordInfo.split('\t')[1]
imageName = os.path.basename(imageRelativePath) imageName = os.path.basename(imageRelativePath)
if flag == "det": if flag == "det":
imagePath = os.path.join(labelAbsPath, imageName) imagePath = os.path.join(labelAbsPath, imageName)
elif flag == "rec": elif flag == "rec":
imagePath = os.path.join(labelAbsPath, "{}\\{}".format(args.recImageDirName, imageName)) imagePath = os.path.join(labelAbsPath, "{}\\{}".format(args.recImageDirName, imageName))
# 小于划分比例trainValRatio时,数据集划分到训练集,否则测试集
if index / labelRecordLen < args.trainValRatio: # 按预设的比例划分训练集、验证集、测试集
trainValTestRatio = args.trainValTestRatio.split(":")
trainRatio = eval(trainValTestRatio[0]) / 10
valRatio = trainRatio + eval(trainValTestRatio[1]) / 10
curRatio = index / labelRecordLen
if curRatio < trainRatio:
imageCopyPath = os.path.join(absTrainRootPath, imageName) imageCopyPath = os.path.join(absTrainRootPath, imageName)
shutil.copy(imagePath, imageCopyPath) shutil.copy(imagePath, imageCopyPath)
trainTxt.write("{}\t{}".format(imageCopyPath, imageLabel)) trainTxt.write("{}\t{}".format(imageCopyPath, imageLabel))
else: elif curRatio >= trainRatio and curRatio < valRatio:
imageCopyPath = os.path.join(absValRootPath, imageName) imageCopyPath = os.path.join(absValRootPath, imageName)
shutil.copy(imagePath, imageCopyPath) shutil.copy(imagePath, imageCopyPath)
valTxt.write("{}\t{}".format(imageCopyPath, imageLabel)) valTxt.write("{}\t{}".format(imageCopyPath, imageLabel))
else:
imageCopyPath = os.path.join(absTestRootPath, imageName)
shutil.copy(imagePath, imageCopyPath)
testTxt.write("{}\t{}".format(imageCopyPath, imageLabel))
# 删掉存在的文件 # 删掉存在的文件
...@@ -55,48 +71,59 @@ def removeFile(path): ...@@ -55,48 +71,59 @@ def removeFile(path):
def genDetRecTrainVal(args): def genDetRecTrainVal(args):
detAbsTrainRootPath = isCreateOrDeleteFolder(args.detRootPath, "train") detAbsTrainRootPath = isCreateOrDeleteFolder(args.detRootPath, "train")
detAbsValRootPath = isCreateOrDeleteFolder(args.detRootPath, "val") detAbsValRootPath = isCreateOrDeleteFolder(args.detRootPath, "val")
detAbsTestRootPath = isCreateOrDeleteFolder(args.detRootPath, "test")
recAbsTrainRootPath = isCreateOrDeleteFolder(args.recRootPath, "train") recAbsTrainRootPath = isCreateOrDeleteFolder(args.recRootPath, "train")
recAbsValRootPath = isCreateOrDeleteFolder(args.recRootPath, "val") recAbsValRootPath = isCreateOrDeleteFolder(args.recRootPath, "val")
recAbsTestRootPath = isCreateOrDeleteFolder(args.recRootPath, "test")
removeFile(os.path.join(args.detRootPath, "train.txt")) removeFile(os.path.join(args.detRootPath, "train.txt"))
removeFile(os.path.join(args.detRootPath, "val.txt")) removeFile(os.path.join(args.detRootPath, "val.txt"))
removeFile(os.path.join(args.detRootPath, "test.txt"))
removeFile(os.path.join(args.recRootPath, "train.txt")) removeFile(os.path.join(args.recRootPath, "train.txt"))
removeFile(os.path.join(args.recRootPath, "val.txt")) removeFile(os.path.join(args.recRootPath, "val.txt"))
removeFile(os.path.join(args.recRootPath, "test.txt"))
detTrainTxt = open(os.path.join(args.detRootPath, "train.txt"), "a", encoding="UTF-8") detTrainTxt = open(os.path.join(args.detRootPath, "train.txt"), "a", encoding="UTF-8")
detValTxt = open(os.path.join(args.detRootPath, "val.txt"), "a", encoding="UTF-8") detValTxt = open(os.path.join(args.detRootPath, "val.txt"), "a", encoding="UTF-8")
detTestTxt = open(os.path.join(args.detRootPath, "test.txt"), "a", encoding="UTF-8")
recTrainTxt = open(os.path.join(args.recRootPath, "train.txt"), "a", encoding="UTF-8") recTrainTxt = open(os.path.join(args.recRootPath, "train.txt"), "a", encoding="UTF-8")
recValTxt = open(os.path.join(args.recRootPath, "val.txt"), "a", encoding="UTF-8") recValTxt = open(os.path.join(args.recRootPath, "val.txt"), "a", encoding="UTF-8")
recTestTxt = open(os.path.join(args.recRootPath, "test.txt"), "a", encoding="UTF-8")
for root, dirs, files in os.walk(args.labelRootPath): for root, dirs, files in os.walk(args.labelRootPath):
for dir in dirs: for dir in dirs:
splitTrainVal(root, dir, detAbsTrainRootPath, detAbsValRootPath, detTrainTxt, detValTxt, "det") splitTrainVal(root, dir, detAbsTrainRootPath, detAbsValRootPath, detAbsTestRootPath, detTrainTxt, detValTxt,
splitTrainVal(root, dir, recAbsTrainRootPath, recAbsValRootPath, recTrainTxt, recValTxt, "rec") detTestTxt, "det")
splitTrainVal(root, dir, recAbsTrainRootPath, recAbsValRootPath, recAbsTestRootPath, recTrainTxt, recValTxt,
recTestTxt, "rec")
break break
if __name__ == "__main__": if __name__ == "__main__":
# 功能描述:分别划分检测和识别的训练集和验证 # 功能描述:分别划分检测和识别的训练集、验证集、测试
# 说明:可以根据自己的路径和需求调整参数,图像数据往往多人合作分批标注,每一批图像数据放在一个文件夹内用PPOCRLabel进行标注, # 说明:可以根据自己的路径和需求调整参数,图像数据往往多人合作分批标注,每一批图像数据放在一个文件夹内用PPOCRLabel进行标注,
# 如此会有多个标注好的图像文件夹汇总并划分训练集和验证集的需求 # 如此会有多个标注好的图像文件夹汇总并划分训练集、验证集、测试集的需求
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument( parser.add_argument(
"--trainValRatio", "--trainValTestRatio",
type=float, type=str,
default=0.8, default="6:2:2",
help="ratio of training set to validation set") help="ratio of trainset:valset:testset")
parser.add_argument( parser.add_argument(
"--labelRootPath", "--labelRootPath",
type=str, type=str,
default="./train_data/label", default="../train_data/label",
help="path to the dataset marked by ppocrlabel, E.g, dataset folder named 1,2,3..." help="path to the dataset marked by ppocrlabel, E.g, dataset folder named 1,2,3..."
) )
parser.add_argument( parser.add_argument(
"--detRootPath", "--detRootPath",
type=str, type=str,
default="./train_data/det", default="../train_data/det",
help="the path where the divided detection dataset is placed") help="the path where the divided detection dataset is placed")
parser.add_argument( parser.add_argument(
"--recRootPath", "--recRootPath",
type=str, type=str,
default="./train_data/rec", default="../train_data/rec",
help="the path where the divided recognition dataset is placed" help="the path where the divided recognition dataset is placed"
) )
parser.add_argument( parser.add_argument(
......
python gen_ocr_train_val.py --trainValRatio 0.8 --labelRootPath ./train_data/label --detRootPath ./train_data/det --recRootPath ./train_data/rec
\ No newline at end of file
1、功能描述:分别划分检测和识别的训练集和验证集
2、说明:可以根据自己的路径和需求调整参数,图像数据往往多人合作分批标注,每一批图像数据放在一个文件夹内用PPOCRLabel进行标注,如此会有多个标注好的图像文件夹汇总并划分训练集和验证集的需求。
3、使用方法:
3.1 首先使用PPOCRLabel标注好图像,一般是分批次标注,多个标注好的图像文件夹存放在train_data目录下的label文件夹里,文件夹没有自己创建,label同级路径下创建det文件夹存放划分好的文本检测数据集,label同级路径下创建rec文件夹存放划分好的字符识别数据集,目录结构如下图所示:
![20211008154929](20211008154929.png)
![20211008155029](20211008155029.png)
3.2 gen_ocr_train_val.py参数说明
trainValRatio 训练集和验证集的图像数量划分比例,根据你的实际情况设定,默认是0.8
labelRootPath PPOCRLabel标注的数据集存放路径,默认是./train_data/label
detRootPath 根据PPOCRLabel标注的数据集划分后的文本检测数据集存放的路径
recRootPath 根据PPOCRLabel标注的数据集划分后的字符识别数据集存放的路径
detLabelFileName 使用PPOCRLabel标注图像时,人工确认过的标注结果会存放在Label.txt内
recLabelFileName 使用PPOCRLabel标注图像时,点击导出识别结果后,会对人工确认过的字符标注结果进行字符裁剪,生成裁剪后的字符图像路径以及字符图像对应的字符标签保存到rec_gt.txt中
recImageDirName 使用PPOCRLabel标注图像时,点击导出识别结果后,会把裁剪后的字符图像保存到crop_img文件夹内
3.3 执行gen_ocr_train_val.py方法
如果目录结构和文件夹名称是严格按照以上说明创建的,可以直接在windows环境下执行gen_ocr_train_val.bat,在linux环境下需要执行gen_ocr_train_val.sh,默认划分比例是0.8
也可以在终端中输入以下命令执行:
python gen_ocr_train_val.py --trainValRatio 0.8 --labelRootPath ./train_data/label --detRootPath ./train_data/det --recRootPath ./train_data/rec
如果想创建自己的目录结构和文件夹名称,需要手动修改命令里的路径
#!/bin/bash
python gen_ocr_train_val.py --trainValRatio 0.8 --labelRootPath ./train_data/label --detRootPath ./train_data/det --recRootPath ./train_data/rec
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册