未验证 提交 15388cb6 编写于 作者: R ruri 提交者: GitHub

polish code and add multi-card validate and infer in image classification (#4042)

上级 775741f3
......@@ -14,6 +14,7 @@
- [进阶使用](#进阶使用)
- [Mixup训练](#mixup训练)
- [混合精度训练](#混合精度训练)
- [性能分析](#性能分析)
- [DALI预处理](#DALI预处理)
- [自定义数据集](#自定义数据集)
- [已发布模型及其性能](#已发布模型及其性能)
......@@ -129,7 +130,7 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m paddle.distributed.launch train.py \
* **model**: 模型名称, 默认值: "ResNet50"
* **total_images**: 图片数,ImageNet2012,默认值: 1281167
* **class_dim**: 类别数,默认值: 1000
* **image_shape**: 图片大小,默认值: 3 224 224
* **image_shape**: 图片大小,默认值: [3,224,224]
* **num_epochs**: 训练回合数,默认值: 120
* **batch_size**: batch size大小(所有设备),默认值: 8
* **test_batch_size**: 测试batch大小,默认值:16
......@@ -159,25 +160,37 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m paddle.distributed.launch train.py \
一些开关:
* **validate**: 是否在模型训练过程中启动模型测试,默认值: True
* **use_gpu**: 是否在GPU上运行,默认值: True
* **use_label_smoothing**: 是否对数据进行label smoothing处理,默认值: False
* **label_smoothing_epsilon**: label_smoothing的epsilon, 默认值:0.1
* **random_seed**: 随机数种子, 默认值: 1000
* **padding_type**: efficientNet中卷积操作的padding方式, 默认值: "SAME".
* **use_se**: efficientNet中是否使用Squeeze-and-Excitation模块, 默认值: True.
* **use_ema**: 是否在更新模型参数时使用ExponentialMovingAverage. 默认值: False.
* **ema_decay**: ExponentialMovingAverage的decay rate. 默认值: 0.9999.
性能分析:
* **enable_ce**: 是否开启CE,默认值: False
* **random_seed**: 随机数种子,当设置数值后,所有随机化会被固定,默认值: None
* **is_profiler**: 是否开启性能分析,默认值: 0
* **profilier_path**: 分析文件保存位置,默认值: 'profiler_path/'
* **max_iter**: 最大训练batch数,默认值: 0
* **same_feed**: 是否feed相同数据进入网络,设定具体数值来指定数据数量,默认值:0
**数据读取器说明:** 数据读取器定义在```reader.py```文件中,现在默认基于cv2的数据读取器, 在[训练阶段](#模型训练),默认采用的增广方式是随机裁剪与水平翻转, 而在[模型评估](#模型评估)[模型预测](#模型预测)阶段用的默认方式是中心裁剪。当前支持的数据增广方式有:
* 旋转
* 颜色抖动(暂未实现)
* 颜色抖动
* 随机裁剪
* 中心裁剪
* 长宽调整
* 水平翻转
* 自动增广
### 参数微调
参数微调(Finetune)是指在特定任务上微调已训练模型的参数。可以下载[已发布模型及其性能](#已发布模型及其性能)并且设置```path_to_pretrain_model```为模型所在路径,微调一个模型可以采用如下的命令:
......@@ -193,6 +206,10 @@ python train.py \
模型评估(Eval)是指对训练完毕的模型评估各类性能指标。可以下载[已发布模型及其性能](#已发布模型及其性能)并且设置```path_to_pretrain_model```为模型所在路径。运行如下的命令,可以获得模型top-1/top-5精度:
**参数说明**
* **save_json_path**: 是否将eval结果保存到json文件中,默认值:None
```bash
python eval.py \
--model=model_name \
......@@ -220,18 +237,21 @@ python eval.py \
**参数说明:**
* **save_inference**: 是否保存模型,默认值:False
* **save_inference**: 是否保存二进制模型,默认值:False
* **topk**: 按照置信由高到低排序标签结果,返回的结果数量,默认值:1
* **label_path**: 可读标签文件路径,默认值:"./utils/tools/readable_label.txt"
* **class_map_path**: 可读标签文件路径,默认值:"./utils/tools/readable_label.txt"
* **image_path**: 指定单文件进行预测,默认值:None
* **save_json_path**: 将预测结果保存到json文件中,默认值: None
```bash
python infer.py \
--model=model_name \
--pretrained_model=${path_to_pretrain_model}
--image_path=${path_to_single_image}
```
注意:根据具体模型和任务添加并调整其他参数
模型预测默认ImageNet1000类类别,标签文件存储在/utils/tools/readable_label.txt中,如果使用自定义数据,请指定--label_path参数
模型预测默认ImageNet1000类类别,预测数值和可读标签的map文件存储在/utils/tools/readable_label.txt中,如果使用自定义数据,请指定--class_map_path参数
## 进阶使用
......@@ -246,6 +266,34 @@ Mixup相关介绍参考[mixup: Beyond Empirical Risk Minimization](https://arxiv
FP16相关内容已经迁移至PaddlePaddle/Fleet 中
### 性能分析
注意:本部分主要为内部测试功能。
其中包括启动CE以监测模型运行的稳定性,启动profiler以测试benchmark,启动same_feed来进行快速调试。
启动CE会固定随机初始化,其中包括数据读取器中的shuffle和program的[random_seed](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api_cn/fluid_cn/Program_cn.html#random_seed)
``` bash
python train.py \
--enable_ce=True \
--data_dir=${path_to_a_smaller_dataset}
```
启动profiler进行性能分析
``` bash
python train.py \
--is_profiler=True
```
设置same_feed参数以进行快速调试, 相同的图片(same_feed张图片)将传入网络中
```bash
python train.py \
--same_feed=8 \
--batch_size=4 \
--print_step=1
```
### DALI预处理
使用[Nvidia DALI](https://github.com/NVIDIA/DALI)预处理类库可以加速训练并提高GPU利用率。
......@@ -275,7 +323,7 @@ python -m paddle.distributed.launch train.py \
#### 注意事项
1. PaddlePaddle需使用1.6或以上的版本,并且需要使用GCC5.4以上编译器编译
1. PaddlePaddle需使用1.6或以上的版本,并且需要使用GCC5.4以上编译器[编译](https://www.paddlepaddle.org.cn/install/doc/source/ubuntu), 另外,请在编译过程中指定-DWITH_DISTRIBUTE=ON 来启动多进程训练模式
2. Nvidia DALI需要使用[#1371](https://github.com/NVIDIA/DALI/pull/1371)以后的git版本。请参考[此文档](https://docs.nvidia.com/deeplearning/sdk/dali-master-branch-user-guide/docs/installation.html)安装nightly版本或从源码安装。
3. 因为DALI使用GPU进行图片预处理,需要占用部分显存,请适当调整 `FLAGS_fraction_of_gpu_memory_to_use`环境变量(如`0.8`)来预留部分显存供DALI使用。
......@@ -491,7 +539,7 @@ PaddlePaddle/Models ImageClassification 支持自定义数据
|[ResNeXt50_vd_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt50_vd_64x4d_pretrained.tar) | 80.12% | 94.86% | 20.888 | 15.938 |
|[ResNeXt101_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_32x4d_pretrained.tar) | 78.65% | 94.19% | 24.154 | 17.661 |
|[ResNeXt101_vd_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_vd_32x4d_pretrained.tar) | 80.33% | 95.12% | 24.701 | 17.249 |
|[ResNeXt101_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt50_64x4d_pretrained.tar) | 78.43% | 94.13% | 41.073 | 31.288 |
|[ResNeXt101_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt50_64x4d_pretrained.tar) | 78.35% | 94.52% | 41.073 | 31.288 |
|[ResNeXt101_vd_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_vd_64x4d_pretrained.tar) | 80.78% | 95.20% | 42.277 | 32.620 |
|[ResNeXt152_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt152_32x4d_pretrained.tar) | 78.98% | 94.33% | 37.007 | 26.981 |
|[ResNeXt152_vd_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt152_vd_32x4d_pretrained.tar) | 80.72% | 95.20% | 35.783 | 26.081 |
......
......@@ -15,6 +15,7 @@ English | [中文](README.md)
- [Advanced Usage](#advanced-usage)
- [Mixup Training](#mixup-training)
- [Using Mixed-Precision Training](#using-mixed-precision-training)
- [Profiling](#profiling)
- [Preprocessing with Nvidia DALI](#preprocessing-with-nvidia-dali)
- [Custom Dataset](#custom-dataset)
- [Supported Models and Performances](#supported-models-and-performances)
......@@ -152,19 +153,29 @@ Reader and preprocess:
Switch:
* **validate**: whether to validate when training. Default: True.
* **use_gpu**: whether to use GPU or not. Default: True.
* **use_label_smoothing**: whether to use label_smoothing or not. Default:False.
* **label_smoothing_epsilon**: the label_smoothing_epsilon. Default:0.1.
* **random_seed**: random seed for debugging, Default: 1000.
* **padding_type**: padding type of convolution for efficientNet, Default: "SAME".
* **use_se**: whether to use Squeeze-and-Excitation module in efficientNet, Default: True.
* **use_ema**: whether to use ExponentialMovingAverage or not. Default: False.
* **ema_decay**: the value of ExponentialMovingAverage decay rate. Default: 0.9999.
Profiling:
* **enable_ce**: whether to start CE, Default: False
* **random_seed**: random seed, Default: None
* **is_profiler**: whether to start profilier, Default: 0
* **profilier_path**: path to save profilier output, Default: 'profilier_path'
* **max_iter**: maximum training batch, Default: 0
* **same_feed**: whether to feed same data in the net, Default: 0
**data reader introduction:** Data reader is defined in ```reader.py```, default reader is implemented by opencv. In the [Training](#training) Stage, random crop and flipping are applied, while center crop is applied in the [Evaluation](#evaluation) and [Inference](#inference) stages. Supported data augmentation includes:
* rotation
* color jitter (haven't implemented in cv2_reader)
* color jitter
* random crop
* center crop
* resize
......@@ -187,6 +198,10 @@ Note: Add and adjust other parameters accroding to specific models and tasks.
Evaluation is to evaluate the performance of a trained model. One can download [pretrained models](#supported-models-and-performances) and set its path to ```path_to_pretrain_model```. Then top1/top5 accuracy can be obtained by running the following command:
**parameters**
* **save_json_path**: whether to save output, default: None
```
python eval.py \
--model=model_name \
......@@ -215,7 +230,9 @@ python eval.py \
* **save_inference**: whether to save binary model, Default: False
* **topk**: the number of sorted predicated labels to show, Default: 1
* **label_path**: readable label filepath, Default: "/utils/tools/readable_label.txt"
* **class_map_path**: readable label filepath, Default: "/utils/tools/readable_label.txt"
* **save_json_path**: whether to save output, Default: None
* **image_path**: whether to indicate the single image path to predict, Default: None
Inference is used to get prediction score or image features based on trained models. One can download [pretrained models](#supported-models-and-performances) and set its path to ```path_to_pretrain_model```. Run following command then obtain prediction score.
......@@ -480,7 +497,7 @@ Pretrained models can be downloaded by clicking related model names.
|[ResNeXt50_vd_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt50_vd_64x4d_pretrained.tar) | 80.12% | 94.86% | 20.888 | 15.938 |
|[ResNeXt101_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_32x4d_pretrained.tar) | 78.65% | 94.19% | 24.154 | 17.661 |
|[ResNeXt101_vd_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_vd_32x4d_pretrained.tar) | 80.33% | 95.12% | 24.701 | 17.249 |
|[ResNeXt101_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt50_64x4d_pretrained.tar) | 78.43% | 94.13% | 41.073 | 31.288 |
|[ResNeXt101_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt50_64x4d_pretrained.tar) | 79.35% | 94.52% | 41.073 | 31.288 |
|[ResNeXt101_vd_64x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt101_vd_64x4d_pretrained.tar) | 80.78% | 95.20% | 42.277 | 32.620 |
|[ResNeXt152_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt152_32x4d_pretrained.tar) | 78.98% | 94.33% | 37.007 | 26.981 |
|[ResNeXt152_vd_32x4d](https://paddle-imagenet-models-name.bj.bcebos.com/ResNeXt152_vd_32x4d_pretrained.tar) | 80.72% | 95.20% | 35.783 | 26.081 |
......
......@@ -48,7 +48,8 @@ def _basic_model(data, model, args, is_train):
avg_cost = fluid.layers.mean(cost)
acc_top1 = fluid.layers.accuracy(input=softmax_out, label=label, k=1)
acc_top5 = fluid.layers.accuracy(input=softmax_out, label=label, k=5)
acc_top5 = fluid.layers.accuracy(
input=softmax_out, label=label, k=min(5, args.class_dim))
return [avg_cost, acc_top1, acc_top5]
......@@ -73,7 +74,8 @@ def _googlenet_model(data, model, args, is_train):
avg_cost = avg_cost0 + 0.3 * avg_cost1 + 0.3 * avg_cost2
acc_top1 = fluid.layers.accuracy(input=out0, label=label, k=1)
acc_top5 = fluid.layers.accuracy(input=out0, label=label, k=5)
acc_top5 = fluid.layers.accuracy(
input=out0, label=label, k=min(5, args.class_dim))
return [avg_cost, acc_top1, acc_top5]
......
......@@ -34,7 +34,7 @@ parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
# yapf: disable
add_arg('data_dir', str, "./data/ILSVRC2012/", "The ImageNet datset")
add_arg('batch_size', int, 256, "Minibatch size.")
add_arg('batch_size', int, 256, "batch size on the all devices.")
add_arg('use_gpu', bool, True, "Whether to use GPU or not.")
add_arg('class_dim', int, 1000, "Class number.")
parser.add_argument("--pretrained_model", default=None, required=True, type=str, help="The path to load pretrained model")
......@@ -48,6 +48,8 @@ parser.add_argument('--image_shape', nargs="+", type=int, default=[3,224,224],
add_arg('interpolation', int, None, "The interpolation mode")
add_arg('padding_type', str, "SAME", "Padding type of convolution")
add_arg('use_se', bool, True, "Whether to use Squeeze-and-Excitation module for EfficientNet.")
add_arg('save_json_path', str, None, "Whether to save output in json file.")
add_arg('same_feed', int, 0, "Whether to feed same images")
# yapf: enable
......@@ -96,27 +98,37 @@ def eval(args):
acc_top1 = fluid.layers.accuracy(input=pred, label=label, k=1)
acc_top5 = fluid.layers.accuracy(input=pred, label=label, k=5)
#startup_prog = fluid.Program()
test_program = fluid.default_main_program().clone(for_test=True)
fetch_list = [avg_cost.name, acc_top1.name, acc_top5.name]
gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
if args.use_gpu:
places = fluid.framework.cuda_places()
compiled_program = fluid.compiler.CompiledProgram(
test_program).with_data_parallel(places=places)
fluid.io.load_persistables(exe, args.pretrained_model)
imagenet_reader = reader.ImageNetReader()
val_reader = imagenet_reader.val(settings=args)
feeder = fluid.DataFeeder(place=place, feed_list=[image, label])
val_reader = feeder.decorate_reader(val_reader, multi_devices=True)
test_info = [[], [], []]
cnt = 0
for batch_id, data in enumerate(val_reader()):
t1 = time.time()
loss, acc1, acc5 = exe.run(test_program,
loss, acc1, acc5 = exe.run(compiled_program,
fetch_list=fetch_list,
feed=feeder.feed(data))
feed=data)
t2 = time.time()
period = t2 - t1
loss = np.mean(loss)
......@@ -127,10 +139,12 @@ def eval(args):
test_info[2].append(acc5 * len(data))
cnt += len(data)
if batch_id % 10 == 0:
print("Testbatch {0},loss {1}, "
"acc1 {2},acc5 {3},time {4}".format(batch_id, \
info = "Testbatch {0},loss {1}, acc1 {2},acc5 {3},time {4}".format(batch_id, \
"%.5f"%loss,"%.5f"%acc1, "%.5f"%acc5, \
"%2.2f sec" % period))
"%2.2f sec" % period)
print(info)
if args.save_json_path:
save_json(info, args.save_json_path)
sys.stdout.flush()
test_loss = np.sum(test_info[0]) / cnt
......
......@@ -37,7 +37,7 @@ add_arg('data_dir', str, "./data/ILSVRC2012/", "The ImageNet data")
add_arg('use_gpu', bool, True, "Whether to use GPU or not.")
add_arg('class_dim', int, 1000, "Class number.")
parser.add_argument("--pretrained_model", default=None, required=True, type=str, help="The path to load pretrained model")
add_arg('model', str, "ResNet50", "Set the network to use.")
add_arg('model', str, "ResNet50", "Set the network to use.")
add_arg('save_inference', bool, False, "Whether to save inference model or not")
add_arg('resize_short_size',int, 256, "Set resize short size")
add_arg('reader_thread', int, 1, "The number of multi thread reader")
......@@ -46,10 +46,13 @@ parser.add_argument('--image_mean', nargs='+', type=float, default=[0.485, 0.456
parser.add_argument('--image_std', nargs='+', type=float, default=[0.229, 0.224, 0.225], help="The std of input image data")
parser.add_argument('--image_shape', nargs='+', type=int, default=[3, 224, 224], help="the shape of image")
add_arg('topk', int, 1, "topk")
add_arg('label_path', str, "./utils/tools/readable_label.txt", "readable label filepath")
add_arg('class_map_path', str, "./utils/tools/readable_label.txt", "readable label filepath")
add_arg('interpolation', int, None, "The interpolation mode")
add_arg('padding_type', str, "SAME", "Padding type of convolution")
add_arg('use_se', bool, True, "Whether to use Squeeze-and-Excitation module for EfficientNet.")
add_arg('image_path', str, None, "single image path")
add_arg('batch_size', int, 8, "batch_size on all devices")
add_arg('save_json_path', str, None, "save output to a json file")
# yapf: enable
......@@ -63,6 +66,17 @@ def infer(args):
assert args.image_shape[
1] <= args.resize_short_size, "Please check the args:image_shape and args:resize_short_size, The croped size(image_shape[1]) must smaller than or equal to the resized length(resize_short_size) "
if args.image_path:
assert os.path.isfile(
args.image_path
), "Please check the args:image_path, it should be a path to single image."
if args.use_gpu:
assert fluid.core.get_cuda_device_count(
) == 1, "please set \"export CUDA_VISIBLE_DEVICES=\" available single card"
else:
assert int(os.environ.get('CPU_NUM',
1)) == 1, "please set CPU_NUM as 1"
image = fluid.data(
name='image', shape=[None] + args.image_shape, dtype='float32')
......@@ -87,6 +101,9 @@ def infer(args):
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
compiled_program = fluid.compiler.CompiledProgram(
test_program).with_data_parallel()
fluid.io.load_persistables(exe, args.pretrained_model)
if args.save_inference:
fluid.io.save_inference_model(
......@@ -100,32 +117,44 @@ def infer(args):
print("model: ", args.model, " is already saved")
exit(0)
args.test_batch_size = 1
imagenet_reader = reader.ImageNetReader()
test_reader = imagenet_reader.test(settings=args)
feeder = fluid.DataFeeder(place=place, feed_list=[image])
test_reader = feeder.decorate_reader(test_reader, multi_devices=True)
TOPK = args.topk
assert os.path.exists(args.label_path), "Index file doesn't exist!"
f = open(args.label_path)
label_dict = {}
for item in f.readlines():
key = item.split(" ")[0]
value = [l.replace("\n", "") for l in item.split(" ")[1:]]
label_dict[key] = value
if os.path.exists(args.class_map_path):
print("The map of readable label and numerical label has been found!")
f = open(args.class_map_path)
label_dict = {}
for item in f.readlines():
key = item.split(" ")[0]
value = [l.replace("\n", "") for l in item.split(" ")[1:]]
label_dict[key] = value
for batch_id, data in enumerate(test_reader()):
result = exe.run(test_program,
fetch_list=fetch_list,
feed=feeder.feed(data))
result = exe.run(compiled_program, fetch_list=fetch_list, feed=data)
result = result[0][0]
pred_label = np.argsort(result)[::-1][:TOPK]
readable_pred_label = []
for label in pred_label:
readable_pred_label.append(label_dict[str(label)])
print("Test-{0}-score: {1}, class{2} {3}".format(batch_id, result[
pred_label], pred_label, readable_pred_label))
if os.path.exists(args.class_map_path):
readable_pred_label = []
for label in pred_label:
readable_pred_label.append(label_dict[str(label)])
print(readable_pred_label)
info = "Test-{0}-score: {1}, class{2} {3}".format(
batch_id, result[pred_label], pred_label, readable_pred_label)
else:
info = "Test-{0}-score: {1}, class{2}".format(
batch_id, result[pred_label], pred_label)
print(info)
if args.save_json_path:
save_json(info, args.save_json_path)
sys.stdout.flush()
if args.image_path:
os.remove(".tmp.txt")
def main():
......
......@@ -271,15 +271,13 @@ class ImageNetReader:
rotate=False,
data_dir=None):
num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
if mode == 'test':
batch_size = 1
if settings.use_gpu:
batch_size = settings.batch_size // paddle.fluid.core.get_cuda_device_count(
)
else:
if settings.use_gpu:
batch_size = settings.batch_size // paddle.fluid.core.get_cuda_device_count(
)
else:
batch_size = settings.batch_size // int(
os.environ.get('CPU_NUM', 1))
batch_size = settings.batch_size // int(
os.environ.get('CPU_NUM', 1))
def reader():
def read_file_list():
......@@ -295,11 +293,11 @@ class ImageNetReader:
np.random.RandomState(self.shuffle_seed).shuffle(
full_lines)
elif shuffle:
if not settings.enable_ce or settings.same_feed:
if not settings.enable_ce or not settings.same_feed:
np.random.shuffle(full_lines)
batch_data = []
if settings.same_feed:
if (mode == "train" or mode == "val") and settings.same_feed:
temp_file = full_lines[0]
print("Same images({},nums:{}) will feed in the net".format(
str(temp_file), settings.same_feed))
......@@ -319,6 +317,7 @@ class ImageNetReader:
return read_file_list
data_reader = reader()
if mode == 'train' and num_trainers > 1:
assert self.shuffle_seed is not None, \
"If num_trainers > 1, the shuffle_seed must be set, because " \
......@@ -391,7 +390,6 @@ class ImageNetReader:
assert os.path.isfile(
file_list), "{} doesn't exist, please check data list path".format(
file_list)
return self._reader_creator(
settings,
file_list,
......@@ -408,7 +406,13 @@ class ImageNetReader:
Returns:
test reader
"""
file_list = os.path.join(settings.data_dir, 'val_list.txt')
if settings.image_path:
tmp = open(".tmp.txt", "w")
tmp.write(settings.image_path + " 0")
file_list = ".tmp.txt"
settings.batch_size = 1
else:
file_list = os.path.join(settings.data_dir, 'val_list.txt')
assert os.path.isfile(
file_list), "{} doesn't exist, please check data list path".format(
file_list)
......
......@@ -31,7 +31,7 @@ from build_model import create_model
def build_program(is_train, main_prog, startup_prog, args):
"""build program, and add grad op in program accroding to different mode
"""build program, and add backward op in program accroding to different mode
Parameters:
is_train: indicate train mode or test mode
......@@ -86,13 +86,21 @@ def validate(args,
test_fetch_list,
pass_id,
train_batch_metrics_record,
train_batch_time_record=None):
train_batch_time_record=None,
train_prog=None):
test_batch_time_record = []
test_batch_metrics_record = []
test_batch_id = 0
compiled_program = best_strategy_compiled(
args,
test_prog,
test_fetch_list[0],
exe,
mode="val",
share_prog=train_prog)
for batch in test_iter:
t1 = time.time()
test_batch_metrics = exe.run(program=test_prog,
test_batch_metrics = exe.run(program=compiled_program,
feed=batch,
fetch_list=test_fetch_list)
t2 = time.time()
......@@ -103,7 +111,7 @@ def validate(args,
test_batch_metrics_record.append(test_batch_metrics_avg)
print_info("batch", test_batch_metrics_avg, test_batch_elapse, pass_id,
test_batch_id, args.print_step)
test_batch_id, args.print_step, args.class_dim)
sys.stdout.flush()
test_batch_id += 1
......@@ -118,7 +126,8 @@ def validate(args,
"epoch",
list(train_epoch_metrics_avg) + list(test_epoch_metrics_avg),
test_epoch_time_avg,
pass_id=pass_id)
pass_id=pass_id,
class_dim=args.class_dim)
if args.enable_ce:
device_num = fluid.core.get_cuda_device_count() if args.use_gpu else 1
print_info(
......@@ -136,8 +145,6 @@ def train(args):
"""
startup_prog = fluid.Program()
train_prog = fluid.Program()
test_prog = fluid.Program()
train_out = build_program(
is_train=True,
main_prog=train_prog,
......@@ -152,18 +159,20 @@ def train(args):
train_fetch_list = [var.name for var in train_fetch_vars]
test_out = build_program(
is_train=False,
main_prog=test_prog,
startup_prog=startup_prog,
args=args)
test_data_loader = test_out[-1]
test_fetch_vars = test_out[:-1]
if args.validate:
test_prog = fluid.Program()
test_out = build_program(
is_train=False,
main_prog=test_prog,
startup_prog=startup_prog,
args=args)
test_data_loader = test_out[-1]
test_fetch_vars = test_out[:-1]
test_fetch_list = [var.name for var in test_fetch_vars]
test_fetch_list = [var.name for var in test_fetch_vars]
#Create test_prog and set layers' is_test params to True
test_prog = test_prog.clone(for_test=True)
#Create test_prog and set layers' is_test params to True
test_prog = test_prog.clone(for_test=True)
gpu_id = int(os.environ.get('FLAGS_selected_gpus', 0))
place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace()
......@@ -183,12 +192,14 @@ def train(args):
else:
imagenet_reader = reader.ImageNetReader(0 if num_trainers > 1 else None)
train_reader = imagenet_reader.train(settings=args)
test_reader = imagenet_reader.val(settings=args)
places = place
if num_trainers <= 1 and args.use_gpu:
places = fluid.framework.cuda_places()
train_data_loader.set_sample_list_generator(train_reader, places)
test_data_loader.set_sample_list_generator(test_reader, place)
if args.validate:
test_reader = imagenet_reader.val(settings=args)
test_data_loader.set_sample_list_generator(test_reader, places)
compiled_train_prog = best_strategy_compiled(args, train_prog,
train_fetch_vars[0], exe)
......@@ -204,7 +215,8 @@ def train(args):
if not args.use_dali:
train_iter = train_data_loader()
test_iter = test_data_loader()
if args.validate:
test_iter = test_data_loader()
t1 = time.time()
for batch in train_iter:
......@@ -213,13 +225,16 @@ def train(args):
return
train_batch_metrics = exe.run(compiled_train_prog,
feed=batch,
fetch_list=train_fetch_list)
fetch_list=train_fetch_list
if pass_id % args.print_step == 0 else
[])
t2 = time.time()
train_batch_elapse = t2 - t1
train_batch_time_record.append(train_batch_elapse)
train_batch_metrics_avg = np.mean(
np.array(train_batch_metrics), axis=1)
train_batch_metrics_record.append(train_batch_metrics_avg)
if pass_id % args.print_step == 0:
train_batch_metrics_avg = np.mean(
np.array(train_batch_metrics), axis=1)
train_batch_metrics_record.append(train_batch_metrics_avg)
if trainer_id == 0:
print_info("batch", train_batch_metrics_avg, train_batch_elapse,
pass_id, train_batch_id, args.print_step)
......@@ -242,18 +257,20 @@ def train(args):
print('ExponentialMovingAverage validate start...')
with ema.apply(exe):
validate(args, test_iter, exe, test_prog, test_fetch_list,
pass_id, train_batch_metrics_record)
pass_id, train_batch_metrics_record,
compiled_train_prog)
print('ExponentialMovingAverage validate over!')
validate(args, test_iter, exe, test_prog, test_fetch_list, pass_id,
train_batch_metrics_record, train_batch_time_record)
#For now, save model per epoch.
if pass_id % args.save_step == 0:
save_model(args, exe, train_prog, pass_id)
train_batch_metrics_record, train_batch_time_record,
compiled_train_prog)
if args.use_dali:
test_iter.reset()
if pass_id % args.save_step == 0:
save_model(args, exe, train_prog, pass_id)
def main():
args = parse_args()
......
......@@ -12,4 +12,4 @@
#See the License for the specific language governing permissions and
#limitations under the License.
from .optimizer import cosine_decay, lr_warmup, cosine_decay_with_warmup, exponential_decay_with_warmup, Optimizer, create_optimizer
from .utility import add_arguments, print_arguments, parse_args, check_gpu, check_args, check_version, init_model, save_model, create_data_loader, print_info, best_strategy_compiled, init_model, save_model, ExponentialMovingAverage
from .utility import add_arguments, print_arguments, parse_args, check_gpu, check_args, check_version, init_model, save_model, create_data_loader, print_info, best_strategy_compiled, init_model, save_model, ExponentialMovingAverage, save_json
......@@ -26,6 +26,7 @@ import sys
import os
import warnings
import signal
import json
import paddle
import paddle.fluid as fluid
......@@ -101,8 +102,8 @@ def parse_args():
parser.add_argument('--image_shape', nargs='+', type=int, default=[3, 224, 224], help="The shape of image")
add_arg('num_epochs', int, 120, "The number of total epochs.")
add_arg('class_dim', int, 1000, "The number of total classes.")
add_arg('batch_size', int, 8, "Minibatch size on a device.")
add_arg('test_batch_size', int, 16, "Test batch size on a deveice.")
add_arg('batch_size', int, 8, "Minibatch size on all devices.")
add_arg('test_batch_size', int, 16, "Test batch size on all devices.")
add_arg('lr', float, 0.1, "The learning rate.")
add_arg('lr_strategy', str, "piecewise_decay", "The learning rate decay strategy.")
add_arg('l2_decay', float, 1e-4, "The l2_decay parameter.")
......@@ -129,6 +130,7 @@ def parse_args():
parser.add_argument('--image_std', nargs='+', type=float, default=[0.229, 0.224, 0.225], help="The std of input image data")
# SWITCH
add_arg('validate', bool, True, "whether to validate when training.")
#NOTE: (2019/08/08) FP16 is moving to PaddlePaddle/Fleet now
#add_arg('use_fp16', bool, False, "Whether to enable half precision training with fp16." )
#add_arg('scale_loss', float, 1.0, "The value of scale_loss for fp16." )
......@@ -136,18 +138,17 @@ def parse_args():
add_arg('label_smoothing_epsilon', float, 0.1, "The value of label_smoothing_epsilon parameter")
#NOTE: (2019/08/08) temporary disable use_distill
#add_arg('use_distill', bool, False, "Whether to use distill")
add_arg("enable_ce", bool, False, "Whether to enable ce")
add_arg('random_seed', int, None, "random seed")
add_arg('use_ema', bool, False, "Whether to use ExponentialMovingAverage.")
add_arg('ema_decay', float, 0.9999, "The value of ema decay rate")
add_arg('padding_type', str, "SAME", "Padding type of convolution")
add_arg('use_se', bool, True, "Whether to use Squeeze-and-Excitation module for EfficientNet.")
#NOTE: args for profiler
add_arg('is_profiler', int, 0, "the profiler switch.(used for benchmark)")
add_arg('profiler_path', str, './', "the profiler output file path.(used for benchmark)")
add_arg('max_iter', int, 0, "the max train batch num.(used for benchmark)")
add_arg('validate', int, 1, "whether validate.(used for benchmark)")
add_arg("enable_ce", bool, False, "Whether to enable ce")
add_arg('random_seed', int, None, "random seed")
add_arg('is_profiler', bool, False, "Whether to start the profiler")
add_arg('profiler_path', str, './profilier_files', "the profiler output file path")
add_arg('max_iter', int, 0, "the max train batch num")
add_arg('same_feed', int, 0, "whether to feed same images")
......@@ -270,30 +271,39 @@ def check_args(args):
args.random_seed = 0
print("CE is running now!")
#check gpu
assert args.class_dim > 1, "class_dim must greater than 1"
#check gpu
check_gpu()
check_version()
def init_model(exe, args, program):
"""load model from checkpoint or pretrained model
"""
if args.checkpoint:
fluid.io.load_persistables(exe, args.checkpoint, main_program=program)
print("Finish initing model from %s" % (args.checkpoint))
if args.pretrained_model:
def if_exist(var):
return os.path.exists(os.path.join(args.pretrained_model, var.name))
def is_parameter(var):
return isinstance(var, fluid.framework.Parameter) and (
not ("fc_0" in var.name)) and os.path.exists(
os.path.join(args.pretrained_model, var.name))
print("Load pretrain weights from {}, exclude fc layer.".format(
args.pretrained_model))
vars = filter(is_parameter, program.list_vars())
fluid.io.load_vars(
exe,
args.pretrained_model,
main_program=program,
predicate=if_exist)
exe, args.pretrained_model, vars=vars, main_program=program)
def save_model(args, exe, train_prog, info):
"""save model in model_path
"""
model_path = os.path.join(args.model_save_dir, args.model, str(info))
if not os.path.isdir(model_path):
os.makedirs(model_path)
......@@ -301,6 +311,13 @@ def save_model(args, exe, train_prog, info):
print("Already save model in %s" % (model_path))
def save_json(info, path):
""" save eval result or infer result to file as json format.
"""
with open(path, 'a') as f:
json.dump(info, f)
def create_data_loader(is_train, args):
"""create data_loader
......@@ -357,7 +374,8 @@ def print_info(info_mode,
pass_id=0,
batch_id=0,
print_step=1,
device_num=1):
device_num=1,
class_dim=5):
"""print function
Args:
......@@ -383,16 +401,18 @@ def print_info(info_mode,
elif len(metrics) == 4:
loss, acc1, acc5, lr = metrics
print(
"[Pass {0}, train batch {1}] \tloss {2}, acc1 {3}, acc5 {4}, lr {5}, elapse {6}".
"[Pass {0}, train batch {1}] \tloss {2}, acc1 {3}, acc{7} {4}, lr {5}, elapse {6}".
format(pass_id, batch_id, "%.5f" % loss, "%.5f" % acc1,
"%.5f" % acc5, "%.5f" % lr, "%2.4f sec" % time_info))
"%.5f" % acc5, "%.5f" % lr, "%2.4f sec" % time_info,
min(class_dim, 5)))
# test output
elif len(metrics) == 3:
loss, acc1, acc5 = metrics
print(
"[Pass {0}, test batch {1}] \tloss {2}, acc1 {3}, acc5 {4}, elapse {5}".
"[Pass {0}, test batch {1}] \tloss {2}, acc1 {3}, acc{6} {4}, elapse {5}".
format(pass_id, batch_id, "%.5f" % loss, "%.5f" % acc1,
"%.5f" % acc5, "%2.4f sec" % time_info))
"%.5f" % acc5, "%2.4f sec" % time_info,
min(class_dim, 5)))
else:
raise Exception(
"length of metrics {} is not implemented, It maybe caused by wrong format of build_program_output".
......@@ -404,16 +424,16 @@ def print_info(info_mode,
if len(metrics) == 5:
train_loss, _, test_loss, test_acc1, test_acc5 = metrics
print(
"[End pass {0}]\ttrain_loss {1}, test_loss {2}, test_acc1 {3}, test_acc5 {4}".
"[End pass {0}]\ttrain_loss {1}, test_loss {2}, test_acc1 {3}, test_acc{5} {4}".
format(pass_id, "%.5f" % train_loss, "%.5f" % test_loss, "%.5f"
% test_acc1, "%.5f" % test_acc5))
% test_acc1, "%.5f" % test_acc5, min(class_dim, 5)))
elif len(metrics) == 7:
train_loss, train_acc1, train_acc5, _, test_loss, test_acc1, test_acc5 = metrics
print(
"[End pass {0}]\ttrain_loss {1}, train_acc1 {2}, train_acc5 {3},test_loss {4}, test_acc1 {5}, test_acc5 {6}".
"[End pass {0}]\ttrain_loss {1}, train_acc1 {2}, train_acc{7} {3},test_loss {4}, test_acc1 {5}, test_acc{7} {6}".
format(pass_id, "%.5f" % train_loss, "%.5f" % train_acc1, "%.5f"
% train_acc5, "%.5f" % test_loss, "%.5f" % test_acc1,
"%.5f" % test_acc5))
"%.5f" % test_acc5, min(class_dim, 5)))
sys.stdout.flush()
elif info_mode == "ce":
assert len(
......@@ -444,7 +464,12 @@ def print_ce(device_num, metrics, time_info):
print("kpis\ttrain_speed_card{}\t{}".format(device_num, train_speed))
def best_strategy_compiled(args, program, loss, exe):
def best_strategy_compiled(args,
program,
loss,
exe,
mode="train",
share_prog=None):
"""make a program which wrapped by a compiled program
"""
......@@ -468,7 +493,8 @@ def best_strategy_compiled(args, program, loss, exe):
exec_strategy.num_threads = 1
compiled_program = fluid.CompiledProgram(program).with_data_parallel(
loss_name=loss.name,
loss_name=loss.name if mode == "train" else loss,
share_vars_from=share_prog if mode == "val" else None,
build_strategy=build_strategy,
exec_strategy=exec_strategy)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册