diff --git a/experiment_3/3-Computer_Vision.md b/resnet50/README.md similarity index 52% rename from experiment_3/3-Computer_Vision.md rename to resnet50/README.md index edf45160abd8a09061797486032f4f41711af7cc..1c53e9198d443e0b8d1caa72c74f2c5e0cf1e1b6 100644 --- a/experiment_3/3-Computer_Vision.md +++ b/resnet50/README.md @@ -2,7 +2,7 @@ ## 实验介绍 -本实验主要介绍使用MindSpore在CIFAR-10数据集上训练ResNet50。本实验使用MindSpore model_zoo中提供的ResNet50模型定义,以及MindSpore官网教程[在云上使用MindSpore](https://www.mindspore.cn/tutorial/zh-CN/0.2.0-alpha/advanced_use/use_on_the_cloud.html)里的训练脚本。 +本实验主要介绍使用MindSpore在CIFAR-10数据集上训练ResNet50。本实验使用MindSpore model_zoo中提供的ResNet50模型定义,以及MindSpore官网教程[在云上使用MindSpore](https://www.mindspore.cn/tutorial/zh-CN/r0.5/advanced_use/use_on_the_cloud.html)里的训练脚本。 ## 实验目的 @@ -19,7 +19,7 @@ ## 实验环境 -- MindSpore 0.2.0(MindSpore版本会定期更新,本指导也会定期刷新,与版本配套); +- MindSpore 0.5.0(MindSpore版本会定期更新,本指导也会定期刷新,与版本配套); - 华为云ModelArts:ModelArts是华为云提供的面向开发者的一站式AI开发平台,集成了昇腾AI处理器资源池,用户可以在该平台下体验MindSpore。 ## 实验准备 @@ -42,13 +42,15 @@ ### 数据集准备 -CIFAR-10是一个图片分类数据集,包含60000张32x32的彩色物体图片,训练集50000张,测试集10000张,共10类,每类6000张。CIFAR-10数据集的官网:[The CIFAR-10 and CIFAR-100 datasets](http://www.cs.toronto.edu/~kriz/cifar.html)。 +CIFAR-10是一个图片分类数据集,包含60000张32x32的彩色物体图片,训练集50000张,测试集10000张,共10类,每类6000张。 -从CIFAR-10官网下载“CIFAR-10 binary version (suitable for C programs)”到本地并解压。 +- 方式一,从[CIFAR-10官网](http://www.cs.toronto.edu/~kriz/cifar.html)下载“CIFAR-10 binary version (suitable for C programs)”到本地并解压。 + +- 方式二,从华为云OBS中下载[CIFAR-10数据集](https://share-course.obs.cn-north-4.myhuaweicloud.com/dataset/cifar10.zip)并解压。 ### 脚本准备 -从[MindSpore tutorial仓库](https://gitee.com/mindspore/docs/tree/r0.2/tutorials/tutorial_code/sample_for_cloud/)里下载相关脚本。 +从[MindSpore tutorial仓库](https://gitee.com/mindspore/docs/tree/r0.5/tutorials/tutorial_code/sample_for_cloud)里下载相关脚本。 ### 上传文件 @@ -57,6 +59,7 @@ CIFAR-10是一个图片分类数据集,包含60000张32x32的彩色物体图 ``` experiment_3 ├── dataset.py +├── resnet.py ├── resnet50_train.py └── cifar10 ├── batches.meta.txt @@ -74,8 +77,11 @@ experiment_3 ### 代码梳理 -- resnet50_train.py:主脚本,包含性能测试`PerformanceCallback`、动态学习率`get_lr`、执行函数`resnet50_train`等函数; +- resnet50_train.py:主脚本,包含性能测试`PerformanceCallback`、动态学习率`get_lr`、执行函数`resnet50_train`、主函数; - dataset.py:数据处理脚本。 +- resnet.py: resnet模型定义脚本,包含ResidualBlock模块类`ResidualBlock`、`ResNet`类、`ResNet50`类、`ResNet101`类等。 + +#### resnet50_train.py代码梳理 `PerformanceCallback`继承MindSpore Callback类,并统计每个训练step的时延: @@ -153,14 +159,15 @@ def get_lr(global_step, return learning_rate ``` +#### dataset.py代码梳理 + MindSpore支持直接读取CIFAR-10数据集: ```python if device_num == 1 or not do_train: ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=do_shuffle) else: - ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=do_shuffle, - num_shards=device_num, shard_id=device_id) + ds = de.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=do_shuffle,num_shards=device_num, shard_id=device_id) ``` 使用数据增强,如随机裁剪、随机水平反转: @@ -169,55 +176,27 @@ else: # define map operations random_crop_op = C.RandomCrop((32, 32), (4, 4, 4, 4)) random_horizontal_flip_op = C.RandomHorizontalFlip(device_id / (device_id + 1)) -``` -导入并使用model_zoo里的resnet50模型: +resize_op = C.Resize((resize_height, resize_width)) +rescale_op = C.Rescale(rescale, shift) +normalize_op = C.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010]) -```python -from mindspore.model_zoo.resnet import resnet50 -# create model -net = resnet50(class_num = class_num) -``` +change_swap_op = C.HWC2CHW() -`model_zoo.resnet`中resnet50定义如下: +trans = [] +if do_train: + trans += [random_crop_op, random_horizontal_flip_op] -```python -def resnet50(class_num=10): - return ResNet(ResidualBlock, - [3, 4, 6, 3], - [64, 256, 512, 1024], - [256, 512, 1024, 2048], - [1, 2, 2, 2], - class_num) -``` +trans += [resize_op, rescale_op, normalize_op, change_swap_op] -ResNet类定义如下: +type_cast_op = C2.TypeCast(mstype.int32) -```python -class ResNet(nn.Cell): - """ - ResNet architecture. - - Args: - block (Cell): Block for network. - layer_nums (list): Numbers of block in different layers. - in_channels (list): Input channel in each layer. - out_channels (list): Output channel in each layer. - strides (list): Stride size in each layer. - num_classes (int): The number of classes that the training images are belonging to. - Returns: - Tensor, output tensor. - - Examples: - >>> ResNet(ResidualBlock, - >>> [3, 4, 6, 3], - >>> [64, 256, 512, 1024], - >>> [256, 512, 1024, 2048], - >>> [1, 2, 2, 2], - >>> 10) - """ +ds = ds.map(input_columns="label", num_parallel_workers=8, operations=type_cast_op) +ds = ds.map(input_columns="image", num_parallel_workers=8, operations=trans) ``` +#### resnet.py代码梳理 + ResNet的不同版本均由5个阶段(stage)组成,其中ResNet50结构为Convx1 -> ResidualBlockx3 -> ResidualBlockx4 -> ResidualBlockx6 -> ResidualBlockx5 -> Pooling+FC。 ![ResNet Architectures](images/resnet_archs.png) @@ -230,8 +209,24 @@ ResNet的不同版本均由5个阶段(stage)组成,其中ResNet50结构为 [2] 图片来源于https://arxiv.org/pdf/1512.03385.pdf +ResNet的ResidualBlock(残差模块)定义如下,是组成ResNet网络的基础模块。 + ```python class ResidualBlock(nn.Cell): + """ + ResNet V1 residual block definition. + + Args: + in_channel (int): Input channel. + out_channel (int): Output channel. + stride (int): Stride size for the first convolutional layer. Default: 1. + + Returns: + Tensor, output tensor. + + Examples: + >>> ResidualBlock(3, 256, stride=2) + """ expansion = 4 def __init__(self, @@ -253,9 +248,11 @@ class ResidualBlock(nn.Cell): self.relu = nn.ReLU() self.down_sample = False + if stride != 1 or in_channel != out_channel: self.down_sample = True self.down_sample_layer = None + if self.down_sample: self.down_sample_layer = nn.SequentialCell([_conv1x1(in_channel, out_channel, stride), _bn(out_channel)]) @@ -278,7 +275,7 @@ class ResidualBlock(nn.Cell): # ResNet50未使用带有下采样的残差支路 if self.down_sample: identity = self.down_sample_layer(identity) - + # output为残差支路,identity为short-cut支路 out = self.add(out, identity) out = self.relu(out) @@ -286,6 +283,158 @@ class ResidualBlock(nn.Cell): return out ``` +ResNet类定义如下,传入的参数包括: + +- layer_nums:每个stage中ResidualBlock重复次数列表(list) +- in_channels:每个stage输入通道数列表(list) +- out_channels:每个stage输出通道数列表(list) +- strides:每个stage中卷积算子的stride列表(list) +- num_classes:图片分类数(int) + +>**注解:** +> +>- 这里的stage不是ResNet真实层数,只是将ResNet分成多个stage,每个stage包含多个ResidualBlock。 +>- layer_nums、in_channels、out_channels、strides列表的长度必须相同。 +>- 传入的参数不同则网络结构不同,典型的有ResNet50、ResNet101。其定义可以参考resnet.py文件。学员可以尝试自定义参数设计一个新的网络。 + +```python +class ResNet(nn.Cell): + """ + ResNet architecture. + + Args: + block (Cell): Block for network. + layer_nums (list): Numbers of block in different layers. + in_channels (list): Input channel in each layer. + out_channels (list): Output channel in each layer. + strides (list): Stride size in each layer. + num_classes (int): The number of classes that the training images are belonging to. + Returns: + Tensor, output tensor. + + Examples: + >>> ResNet(ResidualBlock, + >>> [3, 4, 6, 3], + >>> [64, 256, 512, 1024], + >>> [256, 512, 1024, 2048], + >>> [1, 2, 2, 2], + >>> 10) + """ + + def __init__(self, + block, + layer_nums, + in_channels, + out_channels, + strides, + num_classes): + super(ResNet, self).__init__() + + if not len(layer_nums) == len(in_channels) == len(out_channels) == 4: + raise ValueError("the length of layer_num, in_channels, out_channels list must be 4!") + + self.conv1 = _conv7x7(3, 64, stride=2) + self.bn1 = _bn(64) + self.relu = P.ReLU() + self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, pad_mode="same") + + self.layer1 = self._make_layer(block, + layer_nums[0], + in_channel=in_channels[0], + out_channel=out_channels[0], + stride=strides[0]) + self.layer2 = self._make_layer(block, + layer_nums[1], + in_channel=in_channels[1], + out_channel=out_channels[1], + stride=strides[1]) + self.layer3 = self._make_layer(block, + layer_nums[2], + in_channel=in_channels[2], + out_channel=out_channels[2], + stride=strides[2]) + self.layer4 = self._make_layer(block, + layer_nums[3], + in_channel=in_channels[3], + out_channel=out_channels[3], + stride=strides[3]) + + self.mean = P.ReduceMean(keep_dims=True) + self.flatten = nn.Flatten() + self.end_point = _fc(out_channels[3], num_classes) + + def _make_layer(self, block, layer_num, in_channel, out_channel, stride): + """ + Make stage network of ResNet. + + Args: + block (Cell): Resnet block. + layer_num (int): Layer number. + in_channel (int): Input channel. + out_channel (int): Output channel. + stride (int): Stride size for the first convolutional layer. + + Returns: + SequentialCell, the output layer. + + Examples: + >>> _make_layer(ResidualBlock, 3, 128, 256, 2) + """ + layers = [] + + resnet_block = block(in_channel, out_channel, stride=stride) + layers.append(resnet_block) + + for _ in range(1, layer_num): + resnet_block = block(out_channel, out_channel, stride=1) + layers.append(resnet_block) + + return nn.SequentialCell(layers) + + def construct(self, x): + x = self.conv1(x) + x = self.bn1(x) + x = self.relu(x) + c1 = self.maxpool(x) + + c2 = self.layer1(c1) + c3 = self.layer2(c2) + c4 = self.layer3(c3) + c5 = self.layer4(c4) + + out = self.mean(c5, (2, 3)) + out = self.flatten(out) + out = self.end_point(out) + + return out +``` + +ResNet50类定义如下: + +```python +def resnet50(class_num=10): + """ + Get ResNet50 neural network. + + Args: + class_num (int): Class number. + + Returns: + Cell, cell instance of ResNet50 neural network. + + Examples: + >>> net = resnet50(10) + """ + return ResNet(ResidualBlock, + [3, 4, 6, 3], + [64, 256, 512, 1024], + [256, 512, 1024, 2048], + [1, 2, 2, 2], + class_num) +``` + +### 适配训练作业 + 创建训练作业时,运行参数会通过脚本传参的方式输入给脚本代码,脚本必须解析传参才能在代码中使用相应参数。如data_url和train_url,分别对应数据存储路径(OBS路径)和训练输出路径(OBS路径)。脚本对传参进行解析后赋值到`args`变量里,在后续代码里可以使用。 ```python @@ -293,23 +442,51 @@ import argparse parser = argparse.ArgumentParser() parser.add_argument('--data_url', required=True, default=None, help='Location of data.') parser.add_argument('--train_url', required=True, default=None, help='Location of training outputs.') -parser.add_argument('--num_epochs', type=int, default=1, help='Number of training epochs.') +parser.add_argument('--num_epochs', type=int, default=90, help='Number of training epochs.') args, unknown = parser.parse_known_args() ``` -MindSpore暂时没有提供直接访问OBS数据的接口,需要通过MoXing提供的API与OBS交互。将OBS中存储的数据拷贝至执行容器: +MindSpore暂时没有提供直接访问OBS数据的接口,需要通过MoXing提供的API与OBS交互。 -```python -import moxing as mox -mox.file.copy_parallel(src_url=args.data_url, dst_url='cifar10/') -``` +**方式一** -如需将训练输出(如模型Checkpoint)从执行容器拷贝至OBS,请参考: +- 拷贝自己账户下OBS桶内的数据集至执行容器 -```python -import moxing as mox -mox.file.copy_parallel(src_url='output', dst_url='s3://OBS/PATH') -``` + ```python + import moxing as mox + mox.file.copy_parallel(src_url=args.data_url, dst_url='cifar10/') + ``` + +- 如需将训练输出(如模型Checkpoint)从执行容器拷贝至自己的OBS,请参考: + + ```python + import moxing as mox + mox.file.copy_parallel(src_url='output', dst_url='s3://OBS/PATH') + ``` + +**方式二** + +- 拷贝他人账户下OBS桶内的数据集,前提是他人账户下的OBS桶已设为公共读/公共读写,且需要他人账户的访问密钥、私有访问密钥、OBS桶-概览-基本信息-Endpoint。 + + ```python + import moxing as mox + # set moxing/obs auth info, ak:Access Key Id, sk:Secret Access Key, server:endpoint of obs bucket + mox.file.set_auth(ak='VCT2GKI3GJOZBQYJG5WM', sk='t1y8M4Z6bHLSAEGK2bCeRYMjo2S2u0QBqToYbxzB', + server="obs.cn-north-4.myhuaweicloud.com") + # copy dataset from obs bucket to container/cache + mox.file.copy_parallel(src_url="s3://share-course/dataset/cifar10/", dst_url='cifar10/') + ``` + +- 通过set_auth()设置了他人账户的密钥,则再通过set_auth()设置自己账户的密钥,然后再行拷贝。 + + ```python + import moxing as mox + mox.file.set_auth(ak='Your own Access Key', sk='Your own Secret Access Key', + server="obs.cn-north-4.myhuaweicloud.com") + mox.file.copy_parallel(src_url='ckpt', dst_url=os.path.join(args.train_url, 'ckpt')) + ``` + + 如果不设置自己账户的密钥,则只能将Checkpoint拷贝到他人账户下的OBS桶中。 ### 创建训练作业 @@ -331,9 +508,53 @@ mox.file.copy_parallel(src_url='output', dst_url='s3://OBS/PATH') 1. 点击提交以开始训练; 2. 在训练作业列表里可以看到刚创建的训练作业,在训练作业页面可以看到版本管理; 3. 点击运行中的训练作业,在展开的窗口中可以查看作业配置信息,以及训练过程中的日志,日志会不断刷新,等训练作业完成后也可以下载日志到本地进行查看; -4. 在训练日志中可以看到`epoch 90 cost time = 27.963477849960327, train step num: 1562, one step time: 17.90235457743939 ms, train samples per second of cluster: 1787.5`等字段,即训练过程的性能数据; -5. 在训练日志中可以看到`epoch: 90 step: 1562, loss is 0.00250402`等字段,即训练过程的loss数据; -6. 在训练日志里可以看到`Evaluation result: {'acc': 0.9182692307692307}.`字段,即训练完成后的验证精度。 +4. 在训练日志中可以看到`epoch 90 cost time = 27.328994035720825, train step num: 1562, one step time: 17.496154952446112 ms, train samples per second of cluster: 1829.0`等字段,即训练过程的性能数据; +5. 在训练日志中可以看到`epoch: 90 step 1562, loss is 0.0002547435578890145 `等字段,即训练过程的loss数据; +6. 在训练日志里可以看到`Evaluation result: {'acc': 0.9467147435897436}.`字段,即训练完成后的验证精度。 + +``` +epoch 1 cost time = 156.34279108047485, train step num: 1562, one step time: 100.09141554447814 ms, train samples per second of cluster: 319.7 +epoch: 1 step 1562, loss is 1.5020508766174316 +Epoch time: 156343.661, per step time: 100.092, avg loss: 1.502 +************************************************************ +epoch 2 cost time = 27.33933186531067, train step num: 1562, one step time: 17.502773281248828 ms, train samples per second of cluster: 1828.3 +epoch: 2 step 1562, loss is 1.612194299697876 +Epoch time: 27339.779, per step time: 17.503, avg loss: 1.612 +************************************************************ +epoch 3 cost time = 27.33275270462036, train step num: 1562, one step time: 17.498561270563613 ms, train samples per second of cluster: 1828.7 +epoch: 3 step 1562, loss is 1.0880045890808105 +Epoch time: 27333.157, per step time: 17.499, avg loss: 1.088 +************************************************************ +... +... +... +epoch 50 cost time = 27.318379402160645, train step num: 1562, one step time: 17.48935941239478 ms, train samples per second of cluster: 1829.7 +epoch: 50 step 1562, loss is 0.028316421434283257 +Epoch time: 27318.783, per step time: 17.490, avg loss: 0.028 +************************************************************ +epoch 51 cost time = 27.317234992980957, train step num: 1562, one step time: 17.488626756069756 ms, train samples per second of cluster: 1829.8 +epoch: 51 step 1562, loss is 0.09725271165370941 +Epoch time: 27317.556, per step time: 17.489, avg loss: 0.097 +************************************************************ +... +... +... +************************************************************ +epoch 88 cost time = 27.33049988746643, train step num: 1562, one step time: 17.497119006060455 ms, train samples per second of cluster: 1828.9 +epoch: 88 step 1562, loss is 0.0008127370965667069 +Epoch time: 27330.821, per step time: 17.497, avg loss: 0.001 +************************************************************ +epoch 89 cost time = 27.33343005180359, train step num: 1562, one step time: 17.498994911525987 ms, train samples per second of cluster: 1828.7 +epoch: 89 step 1562, loss is 0.00029994442593306303 +Epoch time: 27333.826, per step time: 17.499, avg loss: 0.000 +************************************************************ +epoch 90 cost time = 27.328994035720825, train step num: 1562, one step time: 17.496154952446112 ms, train samples per second of cluster: 1829.0 +epoch: 90 step 1562, loss is 0.0002547435578890145 +Epoch time: 27329.307, per step time: 17.496, avg loss: 0.000 +************************************************************ +Start run evaluation. +Evaluation result: {'acc': 0.9467147435897436}. +``` ## 实验结论 diff --git a/experiment_3/images/resnet_archs.png b/resnet50/images/resnet_archs.png similarity index 100% rename from experiment_3/images/resnet_archs.png rename to resnet50/images/resnet_archs.png diff --git a/experiment_3/images/resnet_block.png b/resnet50/images/resnet_block.png similarity index 100% rename from experiment_3/images/resnet_block.png rename to resnet50/images/resnet_block.png