diff --git a/tutorials/notebook/README.md b/tutorials/notebook/README.md
index fb999501e7ac9982e36a878b76981277fc0dca92..f7eac5dd4b49ceb37ad71ce574f11d0165e53954 100644
--- a/tutorials/notebook/README.md
+++ b/tutorials/notebook/README.md
@@ -57,6 +57,7 @@
| 数据处理与数据增强 | [data_loading_enhancement.ipynb](https://gitee.com/mindspore/docs/blob/master/tutorials/notebook/data_loading_enhance/data_loading_enhancement.ipynb) | 使用指南 | - 学习MindSpore中数据处理和增强的方法
- 展示数据处理、增强方法的实际操作
- 对比展示数据处理前和处理后的效果
- 表述在数据处理、增强后的意义
| 自然语言处理应用 | [nlp_application.ipynb](https://gitee.com/mindspore/docs/blob/master/tutorials/notebook/nlp_application.ipynb) | 应用实践 | - 展示MindSpore在自然语言处理的应用
- 展示自然语言处理中数据集特定的预处理方法
- 展示如何定义基于LSTM的SentimentNet网络
| 计算机视觉应用 | [computer_vision_application.ipynb](https://gitee.com/mindspore/docs/blob/master/tutorials/notebook/computer_vision_application.ipynb) | 应用实践 | - 学习MindSpore卷积神经网络在计算机视觉应用的过程
- 学习下载CIFAR-10数据集,搭建运行环境
- 学习使用ResNet-50构建卷积神经网络
- 学习使用Momentum和SoftmaxCrossEntropyWithLogits构建优化器和损失函数
- 学习调试参数训练模型,判断模型精度
+| 模型的训练及验证同步方法 | [synchronization_training_and_evaluation.ipynb](https://gitee.com/mindspore/docs/blob/master/tutorials/notebook/synchronization_training_and_evaluation.ipynb) | 应用实践 | - 了解模型训练和验证同步进行的方法
- 学习同步训练和验证中参数设置方法
- 利用绘图函数从保存的模型中挑选出最优模型
| 使用PyNative进行神经网络的训练调试体验 | [debugging_in_pynative_mode.ipynb](https://gitee.com/mindspore/docs/blob/master/tutorials/notebook/debugging_in_pynative_mode.ipynb) | 模型调优 | - GPU平台下从数据集获取单个数据进行单个step训练的数据变化全过程解读
- 了解PyNative模式下的调试方法
- 图片数据在训练过程中的变化情况的图形展示
- 了解构建权重梯度计算函数的方法
- 展示1个step过程中权重的变化及数据展示
| 自定义调试信息体验文档 | [customized_debugging_information.ipynb](https://gitee.com/mindspore/docs/blob/master/tutorials/notebook/customized_debugging_information.ipynb) | 模型调优 | - 了解MindSpore的自定义调试算子
- 学习使用自定义调试算子Callback设置定时训练
- 学习设置metrics算子输出相对应的模型精度信息
- 学习设置日志环境变量来控制glog输出日志
| MindInsight的模型溯源和数据溯源体验 | [mindinsight_model_lineage_and_data_lineage.ipynb](https://gitee.com/mindspore/docs/blob/master/tutorials/notebook/mindinsight/mindinsight_model_lineage_and_data_lineage.ipynb) | 模型调优 | - 了解MindSpore中训练数据的采集及展示
- 学习使用SummaryRecord记录数据
- 学习使用回调函数SummaryCollector进行数据采集
- 使用MindInsight进行数据可视化
- 了解数据溯源和模型溯源的使用方法
diff --git a/tutorials/notebook/synchronization_training_and_evaluation.ipynb b/tutorials/notebook/synchronization_training_and_evaluation.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..486fd4cf2f762b193054db536c9576b5bdc5512f
--- /dev/null
+++ b/tutorials/notebook/synchronization_training_and_evaluation.ipynb
@@ -0,0 +1,510 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#
同步训练和验证模型体验"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 概述"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "在面对复杂网络时,往往需要进行几十甚至几百次的epoch训练。而在训练之前,往往很难掌握在训练到第几个epoch时,模型的精度能达到满足要求的程度。所以经常会采用一边训练的同时,在相隔固定epoch的位置对模型进行精度验证,并保存相应的模型,等训练完毕后,通过查看对应模型精度的变化就能迅速地挑选出相对最优的模型,本文将采用这种方法,以LeNet网络为样本,进行示例。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "整体流程如下:\n",
+ "1. 数据集准备。\n",
+ "2. 构建神经网络。\n",
+ "3. 定义回调函数EvalCallBack。\n",
+ "4. 定义训练网络并执行。\n",
+ "5. 定义绘图函数并对不同epoch下的模型精度绘制出折线图。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 数据准备"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 数据集的下载"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "训练数据集下载地址:{\"http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz \", \"http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz \"}。\n",
+ "\n",
+ "测试数据集:{\"\", \"\"}\n",
+ "
数据集放在----*Jupyter工作目录+\\MNIST_Data\\*,如下图结构:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "```\n",
+ "MNIST\n",
+ "├── test\n",
+ "│ ├── t10k-images-idx3-ubyte\n",
+ "│ └── t10k-labels-idx1-ubyte\n",
+ "└── train\n",
+ " ├── train-images-idx3-ubyte\n",
+ " └── train-labels-idx1-ubyte \n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 数据集的增强操作"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "下载下来后的数据集,需要通过`mindspore.dataset`处理成适用于MindSpore框架的数据,再使用一系列框架中提供的工具进行数据增强操作来适应LeNet网络的数据处理需求。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import mindspore.dataset as ds\n",
+ "import mindspore.dataset.transforms.vision.c_transforms as CV\n",
+ "import mindspore.dataset.transforms.c_transforms as C\n",
+ "from mindspore.dataset.transforms.vision import Inter\n",
+ "from mindspore.common import dtype as mstype\n",
+ "\n",
+ "def create_dataset(data_path, batch_size=32, repeat_size=1,\n",
+ " num_parallel_workers=1):\n",
+ " # define dataset\n",
+ " mnist_ds = ds.MnistDataset(data_path)\n",
+ "\n",
+ " # define map operations\n",
+ " resize_op = CV.Resize((32, 32), interpolation=Inter.LINEAR) \n",
+ " rescale_nml_op = CV.Rescale(1 / 0.3081, -1 * 0.1307 / 0.3081) \n",
+ " rescale_op = CV.Rescale(1/255.0, 0.0) \n",
+ " hwc2chw_op = CV.HWC2CHW() \n",
+ " type_cast_op = C.TypeCast(mstype.int32) \n",
+ "\n",
+ " # apply map operations on images\n",
+ " mnist_ds = mnist_ds.map(input_columns=\"label\", operations=type_cast_op, num_parallel_workers=num_parallel_workers)\n",
+ " mnist_ds = mnist_ds.map(input_columns=\"image\", operations=[resize_op,rescale_op,rescale_nml_op,hwc2chw_op],\n",
+ " num_parallel_workers=num_parallel_workers)\n",
+ "\n",
+ " # apply DatasetOps\n",
+ " buffer_size = 10000\n",
+ " mnist_ds = mnist_ds.shuffle(buffer_size=buffer_size)\n",
+ " mnist_ds = mnist_ds.batch(batch_size, drop_remainder=True)\n",
+ " mnist_ds = mnist_ds.repeat(repeat_size)\n",
+ " \n",
+ " return mnist_ds"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 构建神经网络"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "LeNet网络属于7层神经网络,其中涉及卷积层,全连接层,函数激活等算法,在MindSpore中都已经建成相关算子只需导入使用,如下先将卷积函数,全连接函数,权重等进行初始化,然后在LeNet5中定义神经网络并使用`construct`构建网络。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import mindspore.nn as nn\n",
+ "from mindspore.common.initializer import TruncatedNormal\n",
+ "\n",
+ "\n",
+ "def conv(in_channels, out_channels, kernel_size, stride=1, padding=0):\n",
+ " \"\"\"Conv layer weight initial.\"\"\"\n",
+ " weight = weight_variable()\n",
+ " return nn.Conv2d(in_channels, out_channels,\n",
+ " kernel_size=kernel_size, stride=stride, padding=padding,\n",
+ " weight_init=weight, has_bias=False, pad_mode=\"valid\")\n",
+ "\n",
+ "def fc_with_initialize(input_channels, out_channels):\n",
+ " \"\"\"Fc layer weight initial.\"\"\"\n",
+ " weight = weight_variable()\n",
+ " bias = weight_variable()\n",
+ " return nn.Dense(input_channels, out_channels, weight, bias)\n",
+ "\n",
+ "def weight_variable():\n",
+ " \"\"\"Weight initial.\"\"\"\n",
+ " return TruncatedNormal(0.02)\n",
+ "\n",
+ "class LeNet5(nn.Cell):\n",
+ " \"\"\"Lenet network structure.\"\"\"\n",
+ " # define the operator required\n",
+ " def __init__(self):\n",
+ " super(LeNet5, self).__init__()\n",
+ " self.conv1 = conv(1, 6, 5)\n",
+ " self.conv2 = conv(6, 16, 5)\n",
+ " self.fc1 = fc_with_initialize(16 * 5 * 5, 120)\n",
+ " self.fc2 = fc_with_initialize(120, 84)\n",
+ " self.fc3 = fc_with_initialize(84, 10)\n",
+ " self.relu = nn.ReLU()\n",
+ " self.max_pool2d = nn.MaxPool2d(kernel_size=2, stride=2)\n",
+ " self.flatten = nn.Flatten()\n",
+ "\n",
+ " # use the preceding operators to construct networks\n",
+ " def construct(self, x):\n",
+ " x = self.conv1(x)\n",
+ " x = self.relu(x)\n",
+ " x = self.max_pool2d(x)\n",
+ " x = self.conv2(x)\n",
+ " x = self.relu(x)\n",
+ " x = self.max_pool2d(x)\n",
+ " x = self.flatten(x)\n",
+ " x = self.fc1(x)\n",
+ " x = self.relu(x)\n",
+ " x = self.fc2(x)\n",
+ " x = self.relu(x)\n",
+ " x = self.fc3(x)\n",
+ " return x"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 定义回调函数EvalCallBack"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "实现思想:每隔n个epoch验证一次模型精度,由于在自定义函数中实现,如需了解自定义回调函数的详细用法,请参考[API说明](https://www.mindspore.cn/api/zh-CN/master/api/python/mindspore/mindspore.train.html?highlight=callback#mindspore.train.callback.Callback)。\n",
+ "\n",
+ "核心实现:回调函数的`epoch_end`内设置验证点,如下:\n",
+ "\n",
+ "`cur_epoch % eval_per_epoch == 0`:即每`eval_per_epoch`个epoch结束时,验证一次模型精度。\n",
+ "\n",
+ "- `cur_epoch`:当前训练过程的epoch数值。\n",
+ "- `eval_per_epoch`:用户自定义数值,即验证频次。\n",
+ "\n",
+ "其他参数解释:\n",
+ "\n",
+ "- `model`:即是MindSpore中的`Model`函数。\n",
+ "- `eval_dataset`:验证数据集。\n",
+ "- `epoch_per_eval`:记录验证模型的精度和相应的epoch数,其数据形式为`{\"epoch\":[],\"acc\":[]}`。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import matplotlib.pyplot as plt\n",
+ "from mindspore.train.callback import Callback\n",
+ "\n",
+ "class EvalCallBack(Callback):\n",
+ " def __init__(self, model, eval_dataset, eval_per_epoch):\n",
+ " self.model = model\n",
+ " self.eval_dataset = eval_dataset\n",
+ " self.eval_per_epoch = eval_per_epoch\n",
+ " \n",
+ " def epoch_end(self, run_context):\n",
+ " cb_param = run_context.original_args()\n",
+ " cur_epoch = cb_param.cur_epoch_num\n",
+ " if cur_epoch % self.eval_per_epoch == 0:\n",
+ " acc = self.model.eval(self.eval_dataset,dataset_sink_mode = True)\n",
+ " epoch_per_eval[\"epoch\"].append(cur_epoch)\n",
+ " epoch_per_eval[\"acc\"].append(acc[\"Accuracy\"])\n",
+ " print(acc)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 定义训练网络并执行"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "在保存模型的参数`CheckpointConfig`中,需计算好单个epoch中的step数,再根据需要进行验证模型精度的频次对应,\n",
+ "本次示例为1875个step/epoch,按照每两个epoch验证一次的思想,这里设置`save_checkpoint_steps=eval_per_epoch*1875`,\n",
+ "其中变量`eval_per_epoch`等于2。\n",
+ "\n",
+ "参数解释:\n",
+ "\n",
+ "- `train_data_path`:训练数据集地址。\n",
+ "- `eval_data_path`:验证数据集地址。\n",
+ "- `train_data`:训练数据集。\n",
+ "- `eval_data`:验证数据集。\n",
+ "- `net_loss`:定义损失函数。\n",
+ "- `net-opt`:定义优化器函数。\n",
+ "- `config_ck`:定义保存模型信息。\n",
+ " - `save_checkpoint_steps`:每多少个step保存一次模型。\n",
+ " - `keep_checkpoint_max`:设置保存模型数量的上限。\n",
+ "- `ckpoint_cb`:定义模型保存的名称及路径信息。\n",
+ "- `model`:定义模型。\n",
+ "- `model.train`:模型训练函数。\n",
+ "- `epoch_per_eval`:定义收集`epoch`数和对应模型精度信息的字典。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "epoch: 1 step: 375, loss is 2.3058078\n",
+ "epoch: 1 step: 750, loss is 2.3073978\n",
+ "epoch: 1 step: 1125, loss is 2.3103657\n",
+ "epoch: 1 step: 1500, loss is 0.65018296\n",
+ "epoch: 1 step: 1875, loss is 0.07800862\n",
+ "epoch: 2 step: 375, loss is 0.010344766\n",
+ "epoch: 2 step: 750, loss is 0.052723818\n",
+ "epoch: 2 step: 1125, loss is 0.08183526\n",
+ "epoch: 2 step: 1500, loss is 0.007430988\n",
+ "epoch: 2 step: 1875, loss is 0.0076965275\n",
+ "{'Accuracy': 0.9753605769230769}\n",
+ "epoch: 3 step: 375, loss is 0.11964749\n",
+ "epoch: 3 step: 750, loss is 0.04522314\n",
+ "epoch: 3 step: 1125, loss is 0.018271001\n",
+ "epoch: 3 step: 1500, loss is 0.006928641\n",
+ "epoch: 3 step: 1875, loss is 0.15374172\n",
+ "epoch: 4 step: 375, loss is 0.12120275\n",
+ "epoch: 4 step: 750, loss is 0.122824214\n",
+ "epoch: 4 step: 1125, loss is 0.0023852547\n",
+ "epoch: 4 step: 1500, loss is 0.018273383\n",
+ "epoch: 4 step: 1875, loss is 0.08102103\n",
+ "{'Accuracy': 0.9821714743589743}\n",
+ "epoch: 5 step: 375, loss is 0.12944886\n",
+ "epoch: 5 step: 750, loss is 0.0010141768\n",
+ "epoch: 5 step: 1125, loss is 0.0054096584\n",
+ "epoch: 5 step: 1500, loss is 0.0022614016\n",
+ "epoch: 5 step: 1875, loss is 0.07229582\n",
+ "epoch: 6 step: 375, loss is 0.0025749032\n",
+ "epoch: 6 step: 750, loss is 0.06261393\n",
+ "epoch: 6 step: 1125, loss is 0.021273317\n",
+ "epoch: 6 step: 1500, loss is 0.011360342\n",
+ "epoch: 6 step: 1875, loss is 0.12855275\n",
+ "{'Accuracy': 0.9853766025641025}\n",
+ "epoch: 7 step: 375, loss is 0.09330422\n",
+ "epoch: 7 step: 750, loss is 0.002063415\n",
+ "epoch: 7 step: 1125, loss is 0.0047940286\n",
+ "epoch: 7 step: 1500, loss is 0.0052507296\n",
+ "epoch: 7 step: 1875, loss is 0.018066114\n",
+ "epoch: 8 step: 375, loss is 0.08678668\n",
+ "epoch: 8 step: 750, loss is 0.02440551\n",
+ "epoch: 8 step: 1125, loss is 0.0017507032\n",
+ "epoch: 8 step: 1500, loss is 0.02957578\n",
+ "epoch: 8 step: 1875, loss is 0.0023948685\n",
+ "{'Accuracy': 0.9863782051282052}\n",
+ "epoch: 9 step: 375, loss is 0.012376097\n",
+ "epoch: 9 step: 750, loss is 0.029711302\n",
+ "epoch: 9 step: 1125, loss is 0.017438065\n",
+ "epoch: 9 step: 1500, loss is 0.015443239\n",
+ "epoch: 9 step: 1875, loss is 0.0031764025\n",
+ "epoch: 10 step: 375, loss is 0.0005294987\n",
+ "epoch: 10 step: 750, loss is 0.0015696918\n",
+ "epoch: 10 step: 1125, loss is 0.019949459\n",
+ "epoch: 10 step: 1500, loss is 0.004248183\n",
+ "epoch: 10 step: 1875, loss is 0.07389321\n",
+ "{'Accuracy': 0.9824719551282052}\n"
+ ]
+ }
+ ],
+ "source": [
+ "from mindspore.train.serialization import load_checkpoint, load_param_into_net\n",
+ "from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor\n",
+ "from mindspore.train import Model\n",
+ "from mindspore import context\n",
+ "from mindspore.nn.metrics import Accuracy\n",
+ "from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits\n",
+ "\n",
+ "if __name__ == \"__main__\":\n",
+ " context.set_context(mode=context.GRAPH_MODE, device_target=\"GPU\")\n",
+ " train_data_path = \"./MNIST_Data/train\"\n",
+ " eval_data_path = \"./MNIST_Data/test\"\n",
+ " ckpt_save_dir = \"./lenet_ckpt\"\n",
+ " epoch_size = 10\n",
+ " eval_per_epoch = 2\n",
+ " repeat_size = 1\n",
+ " network = LeNet5()\n",
+ " \n",
+ " train_data = create_dataset(train_data_path,repeat_size = repeat_size)\n",
+ " eval_data = create_dataset(eval_data_path,repeat_size = repeat_size)\n",
+ " \n",
+ " # define the loss function\n",
+ " net_loss = SoftmaxCrossEntropyWithLogits(is_grad=False, sparse=True, reduction='mean')\n",
+ " # define the optimizer\n",
+ " net_opt = nn.Momentum(network.trainable_params(), learning_rate=0.01, momentum=0.9)\n",
+ " config_ck = CheckpointConfig(save_checkpoint_steps=eval_per_epoch*1875, keep_checkpoint_max=15)\n",
+ " ckpoint_cb = ModelCheckpoint(prefix=\"checkpoint_lenet\",directory=ckpt_save_dir, config=config_ck)\n",
+ " model = Model(network, net_loss, net_opt, metrics={\"Accuracy\": Accuracy()})\n",
+ " \n",
+ " epoch_per_eval = {\"epoch\":[],\"acc\":[]}\n",
+ " eval_cb = EvalCallBack(model,eval_data,eval_per_epoch)\n",
+ " \n",
+ " model.train(epoch_size, train_data, callbacks=[ckpoint_cb, LossMonitor(375),eval_cb],\n",
+ " dataset_sink_mode=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "在同一目录的文件夹中可以看到`lenet_ckpt`文件夹中,保存了5个模型,和一个计算图相关数据,其结构如下:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "```\n",
+ "lenet_ckpt\n",
+ "├── checkpoint_lenet-10_1875.ckpt\n",
+ "├── checkpoint_lenet-2_1875.ckpt\n",
+ "├── checkpoint_lenet-4_1875.ckpt\n",
+ "├── checkpoint_lenet-6_1875.ckpt\n",
+ "├── checkpoint_lenet-8_1875.ckpt\n",
+ "└── checkpoint_lenet-graph.meta\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 绘制不同epoch下模型的精度"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "定义绘图函数`eval_show`,将`epoch_per_eval`载入到`eval_show`中,绘制出不同`epoch`下模型的验证精度折线图。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "def eval_show(epoch_per_eval):\n",
+ " plt.xlabel(\"epoch number\")\n",
+ " plt.ylabel(\"Model accuracy\")\n",
+ " plt.title(\"Model accuracy variation chart\")\n",
+ " plt.plot(epoch_per_eval[\"epoch\"],epoch_per_eval[\"acc\"],\"red\")\n",
+ " plt.show()\n",
+ " \n",
+ "eval_show(epoch_per_eval)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从上图可以一目了然地挑选出需要的最优模型。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 总结"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "本例使用MNIST数据集通过卷积神经网络LeNet5进行训练,着重介绍了利用回调函数在进行模型训练的同时进行模型的验证,保存对应`epoch`的模型,并从中挑选出最优模型的方法。"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.6"
+ },
+ "toc": {
+ "base_numbering": 1,
+ "nav_menu": {},
+ "number_sections": true,
+ "sideBar": true,
+ "skip_h1_title": false,
+ "title_cell": "Table of Contents",
+ "title_sidebar": "Contents",
+ "toc_cell": false,
+ "toc_position": {},
+ "toc_section_display": true,
+ "toc_window_display": true
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/tutorials/source_zh_cn/advanced_use/images/synchronization_training_and_evaluation.png b/tutorials/source_zh_cn/advanced_use/images/synchronization_training_and_evaluation.png
new file mode 100644
index 0000000000000000000000000000000000000000..cbecb6c9739eaf047c89ea79f9d596a2793e6283
Binary files /dev/null and b/tutorials/source_zh_cn/advanced_use/images/synchronization_training_and_evaluation.png differ
diff --git a/tutorials/source_zh_cn/advanced_use/synchronization_training_and_evaluation.md b/tutorials/source_zh_cn/advanced_use/synchronization_training_and_evaluation.md
new file mode 100644
index 0000000000000000000000000000000000000000..6e6932f1894f5a6caa018e5a7684e738a323f294
--- /dev/null
+++ b/tutorials/source_zh_cn/advanced_use/synchronization_training_and_evaluation.md
@@ -0,0 +1,174 @@
+# 同步训练和验证模型
+
+
+
+- [同步训练和验证模型](#同步训练和验证模型)
+ - [概述](#概述)
+ - [定义回调函数EvalCallBack](#定义回调函数evalcallback)
+ - [定义训练网络并执行](#定义训练网络并执行)
+ - [定义函数绘制不同epoch下模型的精度](#定义函数绘制不同epoch下模型的精度)
+ - [总结](#总结)
+
+
+
+
+
+
+
+## 概述
+
+在面对复杂网络时,往往需要进行几十甚至几百次的epoch训练。在训练之前,很难掌握在训练到第几个epoch时,模型的精度能达到满足要求的程度,所以经常会采用一边训练的同时,在相隔固定epoch的位置对模型进行精度验证,并保存相应的模型,等训练完毕后,通过查看对应模型精度的变化就能迅速地挑选出相对最优的模型,本文将采用这种方法,以LeNet网络为样本,进行示例。
+
+流程如下:
+1. 定义回调函数EvalCallBack,实现同步进行训练和验证。
+2. 定义训练网络并执行。
+3. 将不同epoch下的模型精度绘制出折线图并挑选最优模型。
+
+完整示例请参考[notebook](https://gitee.com/mindspore/docs/blob/master/tutorials/notebook/synchronization_training_and_evaluation.ipynb)。
+
+## 定义回调函数EvalCallBack
+
+实现思想:每隔n个epoch验证一次模型精度,由于在自定义函数中实现,如需了解详细用法,请参考[API说明](https://www.mindspore.cn/api/zh-CN/master/api/python/mindspore/mindspore.train.html?highlight=callback#mindspore.train.callback.Callback);
+
+核心实现:回调函数的`epoch_end`内设置验证点,如下:
+
+`cur_epoch % eval_per_epoch == 0`:即每`eval_per_epoch`个epoch结束时,验证一次模型精度。
+
+- `cur_epoch`:当前训练过程的epoch数值。
+- `eval_per_epoch`:用户自定义数值,即验证频次。
+
+其他参数解释:
+
+- `model`:即是MindSpore中的`Model`函数。
+- `eval_dataset`:验证数据集。
+- `epoch_per_eval`:记录验证模型的精度和相应的epoch数,其数据形式为`{"epoch":[],"acc":[]}`。
+
+```python
+import matplotlib.pyplot as plt
+from mindspore.train.callback import Callback
+
+class EvalCallBack(Callback):
+ def __init__(self, model, eval_dataset, eval_per_epoch):
+ self.model = model
+ self.eval_dataset = eval_dataset
+ self.eval_per_epoch = eval_per_epoch
+
+ def epoch_end(self, run_context):
+ cb_param = run_context.original_args()
+ cur_epoch = cb_param.cur_epoch_num
+ if cur_epoch % self.eval_per_epoch == 0:
+ acc = self.model.eval(self.eval_dataset,dataset_sink_mode = True)
+ epoch_per_eval["epoch"].append(cur_epoch)
+ epoch_per_eval["acc"].append(acc["Accuracy"])
+ print(acc)
+
+```
+
+## 定义训练网络并执行
+
+在保存模型的参数`CheckpointConfig`中,需计算好单个epoch中的step数,再根据需要进行验证模型精度的频次对应,本次示例为1875个step/epoch,按照每两个epoch验证一次的思想,这里设置`save_checkpoint_steps=eval_per_epoch*1875`,其中变量`eval_per_epoch`等于2。
+
+参数解释:
+
+- `config_ck`:定义保存模型信息。
+ - `save_checkpoint_steps`:每多少个step保存一次模型。
+ - `keep_checkpoint_max`:设置保存模型数量的上限。
+- `ckpoint_cb`:定义模型保存的名称及路径信息。
+- `model`:定义模型。
+- `model.train`:模型训练函数。
+- `epoch_per_eval`:定义收集`epoch`数和对应模型精度信息的字典。
+
+```python
+from mindspore.train.serialization import load_checkpoint, load_param_into_net
+from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor
+from mindspore.train import Model
+from mindspore import context
+from mindspore.nn.metrics import Accuracy
+from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
+
+if __name__ == "__main__":
+ context.set_context(mode=context.GRAPH_MODE, device_target="GPU")
+ ckpt_save_dir = "./lenet_ckpt"
+ eval_per_epoch = 2
+
+ ... ...
+
+ # need to calculate how many steps are in each epoch,in this example, 1875 steps per epoch
+ config_ck = CheckpointConfig(save_checkpoint_steps=eval_per_epoch*1875, keep_checkpoint_max=15)
+ ckpoint_cb = ModelCheckpoint(prefix="checkpoint_lenet",directory=ckpt_save_dir, config=config_ck)
+ model = Model(network, net_loss, net_opt, metrics={"Accuracy": Accuracy()})
+
+ epoch_per_eval = {"epoch":[],"acc":[]}
+ eval_cb = EvalCallBack(model,eval_data,eval_per_epoch)
+
+ model.train(epoch_size, train_data, callbacks=[ckpoint_cb, LossMonitor(375),eval_cb],
+ dataset_sink_mode=True)
+```
+
+输出结果:
+
+ epoch: 1 step: 375, loss is 2.298612
+ epoch: 1 step: 750, loss is 2.075152
+ epoch: 1 step: 1125, loss is 0.39205977
+ epoch: 1 step: 1500, loss is 0.12368304
+ epoch: 1 step: 1875, loss is 0.20988345
+ epoch: 2 step: 375, loss is 0.20582482
+ epoch: 2 step: 750, loss is 0.029070046
+ epoch: 2 step: 1125, loss is 0.041760832
+ epoch: 2 step: 1500, loss is 0.067035824
+ epoch: 2 step: 1875, loss is 0.0050643035
+ {'Accuracy': 0.9763621794871795}
+
+ ... ...
+
+ epoch: 9 step: 375, loss is 0.021227183
+ epoch: 9 step: 750, loss is 0.005586236
+ epoch: 9 step: 1125, loss is 0.029125651
+ epoch: 9 step: 1500, loss is 0.00045874066
+ epoch: 9 step: 1875, loss is 0.023556218
+ epoch: 10 step: 375, loss is 0.0005807788
+ epoch: 10 step: 750, loss is 0.02574059
+ epoch: 10 step: 1125, loss is 0.108463734
+ epoch: 10 step: 1500, loss is 0.01950589
+ epoch: 10 step: 1875, loss is 0.10563098
+ {'Accuracy': 0.979667467948718}
+
+
+在同一目录找到`lenet_ckpt`文件夹,文件夹中保存了5个模型,和一个计算图相关数据,其结构如下:
+
+```
+lenet_ckpt
+├── checkpoint_lenet-10_1875.ckpt
+├── checkpoint_lenet-2_1875.ckpt
+├── checkpoint_lenet-4_1875.ckpt
+├── checkpoint_lenet-6_1875.ckpt
+├── checkpoint_lenet-8_1875.ckpt
+└── checkpoint_lenet-graph.meta
+```
+
+## 定义函数绘制不同epoch下模型的精度
+
+定义绘图函数`eval_show`,将`epoch_per_eval`载入到`eval_show`中,绘制出不同`epoch`下模型的验证精度折线图。
+
+
+```python
+def eval_show(epoch_per_eval):
+ plt.xlabel("epoch number")
+ plt.ylabel("Model accuracy")
+ plt.title("Model accuracy variation chart")
+ plt.plot(epoch_per_eval["epoch"],epoch_per_eval["acc"],"red")
+ plt.show()
+
+eval_show(epoch_per_eval)
+```
+
+输出结果:
+
+
+
+
+从上图可以一目了然地挑选出需要的最优模型。
+
+## 总结
+
+本次使用MNIST数据集通过卷积神经网络LeNet5进行训练,着重介绍了在进行模型训练的同时进行模型的验证,保存对应`epoch`的模型,并从中挑选出最优模型的方法。
diff --git a/tutorials/source_zh_cn/index.rst b/tutorials/source_zh_cn/index.rst
index 33171ea18ab9b7a45a08c40f6aeea694c4cc7efa..bb6bdfdedde13c651bd3abbfd4a147572aa853fc 100644
--- a/tutorials/source_zh_cn/index.rst
+++ b/tutorials/source_zh_cn/index.rst
@@ -33,6 +33,7 @@ MindSpore教程
advanced_use/computer_vision_application
advanced_use/nlp_application
advanced_use/second_order_optimizer_for_resnet50_application
+ advanced_use/synchronization_training_and_evaluation
.. toctree::
:glob: