diff --git a/tutorials/source_en/advanced_use/checkpoint_for_hybrid_parallel.md b/tutorials/source_en/advanced_use/checkpoint_for_hybrid_parallel.md index 51cdb37a801109a46985d2e1b677af25cc879c38..d9bd74fdc4335914999a283cc9638b7bbd8298cc 100644 --- a/tutorials/source_en/advanced_use/checkpoint_for_hybrid_parallel.md +++ b/tutorials/source_en/advanced_use/checkpoint_for_hybrid_parallel.md @@ -313,7 +313,7 @@ User process: 3. Execute stage 2 training: There are two devices in stage 2 training environment. The weight shape of the MatMul operator on each device is \[4, 8]. Load the initialized model parameter data from the integrated checkpoint file and then perform training. -> For details about the distributed environment configuration and training code, see [Distributed Training](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training.html). +> For details about the distributed environment configuration and training code, see [Distributed Training](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training_ascend.html). > > This document provides the example code for integrating checkpoint files and loading checkpoint files before distributed training. The code is for reference only. diff --git a/tutorials/source_en/advanced_use/host_device_training.md b/tutorials/source_en/advanced_use/host_device_training.md index d52d54d0b58a20c5e9ebbc1281f5a00336e56211..c63d735b0cd7b49d5a36026e0f6ed8a8589036e1 100644 --- a/tutorials/source_en/advanced_use/host_device_training.md +++ b/tutorials/source_en/advanced_use/host_device_training.md @@ -14,7 +14,7 @@ ## Overview -In deep learning, one usually has to deal with the huge model problem, in which the total size of parameters in the model is beyond the device memory capacity. To efficiently train a huge model, one solution is to employ homogenous accelerators (*e.g.*, Ascend 910 AI Accelerator and GPU) for [distributed training](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training.html). When the size of a model is hundreds of GBs or several TBs, +In deep learning, one usually has to deal with the huge model problem, in which the total size of parameters in the model is beyond the device memory capacity. To efficiently train a huge model, one solution is to employ homogenous accelerators (*e.g.*, Ascend 910 AI Accelerator and GPU) for distributed training. When the size of a model is hundreds of GBs or several TBs, the number of required accelerators is too overwhelming for people to access, resulting in this solution inapplicable. One alternative is Host+Device hybrid training. This solution simultaneously leveraging the huge memory in hosts and fast computation in accelerators, is a promisingly efficient method for addressing huge model problem. diff --git a/tutorials/source_en/advanced_use/mixed_precision.md b/tutorials/source_en/advanced_use/mixed_precision.md index c4b942be8930294a630da7e2a866eefd15c9aa2a..7d25e7fac5c2be62238f83a836ca4741029df0fa 100644 --- a/tutorials/source_en/advanced_use/mixed_precision.md +++ b/tutorials/source_en/advanced_use/mixed_precision.md @@ -10,7 +10,7 @@ - + ## Overview diff --git a/tutorials/source_en/advanced_use/quantization_aware.md b/tutorials/source_en/advanced_use/quantization_aware.md index c3839f7a5e648bcca19a615c80899656bf2eefb7..2d3098b85dd1b659e89331feca691591bfad84e1 100644 --- a/tutorials/source_en/advanced_use/quantization_aware.md +++ b/tutorials/source_en/advanced_use/quantization_aware.md @@ -74,7 +74,7 @@ Compared with common training, the quantization aware training requires addition Next, the LeNet network is used as an example to describe steps 3 and 6. -> You can obtain the complete executable sample code at . +> You can obtain the complete executable sample code at . ### Defining a Fusion Network @@ -196,4 +196,4 @@ For details, see 具体分布式环境配置和训练部分代码,此处不做详细说明,可以参考[分布式并行训练](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/distributed_training.html) +> 具体分布式环境配置和训练部分代码,此处不做详细说明,可以参考[分布式并行训练](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/distributed_training_ascend.html) 章节。 > > 本文档附上对CheckPoint文件做合并处理以及分布式训练前加载CheckPoint文件的示例代码,仅作为参考,实际请参考具体情况实现。 diff --git a/tutorials/source_zh_cn/advanced_use/host_device_training.md b/tutorials/source_zh_cn/advanced_use/host_device_training.md index a64ad4101e96535215cbffb1381f7061800fc069..300b21ec8e19d7e9c5ab67df45b734bd294c12f4 100644 --- a/tutorials/source_zh_cn/advanced_use/host_device_training.md +++ b/tutorials/source_zh_cn/advanced_use/host_device_training.md @@ -14,7 +14,7 @@ ## 概述 -在深度学习中,工作人员时常会遇到超大模型的训练问题,即模型参数所占内存超过了设备内存上限。为高效地训练超大模型,一种方案便是[分布式并行训练](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/distributed_training.html),也就是将工作交由同构的多个加速器(如Ascend 910 AI处理器,GPU等)共同完成。但是这种方式在面对几百GB甚至几TB级别的模型时,所需的加速器过多。而当从业者实际难以获取大规模集群时,这种方式难以应用。另一种可行的方案是使用主机端(Host)和加速器(Device)的混合训练模式。此方案同时发挥了主机端内存大和加速器端计算快的优势,是一种解决超大模型训练较有效的方式。 +在深度学习中,工作人员时常会遇到超大模型的训练问题,即模型参数所占内存超过了设备内存上限。为高效地训练超大模型,一种方案便是分布式并行训练,也就是将工作交由同构的多个加速器(如Ascend 910 AI处理器,GPU等)共同完成。但是这种方式在面对几百GB甚至几TB级别的模型时,所需的加速器过多。而当从业者实际难以获取大规模集群时,这种方式难以应用。另一种可行的方案是使用主机端(Host)和加速器(Device)的混合训练模式。此方案同时发挥了主机端内存大和加速器端计算快的优势,是一种解决超大模型训练较有效的方式。 在MindSpore中,用户可以将待训练的参数放在主机,同时将必要算子的执行位置配置为主机,其余算子的执行位置配置为加速器,从而方便地实现混合训练。此教程以推荐模型[Wide&Deep](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/recommend/wide_and_deep)为例,讲解MindSpore在主机和Ascend 910 AI处理器的混合训练。