提交 3beec229 编写于 作者: X Xiaoda Zhang

fix some erros in host+device training

上级 a11aaa9a
...@@ -25,7 +25,7 @@ This tutorial introduces how to train [Wide&Deep](https://gitee.com/mindspore/mi ...@@ -25,7 +25,7 @@ This tutorial introduces how to train [Wide&Deep](https://gitee.com/mindspore/mi
1. Prepare the model. The Wide&Deep code can be found at: <https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/recommend/wide_and_deep>, in which `train_and_eval_auto_parallel.py` is the main function for training, 1. Prepare the model. The Wide&Deep code can be found at: <https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/recommend/wide_and_deep>, in which `train_and_eval_auto_parallel.py` is the main function for training,
`src/` directory contains the model definition, data processing and configuration files, `script/` directory contains the launch scripts in different modes. `src/` directory contains the model definition, data processing and configuration files, `script/` directory contains the launch scripts in different modes.
2. Prepare the dataset. The dataset can be found at: <https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz>. Use the script `/src/preprocess_data.py` to transform dataset into MindRecord format. 2. Prepare the dataset. The dataset can be found at: <https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz>. Use the script `src/preprocess_data.py` to transform dataset into MindRecord format.
3. Configure the device information. When performing training in the bare-metal environment, the network information file needs to be configured. This example only employs one accelerator, thus `rank_table_1p_0.json` containing #0 accelerator is configured as follows (you need to check the server's IP first): 3. Configure the device information. When performing training in the bare-metal environment, the network information file needs to be configured. This example only employs one accelerator, thus `rank_table_1p_0.json` containing #0 accelerator is configured as follows (you need to check the server's IP first):
...@@ -47,32 +47,20 @@ This tutorial introduces how to train [Wide&Deep](https://gitee.com/mindspore/mi ...@@ -47,32 +47,20 @@ This tutorial introduces how to train [Wide&Deep](https://gitee.com/mindspore/mi
## Configuring for Hybrid Training ## Configuring for Hybrid Training
1. Configure the place of trainable parameters. In the file `train_and_eval_auto_parallel.py`, add a configuration `dataset_sink_mode=False` in `model.train` to indicate that parameters are placed on hosts instead of accelerators. In the file `train_and_eval_auto_parallel.py`, change the configuration `context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, mirror_mean=True)` to `context.set_auto_parallel_context(parallel_mode=ParallelMode.SEMI_AUTO_PARALLEL, mirror_mean=True)`, in order to adapt for the host+device mode. 1. Configure the flag of hybrid training. In the function `argparse_init` of file `src/config.py`, change the default value of `host_device_mix` to be `1`; change `self.host_device_mix` in function `__init__` of `class WideDeepConfig` to be `1`:
2. Configure the sparsity of parameters. The actual values that involve in computation are indices, instead of entire parameters.
In the file `train_and_eval_auto_parallel.py`, add a configuration `context.set_context(enable_sparse=True)`.
In the `construct` function of `class WideDeepModel(nn.Cell)` of file `src/wide_and_deep.py`, to adapt for sparse parameters, replace the return value as:
```
return out, deep_id_embs
```
3. Configure the place of operators and optimizers. In `class WideDeepModel(nn.Cell)` of file `src/wide_and_deep.py`, add the attribute of running on host for `EmbeddingLookup`:
```python ```python
self.embeddinglookup = nn.EmbeddingLookup(target='CPU') self.host_device_mix = 1
``` ```
In `class TrainStepWrap(nn.Cell)` of file `src/wide_and_deep.py`, add the attribute of running on host for two optimizers: 2. Check placement of necessary operators and optimizers. In class `WideDeepModel` of file `src/wide_and_deep.py`, check the placement of `EmbeddingLookup` is at host:
```python ```python
self.optimizer_w.sparse_opt.add_prim_attr('primitive_target', 'CPU') self.deep_embeddinglookup = nn.EmbeddingLookup()
self.wide_embeddinglookup = nn.EmbeddingLookup()
``` ```
In `class TrainStepWrap(nn.Cell)` of file `src/wide_and_deep.py`, check two optimizer are also at host:
```python ```python
self.optimizer_d.sparse_opt.add_prim_attr('primitive_target', 'CPU') self.optimizer_w.sparse_opt.add_prim_attr("primitive_target", "CPU")
self.optimizer_d.sparse_opt.add_prim_attr("primitive_target", "CPU")
``` ```
## Training the Model ## Training the Model
......
...@@ -22,7 +22,7 @@ ...@@ -22,7 +22,7 @@
1. 准备模型代码。Wide&Deep的代码可参见:<https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/recommend/wide_and_deep>,其中,`train_and_eval_auto_parallel.py`为训练的主函数所在,`src/`目录中包含Wide&Deep模型的定义、数据处理和配置信息等,`script/`目录中包含不同配置下的训练脚本。 1. 准备模型代码。Wide&Deep的代码可参见:<https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/recommend/wide_and_deep>,其中,`train_and_eval_auto_parallel.py`为训练的主函数所在,`src/`目录中包含Wide&Deep模型的定义、数据处理和配置信息等,`script/`目录中包含不同配置下的训练脚本。
2. 准备数据集。数据集下载链接:<https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz>。利用脚本`/src/preprocess_data.py`将数据集转换为MindRecord格式。 2. 准备数据集。数据集下载链接:<https://s3-eu-west-1.amazonaws.com/kaggle-display-advertising-challenge-dataset/dac.tar.gz>。利用脚本`src/preprocess_data.py`将数据集转换为MindRecord格式。
3. 配置处理器信息。在裸机环境(即本地有Ascend 910 AI 处理器)进行分布式训练时,需要配置加速器信息文件。此样例只使用一个加速器,故只需配置包含0号卡的`rank_table_1p_0.json`文件(每台机器的具体的IP信息不同,需要查看网络配置来设定,此为示例),如下所示: 3. 配置处理器信息。在裸机环境(即本地有Ascend 910 AI 处理器)进行分布式训练时,需要配置加速器信息文件。此样例只使用一个加速器,故只需配置包含0号卡的`rank_table_1p_0.json`文件(每台机器的具体的IP信息不同,需要查看网络配置来设定,此为示例),如下所示:
...@@ -44,32 +44,20 @@ ...@@ -44,32 +44,20 @@
## 配置混合执行 ## 配置混合执行
1. 配置待训练参数的存储位置。在`train_and_eval_auto_parallel.py`文件`train_and_eval`函数的`model.train`调用中,增加配置`dataset_sink_mode=False`,以指示参数数据保持在主机端,而非加速器端。在`train_and_eval_auto_parallel.py`文件中改变配置`context.set_auto_parallel_context(parallel_mode=ParallelMode.AUTO_PARALLEL, mirror_mean=True)``context.set_auto_parallel_context(parallel_mode=ParallelMode.SEMI_AUTO_PARALLEL, mirror_mean=True)`,即配置为半自动并行,以适配混合并行模式。 1. 配置混合训练标识。在`src/config.py`文件中,设置`argparse_init`函数中的`host_device_mix`默认值为`1`,设置`WideDeepConfig`类的`__init__`函数中`self.host_device_mix``1`
2. 配置待训练参数的稀疏性质。由于待训练参数的规模大,需将参数配置为稀疏,也就是:真正参与计算的并非全量的参数,而是其索引值。
`train_and_eval_auto_parallel.py`文件中增加配置`context.set_context(enable_sparse=True)`
`src/wide_and_deep.py`文件的`class WideDeepModel(nn.Cell)`类的`construct`函数中,将函数的返回值替换为如下值,以适配参数的稀疏性:
```
return out, deep_id_embs
```
3. 配置必要算子和优化器的执行位置。在`src/wide_and_deep.py``class WideDeepModel(nn.Cell)`中,为`EmbeddingLookup`设置主机端执行的属性,
```python ```python
self.embeddinglookup = nn.EmbeddingLookup(target='CPU') self.host_device_mix = 1
``` ```
在`src/wide_and_deep.py`文件的`class TrainStepWrap(nn.Cell)`中,为两个优化器增加配置主机端执行的属性。 2. 检查必要算子和优化器的执行位置。在`src/wide_and_deep.py``WideDeepModel`类中,检查`EmbeddingLookup`为主机端执行:
```python ```python
self.optimizer_w.sparse_opt.add_prim_attr('primitive_target', 'CPU') self.deep_embeddinglookup = nn.EmbeddingLookup()
self.wide_embeddinglookup = nn.EmbeddingLookup()
``` ```
`src/wide_and_deep.py`文件的`class TrainStepWrap(nn.Cell)`中,检查两个优化器主机端执行的属性。
```python ```python
self.optimizer_d.sparse_opt.add_prim_attr('primitive_target', 'CPU') self.optimizer_w.sparse_opt.add_prim_attr("primitive_target", "CPU")
self.optimizer_d.sparse_opt.add_prim_attr("primitive_target", "CPU")
``` ```
## 训练模型 ## 训练模型
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册