未验证 提交 63897e07 编写于 作者: C Chengmo 提交者: GitHub

update cpu distribute practice (#1804)

* update cpu distribute practice
上级 f4a037bc
...@@ -4,8 +4,8 @@ ...@@ -4,8 +4,8 @@
分布式CPU训练优秀实践 分布式CPU训练优秀实践
#################### ####################
提高CPU分布式训练的训练速度,主要要从个方面来考虑: 提高CPU分布式训练的训练速度,主要要从个方面来考虑:
1)提高训练速度,主要是提高CPU的使用率;2)提高通信速度,主要是减少通信传输的数据量。 1)提高训练速度,主要是提高CPU的使用率;2)提高通信速度,主要是减少通信传输的数据量;3)提高数据IO速度;4)更换分布式训练策略,提高分布式训练速度
提高CPU的使用率 提高CPU的使用率
============= =============
...@@ -56,3 +56,106 @@ API详细使用方法参考 :ref:`cn_api_fluid_ParallelExecutor` ,简单实例 ...@@ -56,3 +56,106 @@ API详细使用方法参考 :ref:`cn_api_fluid_ParallelExecutor` ,简单实例
以上参数中: 以上参数中:
- :code:`is_sparse` : 配置embedding使用稀疏更新,如果embedding的dict_size很大,而每次数据data很少,建议使用sparse更新方式。 - :code:`is_sparse` : 配置embedding使用稀疏更新,如果embedding的dict_size很大,而每次数据data很少,建议使用sparse更新方式。
提高数据IO速度
==========
要提高CPU分布式的数据IO速度,可以首先考虑使用dataset API进行数据读取。 dataset是一种多生产者多消费者模式的数据读取方法,默认情况下耦合数据读取线程与训练线程,在多线程的训练中,dataset表现出极高的性能优势。
API接口介绍可以参考:https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dataset_cn/QueueDataset_cn.html
结合实际的网络,比如CTR-DNN模型,引入的方法可以参考:https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/ctr/dnn
最后使用 :code:`train_from_dataset` 接口来进行网络的训练:
.. code-block:: python
dataset = fluid.DatasetFactory().create_dataset()
exe = fluid.Executor(fluid.CPUPlace())
exe.run(fluid.default_startup_program())
exe.train_from_dataset(program=fluid.default_main_program(),dataset=dataset)
更换分布式训练策略
==========
CPU分布式训练速度进一步提高的核心在于选择合适的分布式训练策略,比如定义通信策略、编译策略、执行策略等等。paddlepaddle于v1.7版本发布了 :code:`DistributedStrategy` 功能,可以十分灵活且方便的指定分布式运行策略。
首先需要在代码中引入相关库:
.. code-block:: python
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory
然后指定CPU分布式运行的训练策略,目前可选配置有四种:同步训练(Sync)、异步训练(Async)、半异步训练(Half-Async)以及GEO训练。不同策略的细节,可以查看设计文档:
https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler/transpiler_cpu.md
通过如下代码引入上述策略的默认配置,并进行CPU分布式训练:
.. code-block:: python
# step1: 引入CPU分布式训练策略
# 同步训练策略
strategy = DistributedStrategyFactory.create_sync_strategy()
# 半异步训练策略
strategy = DistributedStrategyFactory.create_half_async_strategy()
# 异步训练策略
strategy = DistributedStrategyFactory.create_async_strategy()
# GEO训练策略
strategy = DistributedStrategyFactory.create_geo_strategy(update_frequency=400)
# step2: 定义节点角色
role = role_maker.PaddleCloudRoleMaker()
fleet.init(role)
# step3: 分布式训练program构建
optimizer = fluid.optimizer.SGD(learning_rate) # 以SGD优化器为例
optimizer = fleet.distributed_optimizer(optimizer, strategy)
optimizer.minimize(loss)
# step4.1: 启动参数服务器节点(Server)
if fleet.is_server():
fleet.init_server()
fleet.run_server()
# step4.2: 启动训练节点(Trainer)
elif fleet.is_worker():
fleet.init_worker()
exe.run(fleet.startup_program)
# Do training
exe.run(fleet.main_program)
fleet.stop_worker()
paddlepaddle支持对训练策略中的细节进行调整:
- 创建compiled_program所需的build_strategy及exec_strategy可以直接基于strategy获得
.. code-block:: python
compiled_program = fluid.compiler.CompiledProgram(fleet.main_program).with_data_parallel(
loss_name=loss.name,
build_strategy=strategy.get_build_strategy(),
exec_strategy=strategy.get_execute_strategy())
- 自定义训练策略细节,支持对DistributeTranspilerConfig、TrainerRuntimeConfig、ServerRuntimeConfig、fluid.ExecutionStrategy、fluid.BuildStrategy进行自定义配置。以DistributeTranspilerConfig为例,修改方式如下所示:
.. code-block:: python
strategy = DistributedStrategyFactory.create_sync_strategy()
# 方式一(推荐):
config = strategy.get_program_config()
config.min_block_size = 81920
# 方式二:调用set_program_config修改组网相关配置,支持DistributeTranspilerConfig和dict两种数据类型
config = DistributeTranspilerConfig()
config.min_block_size = 81920
# config = dict()
# config['min_block_size'] = 81920
strategy.set_program_config(config)
\ No newline at end of file
...@@ -7,7 +7,9 @@ Best practices of distributed training on CPU ...@@ -7,7 +7,9 @@ Best practices of distributed training on CPU
To improve the training speed of CPU distributed training, we must consider two aspects: To improve the training speed of CPU distributed training, we must consider two aspects:
1. Improve the training speed mainly by improving utilization rate of CPU; 1. Improve the training speed mainly by improving utilization rate of CPU;
2. Improve the communication speed mainly by reducing the amount of data transmitted in the communication. 2. Improve the communication speed mainly by reducing the amount of data transmitted in the communication;
3. Improve the data IO speed by dataset API;
4. Improve the distributed training speed by changing distributed training strategy.
Improve CPU utilization Improve CPU utilization
============================= =============================
...@@ -58,3 +60,105 @@ To reduce the amount of communication data and improve communication speed is ac ...@@ -58,3 +60,105 @@ To reduce the amount of communication data and improve communication speed is ac
Among the parameters above: Among the parameters above:
- :code:`is_sparse`: Use sparse updates to configure embedding. If the dict_size of embedding is large but the number of data are very small each time, it is recommended to use the sparse update method. - :code:`is_sparse`: Use sparse updates to configure embedding. If the dict_size of embedding is large but the number of data are very small each time, it is recommended to use the sparse update method.
Improve data IO speed
==============================
To improve the CPU's distributed training speed, you can first consider using the dataset API as data reader. Dataset is a multi producer and multi consumer data reading method. By default, data reading thread and training thread are coupled. In multi-threaded training, dataset shows a high performance advantage.
Refer to this page for API introduction: https://www.paddlepaddle.org.cn/documentation/docs/en/api/dataset/QueueDataset.html
Combined with the actual model CTR-DNN, you can learn more about how to use dataset: https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/ctr/dnn
Using :code:`train_from_dataset` for network training.
.. code-block:: python
dataset = fluid.DatasetFactory().create_dataset()
exe = fluid.Executor(fluid.CPUPlace())
exe.run(fluid.default_startup_program())
exe.train_from_dataset(program=fluid.default_main_program(),dataset=dataset)
Change distributed training strategy
==============================
The core of improving CPU distributed training speed is to choose appropriate distributed training strategy, such as defining communication strategy, compiling strategy, executing strategy and so on. PaddlePaddle released :code:`DistributedStrategy` API in V1.7 version , which can be very flexible and convenient to specify distributed operation strategy.
First, we need to introduce relevant libraries into the code:
.. code-block:: python
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory
At present, there are four kinds of training strategies: synchronous training, asynchronous, half asynchronous training and GEO training. For details of different strategies, you can view the design documents:
https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler/transpiler_cpu.md
The default configuration of the above policy is introduced by the following code:
.. code-block:: python
# step1: get distributed strategy
# Sync
strategy = DistributedStrategyFactory.create_sync_strategy()
# Half-Async
strategy = DistributedStrategyFactory.create_half_async_strategy()
# Async
strategy = DistributedStrategyFactory.create_async_strategy()
# GEO
strategy = DistributedStrategyFactory.create_geo_strategy(update_frequency=400)
# step2: define role of node
role = role_maker.PaddleCloudRoleMaker()
fleet.init(role)
# step3: get distributed training program
optimizer = fluid.optimizer.SGD(learning_rate) # 以SGD优化器为例
optimizer = fleet.distributed_optimizer(optimizer, strategy)
optimizer.minimize(loss)
# step4.1: run parameter server node
if fleet.is_server():
fleet.init_server()
fleet.run_server()
# step4.2: run worker node
elif fleet.is_worker():
fleet.init_worker()
exe.run(fleet.startup_program)
# Do training
exe.run(fleet.main_program)
fleet.stop_worker()
PaddlePaddle supports adjusting the details of the training strategy:
- The build_strategy and exec_strategy which used to create compiled_program can generate from strategy:
.. code-block:: python
compiled_program = fluid.compiler.CompiledProgram(fleet.main_program).with_data_parallel(
loss_name=loss.name,
build_strategy=strategy.get_build_strategy(),
exec_strategy=strategy.get_execute_strategy())
- Training strategy details can be customized, Paddlepaddle supports customized configuration of distributetranspierconfig, trainerruntimeconfig, serverruntimeconfig, fluid.executionstrategy and fluid.buildstrategy. Take distributetranspillerconfig as an example. The modification method is as follows:
.. code-block:: python
strategy = DistributedStrategyFactory.create_sync_strategy()
# Mode 1 (recommended):
config = strategy.get_program_config()
config.min_block_size = 81920
# Mode 2
config = DistributeTranspilerConfig()
config.min_block_size = 81920
# config = dict()
# config['min_block_size'] = 81920
strategy.set_program_config(config)
\ No newline at end of file
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册