diff --git a/doc/fluid/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice.rst b/doc/fluid/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice.rst index 5973f9d8bf58030709dc37df89ca5b51cfb74ce6..2386890a13fe133c45a6495cc274827389f4a875 100644 --- a/doc/fluid/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice.rst +++ b/doc/fluid/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice.rst @@ -4,8 +4,8 @@ 分布式CPU训练优秀实践 #################### -提高CPU分布式训练的训练速度,主要要从两个方面来考虑: -1)提高训练速度,主要是提高CPU的使用率;2)提高通信速度,主要是减少通信传输的数据量。 +提高CPU分布式训练的训练速度,主要要从四个方面来考虑: +1)提高训练速度,主要是提高CPU的使用率;2)提高通信速度,主要是减少通信传输的数据量;3)提高数据IO速度;4)更换分布式训练策略,提高分布式训练速度。 提高CPU的使用率 ============= @@ -56,3 +56,106 @@ API详细使用方法参考 :ref:`cn_api_fluid_ParallelExecutor` ,简单实例 以上参数中: - :code:`is_sparse` : 配置embedding使用稀疏更新,如果embedding的dict_size很大,而每次数据data很少,建议使用sparse更新方式。 + + +提高数据IO速度 +========== + +要提高CPU分布式的数据IO速度,可以首先考虑使用dataset API进行数据读取。 dataset是一种多生产者多消费者模式的数据读取方法,默认情况下耦合数据读取线程与训练线程,在多线程的训练中,dataset表现出极高的性能优势。 + +API接口介绍可以参考:https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/dataset_cn/QueueDataset_cn.html + +结合实际的网络,比如CTR-DNN模型,引入的方法可以参考:https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/ctr/dnn + +最后使用 :code:`train_from_dataset` 接口来进行网络的训练: + +.. code-block:: python + + dataset = fluid.DatasetFactory().create_dataset() + exe = fluid.Executor(fluid.CPUPlace()) + exe.run(fluid.default_startup_program()) + exe.train_from_dataset(program=fluid.default_main_program(),dataset=dataset) + + +更换分布式训练策略 +========== + +CPU分布式训练速度进一步提高的核心在于选择合适的分布式训练策略,比如定义通信策略、编译策略、执行策略等等。paddlepaddle于v1.7版本发布了 :code:`DistributedStrategy` 功能,可以十分灵活且方便的指定分布式运行策略。 + +首先需要在代码中引入相关库: + +.. code-block:: python + + from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet + import paddle.fluid.incubate.fleet.base.role_maker as role_maker + from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory + +然后指定CPU分布式运行的训练策略,目前可选配置有四种:同步训练(Sync)、异步训练(Async)、半异步训练(Half-Async)以及GEO训练。不同策略的细节,可以查看设计文档: +https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler/transpiler_cpu.md + +通过如下代码引入上述策略的默认配置,并进行CPU分布式训练: + +.. code-block:: python + + # step1: 引入CPU分布式训练策略 + # 同步训练策略 + strategy = DistributedStrategyFactory.create_sync_strategy() + # 半异步训练策略 + strategy = DistributedStrategyFactory.create_half_async_strategy() + # 异步训练策略 + strategy = DistributedStrategyFactory.create_async_strategy() + # GEO训练策略 + strategy = DistributedStrategyFactory.create_geo_strategy(update_frequency=400) + + # step2: 定义节点角色 + role = role_maker.PaddleCloudRoleMaker() + fleet.init(role) + + # step3: 分布式训练program构建 + optimizer = fluid.optimizer.SGD(learning_rate) # 以SGD优化器为例 + optimizer = fleet.distributed_optimizer(optimizer, strategy) + optimizer.minimize(loss) + + # step4.1: 启动参数服务器节点(Server) + if fleet.is_server(): + fleet.init_server() + fleet.run_server() + + # step4.2: 启动训练节点(Trainer) + elif fleet.is_worker(): + fleet.init_worker() + exe.run(fleet.startup_program) + # Do training + exe.run(fleet.main_program) + fleet.stop_worker() + + +paddlepaddle支持对训练策略中的细节进行调整: + +- 创建compiled_program所需的build_strategy及exec_strategy可以直接基于strategy获得 + +.. code-block:: python + + compiled_program = fluid.compiler.CompiledProgram(fleet.main_program).with_data_parallel( + loss_name=loss.name, + build_strategy=strategy.get_build_strategy(), + exec_strategy=strategy.get_execute_strategy()) + + +- 自定义训练策略细节,支持对DistributeTranspilerConfig、TrainerRuntimeConfig、ServerRuntimeConfig、fluid.ExecutionStrategy、fluid.BuildStrategy进行自定义配置。以DistributeTranspilerConfig为例,修改方式如下所示: + +.. code-block:: python + + strategy = DistributedStrategyFactory.create_sync_strategy() + + # 方式一(推荐): + config = strategy.get_program_config() + config.min_block_size = 81920 + + + # 方式二:调用set_program_config修改组网相关配置,支持DistributeTranspilerConfig和dict两种数据类型 + config = DistributeTranspilerConfig() + config.min_block_size = 81920 + # config = dict() + # config['min_block_size'] = 81920 + strategy.set_program_config(config) \ No newline at end of file diff --git a/doc/fluid/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice_en.rst b/doc/fluid/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice_en.rst index c3fb3fa14231af73c7b19acba8046d6507994510..1ff88252456cd3f00bc268e22159efad9f1529b1 100644 --- a/doc/fluid/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice_en.rst +++ b/doc/fluid/advanced_guide/performance_improving/multinode_training_improving/cpu_train_best_practice_en.rst @@ -7,7 +7,9 @@ Best practices of distributed training on CPU To improve the training speed of CPU distributed training, we must consider two aspects: 1. Improve the training speed mainly by improving utilization rate of CPU; -2. Improve the communication speed mainly by reducing the amount of data transmitted in the communication. +2. Improve the communication speed mainly by reducing the amount of data transmitted in the communication; +3. Improve the data IO speed by dataset API; +4. Improve the distributed training speed by changing distributed training strategy. Improve CPU utilization ============================= @@ -58,3 +60,105 @@ To reduce the amount of communication data and improve communication speed is ac Among the parameters above: - :code:`is_sparse`: Use sparse updates to configure embedding. If the dict_size of embedding is large but the number of data are very small each time, it is recommended to use the sparse update method. + + +Improve data IO speed +============================== + +To improve the CPU's distributed training speed, you can first consider using the dataset API as data reader. Dataset is a multi producer and multi consumer data reading method. By default, data reading thread and training thread are coupled. In multi-threaded training, dataset shows a high performance advantage. + +Refer to this page for API introduction: https://www.paddlepaddle.org.cn/documentation/docs/en/api/dataset/QueueDataset.html + +Combined with the actual model CTR-DNN, you can learn more about how to use dataset: https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/ctr/dnn + +Using :code:`train_from_dataset` for network training. + +.. code-block:: python + + dataset = fluid.DatasetFactory().create_dataset() + exe = fluid.Executor(fluid.CPUPlace()) + exe.run(fluid.default_startup_program()) + exe.train_from_dataset(program=fluid.default_main_program(),dataset=dataset) + + +Change distributed training strategy +============================== + +The core of improving CPU distributed training speed is to choose appropriate distributed training strategy, such as defining communication strategy, compiling strategy, executing strategy and so on. PaddlePaddle released :code:`DistributedStrategy` API in V1.7 version , which can be very flexible and convenient to specify distributed operation strategy. + +First, we need to introduce relevant libraries into the code: + +.. code-block:: python + + from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet + import paddle.fluid.incubate.fleet.base.role_maker as role_maker + from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory + +At present, there are four kinds of training strategies: synchronous training, asynchronous, half asynchronous training and GEO training. For details of different strategies, you can view the design documents: +https://github.com/PaddlePaddle/Fleet/blob/develop/markdown_doc/transpiler/transpiler_cpu.md + +The default configuration of the above policy is introduced by the following code: + +.. code-block:: python + + # step1: get distributed strategy + # Sync + strategy = DistributedStrategyFactory.create_sync_strategy() + # Half-Async + strategy = DistributedStrategyFactory.create_half_async_strategy() + # Async + strategy = DistributedStrategyFactory.create_async_strategy() + # GEO + strategy = DistributedStrategyFactory.create_geo_strategy(update_frequency=400) + + # step2: define role of node + role = role_maker.PaddleCloudRoleMaker() + fleet.init(role) + + # step3: get distributed training program + optimizer = fluid.optimizer.SGD(learning_rate) # 以SGD优化器为例 + optimizer = fleet.distributed_optimizer(optimizer, strategy) + optimizer.minimize(loss) + + # step4.1: run parameter server node + if fleet.is_server(): + fleet.init_server() + fleet.run_server() + + # step4.2: run worker node + elif fleet.is_worker(): + fleet.init_worker() + exe.run(fleet.startup_program) + # Do training + exe.run(fleet.main_program) + fleet.stop_worker() + +PaddlePaddle supports adjusting the details of the training strategy: + +- The build_strategy and exec_strategy which used to create compiled_program can generate from strategy: + +.. code-block:: python + + compiled_program = fluid.compiler.CompiledProgram(fleet.main_program).with_data_parallel( + loss_name=loss.name, + build_strategy=strategy.get_build_strategy(), + exec_strategy=strategy.get_execute_strategy()) + + +- Training strategy details can be customized, Paddlepaddle supports customized configuration of distributetranspierconfig, trainerruntimeconfig, serverruntimeconfig, fluid.executionstrategy and fluid.buildstrategy. Take distributetranspillerconfig as an example. The modification method is as follows: + +.. code-block:: python + + strategy = DistributedStrategyFactory.create_sync_strategy() + + # Mode 1 (recommended): + config = strategy.get_program_config() + config.min_block_size = 81920 + + + # Mode 2 + config = DistributeTranspilerConfig() + config.min_block_size = 81920 + # config = dict() + # config['min_block_size'] = 81920 + strategy.set_program_config(config) \ No newline at end of file