@@ -7,7 +7,9 @@ Best practices of distributed training on CPU
...
@@ -7,7 +7,9 @@ Best practices of distributed training on CPU
To improve the training speed of CPU distributed training, we must consider two aspects:
To improve the training speed of CPU distributed training, we must consider two aspects:
1. Improve the training speed mainly by improving utilization rate of CPU;
1. Improve the training speed mainly by improving utilization rate of CPU;
2. Improve the communication speed mainly by reducing the amount of data transmitted in the communication.
2. Improve the communication speed mainly by reducing the amount of data transmitted in the communication;
3. Improve the data IO speed by dataset API;
4. Improve the distributed training speed by changing distributed training strategy.
Improve CPU utilization
Improve CPU utilization
=============================
=============================
...
@@ -58,3 +60,105 @@ To reduce the amount of communication data and improve communication speed is ac
...
@@ -58,3 +60,105 @@ To reduce the amount of communication data and improve communication speed is ac
Among the parameters above:
Among the parameters above:
- :code:`is_sparse`: Use sparse updates to configure embedding. If the dict_size of embedding is large but the number of data are very small each time, it is recommended to use the sparse update method.
- :code:`is_sparse`: Use sparse updates to configure embedding. If the dict_size of embedding is large but the number of data are very small each time, it is recommended to use the sparse update method.
Improve data IO speed
==============================
To improve the CPU's distributed training speed, you can first consider using the dataset API as data reader. Dataset is a multi producer and multi consumer data reading method. By default, data reading thread and training thread are coupled. In multi-threaded training, dataset shows a high performance advantage.
Refer to this page for API introduction: https://www.paddlepaddle.org.cn/documentation/docs/en/api/dataset/QueueDataset.html
Combined with the actual model CTR-DNN, you can learn more about how to use dataset: https://github.com/PaddlePaddle/models/tree/release/1.7/PaddleRec/ctr/dnn
Using :code:`train_from_dataset` for network training.
The core of improving CPU distributed training speed is to choose appropriate distributed training strategy, such as defining communication strategy, compiling strategy, executing strategy and so on. PaddlePaddle released :code:`DistributedStrategy` API in V1.7 version , which can be very flexible and convenient to specify distributed operation strategy.
First, we need to introduce relevant libraries into the code:
.. code-block:: python
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy_factory import DistributedStrategyFactory
At present, there are four kinds of training strategies: synchronous training, asynchronous, half asynchronous training and GEO training. For details of different strategies, you can view the design documents:
- Training strategy details can be customized, Paddlepaddle supports customized configuration of distributetranspierconfig, trainerruntimeconfig, serverruntimeconfig, fluid.executionstrategy and fluid.buildstrategy. Take distributetranspillerconfig as an example. The modification method is as follows: