From 8a3999c848cf0442b94f1a355ed790a9b0508e5d Mon Sep 17 00:00:00 2001
From: "Mr.Lee" <37506361+ShangCambridge@users.noreply.github.com>
Date: Wed, 20 Feb 2019 21:03:28 +0800
Subject: [PATCH] upload cpu_train_best_practice_en.rst (#592)

* upload cpu_train_best_practiceen.rst

* Review
---
 .../cpu_train_best_practice_en.rst            | 60 +++++++++++++++++++
 1 file changed, 60 insertions(+)
 create mode 100644 doc/fluid/api_guides/low_level/distributed/cpu_train_best_practice_en.rst

diff --git a/doc/fluid/api_guides/low_level/distributed/cpu_train_best_practice_en.rst b/doc/fluid/api_guides/low_level/distributed/cpu_train_best_practice_en.rst
new file mode 100644
index 000000000..c3fb3fa14
--- /dev/null
+++ b/doc/fluid/api_guides/low_level/distributed/cpu_train_best_practice_en.rst
@@ -0,0 +1,60 @@
+.. _api_guide_cpu_training_best_practice_en:
+
+######################################################
+Best practices of distributed training on CPU
+######################################################
+
+To improve the training speed of CPU distributed training, we must consider two aspects:
+
+1. Improve the training speed mainly by improving utilization rate of CPU; 
+2. Improve the communication speed mainly by reducing the amount of data transmitted in the communication.
+
+Improve CPU utilization 
+=============================
+
+The CPU utilization mainly depends on :code:`ParallelExecutor`, which can make full use of the computing power of multiple CPUs to speed up the calculation.
+
+For detailed API usage, please refer to :ref:`api_fluid_ParallelExecutor` . A simple example:
+
+.. code-block:: python
+
+	# Configure the execution strategy, mainly to set the number of threads
+	exec_strategy = fluid.ExecutionStrategy()
+	exec_strategy.num_threads = 8
+
+	# Configure the composition strategy, for CPU training, you should use the Reduce mode for training.
+	build_strategy = fluid.BuildStrategy()
+	if int(os.getenv("CPU_NUM")) > 1:
+		build_strategy.reduce_strategy=fluid.BuildStrategy.ReduceStrategy.Reduce
+
+	pe = fluid.ParallelExecutor(
+		use_cuda=False,
+		loss_name=avg_cost.name,
+		main_program=main_program,
+		build_strategy=build_strategy,
+		exec_strategy=exec_strategy)
+
+Among the parameters above:
+
+- :code:`num_threads` : the number of threads used by the model training. It is preferably close to the number of the physical CPU cores of the machine where the training is performed.
+- :code:`reduce_strategy` : For CPU training, you should choose fluid.BuildStrategy.ReduceStrategy.Reduce
+
+
+Configuration of general environment variables:
+
+- :code:`CPU_NUM`: The number of replicas of the model, preferably the same as num_threads
+
+
+Improve communication speed
+==============================
+
+To reduce the amount of communication data and improve communication speed is achieved mainly by using sparse updates, the current support for `sparse update <../layers/sparse_update_en.html>`_ is mainly :ref:`api_fluid_layers_embedding`.
+
+.. code-block:: python
+
+	data = fluid.layers.data(name='ids', shape=[1], dtype='int64')
+	fc = fluid.layers.embedding(input=data, size=[dict_size, 16], is_sparse=True)
+
+Among the parameters above:
+
+- :code:`is_sparse`: Use sparse updates to configure embedding. If the dict_size of embedding is large but the number of data are very small each time, it is recommended to use the sparse update method.
-- 
GitLab