We want Paddle to support training on the general-purpose cluster. The cluster runs Paddle, the Web server (e.g., Nginx), the log collector (e.g., fluentd), the distributed queue service (e.g., Kafka), the log joiner and other data processors written using Storm, Spark, and Hadoop MapReduce on the same cluster. As illustrated in the following graph:
![general purpose cluster](src/arch.png)
<imgsrc="src/arch.png"/>
This poses new challenges for Paddle,
- Paddle need to be fault tolerant.
- Input training data can be online data from real time logs or batched data from distributed file system.
- Input training data can be online data from real time logs or batch data from distributed file system.
- User needs a simple way to train model on Paddle cloud. Complexities such as job scheduling should be hidden from user.
## Training Job
...
...
@@ -22,7 +22,7 @@ A training job will be created once user asks Paddle cloud to train a model. The
One training job will only have one master process, typically multiple trainer processes and parameter server processes. Their relation is illustrated in the following graph: