Data preparation is done by docker image `paddledev/paddle-tutorial:k8s_data`, see [here](src/k8s_data/README.md) for how to build this docker image and source code.
Data preparation is done by docker image `paddlepaddle/paddle-tutorial:k8s_data`, see [here](src/k8s_data/README.md) for how to build this docker image and source code.
Run `kubectl --kubeconfig=kubeconfig describe job paddle-cluster-job` to check training job status. It will complete in around 20 minutes.
Run `kubectl --kubeconfig=kubeconfig describe job paddle-cluster-job` to check training job status. It will complete in around 20 minutes.
The details for start `pserver` and `trainer` are hidden inside docker image `paddledev/paddle-tutorial:k8s_train`, see [here](src/k8s_train/README.md) for how to build the docker image and source code.
The details for start `pserver` and `trainer` are hidden inside docker image `paddlepaddle/paddle-tutorial:k8s_train`, see [here](src/k8s_train/README.md) for how to build the docker image and source code.
@@ -4,15 +4,24 @@ In this article, we will introduce how to run PaddlePaddle training job on singl
...
@@ -4,15 +4,24 @@ In this article, we will introduce how to run PaddlePaddle training job on singl
## Build Docker Image
## Build Docker Image
In distributed Kubernetes cluster, we will use Ceph or other shared storage system for storing training data so that all processes in the training job can retrieve data from Ceph. In this example, we will only demo training job on single machine. In order to simplify the requirement of the environment, we will directly put training data into PaddlePaddle's Docker Image, so we need to create a PaddlePaddle Docker image that already includes the training data.
In distributed Kubernetes cluster, we will use Ceph or other distributed
storage system for storing training related data so that all processes in
PaddlePaddle's [Quick Start Tutorial](http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/index_en.html) introduces how to download and train data by using script from PaddlePaddle's source code.
PaddlePaddle training can retrieve data from Ceph. In this example, we will
And `paddledev/paddle:cpu-demo-latest` image has the PaddlePaddle source code and demo. (Caution: Default PaddlePaddle image `paddledev/paddle:cpu-latest` doesn't include the source code, PaddlePaddle's different versions of image can be referred here: [Docker installation guide](http://www.paddlepaddle.org/doc/build/docker_install.html)), so we run this container and download the training data, and then commit the whole container to be a new Docker image.
only demo training job on single machine. In order to simplify the requirement
of the environment, we will directly put training data into the PaddlePaddle Docker Image,
so we need to create a PaddlePaddle Docker image that includes the training data.
The production Docker Image `paddlepaddle/paddle:cpu-demo-latest` has the PaddlePaddle
source code and demo. (Caution: Default PaddlePaddle Docker Image `paddlepaddle/paddle:latest` doesn't include
the source code, PaddlePaddle's different versions of Docker Image can be referred here:
<p>Data preparation is done by docker image <codeclass="docutils literal"><spanclass="pre">paddledev/paddle-tutorial:k8s_data</span></code>, see <aclass="reference internal"href="src/k8s_data/README.html"><spanclass="doc">here</span></a> for how to build this docker image and source code.</p>
<p>Data preparation is done by docker image <codeclass="docutils literal"><spanclass="pre">paddlepaddle/paddle-tutorial:k8s_data</span></code>, see <aclass="reference internal"href="src/k8s_data/README.html"><spanclass="doc">here</span></a> for how to build this docker image and source code.</p>
</div>
</div>
<divclass="section"id="start-training">
<divclass="section"id="start-training">
<spanid="start-training"></span><h4>Start Training<aclass="headerlink"href="#start-training"title="Permalink to this headline">¶</a></h4>
<spanid="start-training"></span><h4>Start Training<aclass="headerlink"href="#start-training"title="Permalink to this headline">¶</a></h4>
<p><codeclass="docutils literal"><spanclass="pre">POD_NAME</span></code>: name of any pod (e.g., <codeclass="docutils literal"><spanclass="pre">paddle-cluster-job-cm469</span></code>).</p>
<p><codeclass="docutils literal"><spanclass="pre">POD_NAME</span></code>: name of any pod (e.g., <codeclass="docutils literal"><spanclass="pre">paddle-cluster-job-cm469</span></code>).</p>
<p>Run <codeclass="docutils literal"><spanclass="pre">kubectl</span><spanclass="pre">--kubeconfig=kubeconfig</span><spanclass="pre">describe</span><spanclass="pre">job</span><spanclass="pre">paddle-cluster-job</span></code> to check training job status. It will complete in around 20 minutes.</p>
<p>Run <codeclass="docutils literal"><spanclass="pre">kubectl</span><spanclass="pre">--kubeconfig=kubeconfig</span><spanclass="pre">describe</span><spanclass="pre">job</span><spanclass="pre">paddle-cluster-job</span></code> to check training job status. It will complete in around 20 minutes.</p>
<p>The details for start <codeclass="docutils literal"><spanclass="pre">pserver</span></code> and <codeclass="docutils literal"><spanclass="pre">trainer</span></code> are hidden inside docker image <codeclass="docutils literal"><spanclass="pre">paddledev/paddle-tutorial:k8s_train</span></code>, see <aclass="reference internal"href="src/k8s_train/README.html"><spanclass="doc">here</span></a> for how to build the docker image and source code.</p>
<p>The details for start <codeclass="docutils literal"><spanclass="pre">pserver</span></code> and <codeclass="docutils literal"><spanclass="pre">trainer</span></code> are hidden inside docker image <codeclass="docutils literal"><spanclass="pre">paddlepaddle/paddle-tutorial:k8s_train</span></code>, see <aclass="reference internal"href="src/k8s_train/README.html"><spanclass="doc">here</span></a> for how to build the docker image and source code.</p>
</div>
</div>
<divclass="section"id="inspect-training-output">
<divclass="section"id="inspect-training-output">
<spanid="inspect-training-output"></span><h4>Inspect Training Output<aclass="headerlink"href="#inspect-training-output"title="Permalink to this headline">¶</a></h4>
<spanid="inspect-training-output"></span><h4>Inspect Training Output<aclass="headerlink"href="#inspect-training-output"title="Permalink to this headline">¶</a></h4>
<p>In this article, we will introduce how to run PaddlePaddle training job on single CPU machine using Kubernetes. In next article, we will introduce how to run PaddlePaddle training job on distributed cluster.</p>
<p>In this article, we will introduce how to run PaddlePaddle training job on single CPU machine using Kubernetes. In next article, we will introduce how to run PaddlePaddle training job on distributed cluster.</p>
<divclass="section"id="build-docker-image">
<divclass="section"id="build-docker-image">
<spanid="build-docker-image"></span><h2>Build Docker Image<aclass="headerlink"href="#build-docker-image"title="Permalink to this headline">¶</a></h2>
<spanid="build-docker-image"></span><h2>Build Docker Image<aclass="headerlink"href="#build-docker-image"title="Permalink to this headline">¶</a></h2>
<p>In distributed Kubernetes cluster, we will use Ceph or other shared storage system for storing training data so that all processes in the training job can retrieve data from Ceph. In this example, we will only demo training job on single machine. In order to simplify the requirement of the environment, we will directly put training data into PaddlePaddle’s Docker Image, so we need to create a PaddlePaddle Docker image that already includes the training data.</p>
<p>In distributed Kubernetes cluster, we will use Ceph or other distributed
<p>PaddlePaddle’s <aclass="reference external"href="http://www.paddlepaddle.org/docs/develop/documentation/en/getstarted/index_en.html">Quick Start Tutorial</a> introduces how to download and train data by using script from PaddlePaddle’s source code.
storage system for storing training related data so that all processes in
And <codeclass="docutils literal"><spanclass="pre">paddledev/paddle:cpu-demo-latest</span></code> image has the PaddlePaddle source code and demo. (Caution: Default PaddlePaddle image <codeclass="docutils literal"><spanclass="pre">paddledev/paddle:cpu-latest</span></code> doesn’t include the source code, PaddlePaddle’s different versions of image can be referred here: <aclass="reference external"href="http://www.paddlepaddle.org/doc/build/docker_install.html">Docker installation guide</a>), so we run this container and download the training data, and then commit the whole container to be a new Docker image.</p>
PaddlePaddle training can retrieve data from Ceph. In this example, we will
only demo training job on single machine. In order to simplify the requirement
of the environment, we will directly put training data into the PaddlePaddle Docker Image,
so we need to create a PaddlePaddle Docker image that includes the training data.</p>
<p>The production Docker Image <codeclass="docutils literal"><spanclass="pre">paddlepaddle/paddle:cpu-demo-latest</span></code> has the PaddlePaddle
source code and demo. (Caution: Default PaddlePaddle Docker Image <codeclass="docutils literal"><spanclass="pre">paddlepaddle/paddle:latest</span></code> doesn’t include
the source code, PaddlePaddle’s different versions of Docker Image can be referred here:
so we run this Docker Image and download the training data, and then commit the whole
Container to be a new Docker Image.</p>
<divclass="section"id="run-docker-container">
<divclass="section"id="run-docker-container">
<spanid="run-docker-container"></span><h3>Run Docker Container<aclass="headerlink"href="#run-docker-container"title="Permalink to this headline">¶</a></h3>
<spanid="run-docker-container"></span><h3>Run Docker Container<aclass="headerlink"href="#run-docker-container"title="Permalink to this headline">¶</a></h3>
<divclass="highlight-default"><divclass="highlight"><pre><span></span>$ docker run --name quick_start_data -it paddledev/paddle:cpu-demo-latest
<divclass="highlight-default"><divclass="highlight"><pre><span></span>$ docker run --name quick_start_data -it paddlepaddle/paddle:cpu-demo-latest
Data preparation is done by docker image `paddledev/paddle-tutorial:k8s_data`, see [here](src/k8s_data/README.md) for how to build this docker image and source code.
Data preparation is done by docker image `paddlepaddle/paddle-tutorial:k8s_data`, see [here](src/k8s_data/README.md) for how to build this docker image and source code.
Run `kubectl --kubeconfig=kubeconfig describe job paddle-cluster-job` to check training job status. It will complete in around 20 minutes.
Run `kubectl --kubeconfig=kubeconfig describe job paddle-cluster-job` to check training job status. It will complete in around 20 minutes.
The details for start `pserver` and `trainer` are hidden inside docker image `paddledev/paddle-tutorial:k8s_train`, see [here](src/k8s_train/README.md) for how to build the docker image and source code.
The details for start `pserver` and `trainer` are hidden inside docker image `paddlepaddle/paddle-tutorial:k8s_train`, see [here](src/k8s_train/README.md) for how to build the docker image and source code.
<p>Data preparation is done by docker image <codeclass="docutils literal"><spanclass="pre">paddledev/paddle-tutorial:k8s_data</span></code>, see <aclass="reference internal"href="src/k8s_data/README.html"><spanclass="doc">here</span></a> for how to build this docker image and source code.</p>
<p>Data preparation is done by docker image <codeclass="docutils literal"><spanclass="pre">paddlepaddle/paddle-tutorial:k8s_data</span></code>, see <aclass="reference internal"href="src/k8s_data/README.html"><spanclass="doc">here</span></a> for how to build this docker image and source code.</p>
<p><codeclass="docutils literal"><spanclass="pre">POD_NAME</span></code>: name of any pod (e.g., <codeclass="docutils literal"><spanclass="pre">paddle-cluster-job-cm469</span></code>).</p>
<p><codeclass="docutils literal"><spanclass="pre">POD_NAME</span></code>: name of any pod (e.g., <codeclass="docutils literal"><spanclass="pre">paddle-cluster-job-cm469</span></code>).</p>
<p>Run <codeclass="docutils literal"><spanclass="pre">kubectl</span><spanclass="pre">--kubeconfig=kubeconfig</span><spanclass="pre">describe</span><spanclass="pre">job</span><spanclass="pre">paddle-cluster-job</span></code> to check training job status. It will complete in around 20 minutes.</p>
<p>Run <codeclass="docutils literal"><spanclass="pre">kubectl</span><spanclass="pre">--kubeconfig=kubeconfig</span><spanclass="pre">describe</span><spanclass="pre">job</span><spanclass="pre">paddle-cluster-job</span></code> to check training job status. It will complete in around 20 minutes.</p>
<p>The details for start <codeclass="docutils literal"><spanclass="pre">pserver</span></code> and <codeclass="docutils literal"><spanclass="pre">trainer</span></code> are hidden inside docker image <codeclass="docutils literal"><spanclass="pre">paddledev/paddle-tutorial:k8s_train</span></code>, see <aclass="reference internal"href="src/k8s_train/README.html"><spanclass="doc">here</span></a> for how to build the docker image and source code.</p>
<p>The details for start <codeclass="docutils literal"><spanclass="pre">pserver</span></code> and <codeclass="docutils literal"><spanclass="pre">trainer</span></code> are hidden inside docker image <codeclass="docutils literal"><spanclass="pre">paddlepaddle/paddle-tutorial:k8s_train</span></code>, see <aclass="reference internal"href="src/k8s_train/README.html"><spanclass="doc">here</span></a> for how to build the docker image and source code.</p>
</div>
</div>
<divclass="section"id="inspect-training-output">
<divclass="section"id="inspect-training-output">
<spanid="inspect-training-output"></span><h4>Inspect Training Output<aclass="headerlink"href="#inspect-training-output"title="永久链接至标题">¶</a></h4>
<spanid="inspect-training-output"></span><h4>Inspect Training Output<aclass="headerlink"href="#inspect-training-output"title="永久链接至标题">¶</a></h4>