未验证 提交 c8699fd5 编写于 作者: Y Yuan Tang 提交者: GitHub

Update MPI Operator instructions to use v1alpha2 (#1580)

* Update MPI Operator instructions to use v1alpha2

* Update mpi.md

* Update mpi.md
上级 396543cb
...@@ -6,9 +6,19 @@ weight = 25 ...@@ -6,9 +6,19 @@ weight = 25
This guide walks you through using MPI for training. This guide walks you through using MPI for training.
The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes. Please check out the list of adopters in [the MPI Operator repository on GitHub](https://github.com/kubeflow/mpi-operator/blob/master/ADOPTERS.md).
## Installation ## Installation
If you haven’t already done so please follow the [Getting Started Guide](https://www.kubeflow.org/docs/started/getting-started/) to deploy Kubeflow. You can deploy the operator with default settings without using `kustomize` by running the following commands:
```shell
git clone https://github.com/kubeflow/mpi-operator
cd mpi-operator
kubectl create -f deploy/mpi-operator.yaml
```
Alternatively, follow the [getting started guide](docs/started/getting-started/) to deploy Kubeflow.
An alpha version of MPI support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow newer than 0.2.0. An alpha version of MPI support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow newer than 0.2.0.
...@@ -32,28 +42,20 @@ If it is not included you can add it as follows: ...@@ -32,28 +42,20 @@ If it is not included you can add it as follows:
```bash ```bash
git clone https://github.com/kubeflow/manifests git clone https://github.com/kubeflow/manifests
cd manifests/mpi-job/mpi-operator cd manifests/mpi-job/mpi-operator
kubectl kustomize base | kubectl apply -f - kustomize build base | kubectl apply -f -
```
Alternatively, you can deploy the operator with default settings without using kustomize by running the following from the repo:
```shell
git clone https://github.com/kubeflow/mpi-operator
cd mpi-operator
kubectl create -f deploy/mpi-operator.yaml
``` ```
## Creating an MPI Job ## Creating an MPI Job
You can create an MPI job by defining an `MPIJob` config file. See [TensorFlow benchmark example](https://github.com/kubeflow/mpi-operator/blob/master/examples/v1alpha1/tensorflow-benchmarks.yaml) config file for launching a multi-node TensorFlow benchmark training job. You may change the config file based on your requirements. You can create an MPI job by defining an `MPIJob` config file. See [TensorFlow benchmark example](https://github.com/kubeflow/mpi-operator/blob/master/examples/v1alpha2/tensorflow-benchmarks.yaml) config file for launching a multi-node TensorFlow benchmark training job. You may change the config file based on your requirements.
``` ```
cat examples/v1alpha1/tensorflow-benchmarks.yaml cat examples/v1alpha2/tensorflow-benchmarks.yaml
``` ```
Deploy the `MPIJob` resource to start training: Deploy the `MPIJob` resource to start training:
``` ```
kubectl create -f examples/v1alpha1/tensorflow-benchmarks.yaml kubectl create -f examples/v1alpha2/tensorflow-benchmarks.yaml
``` ```
## Monitoring an MPI Job ## Monitoring an MPI Job
...@@ -61,45 +63,105 @@ kubectl create -f examples/v1alpha1/tensorflow-benchmarks.yaml ...@@ -61,45 +63,105 @@ kubectl create -f examples/v1alpha1/tensorflow-benchmarks.yaml
Once the `MPIJob` resource is created, you should now be able to see the created pods matching the specified number of GPUs. You can also monitor the job status from the status section. Here is sample output when the job is successfully completed. Once the `MPIJob` resource is created, you should now be able to see the created pods matching the specified number of GPUs. You can also monitor the job status from the status section. Here is sample output when the job is successfully completed.
``` ```
kubectl get -o yaml mpijobs tensorflow-benchmarks-16 kubectl get -o yaml mpijobs tensorflow-benchmarks
``` ```
``` ```
apiVersion: kubeflow.org/v1alpha1 apiVersion: kubeflow.org/v1alpha2
kind: MPIJob kind: MPIJob
metadata: metadata:
clusterName: "" creationTimestamp: "2019-07-09T22:15:51Z"
creationTimestamp: 2019-01-07T20:32:12Z
generation: 1 generation: 1
name: tensorflow-benchmarks-16 name: tensorflow-benchmarks
namespace: default namespace: default
resourceVersion: "185051397" resourceVersion: "5645868"
selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/mpijobs/tensorflow-benchmarks-16 selfLink: /apis/kubeflow.org/v1alpha2/namespaces/default/mpijobs/tensorflow-benchmarks
uid: 8dc8c044-127d-11e9-a419-02420bbe29f3 uid: 1c5b470f-a297-11e9-964d-88d7f67c6e6d
spec: spec:
gpus: 16 cleanPodPolicy: Running
template: mpiReplicaSpecs:
metadata: Launcher:
creationTimestamp: null replicas: 1
spec: template:
containers: spec:
- image: mpioperator/tensorflow-benchmarks:latest containers:
name: tensorflow-benchmarks - command:
resources: {} - mpirun
- --allow-run-as-root
- -np
- "2"
- -bind-to
- none
- -map-by
- slot
- -x
- NCCL_DEBUG=INFO
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
- --model=resnet101
- --batch_size=64
- --variable_update=horovod
image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
Worker:
replicas: 1
template:
spec:
containers:
- image: mpioperator/tensorflow-benchmarks:latest
name: tensorflow-benchmarks
resources:
limits:
nvidia.com/gpu: 2
slotsPerWorker: 2
status: status:
launcherStatus: Succeeded completionTime: "2019-07-09T22:17:06Z"
conditions:
- lastTransitionTime: "2019-07-09T22:15:51Z"
lastUpdateTime: "2019-07-09T22:15:51Z"
message: MPIJob default/tensorflow-benchmarks is created.
reason: MPIJobCreated
status: "True"
type: Created
- lastTransitionTime: "2019-07-09T22:15:54Z"
lastUpdateTime: "2019-07-09T22:15:54Z"
message: MPIJob default/tensorflow-benchmarks is running.
reason: MPIJobRunning
status: "False"
type: Running
- lastTransitionTime: "2019-07-09T22:17:06Z"
lastUpdateTime: "2019-07-09T22:17:06Z"
message: MPIJob default/tensorflow-benchmarks successfully completed.
reason: MPIJobSucceeded
status: "True"
type: Succeeded
replicaStatuses:
Launcher:
succeeded: 1
Worker: {}
startTime: "2019-07-09T22:15:51Z"
``` ```
Training should run for 100 steps and takes a few minutes on a GPU cluster. You can inspect the logs to see the training progress. When the job starts, access the logs from the `launcher` pod: Training should run for 100 steps and takes a few minutes on a GPU cluster. You can inspect the logs to see the training progress. When the job starts, access the logs from the `launcher` pod:
``` ```
PODNAME=$(kubectl get pods -l mpi_job_name=tensorflow-benchmarks-16,mpi_role_type=launcher -o name) PODNAME=$(kubectl get pods -l mpi_job_name=tensorflow-benchmarks,mpi_role_type=launcher -o name)
kubectl logs -f ${PODNAME} kubectl logs -f ${PODNAME}
``` ```
``` ```
TensorFlow: 1.10 TensorFlow: 1.14
Model: resnet101 Model: resnet101
Dataset: imagenet (synthetic) Dataset: imagenet (synthetic)
Mode: training Mode: training
...@@ -109,32 +171,29 @@ Batch size: 128 global ...@@ -109,32 +171,29 @@ Batch size: 128 global
Num batches: 100 Num batches: 100
Num epochs: 0.01 Num epochs: 0.01
Devices: ['horovod/gpu:0', 'horovod/gpu:1'] Devices: ['horovod/gpu:0', 'horovod/gpu:1']
NUMA bind: False
Data format: NCHW Data format: NCHW
Optimizer: sgd Optimizer: sgd
Variables: horovod Variables: horovod
... ...
40 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.146 40 images/sec: 154.4 +/- 0.7 (jitter = 4.0) 8.280
40 images/sec: 132.1 +/- 0.0 (jitter = 0.1) 9.182 40 images/sec: 154.4 +/- 0.7 (jitter = 4.1) 8.482
50 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.071 50 images/sec: 154.8 +/- 0.6 (jitter = 4.0) 8.397
50 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.210 50 images/sec: 154.8 +/- 0.6 (jitter = 4.2) 8.450
60 images/sec: 132.2 +/- 0.0 (jitter = 0.2) 9.180 60 images/sec: 154.5 +/- 0.5 (jitter = 4.1) 8.321
60 images/sec: 132.2 +/- 0.0 (jitter = 0.2) 9.055 60 images/sec: 154.5 +/- 0.5 (jitter = 4.4) 8.349
70 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.005 70 images/sec: 154.5 +/- 0.5 (jitter = 4.0) 8.433
70 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.096 70 images/sec: 154.5 +/- 0.5 (jitter = 4.4) 8.430
80 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.231 80 images/sec: 154.8 +/- 0.4 (jitter = 3.6) 8.199
80 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.197 80 images/sec: 154.8 +/- 0.4 (jitter = 3.8) 8.404
90 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.201 90 images/sec: 154.6 +/- 0.4 (jitter = 3.7) 8.418
90 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.089 90 images/sec: 154.6 +/- 0.4 (jitter = 3.6) 8.459
100 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.183 100 images/sec: 154.2 +/- 0.4 (jitter = 4.0) 8.372
---------------------------------------------------------------- 100 images/sec: 154.2 +/- 0.4 (jitter = 4.0) 8.542
total images/sec: 264.26
----------------------------------------------------------------
100 images/sec: 132.1 +/- 0.0 (jitter = 0.2) 9.044
----------------------------------------------------------------
total images/sec: 264.26
---------------------------------------------------------------- ----------------------------------------------------------------
total images/sec: 308.27
``` ```
# Docker Images # Docker Images
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册