mxnet.md 5.2 KB
Newer Older
1 2 3
+++
title = "MXNet Training"
description = "Instructions for using MXNet"
4
weight = 25
5 6
+++

7 8
This guide walks you through using MXNet with Kubeflow.

9 10 11 12
## Installing MXNet Operator

If you haven't already done so please follow the [Getting Started Guide](https://www.kubeflow.org/docs/started/getting-started/) to deploy Kubeflow.

L
Lei Su 已提交
13
A version of MXNet support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow newer than 0.2.0.
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

## Verify that MXNet support is included in your Kubeflow deployment

Check that the MXNet custom resource is installed

```
kubectl get crd
```

The output should include `mxjobs.kubeflow.org`

```
NAME                                           AGE
...
mxjobs.kubeflow.org                            4d
...
```

If it is not included you can add it as follows

```
E
EDGsheryl 已提交
35 36 37 38 39 40 41 42 43 44 45 46 47
git clone https://github.com/kubeflow/manifests
cd manifests/mxnet-job/mxnet-operator
kubectl kustomize base | kubectl apply -f -
```

Alternatively, you can deploy the operator with default settings without using kustomize by running the following from the repo:

```
git clone https://github.com/kubeflow/mxnet-operator.git
cd mxnet-operator
kubectl create -f manifests/crd-v1beta1.yaml 
kubectl create -f manifests/rbac.yaml 
kubectl create -f manifests/deployment.yaml
48 49
```

L
Lei Su 已提交
50
## Creating a MXNet training job
51 52


L
Lei Su 已提交
53
You create a training job by defining a MXJob with MXTrain mode and then creating it with
54 55 56


```
L
Lei Su 已提交
57
kubectl create -f examples/v1beta1/train/mx_job_dist_gpu.yaml
58 59 60
```


L
Lei Su 已提交
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## Creating a TVM tuning job (AutoTVM)


[TVM](https://docs.tvm.ai/tutorials/) is a end to end deep learning compiler stack, you can easily run AutoTVM with mxnet-operator. 
You can create a auto tuning job by define a type of MXTune job and then creating it with


```
kubectl create -f examples/v1beta1/tune/mx_job_tune_gpu.yaml
```


Before you use the auto-tuning example, there is some preparatory work need to be finished in advance. To let TVM tune your network, you should create a docker image which has TVM module. Then, you need a auto-tuning script to specify which network will be tuned and set the auto-tuning parameters, For more details, please see https://docs.tvm.ai/tutorials/autotvm/tune_relay_mobile_gpu.html#sphx-glr-tutorials-autotvm-tune-relay-mobile-gpu-py. Finally, you need a startup script to start the auto-tuning program. In fact, mxnet-operator will set all the parameters as environment variables and the startup script need to reed these variable and then transmit them to auto-tuning script. We provide an example under examples/v1beta1/tune/, tuning result will be saved in a log file like resnet-18.log in the example we gave. You can refer it for details.


76 77 78 79 80 81
## Monitoring a MXNet Job


To get the status of your job

```bash
82
kubectl get -o yaml mxjobs ${JOB}
83 84 85 86 87
```   

Here is sample output for an example job

```yaml
L
Lei Su 已提交
88
apiVersion: kubeflow.org/v1beta1
89 90
kind: MXJob
metadata:
L
Lei Su 已提交
91
  creationTimestamp: 2019-03-19T09:24:27Z
92
  generation: 1
L
Lei Su 已提交
93
  name: mxnet-job
94
  namespace: default
L
Lei Su 已提交
95 96 97
  resourceVersion: "3681685"
  selfLink: /apis/kubeflow.org/v1beta1/namespaces/default/mxjobs/mxnet-job
  uid: cb11013b-4a28-11e9-b7f4-704d7bb59f71
98
spec:
L
Lei Su 已提交
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157
  cleanPodPolicy: All
  jobMode: MXTrain
  mxReplicaSpecs:
    Scheduler:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources: {}
    Server:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources: {}
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - args:
            - /incubator-mxnet/example/image-classification/train_mnist.py
            - --num-epochs
            - "10"
            - --num-layers
            - "2"
            - --kv-store
            - dist_device_sync
            - --gpus
            - "0"
            command:
            - python
            image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources:
              limits:
                nvidia.com/gpu: "1"
158
status:
L
Lei Su 已提交
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183
  completionTime: 2019-03-19T09:25:11Z
  conditions:
  - lastTransitionTime: 2019-03-19T09:24:27Z
    lastUpdateTime: 2019-03-19T09:24:27Z
    message: MXJob mxnet-job is created.
    reason: MXJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: 2019-03-19T09:24:27Z
    lastUpdateTime: 2019-03-19T09:24:29Z
    message: MXJob mxnet-job is running.
    reason: MXJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: 2019-03-19T09:24:27Z
    lastUpdateTime: 2019-03-19T09:25:11Z
    message: MXJob mxnet-job is successfully completed.
    reason: MXJobSucceeded
    status: "True"
    type: Succeeded
  mxReplicaStatuses:
    Scheduler: {}
    Server: {}
    Worker: {}
  startTime: 2019-03-19T09:24:29Z
184
```