hyperparameter.md 8.8 KB
Newer Older
1 2
+++
title = "Hyperparameter Tuning (Katib)"
3
description = "Using Katib to tune your model's hyperparameters on Kubernetes"
4
weight = 5
5 6
+++

7 8 9 10 11
The [Katib](https://github.com/kubeflow/katib) project is inspired by 
[Google vizier](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/bcb15507f4b52991a0783013df4222240e942381.pdf). 
Katib is a scalable and flexible hyperparameter tuning framework and is tightly 
integrated with Kubernetes. It does not depend on any specific deep learning 
framework (such as TensorFlow, MXNet, or PyTorch).
12 13 14

## Installing Katib

15
To run Katib jobs, you must install the required packages as shown in this
R
Richard Liu 已提交
16 17
section. You can do so by following the Kubeflow [deployment guide](/docs/gke/deploy/),
or by installing Katib directly from its repository:
18
```
R
Richard Liu 已提交
19 20
git clone https://github.com/kubeflow/katib
./katib/scripts/v1alpha2/deploy.sh
21 22
```

R
Richard Liu 已提交
23
### Persistent Volumes
24 25
If you want to use Katib outside Google Kubernetes Engine (GKE) and you don't 
have a StorageClass for  dynamic volume provisioning in your cluster, you must 
R
Richard Liu 已提交
26
create a persistent volume (PV) to bind your persistent volume claim (PVC).
27

28
This is the YAML file for a PV:
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

```yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: katib-mysql
  labels:
    type: local
    app: katib
spec:
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /data/katib
```

47
After deploying the Katib package, run the following command to create the PV:
48

49
```
R
Richard Liu 已提交
50
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha2/pv/pv.yaml
51
```
52 53 54

## Running examples

55
After deploying everything, you can run some examples.
56 57 58

### Example using random algorithm

R
Richard Liu 已提交
59 60
You can create an Experiment for Katib by defining an Experiment config file. See the 
[random algorithm example](https://github.com/kubeflow/katib/blob/master/examples/v1alpha2/random-example.yaml).
61 62

```
R
Richard Liu 已提交
63
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/random-example.yaml
O
oshima 已提交
64
```
65

R
Richard Liu 已提交
66
Running this command launches an Experiment. It runs a series of 
67 68 69
training jobs to train models using different hyperparameters and save the 
results.

R
Richard Liu 已提交
70
The configurations for the experiment (hyperparameter feasible space, optimization 
71
parameter, optimization goal, suggestion algorithm, and so on) are defined in 
R
Richard Liu 已提交
72
[random-example.yaml](https://github.com/kubeflow/katib/blob/master/examples/v1alpha2/random-example.yaml).
73

R
Richard Liu 已提交
74 75
In this demo, hyperparameters are embedded as args.
You can embed hyperparameters in another way (for example, environment values) 
R
Richard Liu 已提交
76
by using the template defined in `TrialTemplate.GoTemplate.RawTemplate`.
O
oshima 已提交
77 78
It is written in [go template](https://golang.org/pkg/text/template/) format.

R
Richard Liu 已提交
79
This demo randomly generates 3 hyperparameters:
80

O
oshima 已提交
81 82 83 84
* Learning Rate (--lr) - type: double
* Number of NN Layer (--num-layers) - type: int
* optimizer (--optimizer) - type: categorical

R
Richard Liu 已提交
85
Check the experiment status:
O
oshima 已提交
86 87

```
R
Richard Liu 已提交
88
$ kubectl -n kubeflow describe experiment random-example
O
oshima 已提交
89
Name:         random-example
O
oshima 已提交
90
Namespace:    kubeflow
O
oshima 已提交
91
Labels:       controller-tools.k8s.io=1.0
92
Annotations:  <none>
R
Richard Liu 已提交
93 94
API Version:  kubeflow.org/v1alpha2
Kind:         Experiment
O
oshima 已提交
95
Metadata:
96 97
  Creation Timestamp:  2019-01-18T16:30:46Z
  Finalizers:
R
Richard Liu 已提交
98
    clean-data-in-db
99 100
  Generation:        5
  Resource Version:  1777650
R
Richard Liu 已提交
101
  Self Link:         /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/experiments/random-example
102
  UID:               687a67f9-1b3e-11e9-a0c2-c6456c1f5f0a
O
oshima 已提交
103
Spec:
R
Richard Liu 已提交
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
  Algorithm:
    Algorithm Name:  random
    Algorithm Settings:
  Max Failed Trial Count:  3
  Max Trial Count:         100
  Objective:
    Additional Metric Names:
      accuracy
    Goal:                   0.99
    Objective Metric Name:  Validation-accuracy
    Type:                   maximize
  Parallel Trial Count:     10
  Parameters:
    Feasible Space:
      Max:           0.03
      Min:           0.01
    Name:            --lr
    Parameter Type:  double
    Feasible Space:
      Max:           5
      Min:           2
    Name:            --num-layers
    Parameter Type:  int
    Feasible Space:
128 129 130 131
      List:
        sgd
        adam
        ftrl
R
Richard Liu 已提交
132 133 134
    Name:            --optimizer
    Parameter Type:  categorical
  Trial Template:
135
    Go Template:
R
Richard Liu 已提交
136 137 138 139
      Template Spec:
        Config Map Name:       trial-template
        Config Map Namespace:  kubeflow
        Template Path:         mnist-trial-template
O
oshima 已提交
140
Status:
R
Richard Liu 已提交
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175
  Completion Time:  2019-06-20T00:12:07Z
  Conditions:
    Last Transition Time:  2019-06-19T23:20:56Z
    Last Update Time:      2019-06-19T23:20:56Z
    Message:               Experiment is created
    Reason:                ExperimentCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2019-06-20T00:12:07Z
    Last Update Time:      2019-06-20T00:12:07Z
    Message:               Experiment is running
    Reason:                ExperimentRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2019-06-20T00:12:07Z
    Last Update Time:      2019-06-20T00:12:07Z
    Message:               Experiment has succeeded because max trial count has reached
    Reason:                ExperimentSucceeded
    Status:                True
    Type:                  Succeeded
  Current Optimal Trial:
    Observation:
      Metrics:
        Name:   Validation-accuracy
        Value:  0.982483983039856
    Parameter Assignments:
      Name:          --lr
      Value:         0.026666666666666665
      Name:          --num-layers
      Value:         2
      Name:          --optimizer
      Value:         sgd
  Start Time:        2019-06-19T23:20:55Z
  Trials:            100
  Trials Succeeded:  100
176
Events:                 <none>
177 178
```

R
Richard Liu 已提交
179 180
The demo should start an experiment and run three jobs with different parameters.
When the `spec.Status.Condition` changes to *Completed*, the experiment is 
181
finished.
182

183
### TensorFlow operator example
184

185
To run the TensorFlow operator example, you must install a volume.
186

187
If you are using GKE and default StorageClass, you must create this PVC:
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205

```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: tfevent-volume
  namespace: kubeflow
  labels:
    type: local
    app: tfjob
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
```

206 207
If you are not using GKE and you don't have StorageClass for dynamic volume 
provisioning in your cluster, you must create a PVC and a PV:
208 209

```
R
Richard Liu 已提交
210
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pvc.yaml
211

R
Richard Liu 已提交
212
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pv.yaml
213 214
```

215
Now you can run the TensorFlow operator example:
216 217

```
R
Richard Liu 已提交
218
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfjob-example.yaml
219 220
```

R
Richard Liu 已提交
221
You can check the status of the experiment:
222 223

```
R
Richard Liu 已提交
224
kubectl -n kubeflow describe experiment tfjob-example
225 226
```

227
### PyTorch example
228

229
This is an example for the PyTorch operator:
230 231

```
R
Richard Liu 已提交
232
kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/pytorchjob-example.yaml
233 234
```

R
Richard Liu 已提交
235
You can check the status of the experiment:
236 237

```
R
Richard Liu 已提交
238
kubectl -n kubeflow describe experiment pytorchjob-example
239 240 241 242
```

## Monitoring results

R
Richard Liu 已提交
243 244
You can monitor your results in the Katib UI. If you installed Kubeflow
using the deployment guide, you can access the Katib UI at
245
```
R
Richard Liu 已提交
246 247 248 249 250 251
https://<your kubeflow endpoint>/katib/
```

For example, if you deployed Kubeflow on GKE, your endpoint would be
```
https://<deployment_name>.endpoints.<project>.cloud.goog/
252 253
```

R
Richard Liu 已提交
254
Otherwise, you can set port-forwarding for the Katib UI service:
255

R
Richard Liu 已提交
256 257 258
```
kubectl port-forward svc/katib-ui -n kubeflow 8080:80
```
259

260
Now you can access the Katib UI at this URL: ```http://localhost:8080/katib/```.
261

262
## Cleanup
263

264
Delete the installed components:
265 266

```
R
Richard Liu 已提交
267
./scripts/v1alpha2/undeploy.sh
268 269
```

270
If you created a PV for Katib, delete it:
271 272

```
R
Richard Liu 已提交
273
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/manifests/v1alpha2/pv/pv.yaml
274 275
```

276
If you created a PV and PVC for the TensorFlow operator, delete it:
277 278

```
R
Richard Liu 已提交
279 280
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pvc.yaml
kubectl delete -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha2/tfevent-volume/tfevent-pv.yaml
281 282 283 284
```

## Metrics collector

R
Richard Liu 已提交
285 286
Katib has a metrics collector to take metrics from each trial. Katib collects
metrics from stdout of each trial. Metrics should print in the following
287 288 289
format: `{metrics name}={value}`. For example, when your objective value name 
is `loss` and the metrics are `recall` and `precision`, your training container
should print like this:
290 291 292 293 294 295 296 297 298 299 300 301

```
epoch 1:
loss=0.3
recall=0.5
precision=0.4

epoch 2:
loss=0.2
recall=0.55
precision=0.5
```
302

R
Richard Liu 已提交
303
Katib periodically launches CronJobs to collect metrics from pods.