ms-dist-train.yaml 1.7 KB
Newer Older
L
leonwanghui 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14
# WIP example for distributed training
apiVersion: "kubeflow.org/v1"
kind: "MSJob"
metadata:
  name: "msjob-mnist"
spec:
  backend: "tcp"
  masterPort: "23456"
  replicaSpecs:
    - replicas: 1
      replicaType: MASTER
      template:
        spec:
          containers:
15
          - image: mindspore/mindspore-cpu:0.1.0-alpha
L
leonwanghui 已提交
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
            imagePullPolicy: IfNotPresent
            name: msjob-mnist
            command: ["/bin/bash", "-c", "python /tmp/test/MNIST/lenet.py"]
            volumeMounts:
              - name: training-result
                mountPath: /tmp/result
              - name: ms-mnist-local-file
                mountPath: /tmp/test
          restartPolicy: OnFailure
          volumes:
            - name: training-result
              emptyDir: {}
            - name: entrypoint
              configMap:
                name: dist-train
                defaultMode: 0755
          restartPolicy: OnFailure
    - replicas: 3
      replicaType: WORKER
      template:
        spec:
          containers:
38
          - image: mindspore/mindspore-cpu:0.1.0-alpha
L
leonwanghui 已提交
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
            imagePullPolicy: IfNotPresent
            name: msjob-mnist
            command: ["/bin/bash", "-c", "python /tmp/test/MNIST/lenet.py"]
            volumeMounts:
              - name: training-result
                mountPath: /tmp/result
              - name: ms-mnist-local-file
                hostPath:
                    path: /root/gopath/src/gitee.com/mindspore/ms-operator/examples
          restartPolicy: OnFailure
          volumes:
            - name: training-result
              emptyDir: {}
            - name: entrypoint
              configMap:
                name: dist-train
                defaultMode: 0755
          restartPolicy: OnFailure