trainer.md 2.8 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
# Design Doc: Trainer Communication Library

For an overview of trainer's role, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the trainer's communication library, which will manage communication with parameter servers and the [master server](master_server.md). The library will be implemented in [Go](https://golang.org/) and made available as a static or dynamic library with a C header file.

## Go Interface

The Go interface is the basic abstraction of communications with the master server and parameter servers. We will add another layer on top (add retry logic, polish interface with C idiom) before exposing the library with a [C interface](#c-interface).

```go
// MasterClient is the client to the master server.
type MasterClient struct {}

// GetTask gets a new task by telling the master server the finished task.
// Use nil as the finished task when getting the task for the first time.
func (*MasterClient) GetTask(finished master.Task) (master.Task, error)

// ElementType is the type of elements of a Parameter.
type ElementType int

// Different element types.
const (
	Int32 ElementType = iota
	UInt32
	Int64
	UInt64
	Float32
	Float64
)

// Parameter is a piece data to sync with the parameter server.
type Parameter struct {
	Name        string
	ElementType ElementType
	Buffer      []byte
}

// Gradient is the gradient of the parameter.
type Gradient Parameter

// PServerClient is the client to parameter servers.
type PServerClient struct {}

// UpdateRule specifies the rule for updating parameters with gradients.
type UpdateRule struct {
	UpdateMethod pserver.UpdateMethod
	LearningRate float32
}

// ParamInitChans returns a send channel for parameter initialization.
//
// ParamInitChans will be called from multiple trainers, only one trainer should
// initialize the parameters on parameter servers, other trainers will instead
// get the initialized parameters from parameter servers using GetParam.
//
// If send channel is not nil, the trainer is selected to do the initialization,
// the trainer needs to signal for finishing initializing the parameters by
// closing the send channel.
func (*PServerClient) ParamInitChan() (send chan<- Parameter, err error)

// SendGrad sends gradients to parameter servers.
func (*PServerClient) SendGrad(method UpdateMethod, grads []Gradient) error

// GetParam gets parameters from parameter servers.
func (*PServerClient) GetParam(names []string) ([]Parameter, error)

// Save indicates parameters to save the parameter to the given path.
//
// Path needs to be the path to a distributed file system which is visible
// to all parameter servers.
func (*PServerClient) Save(path string) error
```
Please see [master server design doc](master_server.md) for the definition of `master.Task`.

## C Interface

TODO