- Generally, it is easy to check whether the forward computation of an Operator is correct or not. However, backpropagation is a notoriously difficult algorithm to debug and get right:
- Generally, it is easy to check whether the forward computation of an Operator is correct or not. However, backpropagation is a notoriously difficult algorithm to debug and get right because of the following challenges:
1.you should get the right backpropagation formula according to the forward computation.
1.The formula for backpropagation formula should be correct according to the forward computation.
2.you should implement it right in CPP.
2.The Implementation of the above shoule be correct in CPP.
3.it's difficult to prepare test data.
3.It is difficult to prepare an unbiased test data.
- Auto gradient checking gets a numerical gradient by forward Operator and use it as a reference of the backward Operator's result. It has several advantages:
- Auto gradient checking gets a numerical gradient using forward Operator and uses it as a reference for the backward Operator's result. It has several advantages:
1.numerical gradient checker only need forward operator.
1.Numerical gradient checker only needs the forward operator.
2.user only need to prepare the input data for forward Operator.
2.The user only needs to prepare the input data for forward Operator and not worry about the backward Operator.
## Mathematical Theory
## Mathematical Theory
The following two document from Stanford has a detailed explanation of how to get numerical gradient and why it's useful.
The following documents from Stanford have a detailed explanation of how to compute the numerical gradient and why it is useful.
-[Gradient checking and advanced optimization(en)](http://deeplearning.stanford.edu/wiki/index.php/Gradient_checking_and_advanced_optimization)
-[Gradient checking and advanced optimization(en)](http://deeplearning.stanford.edu/wiki/index.php/Gradient_checking_and_advanced_optimization)
-[Gradient checking and advanced optimization(cn)](http://ufldl.stanford.edu/wiki/index.php/%E6%A2%AF%E5%BA%A6%E6%A3%80%E9%AA%8C%E4%B8%8E%E9%AB%98%E7%BA%A7%E4%BC%98%E5%8C%96)
-[Gradient checking and advanced optimization(cn)](http://ufldl.stanford.edu/wiki/index.php/%E6%A2%AF%E5%BA%A6%E6%A3%80%E9%AA%8C%E4%B8%8E%E9%AB%98%E7%BA%A7%E4%BC%98%E5%8C%96)
Get Numerical Gradient for the input of an operator.
:param op: C++ operator instance, could be an network
:param op: C++ operator instance, could be an network.
:param input_values: The input variables. Should be an dictionary, whose key is
:param input_values: The input variables. Should be an dictionary, whose key is
variable name, and value is numpy array.
variable name, and value is a numpy array.
:param output_name: The final output variable name.
:param output_name: The final output variable name.
:param input_to_check: The input variable with respect to which to compute the gradient.
:param input_to_check: The input variable with respect to which the gradient has to be computed.
:param delta: The perturbation value for numeric gradient method. The
:param delta: The perturbation value for numerical gradient method. The
smaller delta is, the more accurate result will get. But if that delta is
smaller the delta, the more accurate the result. But if the delta is too
too small, it will suffer from numerical stability problem.
small, it will suffer from the numerical stability problem.
:param local_scope: The local scope used for get_numeric_gradient.
:param local_scope: The local scope used for get_numeric_gradient.
:return: The gradient array in numpy format.
:return: The gradient array in numpy format.
"""
"""
```
```
### Explaination:
### Explanation:
- Why need`output_name`
- Why do we need an`output_name`
- An Operator may have multiple Output, one can get independent gradient from each Output. So caller should specify the name of the output variable.
- An Operator may have multiple Outputs, one can compute an independent gradient from each Output. So the caller should specify the name of the output variable.
- Why need `input_to_check`
- Why do we need `input_to_check`
- One operator may have multiple inputs. Gradient Op can calculate the gradient of these inputs at the same time. But Numeric Gradient needs to calculate them one by one. So `get_numeric_gradient` is designed to calculate the gradient for one input. If you need to compute multiple inputs, you can call `get_numeric_gradient` multiple times.
- One operator can have multiple inputs. Gradient Op can calculate the gradient of these inputs at the same time. But Numerical Gradient needs to calculate them one by one. So `get_numeric_gradient` is designed to calculate the gradient for one input. If you need to compute multiple inputs, you can call `get_numeric_gradient` multiple times each with a different input.
### Core Algorithm Implementation
### Core Algorithm Implementation
```python
```python
# we only compute gradient of one element a time.
# we only compute the gradient of one element a time.
# we use a for loop to compute the gradient of each element.
# we use a for loop to compute the gradient of each element.
foriinxrange(tensor_size):
foriinxrange(tensor_size):
# get one input element by its index i.
# get one input element using the index i.
origin=tensor_to_check.get_float_element(i)
original=tensor_to_check.get_float_element(i)
# add delta to it, run op and then get the new value of the result tensor.
# add delta to it, run the forward op and then
x_pos=origin+delta
# get the new value of the result tensor.
x_pos=original+delta
tensor_to_check.set_float_element(i,x_pos)
tensor_to_check.set_float_element(i,x_pos)
y_pos=get_output()
y_pos=get_output()
# plus delta to this element, run op and get the new value of the result tensor.
# Subtract delta from this element, run the op again
x_neg=origin-delta
# and get the new value of the result tensor.
x_neg=original-delta
tensor_to_check.set_float_element(i,x_neg)
tensor_to_check.set_float_element(i,x_neg)
y_neg=get_output()
y_neg=get_output()
# restore old value
# restore old value
tensor_to_check.set_float_element(i,origin)
tensor_to_check.set_float_element(i,original)
# compute the gradient of this element and store it into a numpy array.
# compute the gradient of this element and store
# it into a numpy array.
gradient_flat[i]=(y_pos-y_neg)/delta/2
gradient_flat[i]=(y_pos-y_neg)/delta/2
# reshape the gradient result to the shape of the source tensor.
# reshape the gradient result to the shape of the source tensor.
3. GPU kernel gradient (if supported by the device)
The numerical gradient only relies on forward Operator. So we use the numerical gradient as the reference value. And the gradient checking is performed in the following three steps:
The numerical gradient only relies on the forward Operator, so we use the numerical gradient as the reference value. The gradient checking is performed in the following three steps:
1.calculate the numerical gradient
1.Calculate the numerical gradient
2.calculate CPU kernel gradient with the backward Operator and compare it with the numerical gradient
2.Calculate CPU kernel gradient with the backward Operator and compare it with the numerical gradient.
3.calculate GPU kernel gradient with the backward Operator and compare it with the numeric gradient (if supported)
3.Calculate GPU kernel gradient with the backward Operator and compare it with the numeric gradient. (if supported)
#### Python Interface
#### Python Interface
...
@@ -109,26 +112,27 @@ The numerical gradient only relies on forward Operator. So we use the numerical
...
@@ -109,26 +112,27 @@ The numerical gradient only relies on forward Operator. So we use the numerical
"""
"""
:param forward_op: used to create backward_op
:param forward_op: used to create backward_op
:param input_vars: numpy value of input variable. The following
:param input_vars: numpy value of input variable. The following
computation will use these variables.
computation will use these variables.
:param inputs_to_check: the input variable with respect to which to compute the gradient.
:param inputs_to_check: the input variable with respect to which the
gradient will be computed.
:param output_name: The final output variable name.
:param output_name: The final output variable name.
:param max_relative_error: The relative tolerance parameter.
:param max_relative_error: The relative tolerance parameter.
:param no_grad_set: used when create backward ops
:param no_grad_set: used to create backward ops
:param only_cpu: only compute and check gradient on cpu kernel.
:param only_cpu: only compute and check gradient on cpu kernel.
:return:
:return:
"""
"""
```
```
### How to check if two numpy array is close enough?
### How to check if two numpy arrays are close enough?
if `abs_numerical_grad` is nearly zero, then use abs error for numerical_grad
if `abs_numerical_grad` is nearly zero, then use absolute error for numerical_grad.
The Input data for auto gradient checker should be reasonable to avoid numerical stability problem.
The Input data for auto gradient checker should be reasonable to avoid numerical stability problem.
#### Refs:
#### References:
-[Gradient checking and advanced optimization(en)](http://deeplearning.stanford.edu/wiki/index.php/Gradient_checking_and_advanced_optimization)
-[Gradient checking and advanced optimization(en)](http://deeplearning.stanford.edu/wiki/index.php/Gradient_checking_and_advanced_optimization)
-[Gradient checking and advanced optimization(cn)](http://ufldl.stanford.edu/wiki/index.php/%E6%A2%AF%E5%BA%A6%E6%A3%80%E9%AA%8C%E4%B8%8E%E9%AB%98%E7%BA%A7%E4%BC%98%E5%8C%96)
-[Gradient checking and advanced optimization(cn)](http://ufldl.stanford.edu/wiki/index.php/%E6%A2%AF%E5%BA%A6%E6%A3%80%E9%AA%8C%E4%B8%8E%E9%AB%98%E7%BA%A7%E4%BC%98%E5%8C%96)
@@ -42,7 +42,7 @@ The type *channel* is conceptually the blocking queue. In Go, its implemented i
...
@@ -42,7 +42,7 @@ The type *channel* is conceptually the blocking queue. In Go, its implemented i
The `select` operation has been in OS kernels long before Go language. All Unix kernels implement system calls *poll* and *select*. They monitor multiple file descriptors to see if I/O is possible on any of them. This takes O(N) time. Since Linux 2.6, a new system call, *epoll*, can do the same in O(1) time. In BSD systems, there is a similar system call *kqueue*. Go's Linux implementation uses epoll.
The `select` operation has been in OS kernels long before Go language. All Unix kernels implement system calls *poll* and *select*. They monitor multiple file descriptors to see if I/O is possible on any of them. This takes O(N) time. Since Linux 2.6, a new system call, *epoll*, can do the same in O(1) time. In BSD systems, there is a similar system call *kqueue*. Go's Linux implementation uses epoll.
It might be a good idea to implement Fluid's select using epoll too. In this design doc, we start from the O(N) way, so we could focus on Python binding and the syntax.
It might be a good idea to implement Fluid's select using epoll too. In this design doc, we start from the O(N) way so that we could focus on Python binding and the syntax.
### Type Channel
### Type Channel
...
@@ -71,14 +71,14 @@ ch1 := make(chan int, 100) // a channel that can buffer 100 ints.
...
@@ -71,14 +71,14 @@ ch1 := make(chan int, 100) // a channel that can buffer 100 ints.
In Fluid, we should be able to do the same:
In Fluid, we should be able to do the same:
```python
```python
ch=fluid.make_chan(dtype=INT)
ch=fluid.make_channel(dtype=INT)
ch1=fluid.make_chan(dtype=INT,100)
ch1=fluid.make_channel(dtype=INT,100)
```
```
In addition to that, we want channels that can hold more complex element types, e.g., Tensors of float16:
In addition to that, we want channels that can hold more complex element types, e.g., Tensors of float16:
```python
```python
ch=fluid.make_chan(dtype=Tensor,etype=float16)
ch=fluid.make_channel(dtype=Tensor,etype=float16)
```
```
or Tensors of Tensors of float16 etc.
or Tensors of Tensors of float16 etc.
...
@@ -87,8 +87,136 @@ The point here is that we need a consistent way to compose types, like in C++ we
...
@@ -87,8 +87,136 @@ The point here is that we need a consistent way to compose types, like in C++ we
### Send and Recv
### Send and Recv
Go's CSP implementation depends on data type *channel*. There are two types of channels:
1. The unblocked channel, or buffered channel, is a blocking queue with a non-zero sized buffer. The sending to buffered channel blocks if the buffer is full, and the receive operation blocks if the buffer is empty.
1. blocked channel, or unbuffered channel, is a blocking queue with no buffer. Both sending and receiving block with unbuffered channels.
There are four types of actions with a channel:
1. Create a channel
```go
ch:=make(chanint)// this is an unbuffered channel
ch:=make(chanint,100)// this is a buffered channel of 100 ints.
```
1. Send
```go
ch<-111
```
1. Recv
```go
y,ok<-ch
```
1. Close
```go
close(ch)
```
Please be aware that a closed channel is not a nil channel, which is `var ch chan int`.
There are some [axioms with channels](https://dave.cheney.net/2014/03/19/channel-axioms):
1. A send to a nil channel blocks forever
1. A receive from a nil channel blocks forever
1. A send to a closed channel panics
1. A receive from a closed channel returns the residual values and then zeros.
In Fluid, we have [buffered channels](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/details/buffered_channel.h) and [unbuffered channels](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/details/unbuffered_channel.h)
The following program illustrates the Python syntax for accessing Fluid buffers.
```python
importfluid
buffer_size=10
ch=fluid.make_channel(dtype=INT,buffer_size)
# Now write three elements to the channel
withfluid.while(steps=buffer_size):
fluid.send(ch,step)
fluid.close_channel(ch)
withfluid.while(steps=buffer_size):
fluid.print(fluid.recv(ch))
```
The following example shows that to avoid the always-blocking behavior of unbuffered channels, we need to use Fluid's goroutines.
```python
importfluid
ch=fluid.make_channel(dtype=INT)
withfluid.go():
fluid.send(ch)
y=fluid.recv(ch)
fluid.close_channel(ch)
```
### Select
### Select
In Go, the `select` statement lets a goroutine wait on multiple communication operations. A `select` blocks until one of its cases can run, then it executes that case. It chooses one at random if multiple are ready.
```go
ch1:=make(chanint)
ch2:=make(chanint,100)
x:=0
for{
select{
casech1<-x:
x:=x+1
casey<-ch2:
fmt.Println("Received on channel")
default:
fmt.Println("Default")
}
}
```
In Fluid, we should be able to do the same:
```python
ch1=fluid.make_chan(dtype=INT)
ch2=fluid.make_chan(dtype=INT,100)
sel=fluid.select()
withsel.case(ch1,'w',X):
fluid.layers.increment(X)
withsel.case(ch2,'r',Y):
fluid.print("Received on Channel")
withsel.default():
fluid.print("Default")
```
In the above code snippet, `X` and `Y` are variables. Now let us look at each of these statements one by one.
-`sel.case(ch1, 'w', X)` : This specifies that we are writing to `ch1` and we want to write the integer in variable `X` to the channel. The character `w` is used here to make the syntax familiar to write syntax in Python I/O.
-`sel.case(ch2, 'r', Y)` : This specifies that we would like to read the result from `ch2` into variable `Y`. The character `r` is used here to make the syntax familiar to read syntax in Python I/O.
-`sel.default()` : This is equivalent to the default in Go `select`. If none of the channels are ready for read or write, then the fluid code in the default block will be executed.