A key challenge in the field of deep learning is to automatically derive the backward pass from the forward pass described algorithmically by researchers. Such a derivation, or a transformation of the forward pass program, has been long studied before the recent prosperity of deep learning in the field known as [automatic differentiation](https://arxiv.org/pdf/1502.05767.pdf).
## The Tape
Given the forward pass program (usually in Python in practices), there are two strategies to derive the backward pass:
1. from the forward pass program itself, or
1. from the execution trace of the forward pass program, which is often known as the *tape*.
This article surveys systems that follow the latter strategy.
## Dynamic Network
When we train a deep learning model, the tape changes every iteration as the input data change, so we have to re-derive the backward pass every iteration. This is known as *dynamic network*.
Deep learning systems that utilize the idea of dynamic network gained their popularities in recent years. This article surveys two representative systems: [PyTorch](https://pytorch.org/) and [DyNet](https://dynet.readthedocs.io/en/latest/).
## An Overview
Both frameworks record a ‘tape’ of the computation and interpreting (or run-time compiling) a transformation of the tape played back in reverse. This tape is a different kind of entity than the original program.[[link]](http://www.bcl.hamilton.ie/~barak/papers/toplas-reverse.pdf)
During the forward execution, a list of operators, in this case `matmul`, `matmul` and `softmax`, are recorded in the tape, along with the necessary information needed to do the backward such as pointers to the inputs and outputs. Then the tape is played in reverse order at `loss.backward()`.
The graph is composed of `Variable`s and `Function`s. During the forward execution, a `Variable` records its creator function, e.g. `h.creator = matmul`. And a Function records its inputs' previous/dependent functions `prev_func` through `creator`, e.g. `matmul.prev_func = matmul1`. At `loss.backward()`, a topological sort is performed on all `prev_func`s. Then the grad op is performed by the sorted order.
Chainer and Autograd uses the similar techniques to record the forward pass. For details please refer to the appendix.
## Design choices
### 1) Dynet's List vs Pytorch's Node Graph
What's good about List:
1. It avoids a topological sort. One only needs to traverse the list of operators in reverse and calling the corresponding backward operator.
1. It promises effient data parallelism implementations. One could count the time of usage of a certain variable during the construction list. Then in the play back, one knows the calculation of a variable has completed. This enables communication and computation overlapping.
What's good about Node Graph:
1. More flexibility. PyTorch users can mix and match independent graphs however they like, in whatever threads they like (without explicit synchronization). An added benefit of structuring graphs this way is that when a portion of the graph becomes dead, it is automatically freed. [[2]](https://openreview.net/pdf?id=BJJsrmfCZ) Consider the following example, Pytorch only does backward on SmallNet while Dynet does both BigNet and SmallNet.
```python
result=BigNet(data)
loss=SmallNet(data)
loss.backward()
```
### 2) Dynet's Lazy evaluation vs Pytorch's Immediate evaluation
Dynet builds the list in a symbolic matter. Consider the following example
The computation of `lookup`, `concat`, `matmul` and `softmax` didn't happen until the call of `loss_sym.value()`. This defered execution is useful because it allows some graph-like optimization possible, e.g. kernel fusion.
Pytorch chooses immediate evaluation. It avoids ever materializing a "forward graph"/"tape" (no need to explicitly call `dy.renew_cg()` to reset the list), recording only what is necessary to differentiate the computation, i.e. `creator` and `prev_func`.
## What can fluid learn from them?
TBD
# Appendix
### Overview
| Framework | Has Tape | Core in C++ | First Release Date |
[Backward code](https://github.com/HIPS/autograd/blob/442205dfefe407beffb33550846434baa90c4de7/autograd/core.py#L8-L40). In the forward pass, a graph of VJPNode is constructed.
# (1) Function Set definition, creates FunctionNode
model=FunctionSet(
l1=F.Linear(784,100),
l2=F.Linear(100,100),
l3=F.Linear(100,10)).to_gpu()
# (2) Optimizer Setup
opt=optimizers.SGD()
opt.setup(model)
# (3) Forward computation
defforward(x,t):
h1=F.relu(model.l1(x))
h2=F.relu(model.l2(h1))
y=model.l3(h2)
returnF.softmax_cross_entropy(y,t)
# (4) Training loop
forepochinxrange(n_epoch):
foriinxrange(0,N,b_size):
x=Variable(to_gpu(...))
t=Variable(to_gpu(...))
opt.zero_grads()
loss=forward(x,t)
loss.backward()
opt.update()
```
In `forward(x, t)`, a graph of [`VariableNode`](https://github.com/chainer/chainer/blob/master/chainer/variable.py#L110) and [`FunctionNode`](https://github.com/chainer/chainer/blob/a69103a4aa59d5b318f39b01dbcb858d465b89cf/chainer/function_node.py#L19) is constructed. Every output's `VariableNode.creator` is pointed to the `FunctionNode`.
`loss.backward()` will calculate the accumulated gradient of all variables. All the backward of `FunctionNode`s will be called based on the topological order.
```python
classVariableNode(object):
...
defbackward(self,retain_grad,loss_scale):
ifself.creator_nodeisNone:
return
cand_funcs=[]
seen_set=set()
grads={}
# Initialize error by 1, if this is a loss variable
ifself.data.size==1andself._grad_varisNone:
self.grad=numpy.ones_like(self.data)
grads[self._node]=self._grad_var
defadd_cand(cand):
ifcandnotinseen_set:
# Negate since heapq is min-heap. This is a global variable
[forward](https://github.com/clab/dynet/blob/740a9626a13a2732544de142e256ad0d0a166658/dynet/exec.cc#L84-L158), [backward](https://github.com/clab/dynet/blob/740a9626a13a2732544de142e256ad0d0a166658/dynet/exec.cc#L166-L284). The trace is done by creating a tape of expressions in every iteration. Backward is done by traverse the tape in the reverse order.