Created by: reyoung
We should make Paddle fast even we using Operator. So we should give a way to fuse many fine-grained functors to a single GPU kernel.
Operator
The design doc should be written first.