* Fix bug and add traceback.
* Speed data processing by multi-threads/multi-process. * Add profiling scripts. * Use depthwise transposed conv2d.