[new feature request]half-precision support
Created by: chenjiasheng
As the DS2 paper mentioned:
Our deployment system evaluates RNNs in half-precision arithmetic, which has no measurable accuracy impact, but significantly improves efficiency. We wrote our own 16-bit matrix-matrix multiply routines for this task, substantially improving throughput for our relatively small batches.
So have you implemented this half-precision arithmetic? Does it take the advantages of CUDA's half-precision ability (https://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/ https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/)?