Created by: kexinzhao
Add fp16 compute kernel in mul op so that it can call the FP16 gemm math function using the cublas fp16 kernel on GPU.