Execute operators with kernels implemented in MegDNN with NCHW64 tensor format. Can only be used
on Nvidia GPUs, which natively support fast int4 tensorcore inference.
)__usage__"
R"__usage__(
--layout-transform [cuda|x86|arm|opencl|unspec]
Enable global layout transform optimization for computing graph. User should specify the device target for the optimization, and a series of passes will be applied on the computing graph. The passes will benchmark the elapsed time of operators on different tensor layouts, and select fastest implementation for the operators. The optimization process will take some time. The default target is unspec, which all the available for operators will be profiled. So the optimize time will be longer.
--layout-transform-dump <dump_path>
The computing graph after global layout transform will be dumped to the given file path.
--layout-transform-verify
After applying the layout transform optimization, the results of the computing graph before and after layout transform passes will be compared to verify the correctness of the passes.