New Features
- 增加cuda对nchw quantized数据的计算支持。
- conv1x1中添加对gemv的支持。
- CPU(X86,ARM)上增加NCHW4 layout 计算方式的支持。
- 针对Armv8.2-a+dotprod指令集增加 nchw44-dot layout的优化支持。
- 增加nchw44来优化float的计算,包括直接卷积,channel wise卷积,混合layout的卷积,winograd,mk4-matmul,以及pooling和elemwise等algo的优化。
- 整理图优化,将一些通用转换能同时支持runtime和dump阶段。
- 增加 Calibration 量化训练接口。
- QAT量化训练优化:ConvBn 添加支持BN和fake quantization/observer的复合操作、添加Conv的量化操作、添加Linear的量化操作、quantize_qat支持自定义跳过Module。
- 多卡训练增加同步 BN 统计量的支持。
- 多卡训练在图优化中增加 PackAllReducePass,打包需要AllReduce的参数,减少卡间通信次数。
- API的一些优化调整:F.eye原本放在functional/nn.py里,现在挪到了core/tensor_factory.py里;F.add_axis和F.remove_axis里强行限制只能传入int的axis,而不再允许传入list。
Bug Fix
- 在FuseConvBiasWithZ的pass里添加HSwish激活函数的支持,将QFUSE_ADD_H_SWISH折叠进conv bias算子,提升性能。
- 修复cuda-TopK算法在batch超过65535时会导致grid的y维超出限制,而报出invalid parameter的cuda错误。
- 解除 cuda-stub 中对 libcuda.so 的路径限制。
- 修复conv1x1错误使用了基类的is_prefered方法导致的性能问题。
- ConvDirectUnrollBuffer算法中,在load src时取出的数据会变成0,加入printf语句或者去掉循环的unroll优化可以避免这个问题。
- 修复paramfuse在处理midconst时,endpoint导致endpoint被replace两次的问题。
- 修复自8.3.0(包括)gopt中的ReorderArithChainPass BUG fix reorder arith chain pass。
- 修复cond op不支持空shape的问题。
- 修复SetMeshIndexing使用多个axis做indexing时的问题。
- 修复CompNode中assert locator.device < sd.MAX_NR_DEVICE 的书写错误 @zjd1988 。
- 修复voc和objects365的书写错误。
- 修复voc中错误的类名。
- 修复Tensor 的default_comp_graph 使用 。
- 修复Function中saved_tensors在静态图下无法copy而导致图优化失败的问题 。
- 修复 scatter 的API文档,避免在GPU上报错。
- 修复unused var ins的问题。
- 修复Module中字段的非str键错误。
- 修复QAT训练完的模型在eval模式下依然会更新scale和zero_point 的问题。
- 在所有mali系列机器上都关闭 image算法。
Thanks to our Contributors
- 本次release非常感谢@zjd1988 提交PR,期待更多的开发者一起共建MegEngine!
New Features
- Enable cuda algos for nchw quantized.
- Update conv1x1 to support gemv.
- NCHW4 layout is now supported on CPU(X86,ARM).
- Optimized nchw44-dot layout is available in Armv8.2-a+dotprod instruction set.
- nchw44 is incorporated to optimize float-typed calculation, including but not limited to direct convolution, channel wise convolution, hybrid layout convolution, winograd, mk4-matmul, along with algorithm optimization of pooling and elemwise.
- Graph optimization. Generalized conversion is supported both in runtime and dump phase.
- Synchronized BN statistics are now available on multi-device training tasks.
- PackAllReducePass is introduced into graph optimization on multi-device training.
- Calibration quantization training interface is now available.
- QAT quantization training updates: ConvBn is now able to conduct composed operation of BN and fake quantization/observer; enable quantization on Conv and Linear; quantize_qat is now allowed to skip Module on your needs
- API adjustments: F.eye is moved to core/tensor_factory.py from the previous location functional/nn.py. F.add_axis and F.remove_axis are now restricted to accept axis of int type only, which disables axis of list type.
Bug Fix
- HSwish activation function is enabled in pass of FuseConvBiasWithZ, and QFUSE_ADD_H_SWISH is wrapped into conv bias operator to enhance performance.
- Fix cuda error‘invalid parameter’raised from cuda-TopK when batch exceeds 65535 which violates the y dimension limit of grid.
- Drop path restriction of libcuda.so in cuda-stub.
- Fix impacted performance for conv1x1 mistakenly adopts is_prefered from its base class.
- Insert printf statements or removing looped unroll optimization to avoid the issue that data fetched through load src in ConvDirectUnrollBuffer are unexpectedly casted to 0.
- Fix issue that endpoint would be replaced twice when paramfuse was processing midconst.
- Fix ReorderArithChainPass in gopt raised since 8.3.0 (inclusive).
- Fix empty shape not recognized by cond op.
- Fix SetMeshIndexing uses multiple axes for indexing.
- Fix typo assert locator.device < sd.MAX_NR_DEVICE in CompNode @zjd1988 .
- Fix typo in voc and objects365.
- Fix incorrect class name in voc.
- Fix default_comp_graph of Tensor.
- Fix graph optimization failure on occasion that saved_tensors in Function is unable to copy in a static graph.
- Fix API documentation of scatter to circumvent exception on GPU environment.
- Fix issues of unused var ins.
- Fix none-str key exception in Module fields.
- Fix unexpected eval-mode scale and zero_point updates in models trained by QAT.
- Disable image algorithm on all of mali-series machines.
Thanks to our Contributors
- A kind acknowledgement to PR lodged by @zjd1988 , and we are genuinely welcoming more developers to co-build MegEngine!