兼容性破坏
- 由于 C++ 序列化增加了 opname 字段,导致老版本不能加载新版本序列化文件。
- 废弃 set/get_conv_execution_strategy ,请使用新接口 set/get_execution_strategy 。
其他说明
- funtional.nn 模块中 interpolate/roi_pooling/roi_align/nms/remap/warp_affine/warp_perspective/cvt_color 移动到 funtional.vision 模块。
- functional.elemwsie 模块中 sigmoid/hsigmoid/relu/relu6/hswish 移动到 funtional.nn 模块。
- functional.utils 模块中 topk_accuracy 被移动到 funtional.metric 模块。
- functional.utils 模块中 copy 被移动到 funtional.tensor 模块。
问题修复
通用组件
- 修复 reshape 推导 shape 错误导致 trace 报错的问题。
- 修复 trace 内存泄漏的问题。
- 修复 linspace 造成 trace 报错的问题。
- 修复 scalar 参数经过求导后变成 1 维 tensor 的问题。
- 修复图优化中 NCHW 转 NCHW4 出错的问题。
- 修复异步执行下发任务过快导致内存泄漏问题。
- 修复 pyobject 引用计数问题引起的段错误。
- 修复 roialign 越界访存的问题。
- 修复 CompNode reuse 某些情况下 load 错误。
- 修复 NormalizeArithChainPass 和 WarpFusion 的图优化错误。
- 修复 linspace 中 device 参数。
Python API
- 修复 F.full/F.ones/F.zeros 输入 shape 是 scalar 类型的 tensor 会报错的问题。
量化
- 修复量化类型在某些 case 下判等会报错的问题。
- 修复量化训练 checkpoint 加载出错的问题。
- 修复 TQT 量化训练参数不更新的问题。
- 修复 TQT 量化训练反向求导计算的问题。
- 修复量化训练未转换自定义量化 Module 的问题。
其他
- 修复 set_mgb_log_level 不生效的问题。
- 修复 batch normalization 中的 freeze 参数的问题。
新功能
通用组件
- 支持小 tensor 在 host 上的计算以减少 host-device 同步。
- fastrun 添加 fast profile 模式。
- fast-run 支持递归搜索。
- Matmul Opr 支持 fast-run 搜参。
- load_and_run 增加 disable-optimize-for-inference 参数。
- 增加 trace 时根据 module 结构自动命名 op name 的功能。
- Reshape 增加静态 shape 推导。
Python API
- 增加 TensorRT/Atalas/Cambricon (三方硬件)、cvt_color、matinv、resize、warp_affine、deformable_conv2d、deformable_psroi_pooling、repeat、tile 等新算子。
- 增加给 tensor 命名的功能。
分布式训练
- 增加分布式通信算子对 scalar 的支持。
周边工具
- 在 cgtools 中增加 GraphInference 并支持指定输出节点。
- 增加基于 .mge 文件的可视化、统计参数量计算量的工具。
- 增加 python 版 load_and_run 工具。
Dataloader
- stream dataloader 支持设置 timeout 以及设置 timeout 后的回调函数。
ARM
- 自动检测 ARM 平台特征并开启相应优化。
- 添加 ARM64 CUDA 推理支持。
改进
通用组件
- 被 trace 的函数增加支持返回dict的功能。
- Python API
- module 支持用复杂 key 来做 getattr。
- module repr 支持 list/dict。
分布式训练
- 分布式训练增加返回值功能。
量化
- 调整了假量化 bias 的策略,只有在 weight 和 activation 都被量化时才对 bias 做假量化。
- 优化量化数据类型结构使量化框架支持第三方量化数据类型。
ARM
- 增加了 Matmul 的分块实现,优化某些 shape 下的性能。
Thanks to our Contributors
本次 release 非常感谢 @jia-kai 提交 PR ,期待更多的开发者一起共建 MegEngine!
Compatibility violation
- Since C++ serialization adds new opname filed, C++ serialization file dumped by this version can not be loaded by earlier releases.
- set/get_conv_execution_strategy is deprecated and set/get_execution_strategy is suggested to use.
Additional Note
- Some functionals are moved to new modules for better orgnization. Backward compatibility is also gurrettened so the change is not expected to affact original usage. The moved functionals include:interpolate/roi_pooling/roi_align/nms/remap/warp_affine/warp_perspective/cvt_color are moved from funtional.nn to funtional.vision.
- sigmoid/hsigmoid/relu/relu6/hswish are moved from functional.elemwsie to funtional.nn.
- topk_accuracy is moved from functional.utils to funtional.metric copy is moved from functional.utils to funtional.tensor.
- copy is moved from functional.utils to funtional.tensor.
Bug Fixes
General components
- Fix shape inference in reshape which may lead to error in trace.
- Fix the problem of trace memory leak.
- Fix trace error caused by linspace.
- Fix the bug in automatic differentiation which turns a scalar into an 1-dim tensor.
- Fix NCHW-to-NCHW4 layout transform in gopt.
- Fix memory leak when python frontend runs much faster without synchronization to the device.
- Fix segfault caused by pyobject reference counting error.
- Fix the illegal memory access in ROIAlign operator.
- Fix CompNode reuse load error in some cases.
- Fix the graph optimization error of NormalizeArithChainPass and WarpFusion.
- Fix the device parameter in linspace.
Python API
- Fix scalar as the input shape of F.full/F.ones/F.zeros.
Quantization
- Fix comparision error of quantized data type.
- Fix checkpoint loading error in quantized training.
- Fix parameters which cannot be updated in TQT.
- Fix gradient calculation in TQT.
- FIx bug in user-defined TQT module.
Others
- Fix set_mgb_log_level malfunction.
- Fix freeze parameter in batch normalization.
New Features
General components
- Support host computation for small tensors to reduce synchronization between host and device.
- Add fast profile mode for fastrun.
- Support recursive search in fastrun.
- Add matmul support in fastrun.
- Add disable-optimize-for-inference parameter to load_and_run.
- Add automatic naming of op's based on module structure.
- Add static shape inference for reshape operator.
Python API
- Add new operators: TensorRT/Atalas/Cambricon(third party hardwares), cvt_color, matinv, resize, warp_affine, deformable_conv2d, deformable_psroi_pooling、repeat、tile.
- Enable tensor naming.
Distributed training
- Support scalar tensors for distributed operators.
Tools
- Add GraphInference in cgtools and support specifying output nodes.
- Support model visualization and parameter statistics from .mge files.
- Add python load_and_run.
Dataloader
- Support setting timeout and callback function after timeout in stream dataloader.
ARM
- Automatically detect ARM platform calculation characteristics and enable corresponding optimization.
- Support inference on ARM64 with CUDA.
Improvements
General components
- Support dict as returned value for traced function.
Python API
- Add get/set_expand_structure to deal with complex key.
- Support list and dict in module repr methods.
Distributed training
- Add return values for distributed training.
Quantization
- Adjust fake quantization method such that bias is quantized only both weight and activation are quantized.
- Support user-defined quantized data type in quantized training.
ARM
- Add more tiled kernels of Matmul to improve performance.
Thanks to our Contributors
A kind acknowledgement to PR lodged by @jia-kai , and we are genuinely welcoming more developers to co-build MegEngine!