
  • 修复asan报错的问题
  • 修复寒武纪跨计算节点拷贝的问题
  • 修复profile导致的显存爆炸
  • 修复寒武纪环境下显存未能正确回收
  • 修复由于CUDA环境变量没有正确设置而导致分布式训练卡0显存爆炸的问题
  • 修复tensor split
  • 修复 ARM testcase 内存占用过多的问题
  • 修复 Fastrun 占用显存过多的问题
  • 修复 Atlas dump 模型指定的 batch size 大于模型最大 batch size 的问题
  • 修复 MLIR 不能正确处理不同的 shape 的问题
  • 修复 MLIR 执行 CUDA 时出现 Dangling Pointer 的问题
  • 修复 Weight 前处理时没有考虑无 bias 的 ConvBias 的问题
  • 修复打印错误堆栈过程中再次crash导致 log 混乱的问题


  • python退出时做full sync
  • MegEngine中添加subpackages
  • pooling window size 小于 padding size 时输出警告信息
  • 添加 Atlas Stub, 支持在 X86 平台上 dump Atlas 模型
  • 为 JITExecutor opr 添加 memory forwarding 功能
  • 为 load_and_run 添加将结果输出到 stdout/stderr 的功能
  • 增加EasyQuant量化方法
  • 支持Tensor换入/换出重计算功能
  • Optimizer支持inplace add_update


  • 添加常见 Video Detection 网络前处理融合优化
  • 添加 DimShuffle, Reformat 与 ConvBias 的融合优化
  • 添加 WarpPerspective 和 DimShuffle 的融合优化
  • 将tensor,求导以及trace从python实现改到cpp实现,提高了性能
  • 修改部分opr的求导规则以节省显存
  • 优化QAT和TQT量化训练性能和显存
  • 调整 CUDA chanwise Convolution 算法选择策略
  • 优化 NCHW32 的 pooling 算子性能
  • 优化 CallbackCaller 算子的性能
  • 优化 CUDA IO 通信


Bug Fixes

  • Fix errors reported by ASAN
  • Fix the problem of cross compute node copy in Cambricon
  • Fix out of memory error caused by profiling
  • Fix memory leak in the Cambrian
  • Fix out of memory error during distributed training due to the incorrect setting of CUDA environment variables
  • Fix tensor split
  • Reduce the memory usage of ARM testcase
  • Reduce the memory usage of Fastrun
  • Fix the issue that the batch size specified when dumping the Atlas model exceeds the maximum batch size of the model
  • Fix the problem that MLIR cannot handle different shapes correctly
  • Fix the problem of Dangling Pointer when MLIR executes CUDA
  • Fix the weight pre-processing to handle ConvBias without bias correctly
  • Fix the broken log caused by crash again in the process of printing error stack

New Features

  • Full sync when exits in Python
  • Add sub-packages to MegEngine
  • Print warning message when pooling window size is smaller than padding size
  • Add Atlas Stub, enabling R dump Atlas model on X86 platform
  • Add memory forwarding to JITExecutor operator
  • Make load_and_run print the result to stdout/stderr not just files
  • Add EasyQuant quantification method
  • Support tensor swap-in/swap-out recalculation
  • Optimizer supports inplace add_update


  • Optimize common Video Detection network by pre-processing fusion
  • Optimize performance by fusing DimShuffle and Reformat with Convolution
  • Fuse WarpPerspective with DimShuffle
  • Improve performance by rewriting tensor, derivation and trace in cpp
  • Refactor some opr derivation rules to save memory usage
  • Optimize QAT and TQT quantitative training in terms of both performance and  memory usage
  • Adjust the CUDA chanwise Convolution algorithm selection strategy
  • Optimize the performance of NCHW32 pooling operator
  • Optimize the performance of CallbackCaller operator
  • Optimize CUDA IO communication

Compatibility violation

  • No


MegEngine 是一个快速、可拓展、易于使用且支持自动求导的深度学习框架

🚀 Github 镜像仓库 🚀



发行版本 37

MegEngine v1.13.1


贡献者 39



  • C++ 79.8 %
  • Cuda 13.8 %
  • Python 4.9 %
  • C 0.9 %
  • CMake 0.5 %