MegEngine

HighLight

  • 训练侧默认开启 CUDA_MODULE_LOADING,节省了 fatbin 加载带来的 CUDA 显存开销(对于 cuda 版本为118及以上的包有效),您将有更多的显存可以使用。(使用的 kernel 种类越少,节省效果越明显,最多可为您节省 900MB 显存)
  • 包括此版本在内的近两个版本(v1.12.4,v1.13)会保持对 cuda10.1、cuda11.4 的支持,后续将不再支持 cuda10.1,请您知晓~

Bugfix

Python API

  • 修复了 F.flatten 和 Tensor.flatten 签名未对齐的问题,目前两者均统一为 flatten(start_axis, end_axis)。
  • python 层 multiheadattention functional/module 接口格式修改,用于后续进一步解决原始接口中存在的不能给出中间的 attn matrix、qkvo projection bias 不可组合等问题。
  • c++ 层 multiheadattention functional/module 接口格式修改,用于后续进一步解决原始接口中存在的不能给出中间的 attn matrix、qkvo projection bias 不可组合等问题。

Dataloader

  • 修复 dataloader 中读取系统内存大小后未关闭相关文件导致的 warning。

通用组件

  • 修复 trace 时如果 tarced_function 的 return 是复杂嵌套类型,报错信息不直观的问题。
  • 修复 gitlab 登录 windows 环境打印的错误信息乱码问题。
  • 修复了开启 DTR 情况下多卡训练概率性崩溃的问题。

周边工具

  • 完善 windows 平台下 whl 包的环境依赖。

ARM

  • 修复了 macos aarch64 下开启 fp16 编译失败的问题。

文档

  • 修复 readme 的拼写错误。

New Features

Python API

  • profiler 为 functional 添加 scope,用于记录其调用的层次结构(目前支持 functional/module scope)。

CUDA

  • 新增对 aarch64 下 cuda11.8 的编译支持。
  • 支持并完善 windows cuda118 工具链。
  • 训练侧默认开启 CUDA_MODULE_LOADING,节省了 fatbin 加载带来的 CUDA 显存开销(对于 cuda 版本为118及以上的包有效),您将有更多的显存可以使用。

通用组件

  • profiler 新增了两个指标,以帮助您更直观地获取当前模型训练的性能指标(具体可见MR内容)。gpu 忙碌比:gpu_usage_ratio,gpu 训练时间占整体训练时间的比例;model.step 时间占比:train_time_ratio,实际用于训练的时间(各 epoch 的第一个 step 开始到最后一个 step 结束的时间之和)占整体训练时间的比例。
  • 完善 unsupported opr 的报错 log,便于您直接获取到所输入的模型中具体没有实现的 opr 信息。
  • 加入对复数的支持,包括四则运算、求导、拆包、打包等基本运算(新增 op: F.polar,F.imag,F.real,F.complex;添加了复数支持的旧 op:add,sub,mul,negate,reshape)

Improvements

通用组件

  • 完善 symbolic trace 中部分不能通过静态推导值的 tensor 调用 numpy 方法时的报错信息,使之更完整合理。

量化

  • 量化添加对 linear_bn,linear_bn_relu 的支持。

MegEngine

HighLight

  • The training side will open the CUDA_MODULE_LOADING default to save the CUDA video memory overhead brought by Fatbin loading (effective for the CUDA version of 118 and above), and you will have more memory to use. (The fewer types of Kernel you use, the more obvious saving the effect, you can save you at most 900MB of memory)
  • Nearly two versions (V1.12.4, V1.13), including this version, will maintain support for CUDA10.1 and CUDA11.4. In the future, CUDA10.1 will no longer be supported. Please know ~

Bugfix

Python API

  • Fixed the problem that the signatures of F.flatten and Tensor.flatten were not aligned. Currently both are unified as flatten(start_axis, end_axis).
  • The python layer multiheadattention functional/module interface format modification is used to further solve the problems in the original interface that the intermediate attn matrix cannot be given, and the qkvo projection bias cannot be combined.
  • The C++ layer multiheadattention functional/module interface format modification is used to further solve the problems in the original interface that the intermediate attn matrix cannot be given, and the qkvo projection bias cannot be combined.

Dataloader

  • Fix the warning caused by not closing related files after reading the system memory size in dataloader.

通用组件

  • Fixed the problem that the error message is not intuitive when the return of tarced_function is a complex nested type.
  • Fix the issue of garbled error messages printed by Gitlab logging into the Windows environment.
  • Fixed the probabilistic crash of multi-card training after enabling DTR.

周边工具

  • Improving the environmental dependency of whl package on windows platform.

ARM

  • Fix compile error on macos aarch64 with fp16 enabled.

文档

  • Fix the typo in README.md.

New Features

Python API

  • The profiler adds a scope to functional to record the hierarchy of its calls (currently supports functional/module scope).

CUDA

  • Support compiling with cuda11.8 on aarch64.
  • Support and improve the windows cuda118 toolchain.
  • CUDA_MODULE_LOADING is enabled by default on the training side, which saves the CUDA video memory overhead caused by fatbin loading (valid for packages with cuda version 118 and above), and you will have more video memory available.

通用组件

  • The profiler has added two new indicators, the gpu busy ratio (gpu_usage_ratio) and the model.step time ratio (train_time_ratio), to help users more intuitively obtain the overall performance indicators of the current model training.
  • Added support for complex numbers, including basic operations such as four arithmetic operations, derivation, unpacking, and packaging (new ops: F.polar, F.imag, F.real, F.complex; old ops with complex number support: add , sub, mul, negate, reshape)

Improvements

通用组件

  • Improve the error message in the symbolic trace when some tensors that cannot statically derive the value call the numpy method to make it more complete and reasonable.

量化

  • Quantization added linear bn, linear bn relu support.

项目简介

MegEngine 是一个快速、可拓展、易于使用且支持自动求导的深度学习框架

🚀 Github 镜像仓库 🚀

源项目地址

https://github.com/MegEngine/MegEngine

发行版本 37

MegEngine v1.13.1

全部发行版

贡献者 39

全部贡献者

开发语言

  • C++ 79.8 %
  • Cuda 13.8 %
  • Python 4.9 %
  • C 0.9 %
  • CMake 0.5 %