diff --git a/design/meps/MEP-AKG.md b/design/meps/mep-akg/MEP-AKG.md similarity index 98% rename from design/meps/MEP-AKG.md rename to design/meps/mep-akg/MEP-AKG.md index eeee4f2d054cdd3b98f081e35bba2388e9281ba6..6889284abbb3d75f43457ff99288007a1e55f62d 100644 --- a/design/meps/MEP-AKG.md +++ b/design/meps/mep-akg/MEP-AKG.md @@ -1,218 +1,218 @@ -| title | authors | owning-sig | participating-sigs | status | creation-date | reviewers | approvers | stage | milestone | -| ------- | -------------------------------- | ---------- | ------------------ | ----------- | ------------- | --------- | --------- | ----- | ------------- | -| MEP-AKG | @anyrenwei  @ckey_dou @dylangeng | akg | | provisional | 2020-06-16 | | TBD | beta | beta : "v0.5" | - -# MEP-AKG: Auto Kernel Generator - -## Table of Contents - - - -- [MEP-AKG: Auto kernel Generator](#mep-akg-auto-kernel-generator) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) - - [Proposal](#proposal) - - [User Stories](#user-stories) - - [Deep Graph Optimization](#deep-graph-optimization) - - [Optimize Dynamic Neural Network](#optimize-dynamic-neural-network) - - [Design Details](#design-details) - - [Test Plan](#test-plan) - - [Implementation History](#implementation-history) - - [Drawbacks](#drawbacks) - - [Alternatives](#alternatives) - - [References](#references-optional) - - - -## Summary - - - -AKG is an optimizer for operators in Deep Learning Networks. It provides the ability to automatically fuse ops with specific patterns. AKG works with MindSpore-GraphKernel to improve the performance of networks running on different hardware backends - -## Motivation - - - -Fusion can improve the performance of Deep Learning networks significantly. The fusion pattern varies in different networks, it may also change even in the same network when the hyperparameters change. So it's hard to ahead-of-time cover all the fused operators manually. GraphKernel analyzes the graph and find out the opportunities to fuse according to pre-designed patterns. AKG generates high-performance target code for these patterns on different hardware backends. - -### Goals - - - -- Provide ability to fuse operators with specific patterns in resnet50 and bert. -- Provide ability to generate high-performance target code for these patterns automatically on different hardware backends. - -### Non-Goals - - -- None - -## Proposal - - - -AKG aims to generate high-performance target code for fusing operators with specific patterns on different hardware backends. So three basic processes should be contained in akg as follows. -- **Operator Expression.** - AKG defines several basic operators which can be used to compose a complicated fused operator. These basic operators have the same granularity with MindSpore's IR. We introduce json to expressed the relation of the basic operators in one fused operator which brings weak dependency between MindSpore and AKG. - -- **Schedule initialize based on polyhedral.** - - When akg obtained the dsl of operators which would be fused, it would transform the operator dsl into formularIR(now we use HalidIR as tvm) and then into isl schedule tree. Next the polyhedral schedule process begin. With the help of pluto algorithm and other optimizations the schedule tree will do some transformations including vectorization, loop tiling, mem promotion and loop distribution, which can help us to improve the parallel capability and data locality. - -- **Emit instructions on different hardware from IR.** - - In order to generate correctness and high-performance codes for different hardware, The IR should be optimized respectively, which consists of double buffer optimization, storage rewrite optimization and inject sync optimization. - - -### User Stories - - - -#### Deep Graph Optimization - -Since the network is becoming more deeper and larger, there are more opportunity to fused different operation into one to optimize network performance. -AKG tools has the ability to auto-generate target code based on composited dsl, without scheduling procedure. -After automatic operator fusion and operator re-composition in graph level, AKG tools can generates high-performance target code for these composited pattern. - -#### Optimize Dynamic Neural Network - -Networks are exhibiting more and more dynamism, especially in the fields of deep graph analysis and NLP. -Tensors in a model may have dynamic shapes such as batch size, image size, sequence length, etc. -Models are expressed with control-flow, such as recursion, conditionals and loops. -Within these different dynamic requirement, AKG can generate one general target code on davinci hardware(different hardware) using for different shape of one common operator. - -## Design Details - - - - - -AKG composes with four basic optimization module, normalization, auto schedule, instruction emit and backend optimization. -- **normalization.** The mainly optimization of normalization includes three address transform, common subexpression elimination, copy propagation and so on. -- **auto schedule.** The auto schedule module mainly have vectorization, loop tiling, mem promotion and loop distribution. -- **instruction emit.** The instruction emitting module has the optimization about loop normalization, auto pragma and emit instruction. -- **backend optimization.** The backend optimization module consists of double buffer optimization, storage rewrite optimization and inject sync optimization. - - - -When GraphKernel is enabled, ops are reconstructed in the graph level. The new ops described in the format of json will be translated into DSL in AKG and then compiled to the target binary. - - - - - - - -### Test Plan - - - -AKG employed pytests and nosetest to launch the testing process, and there are three types of testing strategies in AKG: - -- **Unit Test.** Every optimization or pass in AKG has its own unitest. - -- **System test**. The akg module has its own component testing. Basically we classify the testing into compilation verification, function verification and performance testing. - -- **Integration test or API test**. Akg provides certain number of APIs to MindSpore-GraphKernel. So in the integration test process we have to make sure the fusion of patterns meets the requirements from both correctness and performance. - -## Implementation History - - - -- Support auto scheduling and auto tuning -- Support auto pragma optimization and alignment optimization and auto emitinsn -- Support auto tiling optimization -- Support To ThreeAddr and CSE optimization for auto-diff -- Support dynamic shape for resnet inference -- Enhance fused operator performance for Deep Graph Optimization - -## Drawbacks - - -- The schedule generated directly by pluto algorithm during the polyhedral process would exist some issues on both correctness and performance in some scenarios. So some extra passes have to added before emitting instructions. - -## Alternatives - - -- Both TVM[1] and TC[2] are outstanding tools which can automatically synthesize high-performance machine learning kernel. However, neither of them could generate codes for Davinci cores(cce codes) as davinci cores have more complicated multi-level memory design(L0-A/B/C, L1 and UB) as well as specific dataflow constraint. Besides, TVM adopted schedule space model and had to write the schedule all by ourselves while akg used polyhedral techniques to initialize the schedule automatically, which referenced from the designing of TC. - -## References -- [1] https://github.com/apache/incubator-tvm -- [2] https://github.com/facebookresearch/TensorComprehensions +| title | authors | owning-sig | participating-sigs | status | creation-date | reviewers | approvers | stage | milestone | +| ------- | -------------------------------- | ---------- | ------------------ | ----------- | ------------- | --------- | --------- | ----- | ------------- | +| MEP-AKG | @anyrenwei  @ckey_dou @dylangeng | akg | | provisional | 2020-06-16 | | TBD | beta | beta : "v0.5" | + +# MEP-AKG: Auto Kernel Generator + +## Table of Contents + + + +- [MEP-AKG: Auto kernel Generator](#mep-akg-auto-kernel-generator) + - [Table of Contents](#table-of-contents) + - [Summary](#summary) + - [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Proposal](#proposal) + - [User Stories](#user-stories) + - [Deep Graph Optimization](#deep-graph-optimization) + - [Optimize Dynamic Neural Network](#optimize-dynamic-neural-network) + - [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Implementation History](#implementation-history) + - [Drawbacks](#drawbacks) + - [Alternatives](#alternatives) + - [References](#references-optional) + + + +## Summary + + + +AKG is an optimizer for operators in Deep Learning Networks. It provides the ability to automatically fuse ops with specific patterns. AKG works with MindSpore-GraphKernel to improve the performance of networks running on different hardware backends + +## Motivation + + + +Fusion can improve the performance of Deep Learning networks significantly. The fusion pattern varies in different networks, it may also change even in the same network when the hyperparameters change. So it's hard to ahead-of-time cover all the fused operators manually. GraphKernel analyzes the graph and find out the opportunities to fuse according to pre-designed patterns. AKG generates high-performance target code for these patterns on different hardware backends. + +### Goals + + + +- Provide ability to fuse operators with specific patterns in resnet50 and bert. +- Provide ability to generate high-performance target code for these patterns automatically on different hardware backends. + +### Non-Goals + + +- None + +## Proposal + + + +AKG aims to generate high-performance target code for fusing operators with specific patterns on different hardware backends. So three basic processes should be contained in akg as follows. +- **Operator Expression.** + AKG defines several basic operators which can be used to compose a complicated fused operator. These basic operators have the same granularity with MindSpore's IR. We introduce json to expressed the relation of the basic operators in one fused operator which brings weak dependency between MindSpore and AKG. + +- **Schedule initialize based on polyhedral.** + + When akg obtained the dsl of operators which would be fused, it would transform the operator dsl into formularIR(now we use HalidIR as tvm) and then into isl schedule tree. Next the polyhedral schedule process begin. With the help of pluto algorithm and other optimizations the schedule tree will do some transformations including vectorization, loop tiling, mem promotion and loop distribution, which can help us to improve the parallel capability and data locality. + +- **Emit instructions on different hardware from IR.** + + In order to generate correctness and high-performance codes for different hardware, The IR should be optimized respectively, which consists of double buffer optimization, storage rewrite optimization and inject sync optimization. + + +### User Stories + + + +#### Deep Graph Optimization + +Since the network is becoming more deeper and larger, there are more opportunity to fused different operation into one to optimize network performance. +AKG tools has the ability to auto-generate target code based on composited dsl, without scheduling procedure. +After automatic operator fusion and operator re-composition in graph level, AKG tools can generates high-performance target code for these composited pattern. + +#### Optimize Dynamic Neural Network + +Networks are exhibiting more and more dynamism, especially in the fields of deep graph analysis and NLP. +Tensors in a model may have dynamic shapes such as batch size, image size, sequence length, etc. +Models are expressed with control-flow, such as recursion, conditionals and loops. +Within these different dynamic requirement, AKG can generate one general target code on davinci hardware(different hardware) using for different shape of one common operator. + +## Design Details + + + + + +AKG composes with four basic optimization module, normalization, auto schedule, instruction emit and backend optimization. +- **normalization.** The mainly optimization of normalization includes three address transform, common subexpression elimination, copy propagation and so on. +- **auto schedule.** The auto schedule module mainly have vectorization, loop tiling, mem promotion and loop distribution. +- **instruction emit.** The instruction emitting module has the optimization about loop normalization, auto pragma and emit instruction. +- **backend optimization.** The backend optimization module consists of double buffer optimization, storage rewrite optimization and inject sync optimization. + + + +When GraphKernel is enabled, ops are reconstructed in the graph level. The new ops described in the format of json will be translated into DSL in AKG and then compiled to the target binary. + + + + + + + +### Test Plan + + + +AKG employed pytests and nosetest to launch the testing process, and there are three types of testing strategies in AKG: + +- **Unit Test.** Every optimization or pass in AKG has its own unitest. + +- **System test**. The akg module has its own component testing. Basically we classify the testing into compilation verification, function verification and performance testing. + +- **Integration test or API test**. Akg provides certain number of APIs to MindSpore-GraphKernel. So in the integration test process we have to make sure the fusion of patterns meets the requirements from both correctness and performance. + +## Implementation History + + + +- Support auto scheduling and auto tuning +- Support auto pragma optimization and alignment optimization and auto emitinsn +- Support auto tiling optimization +- Support To ThreeAddr and CSE optimization for auto-diff +- Support dynamic shape for resnet inference +- Enhance fused operator performance for Deep Graph Optimization + +## Drawbacks + + +- The schedule generated directly by pluto algorithm during the polyhedral process would exist some issues on both correctness and performance in some scenarios. So some extra passes have to added before emitting instructions. + +## Alternatives + + +- Both TVM[1] and TC[2] are outstanding tools which can automatically synthesize high-performance machine learning kernel. However, neither of them could generate codes for Davinci cores(cce codes) as davinci cores have more complicated multi-level memory design(L0-A/B/C, L1 and UB) as well as specific dataflow constraint. Besides, TVM adopted schedule space model and had to write the schedule all by ourselves while akg used polyhedral techniques to initialize the schedule automatically, which referenced from the designing of TC. + +## References +- [1] https://github.com/apache/incubator-tvm +- [2] https://github.com/facebookresearch/TensorComprehensions diff --git a/design/meps/akg-design.png b/design/meps/mep-akg/akg-design.png similarity index 100% rename from design/meps/akg-design.png rename to design/meps/mep-akg/akg-design.png diff --git a/design/meps/mep-mm/MEP-MM.md b/design/meps/mep-mm/MEP-MM.md new file mode 100644 index 0000000000000000000000000000000000000000..49ef5ba87c207acfbd6e641a1bfade4667ba1f72 --- /dev/null +++ b/design/meps/mep-mm/MEP-MM.md @@ -0,0 +1,108 @@ +| title | authors | owning-sig | participating-sigs | status | creation-date | reviewers | approvers | stage | milestone | +| ----- | ------- | ---------- | ------------------ | ------ | ------------- |---------- | --------- | ----- | --------- | +| MEP-MM | @helloyesterday @jz_90 @yang_lijiang | wg-mm | sig-compiler, sig-executor | provisional | 2020-08-20 | TBD | TBD | NA | beta: "v0.7" | + +# MEP-MM: MindSpore Molecular Modeling WG + +## Table of Contents + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Notes/Constraints/Caveats (optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (optional)](#infrastructure-needed-optional) +- [References (optional)](#references-optional) + + +## Summary + +MindSpore Molecular Modeling WG aims to build a community collaboration for deep learning framework's application in molecular modeling and simulation. + +## Motivation + +Deep learning is transforming many areas in science, and it has great potential in +modeling molecular systems. However, unlike the mature deployment of deep learning in +computer vision and natural language processing, its development in molecular modeling and +simulations is still at an early stage, largely because the inductive biases of molecules are +completely different from those of images or texts.[0] + + + +### Goals + +The goal of the Molecular Modeling WG are as follows: +- Provide MM specific requirements to related SIGs for each release cycle, and monitoring its implementation via labeled ISSUE/PR. +- Provide documentations on MM support in MindSpore. +- Incubate SIG that will be responsible for developing MM specific libs when necessary and approved by TSC. + +### Non-Goals + +- Full stack MM software implementation is out of scope of this WG. + +## Proposal + +### User Stories + +For deep learning models and algorithms which handle molecules as geometric identities in the 3D Euclidean space, the (fully connected) graph is a reasonable data structure representing the 3D geometry of molecules, and the deep molecular models dealing with the 3D structures are mostly physics-based and can be used to encode or predict configuration-dependent molecular properties. + +From a relatively orthogonal viewpoint, the individual molecules (e.g., small organic compounds and structured macromolecules) can also be regarded as other data structures, for instance, sequences (e.g., SMILES134 or sequence of amino acids) or sparse molecular graphs,etc. Deep learning models based on these data structures are mostly information-based, and they can help researchers navigate through the chemical space and change the way of data-mining for cheminformatics. As demonstrated by AlphaFold, we believe that combining the physics-based deep molecular models and the information-based deep learning methods will bring about new solutions to many long-standing problems in physics, chemistry, and biology. + +Last but not least, in order to democratize deep learning for molecular modeling and simulations, it is necessary to develop special-purpose hardware and software suitable for fast and userfriendly computation. For example, molecular simulation community will definitely benefit from the auto-differentiation and parallel computation techniques which are the bedrocks of modern large-scale deep learning. Also, as machine learners and molecular simulation practitioners usually work in different platforms, a proper linker or interface which could integrate the molecular modeling software and deep learning software is highly desired. Indeed, researchers from both scientific and business communities like Google are making efforts to this end. + +With the improvement of the infrastructure, deep learning is expected to bring a larger impact and more opportunities to molecular modeling and simulations in the near future. + + + +### Notes/Constraints/Caveats (optional) + +NA + +### Risks and Mitigations + +NA + +## Design Details + +NA + +### Test Plan + +NA + +### Graduation Criteria + +NA + +### Upgrade / Downgrade Strategy + +NA + +## Implementation History + +NA + +## Drawbacks + +NA + +## Alternatives + +NA + +## Infrastructure Needed (optional) + +Ascend hardware resources are needed. + +## References (optional) + +[0] A Perspective on Deep Learning for Molecular Modeling and Simulations. https://dx.doi.org/10.1021/acs.jpca.0c04473 diff --git a/design/meps/mep-mm/mm-motivation.png b/design/meps/mep-mm/mm-motivation.png new file mode 100644 index 0000000000000000000000000000000000000000..cb7eef0f614a875af7926686922a7376ca29283d Binary files /dev/null and b/design/meps/mep-mm/mm-motivation.png differ diff --git a/design/meps/mep-mm/mm-usecase.png b/design/meps/mep-mm/mm-usecase.png new file mode 100644 index 0000000000000000000000000000000000000000..9e9c00f73b2bb83381b98b23bea5626daf826d32 Binary files /dev/null and b/design/meps/mep-mm/mm-usecase.png differ diff --git a/design/meps/MEP-MSLITE.md b/design/meps/mep-mslite/MEP-MSLITE.md similarity index 78% rename from design/meps/MEP-MSLITE.md rename to design/meps/mep-mslite/MEP-MSLITE.md index 035d13b71e30e83983a63e8bd3ac9051ac39b76b..2425697543c3c50a9e20e0d3da8906af3a2cca27 100644 --- a/design/meps/MEP-MSLITE.md +++ b/design/meps/mep-mslite/MEP-MSLITE.md @@ -1,135 +1,126 @@ - -| title | authors | owning-sig | participating-sigs | status | creation-date | reviewers | approvers | stage | milestone | -| ------- | -------------------------------- | ---------- | ------------------ | ----------- | ------------- | --------- | --------- | ----- | ------------- | -| MEP-mslite | @zhengli  @zhiqiangzhai @chaijun | mslite | | provisional | 2020-08-18 | | TBD | beta | beta : "v0.7" | - -# MEP-MSLITE: MindSpore Lite - -## Table of Contents - - - -- [MEP-MSLITE: MindSpore Lite](#mep-mslite-mindspore-lite) - - [Table of Contents](#table-of-contents) - - [Summary](#summary) - - [Motivation](#motivation) - - [Goals](#goals) - - [Non-Goals](#non-goals) - - [Proposal](#proposal) - - [User Stories](#user-stories) - - [Generate a compact target model and low latency and low consumption runtime](#generate-a-compact-target-model-and-low-latency-and-low-consumption-runtime) - - [Design Details](#design-details) - - [Test Plan](#test-plan) - - [Implementation History](#implementation-history) - - [Drawbacks](#drawbacks) - - [Alternatives](#alternatives) - - [References](#references-optional) - - -## Summary -MindSpore(MS) Lite is an extremely light-weight deep learning inference framework, -and designed for smart-phones and embedded devices, such as watches, headsets, and various IoT devices. -It supports Android and iOS, as well as Harmony os, and has industry leading performance. - -## Motivation -Since increased computing power and sensor data, intelligence is moving towards edge devices. Improved AI algorithms are driving the trend towards machine learning be run on the end device, such as smart-phones or automobiles, rather than in the cloud. -On-device AI can dramatically reduce latency, conserve bandwidth, improve privacy and enable smarter applications. - -### Goals -- Compatibility: supports MindSpore model, as well as mainstream third-party models, such as TensorFlow Lite, Caffe 1.0 and ONNX. -- High-performance: -generates small, low power consumption and fast inference target model for various hardware backends. - -- Versatility: supports Harmony, Android and iOS os. -- Light-weight: small shared library size, should be less than 1MB, and could be easily deployed on -resource limited devices. - -### Non-Goals -- None - -## Proposal - -MS Lite consists of converter and a runtime library. -The converter is an offline tool can handle most of the model translation work. -The runtime library deploys to device and executes online, -it has Lite RT and Lite Micro two modes. -Lite RT is for slightly resource limited devices, such as smart-phones, -while Lite Micro is for extremely resource limited devices, such as watches, headsets. - -- Compatibility - - provides an abundant of operator parsers for MindSpore, TensorFlow Lite, Caffe, ONNX, - and supports common neural networks in CV and NLP, 208+ CPU operators, and 60+ GPU operators. - -- High performance - - Many optimization methods, including graph optimizations, post training quantization, - are applied to model in offline converter, and generated target model is more compact. - Graph optimizations, such as operator fusion and constant folding, make model more compact. - Post training quantization transfers fp32 model into fix-point int8 model. - It brings nearly 4x smaller model size, low latency and low consumption for inference process. - MS Lite also applies a variety of optimization schemes to NN operations, including using Winograd -algorithm in convolution and deconvolution, Strassen algorithm in matrix multiplication. -Operations support fp64, fp32, fp16 and int8, and are highly optimized with acceleration by -neon instructions, hand-written assemble, multi-thread, memory reuse, heterogeneous computing, etc. - -- Versatility - - Supports Harmony, iOS and Android os, supports smart-phones, watches, headsets, and various IoT devices. - -- Light weight - - MS Lite is highly Optimized under GHLO and GLLO. It has small foot-print, - MS Lite runtime is about 800 kB, and MS Micro is less than 200 KB. - It is flexible and can easily deploy to mobile and a variety of embedded devices. -### User Stories - -#### Generate a compact target model and low latency and low consumption runtime - -Since devices has limited resource with few ROM, RAM, and power, how to deploy AI model to -device is very challenge. MS Lite aims to solve the challenge for users, and provides user-friendly, -flexible tool to help users to make their own models more slim and more efficiency. - -## Design Details - -MS Lite consists of converter and runtime. -The converter is an offline tool has three parts, frontend, IR, and backend. -Runtime deploys to device and executes online. - -- **Frontend.** Frontend aims to parse model from MindSpore, TensorFlow Lite, Caffe and ONNX in protobuf. -- **IR.** IR is to define ANF, including tensor, operations, and graph. -- **Backend.** Backend is an optimizer based ANF graph, including GHLO, GLLO, and quantization. `GHLO` is short for "graph high level optimization", common optimization methods, such as operators fusion, operator substitution, and constant folding, are included. `GLLO` is short for "graph low level optimization", low level optimization methods are related to hardware, such as layout adjustment, mixed-precision, etc. -- **Runtime.** Runtime has Lite RT and Lite Micro two modes. - - - -### Test Plan - -MS Lite employed pytests and nosetest to launch the testing process, -and there are two types of testing strategies in MS Lite: - -- **Unit Test.** Every operation, optimization or pass in MS has its own unitest. - -- **System test**. The ms Lite module has its own component testing. -Basically we classify the testing into compilation verification, -function verification and performance testing. - -## Implementation History -- Support high and low level graph optimization. -- Support post training quantization. -- Support Arm CPU and Mali GPU. -- Support fp64, fp32, fp16, int8 operations. - -## Drawbacks -- MS Lite does not support on-device training yet, it is coming soon... - -## Alternatives -- MNN[1], TF Lite[2] and TNN[3] are outstanding on-device AI frameworks. -MS Lite is for on-device AI, and MS cloud is for on-cloud AI, -both of them are in scope of Huawei's MindSpore AI framework. -They share same IR, and optimization passes. MS Lite is more flexible. - -## References -- [1] https://github.com/alibaba/MNN -- [2] https://www.tensorflow.org/lite -- [3] https://github.com/Tencent/TNN + +| title | authors | owning-sig | participating-sigs | status | creation-date | reviewers | approvers | stage | milestone | +| ------- | -------------------------------- | ---------- | ------------------ | ----------- | ------------- | --------- | --------- | ----- | ------------- | +| MEP-mslite | @zhengli  @zhiqiangzhai @chaijun | mslite | | provisional | 2020-08-18 | | TBD | beta | beta : "v0.7" | + +# MEP-MSLITE: MindSpore Lite + +## Table of Contents + + + +- [MEP-MSLITE: MindSpore Lite](#mep-mslite-mindspore-lite) + - [Table of Contents](#table-of-contents) + - [Summary](#summary) + - [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Proposal](#proposal) + - [User Stories](#user-stories) + - [Generate a compact target model and low latency and low consumption runtime](#generate-a-compact-target-model-and-low-latency-and-low-consumption-runtime) + - [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Implementation History](#implementation-history) + - [Drawbacks](#drawbacks) + - [Alternatives](#alternatives) + - [References](#references-optional) + + +## Summary +MindSpore(MS) Lite is an extremely light-weight deep learning inference framework, and designed for smart-phones and embedded devices, such as watches, headsets, and various IoT devices. +It supports Android and iOS, as well as Harmony os, and has industry leading performance. + +## Motivation +Since increased computing power and sensor data, intelligence is moving towards edge devices. Improved AI algorithms are driving the trend towards machine learning be run on the end device, such as smart-phones or automobiles, rather than in the cloud. +On-device AI can dramatically reduce latency, conserve bandwidth, improve privacy and enable smarter applications. + +### Goals +- Compatibility: supports MindSpore model, as well as mainstream third-party models, such as TensorFlow Lite, Caffe 1.0 and ONNX. +- High-performance: +generates small, low power consumption and fast inference target model for various hardware backends. + +- Versatility: supports Harmony, Android and iOS os. +- Light-weight: small shared library size, should be less than 1MB, and could be easily deployed on resource limited devices. + +### Non-Goals +- None + +## Proposal + +MS Lite consists of converter and a runtime library. +The converter is an offline tool can handle most of the model translation work. +The runtime library deploys to device and executes online, it has Lite RT and Lite Micro two modes. +Lite RT is for slightly resource limited devices, such as smart-phones, while Lite Micro is for extremely resource limited devices, such as watches, headsets. + +- Compatibility + + provides an abundant of operator parsers for MindSpore, TensorFlow Lite, Caffe, ONNX, + and supports common neural networks in CV and NLP, 208+ CPU operators, and 60+ GPU operators. + +- High performance + + Many optimization methods, including graph optimizations, post training quantization, + are applied to model in offline converter, and generated target model is more compact. + Graph optimizations, such as operator fusion and constant folding, make model more compact. + Post training quantization transfers fp32 model into fix-point int8 model. + It brings nearly 4x smaller model size, low latency and low consumption for inference process. + MS Lite also applies a variety of optimization schemes to NN operations, including using Winograd algorithm in convolution and deconvolution, Strassen algorithm in matrix multiplication. + +Operations support fp64, fp32, fp16 and int8, and are highly optimized with acceleration by neon instructions, hand-written assemble, multi-thread, memory reuse, heterogeneous computing, etc. + +- Versatility + + Supports Harmony, iOS and Android os, supports smart-phones, watches, headsets, and various IoT devices. + +- Light weight + + MS Lite is highly Optimized under GHLO and GLLO. It has small foot-print, + MS Lite runtime is about 800 kB, and MS Micro is less than 200 KB. + It is flexible and can easily deploy to mobile and a variety of embedded devices. +### User Stories + +#### Generate a compact target model and low latency and low consumption runtime + +Since devices has limited resource with few ROM, RAM, and power, how to deploy AI model to +device is very challenge. MS Lite aims to solve the challenge for users, and provides user-friendly, flexible tool to help users to make their own models more slim and more efficiency. + +## Design Details + +MS Lite consists of converter and runtime. +The converter is an offline tool has three parts, frontend, IR, and backend. +Runtime deploys to device and executes online. + +- **Frontend.** Frontend aims to parse model from MindSpore, TensorFlow Lite, Caffe and ONNX in protobuf. +- **IR.** IR is to define ANF, including tensor, operations, and graph. +- **Backend.** Backend is an optimizer based ANF graph, including GHLO, GLLO, and quantization. `GHLO` is short for "graph high level optimization", common optimization methods, such as operators fusion, operator substitution, and constant folding, are included. `GLLO` is short for "graph low level optimization", low level optimization methods are related to hardware, such as layout adjustment, mixed-precision, etc. +- **Runtime.** Runtime has Lite RT and Lite Micro two modes. + + + +### Test Plan + +MS Lite employed pytests and nosetest to launch the testing process, and there are two types of testing strategies in MS Lite: + +- **Unit Test.** Every operation, optimization or pass in MS has its own unitest. + +- **System test**. The ms Lite module has its own component testing. +Basically we classify the testing into compilation verification, function verification and performance testing. + +## Implementation History +- Support high and low level graph optimization. +- Support post training quantization. +- Support Arm CPU and Mali GPU. +- Support fp64, fp32, fp16, int8 operations. + +## Drawbacks +- MS Lite does not support on-device training yet, it is coming soon... + +## Alternatives +- MNN[1], TF Lite[2] and TNN[3] are outstanding on-device AI frameworks. +MS Lite is for on-device AI, and MS cloud is for on-cloud AI, both of them are in scope of Huawei's MindSpore AI framework. +They share same IR, and optimization passes. MS Lite is more flexible. + +## References +- [1] https://github.com/alibaba/MNN +- [2] https://www.tensorflow.org/lite +- [3] https://github.com/Tencent/TNN diff --git a/design/meps/ms-lite-arch.jpg b/design/meps/mep-mslite/ms-lite-arch.jpg similarity index 100% rename from design/meps/ms-lite-arch.jpg rename to design/meps/mep-mslite/ms-lite-arch.jpg