diff --git a/paddle/fluid/operators/jit/README.en.md b/paddle/fluid/operators/jit/README.en.md index 8670ec2ff28ac8353217e0ee2f8c9b784e488ac7..7d4dc6d47a512ee7ed75d99800968a38de98f090 100644 --- a/paddle/fluid/operators/jit/README.en.md +++ b/paddle/fluid/operators/jit/README.en.md @@ -1,7 +1,7 @@ # JIT Kernel JIT(Just In Time) Kernel contains actually generated code and some other implemenations with the same logic. -Each implementations has its own condition to use, defined in `UseMe`. +Each implementation has its own condition to use, defined in `CanBeUsed`. They are combined together to get the best performance of one single independent function. They could be some very simple functions like vector multiply, or some complicated functions like LSTM. And they can be composed with some other exited jit kernels to build up a complex function. @@ -42,35 +42,62 @@ All basical definations of jit kernels are addressed in `paddle/fluid/operators/ ## How to use -One simple function `jit::Get`, which is very easy to use, is supported to get the kernel. -It can automatically return the expected function with best performance under the given attributes. -All kernels are inlcuded in `paddle/fluid/operators/jit/kernels.h`, you can only include this one header to get all the registered kernels. +We present these methods to get the functions: +- `GetAllCandidateFuncs`. It can return all the implementations supported. All of the implementations can get the same result. You can do some runtime benchmark to choose which should actually be used. +- `GetDefaultBestFunc`. It only return one default function pointer, which is tuning offline with some genenal configures and attributes. This should cover most situations. +- `KernelFuncs::Cache()`. It can get the default functions and save it for next time with the same attribute. +- `GetReferFunc`. It can only get the reference code in CPU, and all the others implementations have same logic with this reference code. + +And here are some examples: + +Get from cache: + +```cpp + using T = float; + jit::seq_pool_attr_t attr(width, jit::SeqPoolType::kSum); + auto seqpool_func = jit::KernelFuncs, platform::CPUPlace>::Cache().At(attr); + seqpool_func(src_data, dst_data, &attr); +``` + +Get all implementations and run once: + +```cpp + using T = float; + jit::seq_pool_attr_t attr(width, jit::SeqPoolType::kSum); + auto funcs = jit::GetAllCandidateFuncsWithTypes, platform::CPUPlace>(attr); + for (auto f : funcs) { + LOG(INFO) << "Kernel implementation type: " << f.first; + f.second(src_data, dst_data, &attr); + } +``` + +All kernels are inlcuded in `paddle/fluid/operators/jit/kernels.h`, which is automatically generated in compile time, you can only include this one header to get all the registered kernels. ## Solid Test - Unit Test All functions should be compared with the corresponding reference functions, including data tyep `float` and `double`. - Benchmark - All functions should be tested, and make sure the `jit::Get` function obtain the best performance with all attributes. + All functions should be tested, and make sure the `jit::GetDefaultBestFunc` function obtain the best performance with all attributes. # How to add new kernel ## Required 1. Add `your_key` at `KernelType`. -2. Add reference function of `your_key`. +2. Add your new `KernelTuple` which must include `your_key`. It should be a combination of the data type, attribute type and function type. You can refer `SeqPoolTuple`. +3. Add reference function of `your_key`. Note: - this should be run on CPU and do not depend on any third-party. - Add `USE_JITKERNEL_REFER(your_key)` in `refer/CmakeLists.txt` to make sure this code can be used. -3. Add unit test in `test.cc`, and verfiy at least `float` and `double`. +4. Add unit test in `test.cc`, and verfiy at least `float` and `double`. Test more data type for some special functions if necessary, for example `int8`. -4. Add functions in `benchmark.cc` to test all function of same `KernelType`. Make sure `jit::Get` always get the best one. +5. Add functions in `benchmark.cc` to test all function of same `KernelType`. Make sure `GetDefaultBestFunc` always get the best one. ## Optional Add more implementations of `your_kery` for performance enhancement. -1. Add functions based on generated code in `gen`. It should be derived from `JitCode` and should have corepsonding creator from `JitCodeCreator` which will be registered on the `your_key`. -Note: Add new `KernelTuples` if necessary,your can refer to `XYZNTuples`. -Specialie method `JitCodeKey` when add new attribute type。 -2. Add more functions in `more`,you can use any third party you wish, like mkl, mkldnn or intrinsic code to reach the best performance. +1. Add functions based on generated code in `gen`. It should be derived from `JitCode` and should have correpsonding creator from `JitCodeCreator` which will be registered on the `your_key`. +2. If new attribute type is added, you should specialize `JitCodeKey` of this type. +3. Add more functions in `more`,you can use any third party you wish, like mkl, mkldnn or intrinsic code to reach the best performance. diff --git a/paddle/fluid/operators/jit/README.md b/paddle/fluid/operators/jit/README.md index cc19f09f56ddf6a7c74d6605ab3f1bd059f19bb8..770548c5260f73f038f52e0b06b77ba698851997 100644 --- a/paddle/fluid/operators/jit/README.md +++ b/paddle/fluid/operators/jit/README.md @@ -1,7 +1,7 @@ # JIT Kernel 结合函数模板和JIT生成需要的kernel函数。 -这里的kernel是比Operator中kernel更小级别的算子单元,更侧重的是在不同硬件上的性能。可以有多重第三方库的实现,每种实现有自己的`UseMe`函数负责什么条件下可以被调用。 +这里的kernel是比Operator中kernel更小级别的算子单元,更侧重的是在不同硬件上的性能。可以有多重第三方库的实现,每种实现有自己的`CanBeUsed`函数负责什么条件下可以被调用。 这里实现的函数可以非常细粒度的函数方法,比如Vector MUL, 也可以是一个复杂的逻辑比如LSTM等。复杂的逻辑也可以由自己的底层函数拼接而成。 目前仅支持CPU上的高性能计算。 @@ -39,27 +39,55 @@ PaddlePaddle/Paddle/paddle/fluid/ ## 动态获取 -提供一个`jit::Get`方法,根据kernel类别获取,每种实现都有自己的使用范围,根据范围动态和当前条件选择需要的kernel函数。 +- 提供`GetAllCandidateFuncs`方法,根据输入的kernel类别,获取满足要求的所有函数实现。所有实现保证结果一致,但是速度不一致,可以根据具体输入属性大小,动态测试得到当前最优实现,手动选择最优函数。 +- 提供`GetDefaultBestFunc`方法,返回一个默认最优的函数实现。该函数是根据一些通用配置离线tuning之后的结果,能覆盖大多数情况下最优结果。 +- 提供`KernelFuncs::Cache()`方法,该方法会返回默认最优的函数,同时会缓存该函数指针,如果出现属性一致的情况,直接返回上次的函数指针,如果不存在则根据属性新建。 +- 提供`GetReferFunc` 方法,返回该kernel最原始的逻辑函数。该方法与kernel的输入大小和属性没有任何关系,有且并只有一个在CPU上的实现。该方法表征了kernel的原始逻辑,其他所有实现的逻辑与它保持一致。 + +### 例子 + +所有kernel的调用只需要在头文件中包含`"paddle/fluid/operators/jit/kernels.h"`, 该文件是编译时自动生成的。 + +直接从缓存中获取默认最优的函数。 + +```cpp + using T = float; + jit::seq_pool_attr_t attr(width, jit::SeqPoolType::kSum); + auto seqpool_func = jit::KernelFuncs, platform::CPUPlace>::Cache().At(attr); + seqpool_func(src_data, dst_data, &attr); +``` + +跑一遍所有实现,并输出实现类别。 + +```cpp + using T = float; + jit::seq_pool_attr_t attr(width, jit::SeqPoolType::kSum); + auto funcs = jit::GetAllCandidateFuncsWithTypes, platform::CPUPlace>(attr); + for (auto f : funcs) { + LOG(INFO) << "Kernel implementation type: " << f.first; + f.second(src_data, dst_data, &attr); + } +``` ## 测试 - 逻辑测试 所有实现都要与refer的code对比,需要满足精度要求, 包括float和double的数据类型 - 性能测试 - 所有实现的性能对比,并且与最终的`jit::Get`方法对比,该方法拿到的性能需要在各种条件下都是最好的。 + 所有实现的性能对比,并且与最终的`jit::GetDefaultBestFunc`方法对比,该方法拿到的性能需要在各种条件下都是最好的。 # 如何添加新的算子 -- 在`KernelType` 中添加 `your_key` . -- 实现Reference 的逻辑,这个是必须是在CPU上的实现,并且不能依赖任何第三方库。实现后在`refer/CmakeLists.txt`中添加`USE_JITKERNEL_REFER(your_key)`来使用该kernel. -- (optional) 实现更多的算法在`more`目录下,可以依赖mkl,intrinsic或者mkldnn等第三方库。 -- (optional) 实现基于Xbyak的生成code,在`gen`目下。 jitcode需要实现自己的`JitCodeCreator`,并注册在与refer相同的`KernelType`上。 -- 必要时可以添加新的`KernelTuples`,可以参考`XYZNTuples`,新加的Attr类型需要特例化`JitCodeKey`方法。 -- 在`test.cc`中添加unit test,至少需要测试`float`和`double`两种数据类型,如有必要需要支持额外的数据类型,比如`int8`的相关函数。 -- 在`benchmark.cc`中添加相应的性能对比,同一种kernel需要对比所有实现,并且确保`jit::Get`得到的实现一直是速度最快的。 +1. 在`KernelType` 中添加 `your_key` 。 +2. 实现Reference 的逻辑,这个是必须是在CPU上的实现,并且不能依赖任何第三方库。实现后在`refer/CmakeLists.txt`中添加`USE_JITKERNEL_REFER(your_key)`来使用该kernel。 +3. (optional) 实现更多的算法在`more`目录下,可以依赖mkl,intrinsic或者mkldnn等第三方库。 +4. (optional) 实现基于Xbyak的生成code,在`gen`目下。 jitcode需要实现自己的`JitCodeCreator`,并注册在与refer相同的`KernelType`上。 +5. 添加新的`KernelTuple`,需要与`KernelType`一一对应,是所有类型的一个打包,包括数据类型,属性的类型,以及返回的函数类型。可以参考`SeqPoolTuple`,新加的Attr类型需要特例化`JitCodeKey`方法。 +6. 在`test.cc`中添加unit test,至少需要测试`float`和`double`两种数据类型,如有必要需要支持额外的数据类型,比如`int8`的相关函数。 +7. 在`benchmark.cc`中添加相应的性能对比,同一种kernel需要对比所有实现,并且确保`GetDefaultBestFunc`得到的实现一直是速度最快的。 # 优点 -- 统一的Get方法,接口简单。 +- 接口方便,灵活调用。 - 同一套逻辑可以有多套实现,可以依赖多套第三方库,互不影响。 - 目录结构清晰,不会在某个文件中有多个宏定义,导致的可读性差问题。 - 优化方便,可以直接针对某种属性针对性优化,并不影响其他属性下的性能。