README.md 2.8 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
# JIT
A optimization for MegBrain by just-in-time compilation.
JIT can reduce the global memory access times by fusing elemwise kernels into a
single larger one fusion kernel to improve performence.

For some regular expressions like *a * b + c* and *a * b + c * d*, MegBrain have
alreay did FMA3_FUSE and FMA4_FUSE optimization. Now MegBrain can speed up any
elemwise expressions by JIT.

## Benchmark Result
1. a * b * c

    |        |opt0| opt2| opt3(with jit)|
    |--------|----|-----|---------------|
    |speed   |100%|100% | 150%          |

2. a * b + c

    |        |opt0| opt2(with fma3)| opt3(with jit)|
    |--------|----|-----|---------------|
    |speed   |100%|150% | 150%          |

3. Alexnet with adam

    |        |opt0| opt2| opt3(with jit)|
    |--------|----|-----|---------------|
    |speed   |100%|103% | 114%          |

4. Resnet with adam, training

    |        |opt0| opt2| opt3(with jit)|
    |--------|----|-----|---------------|
    |speed   |100%|122% | 124%          |


## What does JIT do
Detection the subgraph can be fused and compiling the subgraph into a fusion
kernel are the most two important parts in JIT.

The detection is implemented in [impl/fusion_pass.cpp](impl/fusion_pass.cpp),
the main detection logic is in function *Fusion::Impl::on_opr*. Compared to nnvm
fusion, our fusion logic can fuse more operators into one fusion kernel.

44 45
For now , JIT support CUDA by HALIDE or NVRTC, CPU by MLIR, OpenCL by TINYOPENCL,
also it has reserved interface to extend more platforms.
46 47 48 49 50 51 52 53 54 55 56 57 58 59

## How to enable JIT
You can set `graph_opt_level` to 3 to enable JIT.

In python
``` python
cg = mgb.comp_graph()
cg.set_option('graph_opt_level', 3)
```

### Selection of Backend

You can set environment variable `MGB_JIT_BACKEND` to select the JIT backend.

60 61 62 63 64 65
|  Backend   | Platforms | Reduction support | Kernel Binary Cache | Kernel Reuse | Noncontig Input |
|------------|-----------|-------------------|---------------------|--------------|-----------------|
| HALIDE     | CUDA      | Y                 | No                  | Shape        | No              |
| NVRTC      | CUDA      | N                 | Via PersistentCache | Bcast type   | Monotone        |
| MLIR       | CPU       | N                 | NO                  | Kernel hash  | Monotone        |
| TINYOPENCL | OpenCL    | N                 | Via OpenCL cache    | Kernel hash  | Monotone        |
66 67 68 69 70 71 72 73 74 75 76 77 78

To enable fusion of Reduce oprs, set `graph_opt.jit = 2` in graph options.

### Working Directory

JIT may produce temporary files. The default working directory is
a temp dir and can be changed via `MGB_JIT_WORKDIR` environment variable. Set
`MGB_JIT_KEEP_INTERM` to keep intermediate files (such as generated sources and
object files) for debugging.

### Other options

* `MGB_HALIDE_DEBUG`: enable debug print for Halide.