README.md

# JIT
A optimization for MegBrain by just-in-time compilation.
JIT can reduce the global memory access times by fusing elemwise kernels into a
single larger one fusion kernel to improve performence.

For some regular expressions like *a * b + c* and *a * b + c * d*, MegBrain have
alreay did FMA3_FUSE and FMA4_FUSE optimization. Now MegBrain can speed up any
elemwise expressions by JIT.

## Benchmark Result
1. a * b * c

    |        |opt0| opt2| opt3(with jit)|
    |--------|----|-----|---------------|
    |speed   |100%|100% | 150%          |

2. a * b + c

    |        |opt0| opt2(with fma3)| opt3(with jit)|
    |--------|----|-----|---------------|
    |speed   |100%|150% | 150%          |

3. Alexnet with adam

    |        |opt0| opt2| opt3(with jit)|
    |--------|----|-----|---------------|
    |speed   |100%|103% | 114%          |

4. Resnet with adam, training

    |        |opt0| opt2| opt3(with jit)|
    |--------|----|-----|---------------|
    |speed   |100%|122% | 124%          |


## What does JIT do
Detection the subgraph can be fused and compiling the subgraph into a fusion
kernel are the most two important parts in JIT.

The detection is implemented in [impl/fusion_pass.cpp](impl/fusion_pass.cpp),
the main detection logic is in function *Fusion::Impl::on_opr*. Compared to nnvm
fusion, our fusion logic can fuse more operators into one fusion kernel.

For now , JIT support CUDA by HALIDE or NVRTC, CPU by MLIR, OpenCL by TINYOPENCL,
also it has reserved interface to extend more platforms.

## How to enable JIT
You can set `graph_opt_level` to 3 to enable JIT.

In python
``` python
cg = mgb.comp_graph()
cg.set_option('graph_opt_level', 3)
```

### Selection of Backend

You can set environment variable `MGB_JIT_BACKEND` to select the JIT backend.

|  Backend   | Platforms | Reduction support | Kernel Binary Cache | Kernel Reuse | Noncontig Input |
|------------|-----------|-------------------|---------------------|--------------|-----------------|
| HALIDE     | CUDA      | Y                 | No                  | Shape        | No              |
| NVRTC      | CUDA      | N                 | Via PersistentCache | Bcast type   | Monotone        |
| MLIR       | CPU       | N                 | NO                  | Kernel hash  | Monotone        |
| TINYOPENCL | OpenCL    | N                 | Via OpenCL cache    | Kernel hash  | Monotone        |

To enable fusion of Reduce oprs, set `graph_opt.jit = 2` in graph options.

### Working Directory

JIT may produce temporary files. The default working directory is
a temp dir and can be changed via `MGB_JIT_WORKDIR` environment variable. Set
`MGB_JIT_KEEP_INTERM` to keep intermediate files (such as generated sources and
object files) for debugging.

### Other options

* `MGB_HALIDE_DEBUG`: enable debug print for Halide.