Created by: panyx0718
Timeline has been a go-place for TF developers when doing performance profiling. It visualizes the multi-device executions as different time-series. Arrows can be generated for cross device data transfer. Additional features such as memory allocation/deallocations are also very useful.
This is the first PR for the timeline feature.
- It collects cuda kernel execution stats with user-defined names.
- It stores the stats into a proto for future analysis.
Near-term next steps:
- Some more clean up and collect other cuda events such as memcpy.
- Generate timeline visualization with the protobuf.