[XLA:GPU] Elide tuple roots of the entry computation
The tuple buffer is never read, so stop emitting code to fill it. A typical root tuple consists of a H2D memcpy and a host callback, both of which are somewhat slow. This helps tiny models and inference benchmarks, where the host/device syncs can be a significant part of the runtime of the entire computation. PiperOrigin-RevId: 216968475
Showing
想要评论请 注册 或 登录