提交 0f13eb91 编写于 作者: B Benjamin Kramer 提交者: TensorFlower Gardener

[XLA:GPU] Elide tuple roots of the entry computation

The tuple buffer is never read, so stop emitting code to fill it. A typical
root tuple consists of a H2D memcpy and a host callback, both of which are
somewhat slow.

This helps tiny models and inference benchmarks, where the host/device syncs
can be a significant part of the runtime of the entire computation.

PiperOrigin-RevId: 216968475
上级 3e94e19e
......@@ -1728,6 +1728,14 @@ Status IrEmitterUnnested::HandleReduce(HloInstruction* reduce) {
}
Status IrEmitterUnnested::HandleTuple(HloInstruction* tuple) {
// For the root node of the entry computation we can elide writing the tuple
// buffer. We can always figure out the contents of the tuples from buffer
// assignment because we insert copies to ensure non-ambiguous output buffers.
// GpuExecutable never reads the tuple buffer.
if (tuple ==
tuple->parent()->parent()->entry_computation()->root_instruction()) {
return Status::OK();
}
bool all_tuple_elements_have_buffer =
absl::c_all_of(tuple->operands(), [&](HloInstruction* tuple_element) {
return ir_emitter_context_->buffer_assignment()
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册