[XLA:GPU] Elide tuple roots of the entry computation

The tuple buffer is never read, so stop emitting code to fill it. A typical root tuple consists of a H2D memcpy and a host callback, both of which are somewhat slow. This helps tiny models and inference benchmarks, where the host/device syncs can be a significant part of the runtime of the entire computation. PiperOrigin-RevId: 216968475

[XLA:GPU] Elide tuple roots of the entry computation
The tuple buffer is never read, so stop emitting code to fill it. A typical root tuple consists of a H2D memcpy and a host callback, both of which are somewhat slow. This helps tiny models and inference benchmarks, where the host/device syncs can be a significant part of the runtime of the entire computation. PiperOrigin-RevId: 216968475
0f13eb91 · Benjamin Kramer · TensorFlower Gardener · 3e94e19e · 0f13eb91
隐藏空白更改
内联并排

Showing with 8 addition and 0 deletion

tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc +8 -0

未找到文件。
--- a/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
+++ b/tensorflow/compiler/xla/service/gpu/ir_emitter_unnested.cc
@@ -1728,6 +1728,14 @@ Status IrEmitterUnnested::HandleReduce(HloInstruction* reduce) {
 }

 Status IrEmitterUnnested::HandleTuple(HloInstruction* tuple) {
+  // For the root node of the entry computation we can elide writing the tuple
+  // buffer. We can always figure out the contents of the tuples from buffer
+  // assignment because we insert copies to ensure non-ambiguous output buffers.
+  // GpuExecutable never reads the tuple buffer.
+  if (tuple ==
+      tuple->parent()->parent()->entry_computation()->root_instruction()) {
+    return Status::OK();
+  }
  bool all_tuple_elements_have_buffer =
      absl::c_all_of(tuple->operands(), [&](HloInstruction* tuple_element) {
        return ir_emitter_context_->buffer_assignment()