- 14 9月, 2023 6 次提交
-
-
由 Jieying Luo 提交于
PiperOrigin-RevId: 565092976
-
由 Marcello Maggioni 提交于
[XLA] Rework dot() sharding propagation to lookahead instructions sharding to choose a sharding for dot() that agrees with the users if possible. PiperOrigin-RevId: 565086052
-
由 Oleg Shyshkov 提交于
PiperOrigin-RevId: 565083641
-
由 Marcello Maggioni 提交于
Small collectives might be better off when sinked and there are other potnential use cases Also fix a bug, where we were accepting reuse of the data that we were storing and changing the tests using that pattern to match the fix. PiperOrigin-RevId: 565080772
-
由 Bixia Zheng 提交于
instructions through control dependence. This is because the generated HLO program is correct even without the control dependence chaining. The purpose of the control dependence chaining is to support a scheduler, such as the latency hiding scheduler, and thus will be added to the latency hiding scheduler preparation pass. Not producing the control dependence chaining while decomposing collective-permute can also simplify the implementation of collective-pipeliner in pipelining Send and Recv instructions. PiperOrigin-RevId: 565073772
-
由 Benjamin Kramer 提交于
Updates LLVM usage to match [8ebe1d1cc1e4](https://github.com/llvm/llvm-project/commit/8ebe1d1cc1e4) PiperOrigin-RevId: 565069541
-
- 13 9月, 2023 34 次提交
-
-
由 A. Unique TensorFlower 提交于
Without this additional bazelrc file TSL will be built with the system compiler which happens to be GCC. PiperOrigin-RevId: 565060297
-
由 Peter Hawkins 提交于
This change was merged upstream in https://github.com/openai/triton/pull/2068 but hasn't made it to the OpenXLA fork yet. Improves build time in OSS by not building a number of MLIR dialects/passes that are not needed. PiperOrigin-RevId: 565058458
-
由 Oleg Shyshkov 提交于
PiperOrigin-RevId: 565025664
-
由 George Karpenkov 提交于
PiperOrigin-RevId: 565024582
-
由 Tamás Danyluk 提交于
Unify iterator names to *_it; Use auto* instead of auto if the type is a pointer type. PiperOrigin-RevId: 565016627
-
由 Johannes Reifferscheid 提交于
This is another prefactoring to make Triton fusions compatible with FusionInterface and partially fused HLO. For the former, we need the LaunchDimension computation to be a separate function. For the latter, we change the launch dimension function signatures to no longer take an HloComputation, because we don't yet have that during fusion (at least not a complete one). For now, this change is a no-op, since we do not yet have any boundary functions for non-fusion ops. PiperOrigin-RevId: 565015567
-
由 Fergus Henderson 提交于
that the corresponding FlatBuffer C API type can also be used instead of the FlatBuffer C++ API type for the TFLiteSettings FlatBuffer parameter. PiperOrigin-RevId: 565008668
-
由 Son Tuan Vu 提交于
From the original code at commit 03d304b, we are only supposed to scale the reduction tile size of dimX by the unroll factor for column reductions, so the check when creating `ReductionCodegenInfo` is only valid for column reductions. PiperOrigin-RevId: 564991695
-
由 Ilia Sergachev 提交于
Add a comparison operator for fragments, improve encapsulation, fix comments. PiperOrigin-RevId: 564990063
-
由 Johannes Reifferscheid 提交于
LaunchDimensions are being decoupled from codegen. Shared memory requirements are only known after codegen. Currently, this feature is only used for Triton fusions. All other fusions that use shared memory allocate it within the kernel. PiperOrigin-RevId: 564981685
-
由 A. Unique TensorFlower 提交于
PiperOrigin-RevId: 564978233
-
由 A. Unique TensorFlower 提交于
PiperOrigin-RevId: 564978209
-
由 A. Unique TensorFlower 提交于
The upcoming CUDA-12 upgrades requires TensorRT 8.6 and this version has a new set of headers which requires an update to the bazel configure script. I also had to change `find_cuda_config.py` because previously it has only been reporting the major version of TensorRT back to the configure script. But we will also need the minor version to distinguish between TensorRT 8.5 and below, and TensorRT 8.6+. PiperOrigin-RevId: 564927510
-
由 TensorFlower Gardener 提交于
PiperOrigin-RevId: 564914384
-
由 Ziyin Huang 提交于
PiperOrigin-RevId: 564909053
-
由 David Majnemer 提交于
GCC is missing some intrinsics, expand them by hand. PiperOrigin-RevId: 564899448
-
由 Russell Power 提交于
PiperOrigin-RevId: 564893903
-
由 A. Unique TensorFlower 提交于
Create metrics: 1) '/pjrt/compiler/is_compiling_computation' to record if pjrt compiler is compiling computations. 2) '/pjrt/compiler/is_compiling_module' to record if pjrt compiler is compiling modules. PiperOrigin-RevId: 564891869
-
由 Jake Harmon 提交于
PiperOrigin-RevId: 564886962
-
由 A. Unique TensorFlower 提交于
PiperOrigin-RevId: 564881319
-
由 A. Unique TensorFlower 提交于
This is for internal error logging only. PiperOrigin-RevId: 564878979
-
由 Yu Feng 提交于
PiperOrigin-RevId: 564871149
-
由 Matt Callanan 提交于
PiperOrigin-RevId: 564864652
-
由 David Majnemer 提交于
For an 8x8 uint32_t transpose, we had: - 4x `vinsertf128 ymm, ymm, xmm` - 4x `vperm2f128`. These are very expensive instructions because they cross the 128-bit lane boundary. Now, we have `8x vinsertf128` but crucially, the inserted operand now comes from memory. This is important because modern X86 HW can easily broadcast on load which means that `vinsertf128` turns into a blend instead of a shuffle. We use the same trick for handling matrices which are smaller than the vector width to accelerate the transpose. We still require a cross lane step but we cut down all the other shuffles in half compared to SSE2. While we are here, don't claim to support kernels which don't exist. This makes the transpose system choose unoptimized implementations. PiperOrigin-RevId: 564860657
-
由 A. Unique TensorFlower 提交于
This check can be removed since tf2xla can run ops with non-const input even if CompileTimeConstant attribute is set with the help of valueinference. PiperOrigin-RevId: 564851049
-
由 Ralf W. Grosse-Kunstleve 提交于
PiperOrigin-RevId: 564845580
-
由 Benjamin Kramer 提交于
Updates LLVM usage to match [c1796be93fe5](https://github.com/llvm/llvm-project/commit/c1796be93fe5) PiperOrigin-RevId: 564842806
-
由 A. Unique TensorFlower 提交于
PiperOrigin-RevId: 564840225
-
由 Antonio Sanchez 提交于
PiperOrigin-RevId: 564825677
-
由 Fiona Lang 提交于
Without this additional bazelrc file TSL will be built with the system compiler which happens to be GCC. I also had to disable a warning that was raised by clang. It puzzles me a bit that this is not needed for the Tensorflow build which definitely uses Clang. PiperOrigin-RevId: 564822820
-
由 Peter Hawkins 提交于
I've seen this file take over 5 minutes to build. Shard it by type. PiperOrigin-RevId: 564820851
-
由 Matt Callanan 提交于
PiperOrigin-RevId: 564813085
-
由 A. Unique TensorFlower 提交于
This is fixing a UB issue which occurs with newer version of Clang (17+). The fix is also upstreamed through https://github.com/NVIDIA/nccl/pull/916. In addition I'm changing the handling of `enqueue.cc` which needs to be compiled in cuda mode under clang. The previous solution with just passing in the `-x cuda` option fails with CUDA 12+. I'm also correcting the version number that we set in the patch - not sure if this version is reported in some logs, but if it is, it should be correct. PiperOrigin-RevId: 564811002
-
由 A. Unique TensorFlower 提交于
Add CopyToMemorySpace to the PjRtBuffer API. This CL does not implement any instance of the method, but adds the ability to do so in followup CLs. PiperOrigin-RevId: 564807735
-