- 20 7月, 2023 40 次提交
-
-
由 Nicolas Perez 提交于
Taking relevant tests from conv_ops_test and conv_ops_3d_test to test general conv Op. Refactor test cases to be parameterized instead of running for loops. PiperOrigin-RevId: 549458691
-
由 Jieying Luo 提交于
Note TF runtime side is already set up in xla_launch_util. PiperOrigin-RevId: 549456738
-
由 Rahul Joshi 提交于
- Add option `xla_gpu_enable_pipelined_reduce_scatter` to enable forward pipelining of reduce-scatter instructions PiperOrigin-RevId: 549452899
-
由 Yishuang Pang 提交于
PiperOrigin-RevId: 549451568
-
由 Skye Wanderman-Milne 提交于
PiperOrigin-RevId: 549440289
-
由 A. Unique TensorFlower 提交于
Move pseudo-constant from while loop argument to while loop body in `tfl_while_outline` to avoid memory penalty during runtime PiperOrigin-RevId: 549440221
-
由 Skye Wanderman-Milne 提交于
It's not possible for tuple buffers (yet?). The eventual goal is for ML frameworks to only call individual getters instead of using PjRtBuffer::{logical_}on_device_shape, since passing around xla::Shapes is expensive and often includes more information than is necessary or even meaningful. We'd like to eventually remove PJRT_Buffer_OnDeviceTrimmedShape from the PJRT C API altogether ({logical_}on_device_shape will likely stay for non-ML framework usage). PiperOrigin-RevId: 549438520
-
由 Yu Feng 提交于
Also tested that relayout to ragged layout works out of box! PiperOrigin-RevId: 549438291
-
由 A. Unique TensorFlower 提交于
http://github.com/tensorflow/runtime/commit/2a7a9bde82ee99f382b26c75769ece54464a210d. PiperOrigin-RevId: 549436957
-
由 Scott Zhu 提交于
The tf.device(CPU:0) is not respected by the dtensor, so we need to explicitly convert the tensor value to dtensor, and make sure they are put on the proper mesh (CPU host mesh) for logging. Also update the test to mimic the current production behavior. PiperOrigin-RevId: 549432976
-
由 A. Unique TensorFlower 提交于
[AutoSharding] Make sure that the 1D device mesh in cluster environment matches the assumptions made by `ReshardingCostMixedMeshShape` in auto_sharding_utils. PiperOrigin-RevId: 549428578
-
由 Rahul Joshi 提交于
PiperOrigin-RevId: 549422754
-
由 Rahul Joshi 提交于
- Specify the HLO opcode the pipeliner config, and use that to derive more descriptive pass name - Some minor code cleanup. PiperOrigin-RevId: 549422627
-
由 A. Unique TensorFlower 提交于
PiperOrigin-RevId: 549422464
-
由 Fiona Lang 提交于
PiperOrigin-RevId: 549421317
-
由 Yishuang Pang 提交于
PiperOrigin-RevId: 549420699
-
由 Kanglan Tang 提交于
The following configs are removed: - v1 - avx2_win - avx2_linux - native_arch_linux - numa - libc++ - ios_i386 - stackdriver_support - rbe_lite_linux - rbe_linux_cuda_nvcc - rbe_gpu_linux - rbe_linux_cuda11.2_nvcc_py3.8, rbe_linux_cuda_nvcc_py38 - rbe_linux_cuda11.2_nvcc_py3.10, rbe_linux_cuda_nvcc_py310 - rbe_linux_rocm_py3.7, rbe_linux_rocm_py3.8, rbe_linux_rocm_py3.10 - rbe_linux_cuda_clang_base, rbe_linux_cuda_clang_py** - rbe_win_py37, rbe_win_py310 If the removal of a config breaks your workflow, you can add it back as a command line option. If you think the config is removed mistakenly, please open an issue on GitHub. PiperOrigin-RevId: 549419745
-
由 Justin Szaday 提交于
PiperOrigin-RevId: 549417829
-
由 A. Unique TensorFlower 提交于
[AutoSharding] Ensure that strategies are generated for custom call ops with user shardings. Previously, no shardings strategies were being generated for such ops. PiperOrigin-RevId: 549414995
-
由 A. Unique TensorFlower 提交于
Updates LLVM usage to match [3cd3f11c174b](https://github.com/llvm/llvm-project/commit/3cd3f11c174b) PiperOrigin-RevId: 549413700
-
由 Armando Ugalde Velasco 提交于
PiperOrigin-RevId: 549411619
-
由 Yash Katariya 提交于
PiperOrigin-RevId: 549410478
-
由 Armando Ugalde Velasco 提交于
PiperOrigin-RevId: 549410198
-
由 Armando Ugalde Velasco 提交于
PiperOrigin-RevId: 549408597
-
由 Rahul Joshi 提交于
PiperOrigin-RevId: 549406453
-
由 A. Unique TensorFlower 提交于
PiperOrigin-RevId: 549404348
-
由 Armando Ugalde Velasco 提交于
PiperOrigin-RevId: 549401555
-
由 Armando Ugalde Velasco 提交于
PiperOrigin-RevId: 549400109
-
由 Russell Power 提交于
PiperOrigin-RevId: 549397949
-
由 Armando Ugalde Velasco 提交于
PiperOrigin-RevId: 549397765
-
由 Armando Ugalde Velasco 提交于
PiperOrigin-RevId: 549396252
-
由 TJ Xu 提交于
PR #4019: [NVIDIA XLA:GPU] Introducing training support for cudnn fused mha (LHLO lowering and thunk, 2nd out of 3) Imported from GitHub PR https://github.com/openxla/xla/pull/4019 This is a follow-up pr for [this](https://github.com/openxla/xla/pull/3886). In total, fused MHA training changes are broken into 3 PRs. This is the 2nd of 3. The other PRs' links: 1st - Stream executor changes are in this https://github.com/openxla/xla/pull/3886 (merged) 3rd - Rewriter changes are in this https://github.com/openxla/xla/pull/3726 This pr contains changes to lower lhlo fused mha custom call to cudnnFMHA thunk and runner for both forward and backward fmha calls. It also adds activation output to forward calls in thunk, runner and stream executor to support training cases. All the functional and unit testing will be in the 3rd and final PR. Copybara import of the project: -- 3a5165f03b8a64b05486eaadc1c677b6da9346b9 by TJ <tjx@nvidia.com>: Introducing lhlo lowering logic for fused mha. This pr contains changes to lower lhlo fused mha custom call to cudnnFMHA thunk and runner for both forward and backward fmha calls. -- ca9b0d9ec59e02eb4d8718575b78fc17373f4fcc by TJ <tjx@nvidia.com>: Address pr comments -- af688151a3a628cfa115ce4ff50df0f549181054 by TJ <tjx@nvidia.com>: Make uid initialization consistent between backward and forward calls Merging this change closes #4019 PiperOrigin-RevId: 549395389
-
由 Armando Ugalde Velasco 提交于
PiperOrigin-RevId: 549394586
-
由 Armando Ugalde Velasco 提交于
Declare the MultipleIterationsAutoScaler class, which is needed to estimate the optimal number of workers for all Iterations running in a tf.data service cluster. Also delete the UpdateOptimalNumberOfWorkers metric method from AutoScaler, since it will be moved to MultipleIterationsAutoScaler. PiperOrigin-RevId: 549392975
-
由 TensorFlower Gardener 提交于
PiperOrigin-RevId: 549391535
-
由 Matt Callanan 提交于
PiperOrigin-RevId: 549387461
-
由 Skye Wanderman-Milne 提交于
This change adds the following interfaces: * PJRT_Buffer_DynamicDimensionIndices * PjRtBuffer::dynamic_dimensions * PjRtBuffer::has_dynamic_dimensions PjRtBuffer::has_dynamic_dimensions is the only function actually needed by callers. However, I implemented it by adding PJRT_Buffer_DynamicDimensionIndices to the C API, rather than a C function just returning whether there are dynamic dimensions, since returning which dimensions are dynamic gives the same information and is more future-proof in case we do need the extra info. To implement the C function, I added PjRtBuffer::dynamic_dimensions, which is only implemented for non-C PJRT clients for now, although it could be implmented in PjRtCApiClient if needed (it's a bit annoying because the interface is that of xla::Shape instead of PJRT_Buffer_DynamicDimensionIndices. I made them different because the PJRT_Buffer_DynamicDimensionIndices interface is better for implementing has_dynamic_dimensions). The eventual goal is for ML frameworks to call individual getters instead of using PjRtBuffer::{logical_}on_device_shape, since passing around xla::Shapes is expensive and often includes more information than is necessary or even meaningful. We'd like to eventually remove PJRT_Buffer_OnDeviceTrimmedShape from the PJRT C API altogether ({logical_}on_device_shape will likely stay for non-ML framework usage). PiperOrigin-RevId: 549381647
-
由 Peter Hawkins 提交于
PiperOrigin-RevId: 549381271
-
由 Antonio Sanchez 提交于
Previously, the CUDA FFT plans were cached based solely on FFT parameters. However, this breaks in multi-GPU setups, since `cufftHandle`s are device-specific. This led to garbage outputs when an FFT plan was attempted to be used on a different device. Added the GPU device ID to the plan cache. Fixes #60926. PiperOrigin-RevId: 549375830
-
由 Eugene Zhulenev 提交于
PiperOrigin-RevId: 549373738
-