---
title: "Up to 5x less communication and 3.4x faster training through 1-bit Adam"
excerpt: ""
date: 2020-09-09 00:00:00
tags: training English
toc: false
---


Adam is an effective and probably the most well-utilized optimizer for
training many large-scale deep learning models.  However, Adam is generally
not compatible with communication-efficient optimization algorithms, and
therefore the communication cost could become a bottleneck while scaling
across distributed devices. We introduce a new algorithm - 1-bit Adam - and
its efficient implementation in DeepSpeed. 1-bit Adam offers the ***same convergence*** as Adam, incurs up to ***5x less communication*** that enables up to ***3.5x higher throughput for BERT-Large pretraining*** and up to ***2.7x higher throughput for SQuAD fine-tuning*** on bandwidth-limited clusters.

* Brief overview, see our [press release]({{ site.press_release_v3 }}).
* Detailed technology deep dive, see our [blog post](https://www.deepspeed.ai/2020/09/08/onebit-adam-blog-post.html).
* Tutorial on how to reproduce our results, see our [1-bit Adam tutorial](/tutorials/onebit-adam/).
* The source code for 1-bit Adam can be found in the [DeepSpeed repo](https://github.com/microsoft/deepspeed). The implementation of 1-bit Adam is in [onebit_adam.py](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/fp16/onebit_adam.py) and CUDA-Aware communication for 1-bit Adam is in [custom_collectives.py](https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/custom_collectives.py). Example codes to try this feature can be found in the [DeepSpeedExamples repo](https://github.com/microsoft/deepspeedexamples) as shown in the [tutorial](/tutorials/onebit-adam/).