Last year, a team from Microsoft Research Asia created BitNet, the first 1-bit QAT method for LLMs, which proved to be more energy efficient than PTQ methods. In February, they announced BitNet 1.58b, which uses parameters of -1, 0, or 1, making it faster and more energy efficient than full-precision networks. Meanwhile, a team from Harbin Institute of Technology developed a method called OneBit, which combines elements of both PTQ and QAT, and achieved good results with less memory usage. However, current hardware cannot fully utilize these models, and new hardware that can natively represent each parameter as a -1 or 1 (or 0) is needed.
Key takeaways:
- Large language models (LLMs) are becoming increasingly larger and more energy-intensive, prompting researchers to find ways to make them smaller and more efficient. One method being explored is drastically rounding off the high-precision numbers that store their memories to just 1 or -1.
- Two general approaches to this are post-training quantization (PTQ) and quantization-aware training (QAT). PTQ has been more popular among researchers, with a team introducing a PTQ method called BiLLM that uses 1 bit for most parameters and 2 bits for a few key weights.
- Microsoft Research Asia has developed BitNet, the first 1-bit QAT method for LLMs, which is roughly 10 times as energy efficient as full-precision networks. A newer version, BitNet 1.58b, uses parameters that can equal -1, 0, or 1, making it even more efficient.
- Quantized models have multiple advantages, including fitting on smaller chips, requiring less data transfer, and allowing for faster processing. However, current hardware can't fully utilize these models, suggesting a need for new hardware specifically optimized for 1-bit LLMs.