The researchers' model uses about 10 times less memory and operates about 25% faster than other models on standard GPUs. The team also created a prototype of their custom hardware on a field-programmable gate array (FPGA), which allowed them to fully utilize the energy-saving features they programmed into the neural network. This custom hardware enabled the model to surpass human-readable throughput on just 13 watts of power, more than 50 times the efficiency of GPUs. The researchers believe that with further development, the technology can be further optimized for even more energy efficiency.
Key takeaways:
- Researchers from UC Santa Cruz have developed a method to eliminate the most energy-consuming element of running large language models, matrix multiplication, while maintaining performance.
- The new method involves forcing all numbers within the matrices to be ternary, reducing computation to summing numbers rather than multiplying, and adjusting the strategy of how matrices communicate with each other.
- With this approach and custom hardware, they were able to power a billion-parameter-scale language model on just 13 watts, more than 50 times more efficient than typical hardware.
- Despite the reduced energy consumption, the new, open-source model achieves the same performance as state-of-the-art models like Meta’s Llama.