Researchers upend AI status quo by eliminating matrix multiplication in LLMs

Researchers from the University of California Santa Cruz, UC Davis, LuxiTech, and Soochow University have developed a new method to run AI language models more efficiently by eliminating matrix multiplication from the process. This could potentially reduce the environmental impact and operational costs of AI systems. The researchers created a custom 2.7 billion parameter model without using matrix multiplication that performs similarly to conventional large language models. They also demonstrated running a 1.3 billion parameter model at 23.8 tokens per second on a GPU accelerated by a custom-programmed FPGA chip that uses about 13 watts of power.

The researchers argue that their approach could make large language models more accessible, efficient, and sustainable, particularly for deployment on resource-constrained hardware like smartphones. The technique has not yet been peer-reviewed, but the researchers claim that their work challenges the prevailing paradigm that matrix multiplication operations are indispensable for building high-performing language models. They were motivated by the limitations of BitNet, a previous technique that still relied on matrix multiplications, to develop a completely "MatMul-free" architecture.

Key takeaways:

Researchers have developed a new method to run AI language models more efficiently by eliminating matrix multiplication, which could significantly reduce the environmental impact and operational costs of AI systems.
The new model, described in a paper titled 'Scalable MatMul-free Language Modeling', features similar performance to conventional large language models but uses significantly less power.
The researchers argue that their approach could make large language models more accessible, efficient, and sustainable, especially for deployment on resource-constrained hardware like smartphones.
The paper also mentions BitNet, a precursor to their work, which demonstrated the viability of using binary and ternary weights in language models, but still relied on matrix multiplications in its self-attention mechanism.

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

Key takeaways:

Comments (0)

Newsletter