Effort Engine

The article discusses a potentially new algorithm for LLM Inference that allows users to adjust the number of calculations in real time. The algorithm is as fast as regular matrix multiplications on Apple Silicon chips at 50% calculations, and twice as fast at 25% effort, while retaining most of the quality. The algorithm also allows users to skip loading the least important weights and does not require retraining, only conversion to a different format and some precomputation. It is currently implemented for Mistral and should work for all other models.

The implementation is currently done for FP16 only and while the multiplications are fast, inference is still slightly lacking due to the need for improvement in some non-essential parts. The algorithm also allows users to dynamically decide how much of the model to load into memory, essentially skipping the least important weights while loading. Despite some implementation overhead, the initial results seem robust enough to warrant publication. The author encourages readers to test the 0.0.1B version of the algorithm and provides a deep dive into its workings.

Key takeaways:

A new algorithm for LLM Inference allows for adjustable calculation effort in real time, with the ability to skip loading the least important weights.
The algorithm is implemented for Mistral and should work for all other models without retraining, only requiring conversion to a different format and some precomputation.
The implementation allows for dynamic decision-making on how much of the model to load into memory, essentially allowing for ad-hoc distillation.
While the algorithm is fast, there is still room for improvement in non-essential parts like softmax, and help from a Swift/Metal engineer is sought to fix implementation overhead issues.

Effort Engine

Key takeaways:

Comments (0)

Newsletter