The article also provides performance comparisons of llamafile on various hardware platforms, including enterprise hardware (HP Intel® Core™ i9-9900), hobbyist hardware (Raspberry Pi v5 and v4), gaming hardware (Intel® Core™ i9-14900K), professional hardware (AMD Ryzen Threadripper PRO 7995WX), and Apple hardware (Mac Studio CPU w/ 24-core M2 Ultra). The author notes that while llamafile is designed to assist those without access to high-end GPUs, it also offers a first-class experience for high-end users. The performance gains are attributed to the new kernels and the use of Cosmopolitan Libc to package llama.cpp as a single-file cross-platform binary.
Key takeaways:
- The author has written 84 new matrix multiplication kernels for llamafile, which enable it to read prompts/images faster. This could make it between 30% and 500% faster than llama.cpp when using F16 and Q8_0 weights on CPU.
- The improvements are most dramatic for ARMv8.2+ (e.g. RPI 5), Intel (e.g. Alderlake), and AVX512 (e.g. Zen 4) computers. The kernels are 2x faster than MKL for matrices that fit in L2 cache.
- Performance gains were observed on various hardware including enterprise, hobbyist, gaming, Apple, and professional hardware. The AMD Ryzen Threadripper PRO 7995WX showed significant performance gains, offering 7x more raw compute power than the M2 Ultra ARM ISA.
- The author warns that many people who bought the Threadripper ran into issues with sketchy RAM. They had to return the first DIMMs they bought for the computer, as most of them died and the performance was significantly reduced.