The proposed methods allow for running models up to twice the size of the available DRAM, with a significant increase in inference speed compared to traditional loading approaches in CPU and GPU. The integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design enables effective inference of LLMs on devices with limited memory. This research could pave the way for more efficient use of LLMs in resource-constrained environments.
Key takeaways:
- The paper presents a method for efficiently running large language models (LLMs) that exceed the available DRAM capacity by storing the model parameters on flash memory and bringing them on demand to DRAM.
- The method involves constructing an inference cost model that is in harmony with the flash memory behavior, which guides optimization in reducing data transfer volume and reading data in larger, more contiguous chunks.
- Two main techniques are introduced: 'windowing' which reduces data transfer by reusing previously activated neurons, and 'row-column bundling' which increases the size of data chunks read from flash memory.
- The methods enable running models up to twice the size of the available DRAM, with a significant increase in inference speed compared to naive loading approaches. This paves the way for effective inference of LLMs on devices with limited memory.