LLM in a flash: Efficient Large Language Model Inference with Limited Memory

This paper presents a novel method for efficiently running large language models (LLMs) on devices with limited DRAM capacity. The method involves storing the model parameters on flash memory and bringing them to DRAM as needed. The researchers developed an inference cost model that aligns with the behavior of flash memory, which helped them optimize data transfer volume and read data in larger, more contiguous chunks. They introduced two main techniques: "windowing", which reduces data transfer by reusing previously activated neurons, and "row-column bundling", which increases the size of data chunks read from flash memory.

The proposed methods allow for running models up to twice the size of the available DRAM, with a significant increase in inference speed compared to traditional loading approaches in CPU and GPU. The integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design enables effective inference of LLMs on devices with limited memory. This research could pave the way for more efficient use of LLMs in resource-constrained environments.

Key takeaways:

The paper presents a method for efficiently running large language models (LLMs) that exceed the available DRAM capacity by storing the model parameters on flash memory and bringing them on demand to DRAM.
The method involves constructing an inference cost model that is in harmony with the flash memory behavior, which guides optimization in reducing data transfer volume and reading data in larger, more contiguous chunks.
Two main techniques are introduced: 'windowing' which reduces data transfer by reusing previously activated neurons, and 'row-column bundling' which increases the size of data chunks read from flash memory.
The methods enable running models up to twice the size of the available DRAM, with a significant increase in inference speed compared to naive loading approaches. This paves the way for effective inference of LLMs on devices with limited memory.

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Key takeaways:

Comments (0)

Newsletter