Paper page - LLM in a flash: Efficient Large Language Model Inference with Limited Memory

This paper presents a method for efficiently running large language models (LLMs) that exceed the available DRAM capacity by storing the model parameters on flash memory and bringing them to DRAM as needed. The method involves constructing an inference cost model that aligns with the flash memory behavior, which guides optimization in two key areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.

Two main techniques are introduced within this framework. The first, "windowing", strategically reduces data transfer by reusing previously activated neurons. The second, "row-column bundling", increases the size of data chunks read from flash memory, taking advantage of the sequential data access strengths of flash memory. These methods enable running models up to twice the size of the available DRAM, with a significant increase in inference speed compared to naive loading approaches. The paper's integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design could pave the way for effective inference of LLMs on devices with limited memory.

Key takeaways:

The paper presents a method for efficiently running large language models (LLMs) that exceed the available DRAM capacity by storing the model parameters on flash memory and bringing them on demand to DRAM.
The method involves constructing an inference cost model that aligns with the flash memory behavior, which guides optimization in two areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.
Two main techniques are introduced: "windowing" which strategically reduces data transfer by reusing previously activated neurons, and "row-column bundling" which increases the size of data chunks read from flash memory.
The proposed methods enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively.

Paper page - LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Key takeaways:

Comments (0)

Newsletter