Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

QuIP#

Dec 09, 2023 - cornell-relaxml.github.io
Researchers have developed a method called QuIP# that compresses large language models (LLMs) without significantly affecting their performance. The method uses lattice codebooks and incoherence processing to create 2-bit quantized models, which are significantly smaller than their 16-bit counterparts. For instance, a model like Llama 2 70B, which has 70 billion parameters and requires 140GB of memory, could fit on a single 24GB GPU using this method. The researchers claim that QuIP# significantly closes the performance gap between 2-bit quantized LLMs and unquantized 16-bit models.

QuIP# relies on two main components: incoherence processing and lattice codebooks. Incoherence processing makes the weights of the model more suitable for quantization by making them approximately Gaussian-distributed. Lattice codebooks, on the other hand, are used to quantize these weights. The researchers used a new lattice codebook based on the E_8 lattice for this purpose. The results showed that QuIP# achieved near-native performance at 2 bits across all Llama 1 and 2 models on both language modeling and zero-shot tasks.

Key takeaways:

  • The researchers introduced QuIP#, a method that combines lattice codebooks with incoherence processing to create state-of-the-art 2 bit quantized models for large language models (LLMs).
  • QuIP# significantly closes the gap between 2 bit quantized LLMs and unquantized 16 bit models, allowing even large models like Llama 2 70B to fit on a single 24GB GPU.
  • The method relies on two main components: incoherence processing and lattice codebooks. Incoherence processing makes the weights and hessian matrices incoherent, improving quantization performance, while lattice codebooks take advantage of the "roundness" of incoherence processed weights for efficient quantization.
  • QuIP# achieves near-native performance at 2 bits on language modeling and zero shot tasks across all Llama 1 and 2 models, providing a solution for the challenges of storing and serving large language models.
View Full Article

Comments (0)

Be the first to comment!