GitHub - IST-DASLab/QUIK: Repository for the QUIK project, enabling the use of 4bit kernels for generative inference

The repository contains the code for QUIK, a method that quantizes the majority of weights and activations to 4bit post-training. The method is detailed in a paper available on arxiv. The installation process requires dependencies such as cmake, a C++ compiler, and nvcc. The instructions involve cloning the repository and installing it using pip.

The repository provides examples, such as the LLama example and linear layer benchmarks. To adapt a model to QUIK, one must quantize the model weights using the GPTQ algorithm, then create QUIK Linear layers that replace the original Linear layers. The repository also includes a fake quantization example. The full paper detailing QUIK is available on arxiv, with the citation provided.

Key takeaways:

QUIK is a method for quantizing the majority of the weights and activations to 4bit post-training, which is described in a paper available on arxiv.
The repository contains the code for QUIK and provides detailed instructions on how to install and use it, including dependencies and command lines.
Examples are provided for different use cases, such as the LLama example, linear layer benchmarks, model adaptation to QUIK, and fake quantization examples.
The full citation for the paper describing QUIK is provided, which is authored by Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, and Dan Alistarh and is expected to be published in 2023.

GitHub - IST-DASLab/QUIK: Repository for the QUIK project, enabling the use of 4bit kernels for generative inference

Key takeaways:

Comments (0)

Newsletter