GitHub - turboderp/exllamav2: A fast inference library for running LLMs locally on modern consumer-class GPUs

ExLlamaV2, an inference library for running local LLMs on modern consumer GPUs, has been released in its initial version. The new version boasts of faster kernels, a cleaner and more versatile codebase, and support for a new quant format. However, it still requires extensive testing and tuning, and some key features are yet to be implemented. The library also supports the same 4-bit GPTQ models as V1, but introduces a new "EXL2" format that allows for mixing quantization levels within a model.

The performance of ExLlamaV2 has been tested against V1, showing improved results across different models and GPUs. The library relies on a Torch C++ extension for its CUDA functions, which is compiled at runtime. Future plans for ExLlamaV2 include porting over features from V1, introducing a PyPi package with prebuilt extensions, LoRA support, a web UI, a web server, and more samplers. A few EXL2-quantized models have been uploaded to HuggingFace for users to experiment with.

Key takeaways:

ExLlamaV2 is an initial release of an inference library for running local LLMs on modern consumer GPUs, which still needs a lot of testing and tuning.
Compared to V1, ExLlamaV2 offers faster kernels, a cleaner and more versatile codebase, and support for a new quant format.
ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new "EXL2" format that allows for mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight.
There are still features that need to be ported over from V1 and other planned features, including a PyPi package with prebuilt extensions, LoRA support, an example web UI, a web server, and more samplers.

GitHub - turboderp/exllamav2: A fast inference library for running LLMs locally on modern consumer-class GPUs

Key takeaways:

Comments (0)

Newsletter