The performance of ExLlamaV2 has been tested against V1, showing improved results across different models and GPUs. The library relies on a Torch C++ extension for its CUDA functions, which is compiled at runtime. Future plans for ExLlamaV2 include porting over features from V1, introducing a PyPi package with prebuilt extensions, LoRA support, a web UI, a web server, and more samplers. A few EXL2-quantized models have been uploaded to HuggingFace for users to experiment with.
Key takeaways:
- ExLlamaV2 is an initial release of an inference library for running local LLMs on modern consumer GPUs, which still needs a lot of testing and tuning.
- Compared to V1, ExLlamaV2 offers faster kernels, a cleaner and more versatile codebase, and support for a new quant format.
- ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new "EXL2" format that allows for mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight.
- There are still features that need to be ported over from V1 and other planned features, including a PyPi package with prebuilt extensions, LoRA support, an example web UI, a web server, and more samplers.