The API was created out of frustration with vLLM, which was causing issues with running large models across cards. EricLLM is multi-threaded and rivals state-of-the-art solutions for throughput and model loading time. It uses the ExllamaV2 engine and can be run in the Text-Generation-Webui environment. Despite its advantages, it is still in development and has some rough edges that need to be polished. It is currently about 1/3rd the speed of vLLM, but it is still one of the fastest batching APIs available and supports the exl2 format with variable bitrate.
Key takeaways:
- EricLLM is a fast batching API designed to serve LLM models, offering a feature-compatible alternative to vLLM with more performance than the text-generation-webui.
- The API can be launched with multiple workers, and in a dual-GPU setup, performance can be increased with the --gpu_balance switch.
- EricLLM supports various options such as setting the model directory, host, port, max sequence length, max input length, GPU split, and the number of worker processes to use.
- Despite being slower than vLLM, EricLLM is one of the fastest batching APIs currently available, supporting the exl2 format with variable bitrate and offering faster model loading times.